1 Introduction
Statistical agencies collect household or establishmentlevel data through survey instruments in order to produce summary statistics. These statistics are often rendered in summary tables defined by geographic and demographic categories. Researchers and data analysts, additionally, often seek access to the respondentlevel, detailed data records in order to facilitate their research inferential goals; for example, by conducting regression analysis using variables in the respondentlevel dataset. Statistical agencies in the U.S., such as the Census Bureau and the Bureau of Labor Statistics, are under legal obligation to protect the privacy of survey respondents and prevent the even inadvertent disclosure of their identities. These agencies increasingly utilize hierarchical Bayesian models, estimated on the real, recordlevel data, as “synthesizers” to produce synthetic recordlevel data (drawn from the model posterior predictive distribution) to encode privacy protection by smoothing the real data distribution
(Hu et al., 2018). Dimitrakakis et al. (2017) demonstrate a direct link between prior smoothing and privacy protection, which contrasts with the usual approach of adding randomlygenerated noise to the data (Dwork et al., 2006) in nonmodeling contexts.The degree of privacy protection encoded by synthetic data may be quantified for each record by computing an identification disclosure risk for that record and for the overall file. The identification disclosure risk is constructed relative to the knowledge held by a putative intruder and assumptions about her possible actions; for example, it is common to synthesize some variables (such as income) conditioned on known, unsynthesized variables where the publishing of these variables presents little disclosure risk. The intersections among levels of known (and unsynthesized) categorical variables, such as age, gender, and education level categories, form a collection of patterns
(Hu and Savitsky, 2018). The intruder will seek the identify of a target record by examining the distribution of records in the pattern containing that record. We assume that the intruder knows the true value(s) of the synthesized variable(s) for the record whose identity they seek. The identification disclosure risk may be constructed as a probability formed as
minus the number of records in a pattern whose synthesized variable(s) values lie inside some radius of the true value (for a continuous variable); that is, the probability that synthesized records lie outside the ball of radius around the true record value of interest. Records whose values of the synthesized variable lie within the radius are deemed “close”. If there are relatively few records close to the true target record value, then that target record is unique and easier for an intruder to discover the identity (Quick et al., 2015).A nonparametric synthesizer is able to dataadaptively shrink the real data distribution to promote a high utility for the synthetic data user; for example, the data analyst will achieve a relatively high utility if the coefficients of a regression model that she estimates on the synthetic data are close the values that would be estimated on the real data. In general, a high utility for synthetic data is achieved when its distribution is similar to the real data distribution (Hu and Savitsky, 2018). Yet, there are some portions of the real data distribution that present a higher disclosure risk than other portions. For any synthesized variable, such as income, records in relatively sparsely populated portions of the distribution, such as the tails, may express high risks for disclosure to the extent that the synthesizer generates values close to the true value. Focusing on a high risk record in the tail, a flexible, nonparametric synthesizer may produce synthetic data that express too much risk by not adequately shrinking the tail portions towards the high mass regions of the distribution. By contrast, a parametric synthesizer would be expected to more fully shrink the distribution tails, but simultaneously to overly distort the higher mass regions, resulting in synthetic data that express low utility.
We introduce a collection of weights, that are indexed by each of observations to exponentiate the likelihood contributions in a pseudo likelihood framework that, when convolved with the prior distributions, produce a joint pseudo posterior distribution. (See Savitsky and Toth (2016) for background on a pseudo posterior distribution constructed in the case of complex survey sampling). Each (where are the observed response values) is constructed as minus the probability of identification disclosure for record multiplies with a recordspecific constant, such that the likelihood contribution for a high risk record is effectively downweighted, which strengthens the influence of the prior distribution for that record. This straightforward approach may be viewed as setting an antiinformative prior for high risk observations that induces the synthesizer to surgically distort the high risk portions of the data distribution in a fashion that preserves the high mass portions. We expect this approach to express higher utility for the same risk level as would be the case under use of a scalar weight that introduces a global tempering (Bhattacharya et al., 2019).
Unlike the setup of Miller and Dunson (2018), our use of weighting to induce misspecification of the actual data distribution is purposeful in order to encode privacy protection. So we don’t view the observed data as a corrupted version of the true data generated by some model, , but rather our approach inserts corruption into the synthetic data. In the former case where the observed data are viewed as corrupted, a scalar weight is used to induce tempering in the posterior distribution to express robustness by smoothing over the corruption. Since our purpose is to induce the minimal mispecification in the synthetic data needed to achieve a privacy protection threshold, while at the same time preserving a high utility for the synthetic data, it is more natural to use a vector of weights to mitigate the riskutility tradeoff. We extend the theoretical result of Bhattacharya et al. (2019) on the frequentist consistency of the misspecified estimator (at a value , not including the true generating parameter, ) in the sequel, where we demonstrate that the contraction rate is injured by the degree of misspecification induced by . Our specification of a vector of weights, , depends on the observed data, , and also on draws of synthetic data from the model posterior predictive distribution, so that in our privacy setting are constructed to be random with respect to , unlike the specification of the scalar in the consistency result of Bhattacharya et al. (2019) and the implementation of Miller and Dunson (2018).
We illustrate the performance of our approach in a simulation study, and by utilizing respondentlevel data for the Consumer Expenditure Surveys (CE), published by the U.S. Bureau of Labor Statistics (BLS). The CE publishes summary, domainlevel statistics used for both policymaking and research, including the most widely used measure of inflation  the Consumer Price Index (CPI), measures of poverty that determine thresholds for the U.S. Governments Supplemental Poverty Measure. The CE consists of two surveys: i) the Quarterly Interview Survey, which aims to capture large purchases (such as rent, utilities, and vehicles), contains approximately 7,000 interviews, and is conducted every quarter; and ii) the Diary Survey, administered on an annual basis, focuses on capturing small purchases (such as food, beverages, tobacco), contains approximately 14,000 interviews of households. We focus on the CE publicuse microdata (PUMD) that result from these instruments. Unlike published CE tables, which release information from the CE data in aggregated forms, the CE PUMD releases the CE data at the individual, respondent level, which potentially enables the CE data users to conduct research tailored to their interests. Directly releasing individuallevel data, however, poses privacy concerns.
Section 1.1 introduces details of the CE sample data in the application and CE survey program’s current topcoding practice of the family income variable in the CE PUMD for disclosure control. It also discusses the motivation of the development of our riskweighted pseudo posterior method.
1.1 The CE data and the topcoded family income
The CE data sample in our application comes from the 2017 1st quarter Interview Survey. There are n = 6,208 consumer units (CU) in this sample. A CU defines a collection of related people, such as a family, who are financially independent of other collections of people who may reside in the same physical location or household; e.g., roommates. Generally, however, the CU used by CE may be thought of as a household. We focus on 11 variables: gender, age, education level, region, urban, marital stats, urban type, CBSA, family size, earner, and family income. The first 10 variables are either categorical in nature or discretized from continuous. They are considered insensitive, therefore not to be synthesized and used as predictors. The family income variable is continuous, ranging from approximately 7K to 1,800K (rounded for confidentiality; negative family income values reflect investment and business loses). This variable is considered sensitive, therefore to be synthesized for disclosure protection. See Table 1 for details of the variables.
The sensitive family income variable is highly rightskewed, as shown by its density plot in Figure
2. The 97.5 percentile value for this variable is approximately $270K.Variable  Description 

Gender  Gender of the reference person; 2 categories 
Age  Age of the reference person; 5 categories 
Education Level  Education level of the reference person; 8 categories 
Region  Region of the CU; 4 categories 
Urban  Urban status of the CU; 2 categories 
Marital Status  Marital status of the reference person; 5 categories 
Urban Type  Urban area type of the CU; 3 categories 
CBSA  2010 corebased statistical area (CBSA) status; 3 categories 
Family Size  Size of the CU; 11 categories 
Earner  Earner status of the reference person; 2 categories 
Family Income  Imputed and reported income before tax of the CU; 
approximate range: (7K, 1,800K) 
Currently, the CE PUMD releases a topcoded version of the family income values to the public. In statistical disclosure control (SDC) literature, topcoding refers to the practice of using a prechosen value and censoring any values above the prechosen topcode value to that value (An and Little, 2007). While the application of topcoding techniques induces disclosure protection by not releasing the exact value of a CU’s family income for a certain portion of the distribution (especially for the extreme values), topcoding might negatively impact the utility of the microdata by destroying important features of the distribution, especially in the tails. As evident in Figure 3 and Figure 4, the density plots of the topcoded family income (green) deviate from the those of the real family income (red) in the approximately top 2.5% of the distribution. These will without a doubt worsen any inferences related to the right tail of the distribution of family income, when topcoded microdata is used by data analysts in place of the confidential real microdata. There is also an implicit assumption in topcoding that high risk records are concentrated in the right tail of the income distribution, which we show in the sequel to be false.
There is, therefore, an opportunity to propose alternatives to topcoding for release of highly skewed continuous data. An and Little (2007) demonstrate through simulation studies and real data application that synthetic recordlevel data provides better inferences than topcoding. The synthetic data approach allows straightforward methods of analysis while maintains good statistical disclosure control properties. In this paper, we take the synthetic data approach to the CE data sample and aim at generating partially synthetic CE data where the sensitive CU’s family income variable is synthesized. To evaluate the degree of privacy protection encoded by synthetic data, we propose measures that quantify the identification disclosure risks as recordlevel probabilities of disclosure. We end up with a collection of recordlevel identification disclosure risks for all records.
We start with a synthesizer that wellreproduces the real data distribution with high utility. For any record with unacceptably high disclosure risk, instead of developing new synthesizers to generate a new set of synthetic values for inducing higher disclosure protection, we introduce a recordlevel weight which is inversely proportional to its disclosure risk. The weight is then used to exponentiate the record’s likelihood contribution in a pseudo likelihood framework, effectively downweighting its likelihood contribution, and a new synthetic value is generated for the record from a weighted version of the original synthesizer. We propose to surgically distort the high risk portions in the data distribution after synthetic data generation. Our methods provide statistical agencies the flexibility to target high risk records when producing synthetic microdata for public release.
The remainder of the paper is organized as follows: Section 2 develops a nonparametric mixture synthesizer that we will use for the CE data sample. In Section 3, we provide a general specification for the weighted pseudo posterior distribution that includes our approach for constructing based on the measured identification disclosure risk for each of the records. Section 4 constructs conditions that guarantee the frequentist consistency of our weighted synthesizer and reveals how the contraction rate is impacted by . We further demonstrate the distortion induced into the asymptotic covariance matrix of the pseudo MLE and pseudo posterior by . Section 5 implements a simulation study to demonstrate how the pseudo posterior produces synthetic data that distorts the portion of the observed data distribution expressing high risk of identification disclosure. We apply our methods to the CE data sample for generating synthetic family income in Section 6. The paper concludes with discussion in Section 7.
2 Nonparametric Mixture Synthesizer
Our goal is to generate partially synthetic data for the CE data sample introduced in Section 1.1. Among the 11 variables, only the family income variable is sensitive and synthesized. It is continuous and highly skewed. The other 10 variables are insensitive, therefore unsynthesized and used as predictors in synthesizing family income.
Our proposed synthesizer is a nonparametric mixture synthesizer for a sensitive continuous variable, utilizing a number of available predictors. We now describe the synthesizer in the context of the CE data sample; however, we believe this synthesizer is generalizable and widely applicable for synthesizing skewed continuous data.
Let be the logarithm of the family income for CU , and be the vector including an intercept and the values of predictors of CU . There are CUs in the sample.
(1)  
(2) 
We employ a truncated Dirichlet process (TDP) prior (Neal, 2000) for the unknown measure, , specified as generating model parameters, to allow the flexible clustering of CUs that employ the same generating distribution for . Under our TDP prior, we sample locations, and cluster indicators, , for CU , where denotes the maximum number of clusters. The cluster indicators, are generated from Multinomial draws with cluster probabilities, in Equation (2).
We achieve a TDP prior construction through our sampling of the with,
(3)  
(4) 
which empowers the TDP mixture synthesizer to pick the effective number of occupied clusters. A TDP prior on the model parameters marginally specifies a TDP mixture for , which becomes arbitrarily close to a DP mixture as (Neal, 2000). The hyperparameter induces sparsity in the number of nonzero cluster probabilities in Equation (3). Due to its influence on the number of clusters learned by the data, we place a further Gamma prior on .
We specify multivariate Normal priors for each vector of coefficient locations, as in Equation (5
), and t priors for each standard deviation locations,
as in Equation (6).(5)  
(6) 
where the correlation matrix, , receives a uniform prior over the space of correlation matrices (Stan Development Team, 2016) and each component of receives a Studentt prior with degrees of freedom.
To generate synthetic family income of each CU, we first generate sample values of from the posterior distribution at MCMC iteration . We estimate our TDP mixture model using Stan (Stan Development Team, 2016), after marginalizing out the discrete cluster assignment probabilities, . So we generate cluster assignments, a posteriori, from the full conditions given, , with,
(7) 
where
denotes the density function of a Normal distribution. We next generate synthetic family income,
, through a Normal draw with given predictor vectors , and samples of and , as in Equation (1). Let denote a partially synthetic dataset at MCMC iteration . We repeat the process for times, creating independent partially synthetic datasets .3 Riskweighted Pseudo Posterior
When releasing partially synthetic data to the public, where sensitive variables are synthesized and insensitive variables are unsynthesized, there are two types of disclosure risks associated with the release: i) attribute disclosure risks, and ii) identification disclosure risks (Hu, 2018).
Use our CE data sample application as a working example. With family income synthesized, although an intruder can no longer know the true value of the synthesized family income of any CU, she can make guesses about the true value, a disclosure commonly known as attribute disclosure. Moreover, if the intruder possess knowledge of a pattern of unsynthesized categorical variables (e.g. {gender, age, education level}, or any other subsets of the 10 unsynthesized categorical variables) and the true value of the synthesized family income of a CU, she could make guesses about the identity of the CU, a disclosure commonly referred to as identification disclosure.
With CE program’s emphasis on identification disclosure in the CE PUMD release, we focus on identification disclosure and develop a recordlevel probability of identification disclosure risk measure for each CU. We proceed to construct recordlevel measures for identification disclosure risks in Section 3.1. Based on these measures, we construct the recordlevel weights for the pseudo likelihood framework in Section 3.2, which leads to our proposed riskweighted pseudo posterior.
3.1 Probability of Identification Disclosure
The released synthetic datasets, , are publicly available. Suppose the intruder seeks identities of records within each synthetic dataset . In addition to , assume the intruder has access to an external file, which contains the following information on each CU, : i) a known pattern of the unsynthesized categorical variables, , ii) the true value of synthesized family income , and iii) a name or identity of interest. With access to such external information, the intruder may attempt to determine the identity of the CU for record by first performing matches based on the known pattern , which is publicly available in each . Let be the collection of CUs in in which each CU shares the same pattern, , with CU , and let the cardinality denote the number of CUs in . Note that the CUs in contain synthesized family income, not their real values.
Armed with the knowledge of the true value of family income of CU , the intruder will seek those records whose synthetic values, , are “close” to the true data value, (which we will next more formally define as within a specified ball around the truth). If there are few records with synthetic values close to , the identification disclosure risk may be higher if the true record is among those records that are close , since there are few such records from which the intruder may randomly select a record (since each is otherwise identical in pattern and close to the truth).
Define as a ball of radius around true value, , where denotes the pattern containing the truth for record in which the intruder will concentrate her search, which we have added for emphasis. Let indicator, if the true value, is among those records, whose . We define the probability of identification for record, , as:
(8) 
where is the indicator function. This risk is constructed as the probability that the synthetic values for records in the focus pattern, , are not close to the truth. This means that there are relatively few records in the pattern close (within a ball of radius ) to the truth, such that the intruder has a higher probability of guessing the record of the name they seek. The attribute risk, by contrast, is higher if there are more records with synthetic values, , close to the truth because then the intruder has a higher probability of selecting a value near the true attribute through guessing. So the identification disclosure risk, , is inversely proportional to the attribute disclosure risk in that if there are many records whose synthetic values are similar to the truth, then the intruder has more difficulty to find the correct record that matches the identity they seek.
It may be the case, however, that though there are only a few records with close to the truth, , the true record may not be among those records, such that , which means that the intruder has a probability of finding the record for the name they seek. The event of may occur because the synthesizer “mixes” values while wellpreserving the true data distribution. We will see an example of this in our CE application in the sequel. The radius, , used to define closeness is a policy choice that we leave to the CE survey program in our application (though we explore sensitivity to multiple choices). If is equal between two different radii, the smaller value for will produce a relatively larger identification risk because it defines fewer synthesized record values as close to the true value. So selection of a smaller value for may be viewed as more conservative, though this will not always be the case as may change from to as the choice for radius, , shrinks.
We take the average of across synthetic datasets and use as the final recordlevel identification disclosure risk for CU . The higher is , the higher the identification disclosure risk is for CU .
The constructed ’s will now be used to determine the level of privacy protection encoded in the synthetic datasets , for each CU individually. The ’s will also be used to construct recordlevel weights, ’s, which are inversely proportional to ’s. These weights will exponentiate the CU’s likelihood contribution in a pseudo likelihood framework. For CU , the higher the , the lower the weight , translating to a smaller likelihood contribution of CU in the pseudo likelihood. In this way, we downweight the likelihood contributions of relatively higher risk records. We can then generate new synthetic datasets, , from a weighted version of the original synthesizer. This procedure surgically distorts the high risk portions in the data distribution, and produces new synthetic datasets with higher disclosure protection.
We proceed to construct the recordlevel weights, , from the recordlevel identification disclosure risk measures, , and the pseudo posterior framework for synthetic data with the goal to induce higher disclosure protection at a minimal loss of utility.
3.2 Pseudo Posterior
When the identification disclosure risk of CU is relatively high, the likelihood contribution of CU will be downweighted, which, in turn strengthens, the influence of the prior distribution for CU . Therefore, the recordlevel weights should be inversely proportional to the identification disclosure risks, . We propose the following formulation, which guarantees the weights :
(9) 
where is the identification disclosure risk of CU , and is a recordlevel constant to control the amount of augmentation or deflation of to determine the likelihood contribution of CU . We use a sigmoid curve for , making the value of dependent on the value of , so that we augment the identification disclosure risk to a greater extent for relatively high risk values. We downweight more of the likelihood contribution of CU if its is relatively high. We set the range of to be bounded below by and bounded above by , and bound below by 0, as in Equation (9).
We formulate the following riskadjusted pseudoposterior distribution to induce misspecification into our reestimated synthesizer,
(10) 
where .
4 Consistency and Asymptotic Covariance of the Riskweighted Pseudo Posterior
In this section we demonstrate the frequentist consistency properties of our riskadjusted, weighted pseudo posterior estimator, , to a point, , in a space that may not contain the true data generating model, , due to our intentional inducing of misspecification in our synthesizer to reduce the probability of identification disclosure. We show that the rate of contraction is injured by (where we recall that denotes the riskadjusted pseudo likelihood weight for data record, ). Our first result extends that of Bhattacharya et al. (2019) for misspecified models from a scalar weight, , to our setup of a vector of recordindexed weights. We show that the utilization of vectorweighting achieves consistency if the downweighting of record likelihood contributions grows progressively more sparse. We utilize this consistency result and proceed to show that the asymptotic covariance matrix for the pseudo MLE is different from the regular MLE due to scale and shape distortions induced by the
. We further show that the pseudo MLE sandwich form the asymptotic covariance matrix is different than the inverse Fisher information form for the pseudo posterior distribution, due to the failure of Bartlett’s second identity (because the pseudo likelihood is approximate). The implication is that uncertainty quantification of the credibility intervals of the synthesizer will be incorrect, both because they are centered on
, rather than , and because of the different forms for the asymptotic covariance matrices. Our second result specializes that of Kleijn and van der Vaart (2012) to our pseudo posterior construction.4.1 Preliminaries
We address the case of independent observations for , where and we perform inference on . admits a density, with respect to dominating measure, , on the space, , where is a field of sets. We construct the measure, , dominated by on a product of measurable spaces, . In the sequel, we write and (rather than and ), though the dependence on is implied.
We formulate our riskweighted pseudo likelihood in a fashion to generalize Bhattacharya et al. (2019) as,
(11) 
where we allow to be a function of , since it is constructed based on the disclosure probability for record, . We suppress the in the sequel for readability.
We formulate the posterior distribution, , by convolving the pseudo likelihood, , with the prior distribution, , such that for any measurable set, ,
(12) 
where , which is a generalization of the definition from Bhattacharya et al. (2019) to incorporate riskadjusted weights, .
Since our pseudo posterior formulation induces misspecification, we allow the true generating parameters, , to lie outside the parameter space, . We will show in the sequel that our model contracts on in probability, where is the point that minimizes the KL divergence from ; that is,
(13) 
We show consistency under an extension of the Rényi divergence measure to a product measure space,
(14) 
where
(15) 
is defined as the affinity for observation, , such that , the affinity for the product measure space. We note that Bhattacharya et al. (2019) show and . These properties extend to the composition, on the product measure space.
We next construct an weighted empirical distribution,
(16) 
where denotes the Dirac delta function with probability mass at . We construct the associated scaled and centered empirical process, . The usual equallyweighted empirical distribution, and associated, may be viewed as special cases. We may define the associated expectation functionals with respect to the weighted empirical distribution by .
We will construct the asymptotic covariance matrix for the following centered and scaled empirical process under our misspecified estimator,
(17) 
from which we define, , where denotes the pseudo MLE.
We next introduce two variance expressions that we will utilize to construct the asymptotic covariance matrix of the pseudo MLE under our employment of intentional misspecification,
(18)  
(19)  
(20) 
which is a riskadjustment weighted version of the Fisher information. The second weighted variance expression we define is,
(21)  
(22) 
which is the variance of the weighted score function and the middle term in sandwich estimator for the asymptotic covariance matrix of the pseudo MLE. Let and be unweighted versions of the above expressions (replacing with ).
4.2 Main Result
The following conditions guarantee the consistency result in Theorem 1 and the forms for asymptotic covariance matrices of the distributions for pseudo MLE and the pseudo posterior in the following theorems. Theorem 2 extends Theorem of van der Vaart (1998) to derive the asymptotic expansion of our form for a centered and scaled pseudo MLE. Theorem 3 specializes a result in Kleijn and van der Vaart (2012) on the sandwich form of the covariance matrix for the pseudo MLE to our pseudo MLE. These conditions also allow use of Kleijn and van der Vaart (2012) and van der Vaart (1998) to specify the form of asymptotic covariance matrix for the pseudo posterior distribution in Theorem 4. We demonstrate that the asymptotic covariance matrices are different for each of the MLE, the pseudoMLE and the pseudoposterior.
 (A1)

((Prior mass covering truth) We construct a KL neighborhood of with radius, ,with,
(23) Restrict the prior, , to place positive probability on this KL neighborhood,
(24)  (A2)

(Control size of ) Let and , where denotes the number of elements in . Let and .
such that for constants and sufficiently large,
 (A3)

(Continuity) For each (an open subset of Euclidean space), be a measurable function (of ) and differentiable at for almost every (with derivative, ), such that for every and in a neighborhood of with , we have a Lipschitz condition:
 (A4)

(Local Quadratic Expansion) The KullbackLiebler divergence with respect to has a second order Taylor expansion about ,
where is a positive definite matrix.
 (A5)

(Bartlett’s First Identity)
The first two conditions are required for consistency of our pseudo posterior estimator. The first condition requires the prior to place some mass on a KL ball near as defined in Equation (13). The second condition outlines a dyadic subgrouping of data records, where contains those records whose likelihood contributions are downweighted to lessen the estimated identification disclosure risk for those records in the resulting synthetic data. The second subset of records, , contains those records that are minimally downweighted due to nearly zero values for identification disclosure risks. Since , the constant value, , for all units in approaches from the left. We show in the sequel that the consistency result to for the synthesizer is dominated by the likelihood weighting for records in the downweighted set, . We set each on the set based on the actual data value for observed record, , , and the synthetic data , with implicit conditioning on the model, , after integrating out from the synthesizer. So the may be expected to express mutual dependence in the general case, unlike the , which are assumed to be independent. While our consistency result allows for dependence among the , condition 4.2 Main Result restricts the number of downweighted records (where ) to grow at a slower rate than the sample size, , such that the downweighting becomes relatively more sparse. This restriction accords well with the assignment of a relatively high identification risk score to records in only small portions of the distribution mass, such as the tail.
We next list our results for consistency of the pseudo posterior distribution and an associated Bernstein von Mises result for the pseudo MLE. We use this result to enumerate the form of the associated asymptotic covariance matrix for the pseudo MLE, followed by that for the pseudo posterior. All proofs are contained in a Supplement that accompanies this manuscript, except where otherwise noted.
Theorem 1
(Contraction of the pseudo posterior distribution). Let . Define and . Let and . Let be as defined in Equation (13). Assume that satisfies and suppose conditions 4.2 Main Result and condition 4.2 Main Result hold. Let . Then for any and ,
(25) 
hold with probability at least .
Since , while , the first term dominates with increasing , so that the is the dominating penalty on the contraction rate of the pseudo posterior. Even though the downweighting becomes relatively more sparse due to condition 4.2 Main Result, it is the maximum value of for on the set of downweighted records that penalizes the rate. We observe that the rate of contraction is injured by factors, and , where . Since , our result generalizes Bhattacharya et al. (2019) to allow a tempering of a portion of the posterior distribution and there is a penalty to be paid in terms of contraction rate for the tempering. Since we induce the misspecification through the weights, , the distance of the point of contraction, from the true generating parameters, , and the contraction rate on this point are both impacted by the induced misspecification. The requirement for increasing sparsity in the number of downweighted record likelihood contributions, however, ensures that will be relatively close to , which ensures the utility of our estimator.
Theorem 2
(Asymptotic normality of the pseudo MLE) Suppose conditions 4.2 Main Result4.2 Main Result hold. Then,
(26a)  
(26b)  
(26c) 
Theorem 3
(Asymptotic covariance matrix of the pseudo MLE) Suppose conditions 4.2 Main Result4.2 Main Result hold. Then,
(27a)  
(27b)  
(27c)  
(27d) 
The scale of the asymptotic sandwich estimator of the weighted pseudo MLE is inflated relative to the ordinary MLE, which will produce overly wide confidence regions. The shape of the confidence regions of the pseudo MLE may also be different than the ordinary MLE due to the weighting of each contribution in and , such that the induced misspecification will impact the shape and scale of the resulting pseudo confidence regions, as well as their centering on .
Theorem 4
(Asymptotic normality of the pseudo posterior) Suppose conditions 4.2 Main Result4.2 Main Result hold. Then
(28) 
where may be the pseudo MLE or the pseudo posterior mean.
Proof 1
The asymptotic credibility region of the pseudo posterior distribution will not contract on the frequentist confidence region for the pseudo MLE or the ordinary MLE due to the failure of Bartlett’s second identity under a misspecified likelihood that prevents the collapse of the sandwich form of the variance estimator in Equation (27c). The implication is that the pseudo posterior credibility regions may under or overcover.
5 Simulation Study
We simulate 1,000 univariate outcome values from a 2component mixture of lognormal distribution with 15 categorical predictors. The resulted distribution of the outcome variable, value, is highly skewed, as shown in Figure 5. We run the TDP mixture synthesizer developed in Section 2
on the log of the response variable,
.Computation of the TDP mixture synthesizer is done using Stan programming language. We run the TDP mixture synthesizer on the simulated data for 10,000 iterations with 5,000 burnin. We set for the Gamma prior for the DP concentration parameter . We set the maximum number of learned clusters, , as we expect few clusters since the logged data are relatively symmetric, and generate synthetic datasets, .
Using the methods in Section 3.1, we compute recordlevel identification disclosure risks, , for all records, based on the percentage radius , which is preferred by the CE program, and the intruder’s knowledge of a known pattern (containing 5 out of the 15 predictors), the true value of each record and their identities. We then calculate recordlevel weights, , using the methods in Section 3.2. To compute the recordlevel ’s, we use the scurve function in the LS2Wstat R package with the coefficient of slope of the curve, , and set and . The vectorized weights, , are then used to exponentiate the likelihood contribution of records. The new vectorweighted, pseudoposterior version of the TDP mixture synthesizer is again estimated using Stan with the same prior specification, and a new set of synthetic datasets, , are generated. As a comparison, a single scalar weight, is used for every record. The scalar value is chosen as that value which produces nearly the same filelevel identification disclosure risk measure as the vectorized weights. We use the same prior specification, and generate synthetic datasets, denoted by . Two sets of recordlevel identification disclosure risks, one for and another for , are computed.
First, we investigate the utility of synthetic datasets. To do so, we compare the density plots of the value variable generated by the three different synthesizers to its true density plot in simulated data. In Figure 6 and Figure 7, the red curve is for the value variable in the simulated data, denoted as Data; the green curve is for one synthetic dataset of , generated from the synthesizer before applying weights, denoted as Synthesizer; the blue curve denotes the synthetic dataset of , generated from the vectorweighted synthesizer, denoted as Weights_V; and the purple curve is for one synthetic dataset of , generated from the scalarweighted synthesizer after applying scalar weights, denoted as Weights_S. Two ranges of the value variable are plotted to allow closer examination of the density plots.
In both plots, the Synthesizer distribution is the closest among the three synthesizers to the Data distribution, indicating good model fitting of our proposed TDP mixture synthesizer in Section 2. This suggests high utility of the synthetic datasets, , before applying weights. The Weights_S distribution shows larger deviation from the Data distribution than does the Weights_V, indicating a decrease of synthetic data utility when applying scalar weights. An interesting takeaway is that the Weights_V distribution shows a higher peak at the modes (one around 0.1 and another around 1). Values in the downward portion on the side of each mode express relatively higher identification disclosure risks because they are in portions of the synthetic data density with relatively few other records. So the vector weighting has the effect of concentrating the modes as those higher risk records are moved closer to the mode, which reduces their risk. Their assigned synthesized values are more concentrated around the modes than in Data. We also more clearly observe in Figure 7 that the Weights_V vectorweighted model shrinks highrisk records in the tail back towards the modes, while generally preserving (and even accentuating) the modes.
Figures 8 through 10 provide further support and insight for the behaviors of different synthesizers. These scatter plots display the values for the synthetic data versus the real data. The Synthesizer in Figure 8 clearly preserves the value variable recordbyrecord the best, as the dots are closely along the line. The Weights_V in Figure 9 surgically downweights portions of value distribution. We can see that most of the largest values at the upper right corner are assigned with low synthesized values (around the two modes, 0.1 and 1). This phenomenon is expected because the vector weights are downweighting the likelihood contribution of records with high identification disclosure risks. Figure 9 also shows that other records with modest values can have high identification disclosure risks as well, and are shrunk by Weights_V. Applying scalar weights produces a fanning pattern of the dots in Figure 10, indicating a decrease of the preservation of the modes of the density under Weights_S. Scalar weights downweight the likelihood contribution of each record, equally, neglecting the different degrees of privacy protection encoded by the synthesizer for different records. The corresponding densities, especially for Weight_S in Figure 7, show a flattening phenomenon that results in a decrease of data utility.
We now turn to the examination of the risk profiles for the synthetic datasets. Figure 11 presents the violin plots of the identification disclosure risks of the synthetic datasets generated from the vectorweighted synthesizer, Weights_V, and the scalarweighted synthesizer, Weights_S, and plots them sidebyside. The figure also includes the risk distribution for the original synthesizer, labeled “Synthesizer”. The risk distribution plot for the original synthesizer shows a longer upper tail than the other two, indicating that both Weights_V and Weights_S successfully lower the maximum identification disclosure risks of the records. This is supported by Table 12, where the identification disclosure risks for the top 10 risky records are greatly decreased in the columns Weights_V and Weights_S, from the unweighted synthesizer shown in the column, labeled “Synthesizer”; in fact, the vectorweighting successfully lowers the risks for 8 of the top 10 risky records to 0, and the remaining 2 to almost 0. The superior risk reduction performance of Weights_V over Weights_S further demonstrates that our vectorweighting is able to surgically distort high risk portions of the distribution, providing higher disclosure protection for targeted records.
Figure 11 also reveals a higher concentration of low identification disclosure risks by Weights_V, evident in its longer and bigger lower tail.
A closer examination of the column, “Data Value” in Table 12 suggests that not all of the top 10 risky records are records with extremely high values. We, therefore, present Table 13, where the identification disclosure risks of the top 10 records by size or magnitude are shown, under the Synthesizer, Weights_V, and Weights_S. Again, Weights_V gives high performance in risk reduction. It is worth noting that the first record, with Data Value 1.710438 in Table 13, starts with a notsohigh identification disclosure risk of 0.3818 under Synthesizer. It ends up with a notsolow identification disclosure risk of 0.1455 under Weights_V. This is not surprising because its weight is inversely proportional to its beginning identification disclosure risk. With a relatively low , its calculated weight would not downweight its likelihood contribution as much as the other large size records in the table.
In summary, our proposed TDP mixture synthesizer preserves high utility; however, its high utility results in portions of the data distribution with high identification disclosure risks. Using a vectorweighted version of the TDP mixture synthesizer can surgically distort those high risk portions, therefore producing synthetic data that provides higher disclosure protection. It also maintains a reasonably high level of utility, successfully balancing the riskutility tradeoff of synthetic data.
Data Value  Synthesizer  Weights_V  Weights_S 

0.5959832  0.9311  0.0486  0.2338 
0.4713665  0.8756  0.0000  0.1936 
3.3694715  0.9256  0.0000  0.0000 
2.4854301  0.8264  0.0000  0.0972 
0.6002610  0.9118  0.0000  0.2382 
0.5880128  0.8264  0.0486  0.1917 
0.5348753  0.8500  0.0000  0.0957 
2.9259830  0.9194  0.0000  0.0484 
0.6818285  0.8952  0.0000  0.5194 
2.4694975  0.9212  0.0000  0.0000 
Data Value  Synthesizer  Weights_V  Weights_S 

1.710438  0.3818  0.1455  0.1439 
3.369471  0.9256  0.0000  0.0000 
2.485430  0.8264  0.0000  0.0972 
1.665322  0.7155  0.0488  0.0952 
2.963501  0.7266  0.0000  0.0484 
2.245890  0.6811  0.0000  0.1946 
2.091579  0.6290  0.0484  0.1935 
2.925983  0.9194  0.0000  0.0484 
2.469497  0.9212  0.0000  0.0000 
2.005061  0.8130  0.0000  0.1891 
6 Application to Synthesis of CE Family Income
We now apply our methods to synthesizing the sensitive family income variable, , in the CE data sample introduced in Section 1.1. We run the TDP mixture synthesizer on the log of the family income variable, and compute the recordlevel identification disclosure risk for each CU. We then construct the vector of weights, , and apply them to formulate the pseudo posterior synthesizer from which we generate new collections of synthetic datasets. The new vectorweighted version of the TDP mixture synthesizer is run using Stan with the same prior specification as in the simulation study, and a new set of synthetic datasets, , are generated. We will evaluate the utility and identification disclosure risk profiles of the vectorweighted synthesizer and comparator synthesizers introduced and discussed in Section 6.1 and Section 6.2. We employ radius, , as before, but also examine sensitivity of the identification risk to different choices of the percentage radius and compare our proposed methods to CE survey program’s current practice of topcoding.
6.1 Utility
Figures 14 and 15 compare the density plots of the family income variable in the CE data (denoted as Data), the synthesizer before weighting (denoted as Synthesizer), and the synthesizer after vectorweighting (denoted as Weights_V). Weights_V shows a higher peak than Data in Figure 15, which displays the righttail of the data due to relative concentration of the mode. As we saw in the simulation study, a highpeak mode will produce more isolated record values in the downward slope portions of the distribution, which induces a contraction of points towards the more populated mode after riskbased weighting.
The scatter plots in Figure 16 and Figure 17 show the synthetic versus true data values for the original synthesizer (labeled, “Synthesizer”) and the vectorweighted synthesizer (labeled, “Weights_V”), respectively, and provides more insight. The original synthesizer demonstrates a high degree of fanning, indicating a relatively large amount of distortion of the byrecord values from the true data, while the above density figures show very good utility (because the density for Synthesizer maps well to the true data distribution). This phenomenon results from a mixing of records that occurs when generating new synthetic data from our synthesizer estimated on the real data and encodes a higher level of identification disclosure protection; in particular, the binary indicator used in our proposed identification disclosure risk measure in Equation (8) is most likely 0 (the synthetic family income being outside the predefined radius from the true family income).
The increased fanning pattern of Figure 17 for Weights_V as compared to that for the original synthesizer, shown in Figure 17, appears to show generally more distortion. Yet, we note that the envelope of distortion is reduced (which is seen by looking at the left and righthand sides the scatter plots), which indicates the vectorweighting focuses the distortion for those highrisk records that lie in the portion of the distribution away from the main features of the data. The concentration of synthetic values at the mode induced by the weighting actually reduces distortion for some points as compared to the original synthesizer, while yet the overall data distribution, as shown in Figure 14, is wellpreserved. These results suggest that the Weights_V is inducing a surgical distortion in high risk portions of the original data distribution.
Figure 18 represents the effects of topcoding, which is designed to distort the righttail of the family income distribution for especially high income values, while leaving untouched other portions of the data distribution. So, while it is even more surgical in targeting a portion of the real data distribution for distortion, this method implicitly assumes that the relatively high risk records are solely those that express large magnitude values for the income variable. We, next, reveal that this assumption is false when comparing the identification disclosure risks under the three methods.
6.2 Identification disclosure risks
To compare the risk profiles of the synthetic datasets, we use violin (density) plots shown in Figure 19. In addition to the Synthesizer and Weights_V plots on the left, we include the distributions for the identification disclosure risks of the topcoded family income (that is currently publicly available in CE PUMD). We use three different values of the percentage radius in the risks calculation: , denoted as TC_5%; , denoted as TC_10%; and , denoted as TC_20%. Recall that for the Synthesizer and Weights_V plots, we set . Our known pattern is composed of the following categorical variables, {gender, age, education level, marital status, earner}, when calculating the identification disclosure risks.
Firstly, we observe that Weights_V lowers the overall identification disclosure risks as compared to Synthesizer. This confirms and demonstrates our proposed vectorweighting method provides higher disclosure protection, reducing the peak recordlevel identification disclosure risks. The columns Synthesizer and Weights_V in Table 20, which shows the identification risk for the top 10 risky records as measured from the synthetic data produced by the original synthesizer, further supports the risk reduction performance of Weights_V. In Table 20 and Table 21, the range, instead of the actual data value, is provided in column “Data Value” for confidentiality.
Secondly, we note that while topcoding should in theory provide disclosure protection, under our identification disclosure risk measures, topcoding only provides such protection on CUs with extreme large family income (as is evident in their long tails and large bulbs around 0 in Figure 19 and the column TC_20% in Table 21). At the same time, since the majority of the family income is not topcoded (see Figure 18), topcoding fails to provide any disclosure protection to most of the CUs that express high identification disclosure risks (as evident in their upper portions in Figure 19). Their identification disclosure risks are high because their family income is unchanged, resulting in only a single value near the target income, while yet (which is perfect risk situation).
We recommend the statistical agencies to reevaluate their practice of topcoding for disclosure protection. Based on our findings, we recommend the alternative of synthetic data for the CE PUMD. For CUs with unacceptably high identification disclosure risks, we recommend using our proposed riskweighted pseudo posterior approach to surgically downweight the likelihood contributions for high risk records for synthetic data generation.
Data Value  Synthesizer  Weights_V  TC 20% 

(10K, 50K)  0.6258  0.0925  0.9762 
(50K, 100K)  0.5318  0.1717  0.8312 
(100K, 250K)  0.5725  0.1635  0.8108 
(100K, 250K)  0.5274  0.2261  0.8889 
(100K, 250K)  0.5369  0.1560  0.7738 
(100K, 250K)  0.5258  0.1348  0.8788 
(10K, 50K)  0.5270  0.3250  0.9400 
(100K, 250K)  0.5310  0.1708  0.8214 
(100K, 250K)  0.5925  0.0460  0.9080 
(50K, 100K)  0.5351  0.2629  0.7113 
Data Value  Synthesizer  Weights_V  TC 20% 

(250K, 3,000K)  0.0000  0.0486  0.0000 
(250K, 3,000K)  0.0482  0.1446  0.0000 
(250K, 3,000K)  0.0000  0.2473  0.0000 
(250K, 3,000K)  0.0000  0.1440  0.0000 
(250K, 3,000K)  0.0496  0.0987  0.0000 
(250K, 3,000K)  0.0967  0.0000  0.9565 
(250K, 3,000K)  0.0986  0.0989  0.0000 
(250K, 3,000K)  0.0000  0.0986  0.0000 
(250K, 3,000K)  0.0987  0.0000  0.0000 
(250K, 3,000K)  0.0000  0.0000  0.0000 
7 Conclusion
We propose a general framework, configured as a vectorweighted, pseudo posterior distribution, that facilitates achievement of a closertooptimal tradeoff between identification disclosure risk and utility for the data analyst in publicly released synthetic datasets performed by statistical agencies. Our riskweighted, pseudo posterior formulation surgically downweights the likelihood contributions for those records in relatively high risk portions of the true data distribution, while leaving the remainder of the data distribution relatively undistorted. This approach facilitates the utilization of a flexible, nonparametric synthesizer that well reproduces the true data distribution because the riskbased weighting may be used to lower identification disclosure risks. So the statistical agencies are not required to conduct a search over the space of synthesizing models in the hope to find one that will both meet a desired identification risk threshold while providing excellent utility.
We demonstrate in a simulation study, and for a real data application to the CE data sample, that our proposed riskweighted synthesizer maintains a high level of data utility while providing superior disclosure protection, successfully balancing the riskutility tradeoff of synthetic data. Moreover, our proposed riskweighted pseudo posterior approach is generalizable, applicable to any synthesizer that expresses high risk portions of the resulting synthetic data distribution. Our approach is supported with a frequentist consistency guarantee under the condition that the percentage of records that are downweighted grows more sparse in the limit of the sample size, which well accords with the targeting of small portions of the real data distribution for distortion that express relatively high identification disclosure risks. From a theoretical standpoint, our use of vectorweighting achieves the goal of producing an estimator that contracts on a that is closer to the truth, , than is possible under scalar weighting because the downweighting is confined to a portion of the data distribution. Our novelty is to leverage the idea of misspecification in the literature where it is used to address corruption in the data generating process and redirect it to the purposeful inducing of misspecification in order to encode privacy protection.
Our proposed vector weights synthesizer has superior risk reduction performance over the scalar weights synthesizer, due to its flexibility of targeting records with high risks individually. An opportunity for future work is a vectorscalar combined weight, such that statistical agencies could insert a predefined risk threshold for all records with a scalar weight, and then add a recordspecific vector weight to each record individually, if necessary. This combined approach can give statistical agencies even more flexibility in targeting high risk records when producing synthetic data.
Another path of future work is the further evaluation of the topcoding practice for disclosure protection, and its comparison to our proposed riskweighted pseudo posterior synthesizer. For the CE survey program and its disclosure risk evaluation practice, further investigation of the choice of percentage radius is of particular interest.
References
 An and Little (2007) An, D. and Little, R. J. A. (2007). Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A 170, 923–940.
 Bhattacharya et al. (2019) Bhattacharya, A., Pati, D., and Yang, Y. (2019). Bayesian fractional posteriors. The Annals of Statistics 47, 1, 39–66.

Dimitrakakis et al. (2017)
Dimitrakakis, C., Nelson, B., Zhang, Z., Mitrokotsa, A., and Rubinstein, B.
I. P. (2017).
Differential privacy for bayesian inference through posterior sampling.
J. Mach. Learn. Res. 18, 1, 343–381.  Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, 265–284.
 Ghosal et al. (2000) Ghosal, S., Ghosh, J. K., and Vaart, A. W. V. D. (2000). Convergence rates of posterior distributions. Ann. Statist 500–531.
 Hu (2018) Hu, J. (2018). Bayesian estimation of attribute and identification disclosure risks in synthetic data. ArXiv eprints .
 Hu et al. (2018) Hu, J., Reiter, J. P., and Wang, Q. (2018). Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Analysis 13, 183–200.
 Hu and Savitsky (2018) Hu, J. and Savitsky, T. D. (2018). Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys. ArXiv eprints .

Kleijn and van der Vaart (2012)
Kleijn, B. and van der Vaart, A. (2012).
The bernsteinvonmises theorem under misspecification.
Electron. J. Statist. 6, 354–381.  Miller and Dunson (2018) Miller, J. W. and Dunson, D. B. (2018). Robust bayesian inference via coarsening. Journal of the American Statistical Association 0, 0, 1–13.
 Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9, 2, 249–265.
 Quick et al. (2015) Quick, H., Holan, S. H., Wikle, C. K., and Reiter, J. P. (2015). Bayesian marked point process modeling for generating fully synthetic public use data with pointreferenced geography. Spatial Statistics 14, 439–451.
 Savitsky and Toth (2016) Savitsky, T. D. and Toth, D. (2016). Bayesian Estimation Under Informative Sampling. Electronic Journal of Statistics 10, 1, 1677–1708.
 Stan Development Team (2016) Stan Development Team (2016). RStan: the R interface to Stan. R package version 2.14.1.
 van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
 Williams and Savitsky (2018) Williams, M. R. and Savitsky, T. D. (2018). Bayesian Uncertainty Estimation Under Complex Sampling. ArXiv eprints .
8 Proof of Theorem 1
Let us the define the following subset of ,
which is the restricted set for which we will bound the pseudo posterior distribution, , from above to achieve the result of Theorem 1. We begin with the statement and proof of Lemma 1 that extends Lemma 8.1 of Ghosal et al. (2000) to our pseudo posterior in order to provide a concentration inequality to probabilistically (in probability) bound the denominator of the pseudo posterior distribution, , from below.
8.1 Enabling Lemma
Lemma 1
(Concentration Inequality) Suppose condition 4.2 Main Result holds. Define and . For every and measure on the set , we have for every , and sufficiently large,
Comments
There are no comments yet.