Privacy protection is an important research topic, which attracts attention from statistical agencies and private companies alike. A popular approach focuses to encode privacy protection to a summary statistic composed from record-level data, through the addition of noise proportional to the “sensitivity”, , of the statistic, defined as the supremum of the change in value of the statistic from the inclusion of a single data record over the space of databases. Dwork et al. (2006) construct a mechanism that employs a Laplace-distributed perturbation of a target statistic such that the resultant statistic achieves a privacy guarantee under the Differential privacy (DP) framework. The guarantee is represented by a budget, some of which is expended for each query a user makes through the mechanism to the underlying, closely-held (by the publishing agency) database.
A related approach to privacy protection is the release of a synthetic record-level database. This approach replaces the entire closely-held (by the statistical agency) database with a synthetically generated record-level database. The synthetic database is released to the public who would use it to conduct any analyses of which they would conceive for the real, confidential record-level data. As a result of releasing a synthetic database encoded with privacy protection, the synthetic data approach replaces multiple queries performed on a summary statistic with the publication of the synthetic database, such that the synthetic data approach is independent of the specific queries performed by users or putative intruders.
Dimitrakakis et al. (2017) demonstrate theoretical results for the Bayesian posterior distribution, which may be employed as a mechanism for synthetic data generation; specifically, if the log-likelihood is Lipschitz , then the posterior mechanism achieves a DP guarantee for each posterior draw of , the model parameter(s). However, Dimitrakakis et al. (2017) acknowledge that computing
, in practice, over the space of databases under the use of the log-likelihood is particularly difficult. They specify relatively simple Bayesian probability models where the Lipschitz bound is analytically available. They acknowledge the lack of generalization to more complicated models, required to produce a high level of utility for the user community, where the bound isnot analytically available. Relatively simple Differentially private Bayesian synthesizers are similarly proposed by Machanavajjhala et al. (2008); Abowd and Vilhuber (2008); McClure and Reiter (2012); Bowen and Liu (2016). These methods are either limited to specific data types or computationally infeasible for data with a reasonable dimension.
A common approach for generating parameter draws for is the exponential mechanism of McSherry and Talwar (2007), which inputs a non-private mechanism for and produces with a DP guarantee.
The exponential mechanism releases values of from a distribution proportional to,
where is a utility function, is the sensitivity, defined globally over ; is the Hamming distance between . Each draw of from the exponential mechanism satisfies , where is a budget target supplied by the publishing statistical agency.
The exponential mechanism inputs a utility function and its sensitivity constructed as the supremum of the utility over the space of data, , and, simultaneously, the parameter space, . Wasserman and Zhou (2010) and Snoke and Slavkovic (2018) construct utility functions based on the real and synthetic datasets (e.g., the Kolmogorov-Smirnov distance between the empirical distributions of the real and synthetic datasets) that are naturally bounded over all , resolving the challenge of using the log-likelihood as the utility function. However, there is a large, and perhaps intractable, computational cost to the use of these naturally bounded utilities to draw samples of from the distribution constructed from the exponential mechanism. For example, Snoke and Slavkovic (2018) must compute their utility statistic multiple times for each proposed value, (), under a Metropolis-Hastings algorithm used to draw samples under the exponential mechanism. Furthermore, Snoke and Slavkovic (2018) assume the existence of some synthesizing distribution, , from which to draw synthetic data, needed to compute their . In practice,
will be defined as the posterior predictive distribution,, which means the posterior distribution must be estimated.
This work focuses on extending the pseudo posterior synthesizer in Hu and Savitsky (2019a) as an alternative (to the exponential) mechanism with the purpose of achieving a DP disclosure risk guarantee. Hu and Savitsky (2019a) design a record-indexed weight , which is inversely proportional to their construction for the identification risk probability of record, . The vector weights are subsequently applied to the likelihood function of all records to form the pseudo posterior,
where denotes the model parameters,
denotes the model hyperparameters anddenotes the prior distribution. This construction employs a data record-indexed, risk-based weight vector with weights to surgically downweight high-risk records in estimation of a pseudo posterior distribution for , subsequently used to generate and release a synthetic record-level database. The authors show that this selective downweighting of records reduces the average of by-record risks as compared to an unweighted synthesis, while inducing only a minor reduction in utility. Hu and Savitsky (2019a) base their risk measure on a calculated probability of identification for a record. They cast a radius around the true data value for each record and count the number of record values that lie outside of the radius, which directly measures the extent that the target record is isolated and, therefore, easier for an intruder to discover by random guessing. While their risk measure appeals to intuition, it is based on an assumption about the behavior of a putative intruder. By contrast, the DP framework makes no assumptions about the behavior of an intruder.
In this work, we focus on the pseudo posterior synthesizer and the use of the log-likelihood for Lipschitz that extends Dimitrakakis et al. (2017) to provide a practical, general formulation for using weights based on record-level sensitivities that we show in the sequel achieves dramatic improvements in the DP guarantee as compared to the unweighted, non-private synthesizer. We compute a local sensitivity specific to our application to a Consumer Expenditure Surveys (CE) sample and reveal mild conditions that guarantee its contraction to a global sensitivity result over all . Our results may be applied to any synthesizing mechanism envisioned by the data analyst in a computationally tractable way that only involves estimation of a pseudo posterior distribution for .
The remainder of the paper is organized as follows: Section 2 generalizes some results on the Lipschitz conditions that guarantee a DP privacy result for our proposed pseudo posterior mechanism; we also present a result that provides a DP guarantee to the pseudo posterior predictive mechanism for generating synthetic data. In Section 3, we describe the computation details of Lipschitz bound and the formulation of vector weights, as well as the connection between the scalar-weighted pseudo posterior mechanism and the exponential mechanism. Section 4 focuses on our application to synthesizing the family income in the CE sample, and presents the risk and utility profiles of Differentially private synthetic data generated under the proposed pseudo posterior mechanism, compared to other competing methods. We conclude with a discussion in Section 5.
2 Differential Privacy for Pseudo Posterior
We proceed to generalize some results on the Lipschitz conditions that guarantee a DP privacy result from Dimitrakakis et al. (2017) based on the unweighted posterior distribution to the risk-weighted, pseudo posterior distribution. We further re-purpose a result from Wasserman and Zhou (2010) that provides a DP guarantee to the pseudo posterior predictive mechanism for generating synthetic data that is based on integrating with respect to the privately guaranteed pseudo posterior distribution mechanism (used to generate the model parameters).
We begin by constructing the probability space, , equipped with prior distribution, . Fixing a database sequence, , we formulate the pseudo likelihood, for each under and exchangeable sampling. The pseudo likelihood uses weights, , where are fixed and known weights that are inversely proportional to the identification risk for each database record. We later construct these record-level weights, , to be inversely proportional to the local (to our database) Lipschitz bound, where , the log-likelihood, and . Our results on conditions that provide a Differential privacy guarantee for our pseudo posterior mechanism, however, apply to other definitions for finite weights, , so we save the details for constructing to the next section. We continue and define the weighted log-likelihood, .
Given the prior and pseudo likelihood, we construct the pseudo posterior distribution,
where normalizes the pseudo posterior distribution.
We construct the empirical distribution for the random sequence, ,
from which we construct the empirical cumulative distribution function,,
which we later use to demonstrate the contraction of a local privacy result achieved for some observed, , to a global privacy result for all , where , almost surely.
Finally, we define the Hamming distance between databases that we use to estimate the sensitivity, , from the inclusion of a data record in the database.
(Hamming distance) Given databases and , let denote the Hamming distance between and :
2.2 Main Results
Our task is to specify assumptions that guarantee our pseudo posterior mechanism achieves an expenditure under the Differential privacy framework. We extend the definition of Differential privacy from Dimitrakakis et al. (2017) to our weighted pseudo posterior mechanism.
which intuitively limits the change in the estimated pseudo posterior distribution for sets, , from the inclusion of a single record. Although the pseudo posterior distribution mass assigned to depends on , the expenditure is defined as the supremum over all .
Our main assumption bounds the log-likelihood ratio, uniformly, for all databases, that are at a Hamming distance (i.e. over all and . The uniform bound defines a maximum sensitivity in the log-likelihood from the inclusion of a record (at a Hamming distance from each database in the space of databases). Our intuition that the magnitude of this sensitivity for the log-likelihood ratio is directly tied to the resulting sensitivity of the pseudo posterior, , that determines the Differential privacy expenditure is confirmed in two results below.
Fix a and construct the Lipschitz function of over the space of databases,
Assumption 1 restricts such that the Lipschitz function of is uniformly bounded from above,
We note that the Lipschitz function of , , is constructed using the pseudo log-likelihood, , using weights, , each of which is . Choosing an that strongly downweights a highly sensitive record for an unweighted posterior mechanism (with a high magnitude log-likelihood ratio for some ) will reduce the sensitivity of that record under our pseudo posterior mechanism. We see in our first two results that reducing the sensitivity of the log-likelihood ratio directly improves (reduces) the Differential privacy expenditure, , using the pseudo posterior mechanism.
Our first result connects the Lipschitz bound, , for the log-likelihood to the supremum over for of the Kullback-Liebler (KL) divergence between the posterior densities (given versus ) from the inclusion of a database record.
and satisfying Assumption 1,
From Assumption 1, , so
Our next result directly connects the Lipschitz bound, , for the log-likelihood of Assumption 1 to resulting Differential privacy expenditure, , for each draw of from the pseudo posterior distribution.
(where is the -algebra of measurable sets on ) under satisfying Assumption 1:
i.e. the pseudo posterior is .
Our next result extends our Differential privacy guarantee from posterior draws of for models that satisfy Assumption 1 to draws of synthetic data, , constructed from the model posterior predictive distribution, which is the focus for our pseudo posterior mechanism.
Suppose is the pseudo posterior predictive probability mass for in set (the algebra of sets for ), constructed from our pseudo posterior model for that satisfies Differential privacy with expenditure, . Let be independent draws from . This defines a mechanism for that satisfies Differential privacy for any .
Theorem 2, to follow, establishes a global supremum over the space of databases, . In practice, we compute the Lipschitz bound specified in Assumption 1, locally, for our dataset, but we seek a privacy result that is global over the space of databases. We, next, specify two mild assumptions under which our local privacy result at the observed, , contracts on the global privacy result specified in Theorem 2.
The sequence, , are independently and exchangeably drawn,
This assumption says the joint distribution fordoes not depend on their ordering, such that if denotes a permutation of the integers, then . We use this assumption to guarantee that , almost surely, where denotes the true marginal distribution of for sufficiently large. Our next assumption restricts the class of distributions to those which are continuous with respect to the Lebesgue measure, which is appropriate for .
Let be the measure space for with joint distribution, .
where denotes the Lebesgue measure on all disjoint open and closed intervals (in a product space) for all .
This assumption is used to transition from a privacy result specific to a local dataset to a result that is global for sufficiently large. We tailor Assumption 3 to our income data, which are continuous on the real line, though a similar assumption may be used for other data types (e.g., such as the absolute continuity of with respect to a counting measure, , on ).
Density, , specified in Assumption 3, is everywhere (on its support) differentiable,
This assumption bounds the log likelihood over our observed .
Then for sufficiently large,
That is, if a local result exists for our observed database (where ), it contracts on the global result over the space of databases.
From Assumption 2, Berti and Rigo (1997) guarantee that the empirical cdf, , pointwise, by an extension of the Glivenko-Cantelli theorem to exchangeable random sequences, at a rate of , which guarantees . Contraction of the marginal distribution for , along with Assumption 3 that is dominated by a Lebesgue measure, guarantees that for sufficiently large our observed local database, , includes all sets, . Our observed database values include all sets with the result that the supremum over all , fixing our observed contracts on the supremum over all . Yet, since is on a set of measure , then the supremum over all values, , is equivalent to the supremum over all . This proves that a local result, if it exists, where , contracts on the global result of Theorem 2 at . We place no restrictions on the boundedness for values of , nor have we made any further assumptions about the boundedness of the log-likelihoods, , to achieve our result.
This result does not guarantee the existence of a local result, for sufficiently large, but does guarantee that a local result, if it exists, contracts on a global result at . This theorem makes no assumptions about the boundedness of the log-likelihood and only requires very mild assumptions to guarantee contraction of the local privacy result. In practice, we generate a single Lipschitz bound from our observed database that ensures there is no additional leakage of information about the data that could occur if one were to calculate multiple local results, in combination or in a sequence, to achieve a global one.
This next lemma guarantees a Lipschitz bound for sufficiently large under an assumption about the differentiability of the marginal density, .
For sufficiently large,
That is, a local result is guaranteed to exist for sufficiently large.
The log-likelihood ratio of Assumption 1 reduces to for each under evaluation of all databases, , since we divide the full data likelihood by a leave-one-out likelihood. Therefore, achieving the bound for the log-likelihood ratios reduces to bounding the log-likelihood values for the individual data component contributions. Assumption 2 provides the existence of the marginal density, and de Finetti’s theorem guarantees that arises from,
where Equation 25c utilizes Jensen’s inequality. De Finetti guarantees the existence of the generating likelihood, , the integrability of the log likelihood contributions, , over . Assumption 4 guarantees that is absolutely bounded over its support. Since we bound the likelihood contributions by , from above, and is absolutely bounded then so are the, for all .
3 Computation of Lipschitz Bound and Weights
In this section, we describe the implementation details for our database of the Lipschitz bound and the formulation of vector weights in Section 3.1. In Section 3.2, we lay out the connection between the scalar-weighted pseudo posterior mechanism and the exponential mechanism, with a discussion of the implications on the data utility of Differentially private synthetic data generated under the two mechanisms.
3.1 Details of Computing and
We proceed to describe the details of computing the Lipschitz bound, , of a given dataset. Let be the number of MCMC draws for parameter , and be the number of records in the dataset.
Firstly, we compute an vector of absolute value of log-likelihood ratios, , for each record , over sampled values of from the pseudo posterior distribution, leaving out the data value for record , . These computations produce an matrix, , of log-likelihood values. Let , where each is the vector for the th MCMC draw.
Secondly, we compute the maximum over the vector of log-likelihood ratios, , for each , giving an vector, . We take the quantile of this vector and set it to the Lipschitz bound , and drop all draws, that produce an to truncate our . In our data application in the sequel, the truncated bound, , is little different than not truncating posterior draws of , though we do so because our utility is barely reduced and we demonstrate, in principle, that under numerical estimation of the distribution of that its support may be truncated to achieve a bound as specified in Assumption 1, a possibility also discussed in Dimitrakakis:2017.
Thirdly, we drop the applicable rows, , from to achieve , with dimension under dropping draws of . Each is a truncated vector for each record . We take the maximum of each , which produces an vector, where each (the positive real line). We transform the vector, to
using a linear transformation such that eachwhere values of closer to 1 indicate higher risk (i.e. sensitivity).
Finally, we formulate by-record weights, ,
where and denote a scaling and a shift parameters, respectively, of the used to tune the risk-utility trade-off. As discussed in Hu and Savitsky (2019b), the scaling parameter has a global effect while the shift parameter has a local effect on the risk-utility trade-off. We will show in Section 4 the effects of different configurations of and on the risk and utility profiles of the Differentially private synthetic dataset for the CE sample, generated under our proposed vector-weighted pseudo posterior mechanism.
3.2 Scalar-Weighted Pseudo Posterior Mechanism and the Exponential Mechanism
Wasserman and Zhou (2010); Zhang et al. (2016); Snoke and Slavkovic (2018) use the exponential mechanism to generate synthetic data with privacy guarantees from a non-private mechanism. Suppose we start with a non-private mechanism, such as an unweighted synthesizer in Equation (27),
Under our set-up that the log-likelihood function as the utility function, i.e. , the exponential mechanism generates private samples from
where the prior, , is chosen as the “base” distribution as in Zhang et al. (2016) specified by McSherry and Talwar (2007) that ensures the exponential mechanism produces a proper density function. Furthermore,
which means that the exponential mechanism is equivalent to a risk-adjusted pseudo posterior synthesizer with scalar weight , where .
There are important implications of the exponential mechanism reducing to a scalar-weighted pseudo posterior under use of the log-likelihood as the utility function. Using a scalar weight, , shown in Equation (29), we expect a resulting lower utility for synthetic data draws under this mechanism than we do under our vector-weighted pseudo posterior shown in Equation (26). The vector-weighted pseudo posterior is more surgical and concentrates the downweighting to more the high-risk records, whereas the exponential mechanism must downweight all records the same amount. Downweighting all records the same amount will be conservative because the scalar weight is based on the worst case sensitivity over the entire database of records, which is required to achieve an privacy guarantee and parameter spaces and not tuned to the risk () of each record.
We see in Section 4 the reduction in utility of the Differentially private synthetic dataset generated under the exponential mechanism, compared to that under our proposed vector-weighted pseudo posterior mechanism, for the CE sample for an equivalent privacy guarantee for both mechanisms.
4 Application to the CE sample
In this section, we first introduce the CE sample 4.1
of consumer units (CU), where our goal is to synthesize family income, a highly skewed continuous variable, with a Differential privacy guaranteeo. In Section4.2, we present risk and utility profiles of our Differentially private synthetic data drawn from our pseudo posterior mechanism and compares performances to the exponential mechanism (EM) and our unweighted synthesizer. Section 4.3 showcases different scaling and shifting () configurations for vector weights in Equation (26) to examine the risk-utility trade-offs with the purpose of providing the Bureau of Labor Statistics (BLS) options for selecting a configuration that matches to their policy.
4.1 The CE Sample and Unweighted Synthesizer
Our application of our pseudo posterior mechanism will focus on income data published by the Consumer Expenditure survey (CE), administered by the Bureau of Labor Statistics (BLS) with the purpose of publishing income and expenditure patterns indexed by geographic domains to support policy-making by State and Federal governments. The description of the CE sample included here closely follows that in Hu and Savitsky (2019b). The CE contain data on expenditures, income, and tax statistics about CUs across the U.S. The CE public-use microdata (PUMD)444For for information about CE PUMD, visit https://www.bls.gov/cex/pumd.htm. is publicly available record-level data, published by the CE. The CE PUMD has undergone masking procedures to provide privacy protection of survey respondents. Notably, the family income variable, has undergone top-coding, a popular Statistical Disclosure Limitation (SDL) procedure which could result in reduced utility and insufficient privacy protection (An and Little, 2007; Hu and Savitsky, 2019a).
The CE sample in our application contains CUs, coming from the 2017 1st quarter CE Interview Survey. It includes the family income variable, which is highly right-skewed and deemed sensitive; refer to Figure 1
for its density plot. The CE sample also contains 10 categorical variables, listed in Table1. These categorical variables are deemed insensitive, and used as predictors in building a flexible synthesizer for the synthesis of the sensitive family income variable.
|Gender||Gender of the reference person; 2 categories|
|Age||Age of the reference person; 5 categories|
|Education Level||Education level of the reference person; 8 categories|
|Region||Region of the CU; 4 categories|
|Urban||Urban status of the CU; 2 categories|
|Marital Status||Marital status of the reference person; 5 categories|
|Urban Type||Urban area type of the CU; 3 categories|
|CBSA||2010 core-based statistical area (CBSA) status; 3 categories|
|Family Size||Size of the CU; 11 categories|
|Earner||Earner status of the reference person; 2 categories|
|Family Income||Imputed and reported income before tax of the CU;|
|approximate range: (-7K, 1,800K)|
To generate partially synthetic datasets for the CE sample with synthetic family income, we use an unweighted, non-private synthesizer. The synthesizer is constructed as an over-determined finite mixture with the probabilities of assignment to each cluster that encourages sparsity in the number of populated clusters such our parametric model becomes arbitrarily close to the Dirichlet Process mixture in the limit of the maximum number of clusters,. This finite mixture synthesizer has been shown to produce synthetic data with high utility, but probably unacceptable level of disclosure risk in previous work (Hu and Savitsky, 2019a, b). We leave the details of the synthesizer in the Appendix for brevity and direct interested readers to the aforementioned work for further information.
4.2 Differentially Private Risk and Utility Comparisons of Mechanisms
To generate synthetic data and compare results, we apply four synthesizers: 1) the unweighted, non-private synthesizer, labeled “Unweigthed”; 2) the private synthesizer under the pseudo posterior mechanism, labeled “DPweighted”, with configuration ; 3) the private synthesizer under the exponential mechanism, labeled “EMweighted”, which is designed to privacy target, , achieved by “DPweighted”; 4) and the weighted, though non-private pseudo posterior synthesizer proposed by Hu and Savitsky (2019b), labeled “Countweighted”, that utilizes their method for measuring the by-record disclosure risk (based on an assumption about the behavior of an intruder). The labels are used throughout the remainder of this paper when presenting various risk and utility results.
We first look at the risk profiles of the four synthesizers. Figure 2 plots the distributions of the Lipschitz bounds, ’s, for the four synthesizers. For each synthesizer, we take the log-likelihood ratios over the posterior draws of for each record and then take the maximum that record. We find the maximum for each record and we obtain and plot a vector of Lipschitz bounds, , over the records. The maximum value of the over all of the records is denoted as , the Lipschitz bound for the mechanism.
The “Unweighted”, non-private synthesizer clearly has the highest maximum with . The other non-private, “Countweighted” synthesizer achieves a much lower maximum with . The large reduction in the Countweighted synthesizer owes to the correlation between by-record risks, where each is computed as the probability that the value for each target record is relatively isolated from that of other records used in the Countweighted synthesizer as compared our use of the by-record (Lipschitz) sensitivity of the log-likelihood ratio used for our DPweighted mechanism. The two private synthesizers both achieve even lower maximum : , indicating the best risk profiles. The EMweighted mechanism was estimated by setting the scalar with a target , the privacy guarantee (expenditure) achieved by our pseudo posterior mechanism. Our intent is to compare the utility performances between the two private mechanisms where each achieves an equivalent privacy guarantee. It bears mention that while the “DPweighted” under the pseudo posterior mechanism and the “EMweighted” under the exponential mechanism achieve similar maximum bounds, which governs the DP guarantee, the exponential mechanism tends to produce notably lower risk for most records than the pseudo posterior mechanism, evident in the flattened shape of the violin plot. The exponential mechanism sets the scalar weight based on the risk of the worst case records because the same level of downweighting must be applied to all records in contrast with the by-record weighting under of our pseudo posterior mechanism.
Figure 3 and Figure 4 show a collection of violin plots of the distribution (obtained from re-sampling) for the each of the mean and the 90th quantile statistics, respectively, estimated on the synthetic data generated under each of our synthesizers and also on the original, confidential data for comparison, labeled “Data”. These figures allow us to compare the utility performances across our synthesizers by the examination of how well the real data distribution for each statistic is reproduced by the synthetic dataset for each of our synthesizers. For the synthesizers, a set of synthetic datasets were generated and the distribution for each statistic was estimated on each dataset (under re-sampling). The resulting barycenter of the individual distributions in the Wasserstein space of measures was computed using by averaging the quantiles over the datasets (Srivastava et al., 2015). Our privacy guarantees apply to each synthetic draw from our mechanism, so the total privacy expenditure is that for each dataset shown in Figure 2 multiplied by . We compute utilities over synthetic datasets for thoroughness, though the distribution of each statistic for a single synthetic dataset is very similar.
Our “DPweighted” synthesizer under the pseudo posterior mechanism outperforms the other three in utility preservation. Firstly, especially evident in Figure 4, “DPweighted” under the pseudo posterior mechanism provides better estimates than “EMweighted” under the exponential mechanism. The clearly worse utility preservation of the exponential mechanism is exactly the cost we have discussed previously of setting the scalar weight applied to all records based on the highest risk records earlier discussed. Since both mechanisms achieve the same maximum Lipschitz bound , which governs the DP guarantee, these results indicate that the exponential mechanism has to compromise a large amount of the utility to achieve a similar DP guarantee compared to the pseudo posterior mechanism.
Secondly, while the non-private “Unweighted” synthesizer and our private “DPweighted” synthesizer provide equally good estimates for both the mean and the 90th quantile, the much greater maximum Lipschitz bound of the “Unweighted” synthesizer shown in Figure 2 indicates a much worse balance of the utility-risk trade-off compared to our “DPweighted” synthesizer. The third and a minor point, is that the “Countweighted” synthesizer, albeit non-private, achieves only a slightly higher maximum Lipschitz bound compared to our private “DPweighted” synthesizer; however, its utility preservation is worse, especially evident in Figure 4 for the 90th quantile estimation.
In summary, our private “DPweighted” synthesizer under the pseudo posterior mechanism outperforms the other three synthesizers to achieve a highly satisfactory risk-utility trade-off balance. We next explore different scaling and shift configurations of and present their effects on the utility and risk of Differentially private synthetic datasets generated under the pseudo posterior mechanism.
4.3 Mapping DP Risk-utility tradeoff
We conclude by applying a scaling, , and a shift, , to the distribution of vector of weights from our pseudo posterior mechanism in order to enumerate of risk-utility settings for the purpose of allowing the Bureau of Labor Statistics (or, more generally, the owner of the closely-held private database) to discover the setting configuration that best represents their policy goal. We compare the risk-utility mapping produced by the pseudo posterior mechanism to that of the exponential mechanism, which we recall reduces to a scalar-weighted pseudo posterior under use of the log-likelihood as the utility measure. As discussed in Hu and Savitsky (2019b), applying a scaling constant, , will induce a compression in the distribution of the weights while apply a scaling will induce a downward shift in the distribution of the record-indexed weights. We apply the scaling and shifting in a manner that uses truncation to ensure each of the resulting weights are restricted to lie in .
Each violin plot in Figure 5 presents a distribution of the quantile for a synthetic dataset generated under a particular (scale,shift) () configuration. The sequence of plots from left-to-right are ordered from less scaling and shifting (with a relatively higher privacy expenditure) to more scaling and shifting (with a relatively lower privacy expenditure). The specific sensitivity values, , associated with each configuration are shown in Table 2, where we recall that the associated privacy expenditure is . Table 2 demonstrates a halving of the DP expenditure over the range of configurations (though all are much less than the non-private, unweighted synthesizer). Figure 5 demonstrates a much flatter or reduced deterioration of utility for the DPweighted, pseudo posterior mechanism as compared to the exponential mechanism (EMweighted). Such is not surprising due to the greater flexibility of the DPweighted formulation to concentrate downweighting to high risk records versus the application of a scalar weight based on the high risk records to all records under EMweighted.
This paper adapts the vector-weighted pseudo posterior synthesizer as a mechanism that achieves markedly lower Differential privacy expenditures for a synthetic dataset in comparison to both the non-private, unweighted synthesizer, as well as under the exponential mechanism. We construct a specific definition for Differential privacy under our pseudo posterior mechanism. The construction for the pseudo posterior mechanism utilizes the log-likelihood to develop the Lipschitz bound, which we show in Section 2 guarantees an privacy expenditure. The absolute value of the log-likelihood ratios comparing all databases at a Hamming distance from our local database reduces to the absolute values of individual data component likelihood contributions. While these log-likelihood ratios are not naturally bounded, we nevertheless specify mild conditions that restrict the smoothness of the marginal density over the database such that our local result is guaranteed to contract on a global result over the space of databases. Our pseudo posterior mechanism has the feature that it accommodates any synthesizer model formulated by the statistical agency and offers a simple weighting scheme that guarantees a Differential privacy result. The simple weighting allows the posterior sampling scheme devised for the non-private synthesizer to be utilized for synthesis with minor modification for the Differentially private pseudo posterior mechanism.
- Abowd and Vilhuber (2008) Abowd, J. and Vilhuber, L. (2008) How protective are synthetic data? In Privacy in Statistical Databases (eds. J. Domingo-Ferrer and Y. Saygin), vol. 5262 of Lecture Notes in Computer Science, 239–246. Springer.
- An and Little (2007) An, D. and Little, R. J. A. (2007) Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A, 170, 923–940.
Berti and Rigo (1997)
Berti, P. and Rigo, P. (1997) A Glivenko-Cantelli theorem for exchangeable random variables.Statistics & Probability Letters, 32, 385–391. URL: https://ideas.repec.org/a/eee/stapro/v32y1997i4p385-391.html.
- Bowen and Liu (2016) Bowen, C. M. and Liu, F. (2016) Comparative study of differentially private data synthesis methods. arXiv:1602.01063.
Dimitrakakis et al. (2017)
Dimitrakakis, C., Nelson, B., Zhang, Z., Mitrokotsa, A. and Rubinstein, B. I. P. (2017) Differential privacy for bayesian inference through posterior sampling.J. Mach. Learn. Res., 18, 343–381.
- Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, 265–284.
- Hu and Savitsky (2019a) Hu, J. and Savitsky, T. D. (2019a) Bayesian pseudo posterior synthesis for data privacy protection. arXiv:1901.06462.
- Hu and Savitsky (2019b) — (2019b) Risk-efficient bayesian data synthesis for privacy protection. arXiv:1908.07639.
- Machanavajjhala et al. (2008) Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008) Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, 277–286. IEEE Computer Society.
- McClure and Reiter (2012) McClure, D. and Reiter, J. P. (2012) Differential privacy and statistical disclosure risk measures: An illustration with binary synthetic data. Transactions on Data Privacy, 5, 535–552.
- McSherry and Talwar (2007) McSherry, M. and Talwar, K. (2007) Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, 94–103.
- Snoke and Slavkovic (2018) Snoke, J. and Slavkovic, A. (2018) pMSE mechanism: Differentially private synthetic data with maximal distributional similarity. In Privacy in Statistical Databases (eds. J. Domingo-Ferrer and F. Montes), vol. 11126 of Lecture Notes in Computer Science, 138–159. Springer.
Srivastava et al. (2015)
Srivastava, S., Cevher, V., Dinh, Q. and Dunson, D. (2015) WASP: Scalable
Bayes via barycenters of subset posteriors.
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 912–920.
- Wasserman and Zhou (2010) Wasserman, L. and Zhou, S. (2010) A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 375–389.
- Zhang et al. (2016) Zhang, Z., Rubinstein, B. I. P. and Dimitrakakis, C. (2016) On the differential privacy of Bayesian inference. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2365–2371. AAAI.
6.1 Unweighted, Non-private Synthesizer
Our description of the unweighted, non-private synthesizer follows closely of that in Hu and Savitsky (2019b). To simulate partially synthetic data for the CE sample, where only the sensitive, continuous family income variable is synthesized, we propose using a flexible, parametric finite mixture synthesizer.
Equation (30) and Equation (31) present the first two levels of the hierarchical parametric finite mixture synthesizer: is the logarithm of the family income for CU , and is the predictor vector for CU . The finite mixture utilizes a hyperparameter for the maximum number of mixture components (i.e., clusters), , that is to set to be over-determined to permit the flexible clustering of CUs. A subset of CUs that are assigned to cluster, , employ the same generating parameters for , , that we term a “location”. Locations, , and the vector of cluster indicators, , are all sampled for each CU, .
where the matrix of regression locations, , denote cluster-indexed regression coefficients for predictors. The are, in turn, assigned a sparsity inducing Dirichlet distribution with hyperparameters specified as for . We next describe our prior specification.
We induce sparsity in the number of clusters with,
We specify multivariate Normal priors for each regression coefficient vector of coefficient locations, ,
where the correlation matrix, , receives a uniform prior over the space of correlation matrices, and each component of receives a Student-t prior with degrees of freedom,
We proceed to describe how to generate partially synthetic data for the CE sample. To implement the finite mixture synthesizer, we first generate sample values of from the posterior distribution at MCMC iteration . Secondly, for CU , we generate cluster assignments, , from its full conditional posterior distribution given in Hu and Savitsky (2019a) using the posterior samples of . Lastly, we generate synthetic family income for CU , , from Equation (30) given , and samples of and . We perform these draws for all CUs, and obtain a partially synthetic dataset, at MCMC iteration . We repeat this process for times, creating independent partially synthetic datasets .