DeepAI
Log In Sign Up

Noise-Aware Statistical Inference with Differentially Private Synthetic Data

While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation using noise-aware Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation from marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/12/2022

Differentially Private Bootstrap: New Privacy Analysis and Inference Strategies

Differential private (DP) mechanisms protect individual-level informatio...
09/23/2021

Robin Hood and Matthew Effects – Differential Privacy Has Disparate Impact on Synthetic Data

Generative models trained using Differential Privacy (DP) are increasing...
12/30/2020

PrivSyn: Differentially Private Data Synthesis

In differential privacy (DP), a challenging problem is to generate synth...
09/02/2022

DPXPlain: Privately Explaining Aggregate Query Answers

Differential privacy (DP) is the state-of-the-art and rigorous notion of...
10/28/2021

Privacy Preserving Inference on the Ratio of Two Gaussians Using (Weighted) Sums

The ratio of two Gaussians is useful in many contexts of statistical inf...
04/27/2022

Spending Privacy Budget Fairly and Wisely

Differentially private (DP) synthetic data generation is a practical met...

1 Introduction

Availability of data for research is constrained by the dilemma between privacy preservation and potential gains obtained from sharing. As a result, many datasets are kept confidential to mitigate the possibility of privacy violations, with access only granted to researchers after a lengthy approval process, if at all, slowing down research.

One approach to solving the dilemma between free access and confidentiality is releasing synthetic data, as proposed by rubin1993statistical. The idea is that the data holder releases a synthetic dataset that is based on a real dataset. Data analysts can use the synthetic dataset instead of the real one for their downstream analysis.

The synthetic dataset should maintain population-level statistical properties of the original, which are of interest to the analysts. Privacy-protection of the synthetic data can be guaranteed by employing differential privacy (DP) dworkCalibratingNoiseSensitivity2006, which offers provable protection, unlike non-DP synthetic data generation methods. The current literature on DP synthetic data generation is sizable.

The analysts of synthetic data should be able to draw valid conclusions on the data generating process (DGP) of the real data using the synthetic data. An important component of the conclusion in any scientific research is estimation of uncertainty, usually in the form of a confidence interval or -value. However, as wildeFoundationsBayesianLearning2021 point out, simply using synthetic data as if it were real data only allows drawing conclusions about the synthetic data generating process, not the real DGP.

The issues of using synthetic data in place of real data are especially apparent with DP, as DP requires adding noise to the synthetic data generation process. We illustrate this with a simple toy data experiment. We generate 3-dimensional binary data, where one variable is generated from logistic regression on the other two, serving as the original dataset. Then, we generate synthetic data from the original data and compute confidence intervals for the coefficients of the logistic regression from the synthetic data. We test our pipeline NA+MI, and the existing algorithms PGM 

mckennaGraphicalmodelBasedEstimation2019, PEP liuIterativeMethodsPrivate2021, RAP aydoreDifferentiallyPrivateQuery2021 and PrivLCM nixonLatentClassModeling2022. A more detailed description of the setup is given in Section 5.1. Figure 1 shows the coverage of the resulting intervals, and highlights some of the intervals that were produced. Figure 2(a) shows the widths of the confidence intervals. Even with very loose privacy bounds, all algorithms except ours and PrivLCM produce overconfident confidence intervals that do not meet the target confidence level of 95%. PrivLCM manages to account for the extra uncertainty for , but is too conservative and produces very wide intervals. Only our method NA+MI produces accurately calibrated confidence intervals.

Figure 1:

Toy data experiment results of logistic regression on 3 binary variables, showing that all algorithms apart from ours and PrivLCM are overconfident, even with almost no privacy (

). The first two panels from the left show the fraction of the 95% confidence intervals that contain the true parameter value in 100 repeats, with the target confidence level of 95% highlighted by a black line. The third panel shows the confidence intervals from synthetic data generated by our mechanism for , and the fourth panel shows the confidence intervals from PGM. The third and fourth panels show that the overconfidence stems from intervals that are too narrow, a result of not accounting for all uncertainty.

Our solution to overconfident uncertainty estimates builds on Rubin’s original work on synthetic data generation rubin1993statistical. He proposed generating multiple synthetic datasets, running the analysis task on all of them, and combining the results with simple combining rules called Rubin’s rules raghunathanMultipleImputationStatistical2003; reiter2002satisfying. This workflow is modeled after multiple imputation rubinMultipleImputationNonresponse1987, where it is used to deal with missing data. Generating multiple synthetic datasets allows the combining rules to account for the additional uncertainty that comes from synthetic data generation process, which includes DP noise when a noise-aware Bayesian model is used to generate the synthetic data. We call the combined pipeline of noise-aware (NA) private synthetic data generation and analysis with multiple imputation (MI) the NA+MI pipeline. We give a more detailed description in Section 2.

To implement the NA step, we develop an algorithm called Noise-Aware Private Synthetic data Using Marginal Queries (NAPSU-MQ), that generates synthetic data from discrete tabular datasets using the noisy values of preselected marginal queries. We describe NAPSU-MQ in Section 4. In Section 5, we evaluate NAPSU-MQ on the UCI Adult dataset, showing that it can produce accurate confidence intervals.

1.1 Related Work

There is a sizable literature on DP synthetic data generation. Most recent work in the area either releases the values of a set of simple queries, such as counting queries, under DP and uses them as the basis of synthetic data hardtSimplePracticalAlgorithm2012; chenDifferentiallyPrivateHighDimensional2015; zhangPrivBayesPrivateData2017; mckennaGraphicalmodelBasedEstimation2019; mckennaWinningNISTContest2021a; mckennaAIMAdaptiveIterative2022; mckenna2018optimizing; bernsteinDifferentiallyPrivateLearning2017; cai2021data; vietriNewOracleEfficientAlgorithms2020; liuIterativeMethodsPrivate2021; aydoreDifferentiallyPrivateQuery2021; nixonLatentClassModeling2022, or trains some kind of generative model, often a GAN, using the whole real dataset under DP xieDifferentiallyPrivateGenerative2018; yoon2018pategan; chenGSWGANGradientsanitizedApproach2020; longGPATEScalableDifferentially2021; jalkoPrivacypreservingDataSharing2021. There are also hybrid approaches that use sophisticated queries that can capture all features of the dataset, and train a generative model using those harderDPMERFDifferentiallyPrivate2021; liewPEARLDataSynthesis2022. Of the existing DP synthetic data generation algorithms, NAPSU-MQ is closest to the PGM algorithm mckennaGraphicalmodelBasedEstimation2019, and can be seen as a noise-aware variant of it.

Rubin’s rules were originally developed for analyses on missing data, as part of an approach called multiple imputation rubinMultipleImputationNonresponse1987. Multiple imputation was later applied to generate and analyse synthetic data rubin1993statistical without DP. The variant of Rubin’s rules that we use, and describe in Supplemental Section B, was developed specifically for synthetic data generation raghunathanMultipleImputationStatistical2003; reiter2002satisfying.

Rubin’s rules have not been widely used with DP synthetic data generation, and we are only aware of three existing works studying the combination. charestHowCanWe2010 studied Rubin’s rules with a very simple early synthetic data generation algorithm, and concluded that Rubin’s rules are not appropriate for that algorithm. zhengDifferentialPrivacyBayesian2015 found that some simple one-dimensional methods developed by the multiple imputation community are in fact DP, but not with practically useful privacy bounds. nixonLatentClassModeling2022 propose using Rubin’s rules with the noise-aware synthetic data generation algorithm PrivLCM, but they only consider computing confidence intervals of query values on the real dataset, not confidence intervals of population parameters of arbitrary downstream analyses.

Noise-aware confidence intervals and other uncertainty estimates have been developed for specific DP analyses. Examples include linear regression with a URL dataset released by Facebook 

evansStatisticallyValidInferences2022 and general receipes for DP analyses without synthetic data ferrandoParametricBootstrapDifferentially2022; covingtonUnbiasedStatisticalEstimation2021. Bayesian examples include posterior inference for simple exponential family models bernsteinDifferentiallyPrivateBayesian2018, linear regression bernsteinDifferentiallyPrivateBayesian2019, generalised linear models kulkarniDifferentiallyPrivateBayesian2021, and approximate Bayesian computation gongExactInferenceApproximate2019.

Several works study techniques for mitigating the effect of DP noise. wildeFoundationsBayesianLearning2021 point out the importance of noise-aware synthetic data analysis with DP and use publicly available data to augment the analysis and correct for the DP noise in Bayesian inference. Other examples include importance sampling for reducing bias 

ghalebikesabiBiasMitigatedLearning2021 and averaging GANs to improve the generator neunhoefferPrivatePostGANBoosting2021.

While some of the existing works address uncertainty estimates for specific analyses of synthetic data under DP, or address the uncertainty of statistics of the real data mckennaAIMAdaptiveIterative2022, not of the population, there is no existing method for proper uncertainty estimation for general downstream analyses of population-level quantities. We fill this gap with the NA+MI pipeline, which we implement for discrete tabular data with NAPSU-MQ.

2 The NA+MI Pipeline

The early work on synthetic data generation with multiple imputation showed that computing accurate uncertainty estimates when using synthetic data requires accounting for the additional uncertainty that comes from generating synthetic data rubin1993statistical; raghunathanMultipleImputationStatistical2003. Rubin rubin1993statistical proposed generating multiple synthetic datasets

from the Bayesian posterior predictive distribution

, where is a prediction of a future dataset, and is the observed real dataset. The downstream analysis is run on each as if were the real dataset, and the results are combined using specialised combining rules raghunathanMultipleImputationStatistical2003.

The generation of multiple datasets from

is necessary to give the combining rules a way to estimate the variance of the synthetic data generation process, which would not be possible if only a single dataset was generated. For a parametric model,

, where is the parameter, is the likelihood, and is the posterior of the parameter. is then generated in two steps: first is sampled, then .

The combining rules require including the posterior distribution  raghunathanMultipleImputationStatistical2003 for sampling , so a non-Bayesian model that samples for some point-estimate is not suitable for synthetic data generation with multiple imputation.

Requiring the synthetic data generation to be DP complicates the picture, as only noisy observations of can be made. We call inference algorithms that handle noisy observations and account for DP noise noise-aware. The combination of noise-aware inference and multiple imputation is the NA+MI pipeline, which we summarise in Figure 2. First, the data holder runs inference on a noise-aware Bayesian model using the private data, which we call the NA step. Different implementation of the NA step may set different requirements on the form of , the type of DP noise, and may provide different privacy guarantees.

After the inference, the data holder generates multiple synthetic datasets. The data holder can also release the posterior distribution in addition to the synthetic datasets, so that the analyst can also generate synthetic datasets if needed.

For each synthetic dataset, the analyst runs their analysis, and combines the results using multiple imputation raghunathanMultipleImputationStatistical2003; reiter2002satisfying; reiter2005significance. We call this the MI step. For frequentist downstream analyses, we use Rubin’s rules, which require that each analysis produces a point estimate , and a variance estimate for the point estimate. The point estimates and the variance estimates are fed to Rubin’s rules raghunathanMultipleImputationStatistical2003, which give a -distribution that the analyst can use to compute confidence intervals or hypothesis tests. For Bayesian downstream analyses, the posteriors from each synthetic dataset can be mixed, and conclusions can be based on the mixed posterior gelmanBayesianDataAnalysis2014. We describe Rubin’s rules in more detail in Supplemental Section B.

Data Holder (NA step)

Data Analyst (MI step)

Data

Noise-awareGenerator

SyntheticData

SyntheticData

AnalysisResult

AnalysisResult

CombinedResult

Privacy Barrier
Figure 2: NA+MI pipeline for noise-aware DP synthetic data generation and statistical inference. The nodes shaded in blue are computed by the data holder, and the nodes shaded in orange are computed by the data analyst. All nodes except Data (with red border) can be released to the public. The synthetic datasets can be generated by either party because the Generator is also released by the data holder.

3 Background for NAPSU-MQ

In this section we describe the datasets and queries NAPSU-MQ uses, and briefly describe key concepts from differential privacy, which we will use in Section 4.

Data and Marginal Queries

NAPSU-MQ uses tabular datasets of discrete variables, where the domains of the discrete variables, as well as the number datapoints are known. We denote the set of possible datapoints by , and the set of possible datasets by .

A marginal query of variables and value is a function that takes a datapoint as input and returns 1 if the variables in in the datapoint have the value , and 0 otherwise. For a dataset , we define , where is the :th datapoint in . When has variables, is called a -way marginal query.

When evaluating multiple marginal queries

, we concatenate their values to a vector-valued function

. We call the concatenation of marginal queries for all possible values of variables the full set of marginals on  111 Some existing works mckennaGraphicalmodelBasedEstimation2019 use the term marginal query for the full set of marginal queries. We chose this terminology because we deal with individual marginal queries in Supplemental Section C. .

Differential Privacy

Differential privacy (DP) dworkCalibratingNoiseSensitivity2006; dworkOurDataOurselves2006 is a definition aiming to quantify the privacy loss resulting from releasing the results of some algorithm. DP algorithms are also called mechanisms.

A mechanism is -differentially private if for all neighbouring datasets and all measurable output sets

(1)

The neighbourhood relation in the definition is domain specific. We use the substitute neighbourhood relation for tabular datasets, where datasets are neighbouring if they differ in at most one datapoint.

The mechanism we use to release marginal query values under DP is the Gaussian mechanism dworkOurDataOurselves2006. The Gaussian mechanism with noise variance adds Gaussian noise to the value of a function for input data : .

The privacy bounds of the Gaussian mechanism depend on the sensitivity of the function , which is an upper bound on the change in the value of for neighbouring datasets. The -sensitivity of a function is . denotes that and are neighbouring.

[] Let be the concatenation of full sets of marginal queries. Then .

Proof.

We defer the proof to Supplemental Section A. ∎

[balleImprovingGaussianMechanism2018] The Gaussian mechanism for function with -sensitivity and noise variance is -DP with

(2)

where

is the cumulative distribution function of the standard Gaussian distribution.

An important property of DP is post-processing immunity: post-processing the result of a DP-algorithm does not weaken the privacy bounds. [dworkAlgorithmicFoundationsDifferential2014] Let be an -DP mechanism, and let be any algorithm. Then the composition is -DP.

4 Noise-aware Synthetic Data Generation

In order to implement the NA step, the data holder needs to generate synthetic data from the posterior of a noise-aware Bayesian model. bernsteinDifferentiallyPrivateBayesian2018 develop noise-aware Bayesian inference under DP for simple exponential family models. We implement the NA step by adapting their algorithm to arbitrary marginal queries.

We start by considering an arbitrary set of marginal queries . We would like to find an exponential family distribution that, in expectation, gives the same answers to as the real data . We do not want to assume anything else about the distribution besides these expected query values, so we use the principle of maximum entropy jaynesInformationTheoryStatistical1957 to choose the distribution.

The distribution with maximal entropy that satisfies to constraint is

(3)

for some parameters  wainwrightGraphicalModelsExponential2008, where is the number of queries. The is the normalising constant of the distribution, so it is given by . We denote this distribution by , and use to denote the distribution of i.i.d. samples from .

is clearly an exponential family distribution, with sufficient statistic and natural parameters . is also a Markov network koller2009probabilistic, given in log-linear form.

The Bayesian model we consider is derived from the generative process of the noisy query values, which are observed. Assuming that the data generating process is , and knowing that the Gaussian mechanism adds noise with variance , we get the probabilistic model

(4)

In principle, we could now sample from the posterior , with marginalised out. In practice, the marginalisation is not feasible, as is a discrete variable with a very large domain.

However, is a sum of the query values for individual datapoints, so asymptotically

has a normal distribution according to the central limit theorem. We can substitute the normal approximation for

into the model, which allows us to easily marginalise out, resulting in

(5)

where and denote the mean and covariance of for a single sample .

bernsteinDifferentiallyPrivateBayesian2018 use the same normal approximation as we do, but they use the Laplace mechanism instead of the Gaussian mechanism, which makes the final model more complicated. They also develop a custom inference algorithm that requires sampling the conjugate prior of

in our setting. As sampling from the conjugate prior of is not tractable, we instead choose to use existing off-the-shelf posterior inference methods, specifically the Laplace approximation gelmanBayesianDataAnalysis2014, which approximates the posterior with a Gaussian distribution centered at the posterior mode, and the NUTS hoffmanNoUTurnSamplerAdaptively2014 algorithm, which is a Markov chain Monte Carlo (MCMC) algorithm that samples the posterior directly using the gradients of the posterior log-density.

For the prior, we choose another Gaussian distribution with mean 0 and standard deviation 10, which is a simple and weak prior, but other priors could be used.

To compute and , we use both the exponential family and Markov network structure of . As is an exponential family distribution,

(6)

where is the Hessian of . Computing naively requires summing over the exponentially large domain , which is not tractable for complex domains. The Markov network structure gives a solution: can be computed with the variable elimination algorithm koller2009probabilistic. We can then autodifferentiate variable elimination to compute and . Alternatively, can be computed by belief propagation koller2009probabilistic, and can be autodifferentiated from it. Using modern autodifferentiation software like JAX frostig2018compiling, we can autodifferentiate through these autodifferentiations to run gradient-based samplers like NUTS hoffmanNoUTurnSamplerAdaptively2014.

The time complexities of computing and for inference, as well as sampling after inference, depend on the sparsity of the Markov network graph that the selected set of queries induces. The time complexity is exponential to the tree width of the graph koller2009probabilistic, which can be much lower than the dimensionality for sparse graphs, making inference and sampling tractable for sparse queries.

If we include all possible marginal queries from selected variable sets, the parametrisation of is not identifiable, as there are linear dependencies among the queries koller2009probabilistic. Nonidentifiablity causes NUTS sampling to be very slow, so we prune the queries to remove linear dependencies while preserving the information in the queries using the canonical parametrisation for  koller2009probabilistic. We give a detailed description of the process in Supplemental Section C.

4.1 NAPSU-MQ Properties

We summarise NAPSU-MQ in Algorithm 1. The privacy bounds for NAPSU-MQ follow from the material of Section 3: NAPSU-MQ (Algorithm 1) is -DP with regards to the real data .

Proof.

The returned values of Algorithm 1 only depend on the real data through . Releasing is -DP due to the selection of with Theorem 3. Computing the returned values from is post-processing, so by Theorem 3, NAPSU-MQ is -DP. ∎

While it is possible to prove that NA+MI results in valid confidence intervals with assumptions on the NA step and the downstream analysis, which we list in Supplemental Section B, we will not formally prove that NAPSU-MQ meets these assumptions, which is typical for the multiple imputation literature rubinMultipleImputationNonresponse1987; reiter2002satisfying; raghunathanMultipleImputationStatistical2003

. Instead, we use the heuristic argument of rubinMultipleImputationNonresponse1987 that generating synthetic data using the posterior of a Bayesian model, that accurately represents the data generating process of observed values, results in valid confidence intervals from the MI step. We summarise this as an informal conjecture, which is supported by our experimental results in Section 

5. If the selected queries contain the relevant information for the downstream task and the downstream task computes accurate estimates with real data, NAPSU-MQ+MI computes valid confidence intervals. Noise-awareness is important to accurately represent the data generating process in the presence of DP, because only noisy query values are observed. We show this experimentally in Section 5.1.

Input: Real data , marginal queries , number of synthetic datasets , size of synthetic datasets , privacy bounds .
Output: Posterior distribution , synthetic datasets .
Canonical queries for (Section C);
;
Sensitivity of (Theorem 3);
Required noise variance for -DP with sensitivity (Theorem 3);
Sample ;
Run Bayesian inference algorithm to find (Section 4);
Sample and for ;
return ,
Algorithm 1 NAPSU-MQ

5 Experiments

In this section, we give detailed descriptions on our two experiments: a simple toy data experiment, and our main experiment with the UCI Adult dataset.

5.1 Toy Data

To demonstrate the necessity of noise-awareness in synthetic data generation, we measure the coverage of confidence intervals computed from DP synthetic data on a generated toy dataset where the data generation process is known. We test the existing algorithms PGM mckennaGraphicalmodelBasedEstimation2019, PEP liuIterativeMethodsPrivate2021, RAP aydoreDifferentiallyPrivateQuery2021, PrivLCM nixonLatentClassModeling2022 and our pipeline NA+MI, where data generation is implemented with NAPSU-MQ. The authors of PrivLCM also propose using multiple imputation nixonLatentClassModeling2022, so we use Rubin’s rules raghunathanMultipleImputationStatistical2003 with the output of PrivLCM.

The original data consists of datapoints of 3 binary variables. The first two are sampled by independent coinflips. The third is sampled from logistic regression on the other two variables with coefficients (1, 0).

For all algorithms except PrivLCM, we use the full set of 3-way marginal queries released with the Gaussian mechanism. PrivLCM doesn’t implement these, and instead uses all full sets of 2-way marginals, and a different mechanism, which is -DP nixonLatentClassModeling2022 instead of -DP like the other algorithms. We use the Laplace approximation for NAPSU-MQ inference, as it is much faster than NUTS and works well for this simple setting.

For the privacy bounds, we use , and vary . We generate synthetic datasets of size

for all algorithms except RAP, where the synthetic dataset size is a function of two hyperparameters. We describe the hyperparameters in detail in Supplemental Section 

D.

The downstream task is inferring the logistic regression coefficients from synthetic data. We repeated all steps 100 times to measure the probability of sampling a dataset giving a confidence interval that includes the true parameter values.

Figure 1 shows the coverages, and Figure 2(a) shows the widths for the resulting confidence intervals. All of the algorithms apart from ours and PrivLCM are overconfident, even with very loose privacy bounds. Examining the confidence intervals shows the reason: PGM is unbiased, but it produces too narrow confidence intervals, while NAPSU-MQ produces wider confidence intervals. On the other hand, for , PrivLCM produces much wider and too conservative confidence intervals.

Ablation Study

We also conducted an ablation study on the toy data to show that both multiple imputation and noise-awareness are necessary for accurate confidence intervals. The results are presented in Figure 2(b). Without both multiple imputation and noise-awareness, NA+MI is overconfident like PGM, except for , where noise-awareness is not required. We show the confidence intervals produced by each method for in Figure S4 in the Supplement.

(a)
(b)
Figure 3: (a) Toy data confidence intervals widths. NA+MI produces slightly wider intervals than PGM, or PEP, which is necessary to account for DP noise. PrivLCM produces very wide intervals. (b) Ablation study on the toy data. “-NA” refers to removing noise-awareness, and “-MI” refers to removing multiple imputation. Unless both are included, NAPSU-MQ is overconfident like PGM except for where noise-awareness is not necessary.

5.2 Adult Dataset

Our main experiment evaluates the performance of NAPSU-MQ on the UCI Adult dataset Adult1996. We include 10 of the original 15 columns to remove redundant columns and keep runtimes manageable, and discretise the continuous columns. After dropping rows with missing values, there are rows. The discretised domain has distinct values. We give a detailed description in of the dataset, query selection and the downstream task in Supplemental Section E.

As out downstream task, we use logistic regression with income as the dependent variable, and a subset of the columns as independent variables, which allows us to include all the relevant marginals for synthetic data generation. Note that the synthetic dataset was still generated with all 10 columns.

NAPSU-MQ sometimes generates synthetic datasets with no people of some race with high income. This causes the downstream logistic regression to produce an extremely wide confidence interval for the coefficient of that race. As Rubin’s rules average over estimates and estimated variances, even a single one of these bad coefficients makes the combined confidence interval extremely wide. To fix this issue, we remove the coefficients whose estimated standard deviations are greater than before applying Rubin’s rules.

As the input queries, we pick 2-way marginals that are relevant for the downstream task, and select the rest of the queries with the MST algorithm mckennaWinningNISTContest2021a. This selection was kept constant throughout the experiment. The size of the synthetic dataset was set to for all algorithms except RAP, as in the toy data experiment (Section 5.1). The number of generated synthetic datasets for NAPSU-MQ and the number of repeats for repeated PGM were chosen with a preliminary experiment, presented in Figures S3 and S2 in the Supplement. We describe the hyperparameters in detail in Supplemental Section D. For the privacy budget, we use for all runs, and vary .

The Laplace approximation for NAPSU-MQ does not work well for this setting because many of the queries have small values, so we use NUTS hoffmanNoUTurnSamplerAdaptively2014 for posterior inference. To speed up NUTS, we normalise the posterior before running the inference using the mean and covariance of the Laplace approximation

We compare NAPSU-MQ against PGM mckennaGraphicalmodelBasedEstimation2019, RAP aydoreDifferentiallyPrivateQuery2021 and PEP liuIterativeMethodsPrivate2021. We used the published implementations of their authors for all of them, with small modifications to ensure compatibility with new library versions and our experiments. The published implementation of PrivLCM only supports binary data, and does not scale to datasets of this size, so it was not included in this experiment. We also include a naive noise-aware baseline that runs completely independent repeats of PGM, splitting the privacy budget appropriately, and uses Rubin’s rules with the generated synthetic datasets. The number of repeats for repeated PGM and the number of generated synthetic datasets for NAPSU-MQ were selected with a preliminary experiment.

Results from 20 repeats of the experiment are shown in Figure 4. PGM, RAP and PEP produce overconfident confidence intervals that do not meet the given confidence levels. The bottom plots show the cause of the overconfidence: the confidence intervals have similar widths in the original data, while they should be wider in synthetic data because of DP noise.

With the repeats, PGM becomes overly conservative, and produces confidence intervals that are too wide. NAPSU-MQ is the only algorithm that produces properly calibrated confidence intervals, although repeated PGM is able to produce narrower intervals than NAPSU-MQ with .

Noise-awareness, especially with the increased accuracy from NUTS, comes with a steep computational cost, as PGM ran in 15s, while the Laplace approximation took several minutes, and NUTS took up to ten hours. All of the algorithms were run on 4 CPU cores of a cluster. The complete set of runtimes for all algorithms and values of are shown in Table S1 of the Supplement.

Figure 4: Top row: the fraction of downstream coefficients where the synthetic confidence interval contains the real data coefficient, averaged over 20 repeated runs. NAPSU-MQ is the only algorithm that is consistently around the diagonal, showing good calibration. The error bands are bootstrap 95% confidence intervals of the average. Bottom row: confidence intervals widths divided by real data confidence interval widths. Each bar is a median of medians over the different coefficients and repeats, and the black lines are 95% bootstrapped confidence intervals. The dashed line is at , showing where synthetic confidence intervals have the same width as original confidence intervals.

6 Limitations and Conclusion

Limitations

While our general pipeline NA+MI is applicable to all kinds of datasets in principle, the data generation algorithm NAPSU-MQ is currently only applicable to discrete tabular data and only supports sparse marginal queries perturbed with the Gaussian mechanism as input. We aim to generalise NAPSU-MQ to more general query classes, such as linear queries, in the future, but supporting other types of noise is likely much harder.

The Gaussian mechanism adds noise uniformly to all query values, so queries with small values are less accurate. This may reduce accuracy for groups with rare combinations of data values, such as minorities. We plan to examine ways to add noise proportionally to the magnitude of the query values to fix this.

Although we left query selection outside the scope of this paper, selecting the right queries to support downstream analysis is important, as NA+MI cannot guarantee confidence interval coverage if the selected queries do not contain enough information for the downstream task. We plan to study whether existing methods giving confidence bounds on query accuracy mckennaAIMAdaptiveIterative2022; nixonLatentClassModeling2022 can be adapted to give confidence intervals for arbitrary downstream analyses.

The runtime of NAPSU-MQ, especially when using NUTS, is another major limitation. As NAPSU-MQ is compatible with any non-DP posterior sampling method, recent hoffmanAdaptiveMCMCSchemeSetting2021 and future advances in MCMC and other sampling techniques are likely able to cut down on the runtime.

Building on the exponential family framework of bernsteinDifferentiallyPrivateBayesian2018 limits NAPSU-MQ to generating synthetic data based on query values, but it may be possible to add noise-awareness to other types of DP Bayesian inference methods, like DP variational inference jalkoDifferentiallyPrivateVariational2017 or DP MCMC heikkilaDifferentiallyPrivateMarkov2019; yildirimExactMCMCDifferentially2019, which would enable extending NA+MI to many other types of datasets.

Conclusion

The analysis of DP synthetic data has not received much attention in existing research. Our work patches a major hole in the current generation and analysis methods by developing the NA+MI pipeline that allows computing accurate confidence intervals and -values from DP synthetic data. We develop the NAPSU-MQ algorithm in order to implement NA+MI on nontrivial discrete datasets. While NAPSU-MQ has several limitations, NA+MI only depends on noise-aware posterior inference, not NAPSU-MQ specifically, and can thus be extended to other settings in the future. With the noise-aware inference algorithm, NA+MI allows conducting valid statistical analyses that include uncertainty estimates with DP synthetic data, potentially unlocking existing privacy-sensitive datasets for widespread analysis.

Acknowledgements

This work was supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; and grants 325572, 325573), the Strategic Research Council at the Academy of Finland (Grant 336032) as well as UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1. The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources.

Appendix A Proof of Theorem 3

See 3

Proof.

Let be the full sets of marginal queries that form . Because all of the queries of have the same set of variables, the vector has a single component of value 1, and the other components are 0 for any . Then, for any neighbouring , . Then

(7)
(8)
(9)
(10)

Appendix B Multiple Imputation

In order to compute uncertainty estimates for downstream analyses from the noise-aware posterior with NA+MI, we use Rubin’s rules for synthetic data raghunathanMultipleImputationStatistical2003; reiter2002satisfying.

After the synthetic datasets for are released by the data holder, the data analyst runs their downstream analysis on each . For each synthetic dataset, the analysis produces a point estimate and a variance estimate for .

The estimates and are combined as follows raghunathanMultipleImputationStatistical2003:

(11)

We use as the combined point estimate, and set

(12)

an estimate of the combined variance. can be negative, which is corrected using instead reiter2002satisfying.

We compute confidence intervals and hypothesis tests using the -distribution with mean , variance

, and degrees of freedom

(13)

where  reiter2002satisfying.

These combining rules apply when is a univariate estimate. reiter2005significance derives appropriate combining rules for multivatiate estimates, which can be applied with NA+MI.

Rubin’s rules make many assumptions on the different distributions that are involved raghunathanMultipleImputationStatistical2003; si2011comparison, such as the normality of the distribution of when sampling data from the population. These assumptions may not hold for some types of estimates, such as probabilities marshallCombiningEstimatesInterest2009

or quantile estimates 

zhouNoteBayesianInference2010. Further work gelmanBayesianDataAnalysis2014; si2011comparison tries to reduce these assumptions, especially in the context of missing data. Their results for synthetic data generation can be applied with our method.

si2011comparison propose to remove some of these assumptions by approximating the integral that Rubin’s rules are derived from by sampling instead of using the analytical approximations in (11) and (12). They find that their sampling-based approximation can be effective, especially with a small number of datasets, but is computationally more expensive.

In the missing data context, when the downstream task uses Bayesian inference, gelmanBayesianDataAnalysis2014 propose to mix the samples of each downstream posterior, and use the mixed posterior for inferences, which doesn’t require the normality assumptions that Rubin’s rules require. However, this is restricted to Bayesian downstream tasks, and was originally proposed for the missing data context, but it may be applicable to synthetic data and our method.

b.1 Unbiasedness of Rubin’s Rules

Rubin’s rules make several assumptions on the downstream analysis method, and several normal approximations when deriving the rules. raghunathanMultipleImputationStatistical2003 derive conditions under which Rubin’s rules give an unbiased estimate.

Rubin’s rules aim to estimate a quantity of the entire population , of which is a sample. Conceptually, the sampling of the synthetic datasets is done in two stages: first, synthetic populations for are sampled. Second, a synthetic dataset is sampled from . This is equivalent to the sampling process for described in Section 2, and makes stating the assumptions of Rubin’s rules easier.

Let denote the quantity of interest computed from the synthetic population instead of . Let denote the sampling variance of from the synthetic population . Let and be the point and variance estimates of when sampling from the population .

For all , is unbiased for and asymptotically normal with respect to sampling from the synthetic population , with sampling variance .

For all , is unbiased for , and the sampling variability in is negligible. That is . Additionally, the variation in across the synthetic populations is negligible.

Assumptions B.1-B.1 ensure that the downstream analysis method used to estimate is accurate, for both point and variance estimates, when applied to real data, regardless of the population.

Assumption B.1 requires that the generation of synthetic datasets does not bias the downstream analysis. For query-based methods like NAPSU-MQ, it may not hold when the queries do not contain the relevant information for the downstream task.

With Assumptions B.1-B.1, raghunathanMultipleImputationStatistical2003 show that is an unbiased estimate of , and in an asymptotically unbiased variance estimate.

[raghunathanMultipleImputationStatistical2003] Assumptions B.1-B.1 imply that

  1. ,

  2. ,

  3. Asymptotically ,

  4. For moderate ,  reiter2002satisfying.

Appendix C Finding an Identifiable Parameterisation

In this section, we describe the process we use to ensure the parameterisation of the posterior in NAPSU-MQ is identifiable. We ensure identifiability by dropping some of the selected queries, chosen using the the canonical parameterisation of to ensure no information is lost. First, we give some background on Markov networks, which is necessary to understand the canonical parameterisation.

Markov Networks

A Markov network is a representation of a probability distribution that is factored according to an undirected graph. Specifically, a Markov network distribution

is a product of factors. A factor is a function from a subset of the variables to non-negative real numbers. The subset of variables is called the scope

of the factor. The joint distribution is given by

(14)

where is the set of scopes for the factors. The undirected graph is formed by representing each variable as a node, and adding edges such that the scope of each factor is a clique in the graph.

Canonical Parametrisation

The canonical parametrisation is given in terms of canonical factors abbeelLearningFactorGraphs2006. The canonical factors depend on an arbitrary assignment of variables . We simply choose . In the following, denotes the selection of components in the set from the vector , and denotes the selection of all components except those in .

A canonical factor with scope is defined as

The sum is over all subsets of , including itself and the empty set. is the size of the set difference of and .

[abbeelLearningFactorGraphs2006(Theorem 3)] Let be a Markov network with factor scopes . Let . Then

There are more canonical factors than original factors, so it might seem that there are more parameters in the canonical parametrisation than in the original parametrisation. However, many values in the canonical factors turn out to be ones. We can select the queries corresponding to non-one canonical factor values to obtain a set of queries with the same information as the original queries, but without linear dependencies koller2009probabilistic. We call this set of queries the canonical queries.

Many of the canonical factor scopes are subsets of the original factor scopes, so using the canonical queries as is would introduce new marginal query sets and potentially increase the sensitivity of the queries. As all of the new queries are sums of existing queries, we can replace each new query with the old queries that sum to the new query, and use the same value for all of the added queries to preserve identifiability. If one of the added queries was already included, it does not need to be added again, because two instances of a single query can be collapsed into a single instance with it’s own parameter value. Because of this, we did not need to fix the values of any queries to the same value in the settings we studied.

Appendix D Hyperparameters

Napsu-Mq

The hyperparameters of NAPSU-MQ are the choice of prior, choice of inference algorithm, and the parameters of that algorithm. For the toy data experiment, we used the Laplace approximation for inference, which approximates the posterior with a Gaussian centered at the maximum aposteriori estimate (MAP). We find the MAP for the Laplace approximation with the LBFGS optimisation algorithm, which we run until the loss improves by less than in an iteration, up to a maximum of 500 iterations. Sometimes LBFGS failed to converge, which we detect by checking is the loss increased by over 1000 in one iteration, and fix by restarting optimisation from a different starting point. We also restarted if the maximum number of iterations was reached without convergence. For almost all runs, no restarts were needed, and at most 2 were needed.

For the Adult experiment, we used NUTS hoffmanNoUTurnSamplerAdaptively2014. We ran 4 chains of 800 warmup samples and 2000 kept samples. We set the maximum tree depth of NUTS to 12. We normalised the posterior using the mean and covariance from the Laplace approximation. For the Laplace approximation, we used the same hyperparameters as with the toy data set, except we set the maximum number of iterations to 6000.

For the prior, we used a Gaussian distribution with mean 0 and standard deviation 10 for all components, without dependencies between components, for both experiments.

PGM and Repeated PGM

PGM finds the parameters that minimise the -error between the expected query values and the noisy query values. The PGM implementation offers several algorithms for this optimisation problem, but we found that the default algorithm and number of iterations works well for both experiments.

Rap

RAP minimises the error on the selected queries of a continuous relaxation of the discrete synthetic dataset. After the optimal relaxed synthetic dataset is found, a discrete synthetic dataset is constructed by sampling. This gives two hyperparameters that control the size of the synthetic data: the size of the continuous dataset, and the number of samples for each datapoint in the continuous relaxation. We set the size of the continuous dataset to 1000 for both experiments, as recommended by the paper aydoreDifferentiallyPrivateQuery2021. For the Adult data experiment, we set the number of samples per datapoint to 46, so that the total size of the synthetic dataset is close to the size of the original dataset. For the toy data experiment, we set the number of samples per datapoint to 50. The RAP paper aydoreDifferentiallyPrivateQuery2021 finds that much smaller values are sufficient, but higher values should only increase accuracy.

In both cases, we weight the synthetic datapoints by before the downstream logistic regression to ensure that the logistic regression does not over- or underestimate variances because of a different sample size from the original data.

RAP also has two other hyperparameters that are relevant in our experiments: the number of iterations and the learning rate for the query error minimisation. After preliminary runs, we set the learning rate at 0.1 for both experiments, and set the number of iterations to 5000 for the toy data experiment, and 10000 for the Adult data experiment.

Pep

PEP has two hyperparameters: the number of iterations used to find a distribution with maximum entropy that has approximately correct query values, and the allowed bound on the difference of the query values. The PEP implementation hardcodes the allowed difference to 0. We set the number of iterations to 1000 after preliminary runs for both experiments.

PrivLCM

PrivLCM samples the posterior of a Bayesian latent class model, where the number of classes in limited to make inference tractable. The model has hyperparameters for the prior, and the number of latent classes. We leave the prior hyperparameters to their defaults, and set the number of latent classes to 10, which the PrivLCM authors used in a 5-dimensional binary data experiment nixonLatentClassModeling2022. The remaining hyperparameter of PrivLCM is the number of posterior samples that are obtained. To keep the runtime of PrivLCM manageable, we set the number of samples to 500 after ensuring that the lower number of samples did not degrade the accuracy of the estimated probabilties for the joint distribution compared to using the default of 5000 samples.

Appendix E Adult Experiment Details

We include the columns Age, Workclass, Education, Marital Status, Race, Gender, Capital gain, Capital loss, Hours per week and Income of the Adult dataset, and discard the rest to remove redundant columns and keep computation times manageable. We discretise Age and Hours per week to 5 buckets, and discretise Capital gain and Capital loss to binary values indicating whether their value is positive. The Income column is binary from the start, and indicates whether a person has an income .

In the downstream logistic regression, we use income as the dependent variable, and Age, Race and Gender as independent variables. Age is transformed back to a continuous value for the logistic regression by picking the middle value of each discretisation bucket. We did not use all variables for the downstream task, as a smaller set of variables allows including the relevant marginals for synthetic data generation.

For the input queries, we include the 2-way marginals with Hours per week and each of the independent variables Age, Race and Gender and income, as well as the 2-way marginal between Race and Gender. The rest of the queries were selected with the MST algorithm mckennaWinningNISTContest2021a. For MST, we used , but we do not include this in our figures, as we focus on the synthetic data generation, not query selection. The selected queries are shown in Figure S1. The selection is very stable: in 100 repeats of query selection, these queries were selected 99 times.

We chose the number of generated synthetic datasets for NAPSU-MQ and the number of repeats for repeated PGM by comparing the results of the Adult experiment for different choices. The results are shown in Figure S3 for NAPSU-MQ and Figure S2 for repeated PGM. We chose for NAPSU-MQ because it had slightly better calibration than the other values, and repeats for repeated PGM because it had the best calibration overall.

Figure S1: Markov network of selected queries for the Adult experiment. Each edge in the graph represents a selected 2-way marginal.

Appendix F Extra Results

Figure S2: Comparison of different numbers of repetitions for repeated PGM. We chose repeats to represent repeated PGM in the main experiment, although the differences between the numbers of repeats are small.
Figure S3: Comparison of different numbers of generated synthetic datasets for NAPSU-MQ. The differences are small, but synthetic datasets has the best calibration, so we chose it for the main experiment.
Figure S4: Ablation Study Confidence Intervals
Mean Standard Deviation
Algorithm Epsilon
LA 0.1 2 min 53 s 18.5 s
0.3 3 min 53 s 29.4 s
0.5 3 min 38 s 35.0 s
1.0 3 min 25 s 25.5 s
NUTS 0.1 9 h 59 min 6 s 6506 s
0.3 7 h 33 min 28 s 2701 s
0.5 4 h 57 min 40 s 3185 s
1.0 3 h 51 min 34 s 1274 s
PEP 0.1 6 min 50 s 25.4 s
0.3 7 min 18 s 31.2 s
0.5 7 min 0 s 33.1 s
1.0 7 min 7 s 33.7 s
PGM 0.1 15 s 0.5 s
0.3 17 s 1.5 s
0.5 15 s 0.4 s
1.0 15 s 0.6 s
PGM-repeat-10 0.1 2 min 35 s 3.3 s
0.3 2 min 53 s 13.0 s
0.5 2 min 37 s 5.0 s
1.0 2 min 36 s 4.4 s
PGM-repeat-20 0.1 5 min 15 s 10.9 s
0.3 5 min 58 s 28.4 s
0.5 5 min 10 s 10.2 s
1.0 5 min 13 s 12.6 s
PGM-repeat-5 0.1 1 min 17 s 2.7 s
0.3 1 min 28 s 6.7 s
0.5 1 min 18 s 2.6 s
1.0 1 min 18 s 1.9 s
RAP 0.1 32 s 2.4 s
0.3 34 s 2.2 s
0.5 32 s 2.1 s
1.0 31 s 2.1 s
Table S1: Runtimes of each inference run. Does not include the time taken to generate synthetic data, or run any downstream analysis. The LA rows record the runtime for obtaining the Laplace approximation for NAPSU-MQ that is used in the NUTS inference, so the total runtime for a NAPSU-MQ run with NUTS is the sum of the LA and NUTS rows. Experiments were run on 4 CPU cores of a cluster.