As data storage and analysis are becoming more cost effective, and data become more complex and unstructured, there is a growing need for sharing large datasets for research and learning purposes. This is in stark contrast to the previous statistical model where a data curator would hold datasets and answer queries from (potentially external) analysts. Sharing entire datasets allows analysts the freedom to perform their analyses in-house with their own devices and toolkits, without having to pre-specify the analyses they wish to perform. However, datasets are often proprietary or sensitive, and cannot be shared directly. This motivates the need for synthetic data generation, where a new dataset is created that shares the same statistical properties as the original data. These data may not be of a single type: all binary, all categorial, or all real-valued; instead they may be of mixed-types, containing data of multiple types in a single dataset. These data may also be unlabeled, requiring techniques for unsupervised learning
, which is typically a more challenging task than supervised learning on labeled data.
Privacy challenges naturally arise when sharing highly sensitive datasets about individuals. Ad hoc anonymization techniques have repeatedly led to severe privacy violations when sharing “anonymized” datasets. Notable examples include the Netflix Challenge [NS08], the AOL Search Logs [nyt], and Massachusetts State Health data [Ohm10]
, where linkage attacks to publicly available auxiliary datasets were used to reidentify individuals in the dataset. Even deep learning model have been shown to inadvertently memoize sensitive personal information such as Social Security Numbers during training[carlini2019secret].
Differential privacy (DP) [DMNS06] (formally defined in Section 2
) has become the de facto gold standard of privacy in the computer science literature. Informally, it bounds the amount the extent to which an algorithm can depend on a single datapoint in its training set. This guarantee ensures that any differentially privately learned models do not overfit to individuals in the database, and therefore cannot reveal sensitive information about individuals. It is an information theoretic notion that does not rely on any assumptions of an adversary’s computational power or auxiliary knowledge. Furthermore, it has been shown empirically that training machine learning models with differential privacy protects against membership inference and model inversion attacks[triastcyn2018generating, carlini2019secret]. Differentially private algorithms have been deployed at large scale in practice by organizations such as Apple, Google, Microsoft, Uber, and the U.S. Census Bureau.
Much of the prior work on differentially private synthetic data generation has been either theoretical algorithms for highly structured classes of queries [BLR08, HR10] or based on deep generative models such as Generative Adversarial Networks (GANs) or autoencoders. These architectures have been primarily designed for either all-binary or all-real-valued datasets, and have focused on the supervised setting, where datapoints are labelled.
In this work we introduce the DP-auto-GAN framework, which combines the low dimensional representation of autoencoders with the flexibility of GANs. This framework can be used to take in raw sensitive data, and privately train a model for generating synthetic data that should satisfy the same statistical properties as the original data. This learned model can be used to generate arbitrary amounts of publicly available synthetic data, which can then be freely shared due to the post-processing guarantees of differential privacy. We implement this framework on both unlabeled binary data (for comparison with previous work) and unlabeled mixed-type data. We also introduce new metrics for evaluating the quality of synthetic mixed-type data, particularly in unsupervised settings, and empirically evaluate the performance our algorithm according to these metrics on two datasets.
1.1 Our Contributions
In this work, we provide three main contributions: a new algorithmic framework for privately generating synthetic data, new evaluation metrics for measuring the quality of synthetic data in unsupervised settings, and empirical evaluations of our algorithmic framework using our new metrics, as well as standard metrics.
Algorithmic Framework. We propose a new data generation architecture which combines the versatility of an autoencoder [kingma2013auto] with the recent success of GANs on complex data. Our model extends previous autoencoder-based DP data generation [abay2018privacy, chen2018differentially]
by removing an assumption that the distribution of the latent space follows a mixture of Gaussian distribution. Instead, we incorporate GANs into the autoencoder framework so that the generator must learn the true latent distribution against the discriminator. We describe the composition analysis of differential privacy when the training consists of optimizing both autoencoders and GANs (with different noise parameters). Furthermore, in this analysis we halve the noise injected into autoencoder from all existing works while provably maintaining the same mathematical privacy guarantee.
Unsupervised-Learning Evaluation Metric of Synthetic Data. We define several new metrics that evaluate the performance of synthetic data compared to the original data when the data is of mixed-type. Previous metrics in the literature are applicable only to all-binary or all-real-valued datasets. Our new metrics generalize the previously used metrics [choi2017generating, xie2018differentially] from all-binary data to mixed-type by training various learning models to predict each feature from the rest of the data in order to assess correlation between features. In additional, our metrics do not require a particular feature to be specified as a label, and therefore do not assume a supervised-learning nature of the data, as in much of the previous work does [papernot2016semi, papernot2018scalable, jordon2018pate].
Empirical Results. We empirically comepare the performance of our algorithmic framework on the MIMIC-III medical dataset [johnson2016mimic] and UCI ADULT Census dataset [uci_ml] using previously studied metrics in literature [frigerio2019differentially, xie2018differentially]. Our experiments show that our algorithms perform better, and allow significantly improved values with , compared to prior work [xie2018differentially] with . We evaluate our synthetic data using new quantitative and qualitative metrics, confirming that the performance of our algorithm remains high even for small values of , corresponding to strong privacy guarantees. Our code is made publicly available for future use and research.
1.2 Related Work on Differentially Private Data Generation
Early work on differentially private synthetic data generation was focused primarily on theoretical algorithms for solving the query release problem of privately and accurately answering a large class of pre-specified queries on a given database. It was discovered that generating synthetic data on which the queries could be evaluated allowed for better privacy composition than simply answering all the queries directly [BLR08, HR10, HLM12, GGHRW14]
. Bayesian inference has also been used for differentially private data generation[zhang2017privbayes, ping2017datasynthesizer]
by estimating the correlation between features. Seesurendra2017review for a survey of techniques used in private synthetic data generation through 2016.
In 2016, abadi2016deep introduced a framework for training deep learning models with differential privacy. Non-convex optimization, which is required when training deep models, can be made differentially private by adding a Gaussian noise to a clipped (norm-bounded) gradient in each training step. abadi2016deep also introduced the moment accountant
privacy analysis for private stochastic gradient descent, which provided much tighter Gaussian-based privacy composition and allowed for significant improvements in accuracy over previously used composition techniques, such as advanced composition[DRV10]
. The moment account was later defined in terms ofRenyi Differential Privacy (RDP) [mironov2017renyi]
, which is a slight variant of differential privacy designed for easy composition, particularly for differentially private stochastic gradient descent (DP-SGD). Much of the work that followed on private data generation used deep (neural-network-based) generative models to generate synthetic data, and can be broadly categorized into two types: autoencoder-based and GAN-based. Our algorithmic framework is the first to combine both DP GANs and autoencoders into one framework.
Differentially Private Autoencoder-Based Models. A variational autoencoder (VaE) [kingma2013auto]
is a generative model that compresses high-dimensional data to a smaller space calledlatent space. The compression is commonly achieved through deep models and can be differentially privately trained [chen2018differentially, acs2018differentially]. VaE makes the (often unrealistic) assumption that the latent distribution is Gaussian. acs2018differentially
uses Restricted Boltzmann machine (RBM) to learn the latent Gaussian distribution, andabay2018privacy
uses expectation maximization to learn a Gaussian mixture. Our work extends this line of work by additionally incorporating the generative model GANs which have also been shown to be successful in learning latent distributions.
Differentially Private GANs. GANs are a generative model proposed by GAN14 that have been shown success in generating several different types of data [mogren2016c, saito2017temporal, salimans2016improved, jang2016categorical, kusner2016gans, wang2018graphgan]. As with other deep models, GANs can be trained privately using the aforementioned private stochastic gradient descent (formally introduced in Section 2.1). See Appendix C.1 for additional related work on performance improvements for differentially private training of deep models.
Variants of DP GANs have been used for synthetic data generation, including the Wasserstein GAN (WGAN) [arjovsky2017wasserstein, Gulrajani17] and DP-WGAN [uclanesl_dp_wgan, triastcyn2018generating]
that use a Wasserstein-distance-based loss function in training[arjovsky2017wasserstein, Gulrajani17, uclanesl_dp_wgan, triastcyn2018generating]; the conditional GAN (CGAN) [mirza2014conditional] and DP-CGAN [torkzadehmahani2019dp] that operate in a supervised (labeled) setting and use labels as auxiliary information in training; and Private Aggregation of Teacher Ensembles (PATE) [papernot2016semi, papernot2018scalable] for the semi-supervised setting of multi-label classification when some unlabelled public data are available (or PATEGAN [jordon2018pate] when no public data are available). Our work focuses on unsupervised setting where data are unlabeled, and no (relevant) labeled public data are available.
These existing works in differentially private synthetic data generation are summarized in Table 1.
|Deep generative models||DPGAN [abadi2016deep]||PATEGAN [jordon2018pate]|
|DP Wasserstein GAN [uclanesl_dp_wgan]|
|DP Conditional GAN [torkzadehmahani2019dp]|
|Gumbel-softmax for categorical data [frigerio2019differentially]|
|Autoencoder||DP-VaE [chen2018differentially, acs2018differentially]|
|RBM generative models in latent space [acs2018differentially]|
|Mixture of Gaussian model in latent space [abay2018privacy]|
|Autoencoder and DPGAN (ours)|
|Other models||SmallDB [BLR08], PMW [HR10], MWEM [HLM12], DualQuery [GGHRW14], DataSynthesizer [ping2017datasynthesizer], PrivBayes [zhang2017privbayes]|
Differentially Private Generation of Mixed-Type Data.
Next we describe the three most relevant recent works on privately generating synthetic data of mixed type. abay2018privacy consider the problem of generating mixed-type labeled data with possible labels. Their algorithm, DP-SYN, partitions the dataset into sets based on the labels and trains a DP autoencoder on each partition. Then a DP expectation maximization (DP-EM) algorithm of park17EM
is used to learn the distribution in the latent space of encoded data of the given label-class. The main workhorse, DM-EM algorithm, is designed and analyzed for Gaussian mixture models and more general factor analysis models.chen2018differentially works in the same setting, but replaces the DP autoencoder and DP-EM with a DP variational autoencoder (DP-VaE). Their algorithm assumes that the mapping from real data to the Gaussian distribution can be efficiently learned by the encoder. Finally, frigerio2019differentially used a Wasserstein GAN (WGAN) to generate differentially private mixed-type synthetic data. This type of GAN uses a Wasserstein-distance-based loss function in training. Their algorithmic framework privatized the WGAN using DP-SGD, similar to the previous approaches for image datasets [zhang2018differentially, xie2018differentially]. The methodology of frigerio2019differentially
for generating mixed-type synthetic data involved two main ingredients: changing discrete (categorical) data to binary data using one-hot encoding, and adding an output softmax layer to the WGAN generator for every discrete variable.
Our framework is distinct from these three approaches. We use a differentially private autoencoder which, unlike DP-VaE of chen2018differentially, does not require mapping data to a Gaussian distribution. This allows us to reduce the dimension of the problem handled by the WGAN, hence escaping the issues of high-dimensionality from the one-hot encoding of frigerio2019differentially. We also use DP-GAN, replacing DP-EM in abay2018privacy, for learning distributions in the latent encoded space.
NIST Differential Privacy Synthetic Data Challenge. The National Institute of Standards and Technology (NIST) recently hosted a challenge to find methods for privately generating synthetic mixed-type data [NIST2018Match3]
, using excerpts from the Integrated Public Use Microdata Sample (IPUMS) of the 1940 U.S. Census Data as training and test datasets. Four of the winning solutions have been made publicly available with open-source code[nistcode]. However, all of these approaches are highly tailored to the specific datasets and evaluation metrics used in the challenge, including specialized data pre-processing methods and hard-coding details of the dataset in the algorithm. As a result, they do not provide general-purpose methods for differentially private synthetic data generation, and it would be inappropriate–if not impossible–to use any of these algorithms as baseline for other datasets such as ones we consider in this paper.
Evaluation Metrics for Synthetic Data.
Various evaluation metrics have been considered in the literature to quantify the quality of the synthetic data (see charest2011can for a survey). The metrics can be broadly categorized into two groups: supervised and unsupervised. Supervised evaluation metrics are used when there are clear distinctions between features and labels of the dataset, e.g., for healthcare applications, a person’s disease status is a natural label. In these settings, a predictive model is typically trained on the synthetic data, and its accuracy is measured with respect to the real (test) dataset. Unsupervised evaluation metrics are used when no feature of the data can be decisively termed as a label. Recently proposed metrics include dimension-wise probability
dimension-wise probabilityfor binary data [choi2017generating], which compares the marginal distribution of real and synthetic data on each individual feature, and dimension-wise prediction which measures how closely synthetic data captures relationships between features in the real data. This metric was proposed for binary data, and we extend it here to mixed-type data. Recently, NIST2018Match3 used a 3-way marginal evaluation metric which used three random features of the real and synthetic datasets to compute the total variation distance as a statistical score. See Appendix C.2 for more details on both categories of metrics, including Table 2 which summarizes the metrics’ applicability to various data types.
2 Preliminaries on Differential Privacy
In the setting of differential privacy, a dataset consists of individuals’ sensitive information, and two datasets are neighbors if one can be obtained from the other by the addition or deletion of one datapoint. Differential privacy requires that an algorithm produce similar outputs on neighboring datasets, thus ensuring that the output does not overfit to its input dataset, and that the algorithm learns from the population but not from the individuals.
Definition 1 (Differential privacy [Dmns06]).
For , an algorithm is -differentially private if for any pair of neighboring databases and any subset ,
A smaller value of implies stronger privacy guarantees (as the constraint above binds more tightly), but usually corresponds with decreased accuracy, relative to non-private algorithms or the same algorithm run with a larger value of . Differential privacy is typically achieved by adding random noise that scales with the sensitivity of the computation being performed, which is the maximum change in the output value that can be caused by changing a single entry. Differential privacy has strong composition guarantees, meaning that the privacy parameters degrade gracefully as additional algorithms are run on the same dataset. It also has a post-processing guarantee, meaning that any function of a differentially private output will maintain the same privacy guarantees.
2.1 Differentially Private Stochastic Gradient Descent (DP-SGD)
Training deep learning models reduces to minimizing some (empirical) loss function on a dataset . Typically is a nonconvex function, and a common method to minimize is by iteratively performing stochastic gradient descent (SGD) on a batch of sampled data points:
The size of is typically fixed as a moderate number to ensure quick computation of gradient, while maintaining that is a good estimate of true gradient .
To make SGD private, abadi2016deep proposed to first clip the gradient of each sample to ensure the -norm is at most :
Then a multivariate Gaussian noise parametrized by noise multiplier is added before taking an average across the batch, leading to noisy-clipped-averaged gradient estimate :
The quantity is now private and can be used for the descent step in place of Equation 1.
In general, the descent step can be performed using other optimization methods—such as Adam or RMSProp—in a private manner, by replacing the gradient value within each step. Also, one does not need to clip the individual gradients, but can instead clip the gradient of a group of datapoints, called a microbatch [mcmahan2018general]. Mathematically, the batch is partitioned into microbatches each of size
, and the gradient clipping is performed on the average of each microbatch:
Standard DP-SGD corresponds to setting , but setting higher values of (while holding fixed) significantly decreases the runtime and reduces the accuracy, and does not impact privacy significantly for large dataset. Other clipping strategies have also been suggested. We refer the interested reader to [mcmahan2018general] and Appendix C.1 for more details of clipping and other optimization strategies.
The improved moment accountant privacy analysis by [abadi2016deep] (which has been implemented in TensorflowPrivacy and is widely used in practice) obtains a tighter privacy bound when data are subsampled, as in SGD. This analysis requires independently sampling each datapoint with a fixed probability in each step. Additional details are also given in Appendix C.1.
The DP-SGD framework (Algorithm 1) is generically applicable to private non-convex optimization. In our proposed model, we use this framework to train the autoencoder and GAN.
2.2 Renyi Differential Privacy Accountant
A variant notion of differential privacy, known as Renyi Differential Privacy (RDP) [mironov2017renyi], is often used to analyze privacy for DP-SGD. A randomized mechanism is -RDP if for all neighboring databases that differ in at most one entry,
where is the Renyi divergence or Renyi entropy of order between two distributions and . Renyi divergence is better tailored to tightly capture the privacy loss from the Gaussian mechanism that is used in DG-SGD, and is a common analysis tool for DP-SGD literature. To compute the final -differential privacy parameters from iterative runs of DP-SGD, one must first compute the subsampled Renyi Divergence, then compose privacy under RDP, and then convert the RDP guarantee into DP.
Step 1: Subsampled Renyi Divergence. Given sampling rate and noise multiplier , one can obtain RDP privacy parameters as a function of for one run of DP-SGD [mironov2017renyi]. We denote this function by , which will depend on and .
Step 2: Composition of RDP. When DP-SGD is run iteratively, we can compose the Renyi privacy parameter across all runs using the following proposition.
Proposition 2 ([mironov2017renyi]).
If respectively satisfy -RDP for , then the composition of two mechanisms satisfies -RDP.
Hence, we can compute RDP privacy parameters for iterations of DP-SGD as .
Step 3: Conversion to -DP. After obtaining an expression for the overall RDP privacy parameter values, any -RDP guarantee can be converted into -DP.
Proposition 3 ([mironov2017renyi]).
If satisfies -RDP for , then for all , satisfies -DP.
Since the privacy parameter of RDP is also a function of , this last step involves optimizing for the that achieves smallest privacy parameter in Proposition 3.
3 Algorithmic Framework
The algorithm takes in raw data points, and pre-processes these points into vectors to be read by DP-auto-GAN, where usually is very large. For example, categorical data may be pre-processed using one-hot encoding, or text may be converted into numerical values. Similarly, the output of DP-auto-GAN can be post-processed from back to the data’s original form. We assume that this pre- and post-processing can done based on public knowledge, such as possible categories for qualitative features and reasonable bounds on quantitative features, and therefore does not require privacy.
Within the DP-auto-GAN, there are two main components: the autoencoder and the GAN. The autoencoder serves to reduce the dimensionality of the data before it is fed into the GAN. The GAN consists of a generator that takes in noise sampled from distribution and produces , and a discriminator . Because of the autoencoder, the generator only needs to synthesize data based on the latent distribution , which is a much easier task than synthesizing in the original high-dimensional space . Both components of our architecture, as well as our algorithm’s overall privacy guarantee, are described in the remainder of this section.
3.1 Autoencoder Training
The autoencoder consists of the encoder and decoder parametrized by edge weights , respectively. The architecture of the autoencoder assumes that high-dimensional data can be represented compactly in low-dimensional space , also called latent space. The encoder is trained to find such low-dimensional representations. We also need the decoder, , to map this point in the latent space back to . A measure of the information preserved in this process is the error between the decoder’s image and the original . Thus a good autoencoder should minimize the distance for each datapoint and the appropriate distance function dist. Our autoencoder uses binary cross entropy loss: dist (where is the th coordinate of ).
This also motivates a definition of a (true) loss function when data are drawn independently from an underlying distribution . The corresponding empirical loss function when we have an access to sample is
The task of finding a good autoencoder reduces to optimizing and to yield small empirical loss as in Equation 2.
We minimize Equation 2 privately using DP-SGD (described in Section 2.1). Our approach differs from previous work on private training of autoencoders [chen2018differentially, acs2018differentially, abay2018privacy] by not adding noise to the encoder during DP-SGD, whereas previous work adds noise to both the encoder and decoder. This improves performance by reducing the noise injected into the model by half, while still maintaining the same privacy guarantee (see Proposition 5). The full description of our autoencoder training is given in Algorithm 3 in Appendix A. In our DP-auto-GAN framework, the autoencoder is trained first until completion, and is then fixed for the second phase of training GAN.
3.2 GAN Training
A GAN consists of the generator and discriminator , parameterized respectively by edge weights and . The aim of the generator is to synthesize (fake) data similar to the real dataset, while the aim of discriminator is to determine whether an input is from the generator’s synthesized data (and assigning label ) or is real data (and assigning label ). The generator is seeded with a random noise that contains no information about real dataset, such as a multivariate Gaussian vector, and aims to generate a distribution that is hard for is distinguish from the real data. Hence, the generator wants to minimize the probability that makes a correct guess, . At the same time, the discriminator wants to maximize its probability of correct guess when the data is fake and when the data is real .
We generalize the output of to a continuous range , with the value indicating the confidence that a sample is real. We use the zero-sum objective for the discriminator and generator proposed by arjovsky2017wasserstein and motivated by the Wasserstein distance of two distributions. Although their proposed Wasserstein objective cannot be computed exactly, it can be approximated by optimizing the objective:
We optimize Equation 3 privately using the DP-SGD framework described in Section 2.1. We differ from prior work on DP GANs in that our generator outputs data in latent space which needs to be decoded to before being fed into the discriminator . The gradient
is obtained by backpropagation through one more component. Hence, the training of generator remains totally private because the additional component is fixed and never accesses the private data. The full description of our GAN training is given in Algorithm 5 in the Appendix A.
At the end of the two-phase training (including autoencoder and GAN), the noise distribution , trained generator , and trained decoder are released to the public. The public can then generate synthetic data by sampling to obtain a synthesized datapoint repeatedly to obtain a synthetic dataset of any desired size.
3.3 Privacy Accounting
Our autoencoder and GAN are trained privately by adding noise to the encoder and discriminator. Since the generator only accesses data through the discriminator’s (privatized) output, then the trained parameters of generator are also private by post-processing guarantees of differential privacy. Finally, we release the privatized decoder and generator, together with generator’s noise distribution and post-processing procedure, both of which are assumed to be public knowledge.
The privacy accounting is therefore required for the two parts that access real data : training the autoencoder and the discriminator. In each training procedure, we apply the RDP accountant described in Section 2.2 to analyze privacy of the DP-SGD training algorithm, to compute final -DP bound. Our application of the RDP accountant diverges from the previous literature in two main ways.
First, we do not add noise to encoder during the autoencoder training, which is contrary to prior work that adds noise to both the encoder and decoder. Our approach of not adding noise to the encoder does not affect the algorithms’ overall privacy guarantees. This claim is stated formally in the following corollary, which follows immediately by instantiating Proposition 5 with the RDP privacy accountant and then composing RDP using Proposition 2.
If the decoder in DP-auto-GAN is trained privately with RDP privacy parameter RDP (the encoder can be trained non-privately) and the discriminator in DP-auto-GAN is trained with RDP privacy parameter RDP, then DP-auto-GAN is RDP with privacy parameter RDPRDP.
Second, the privacy analysis must account for two phases of training, usually with different privacy parameters (due to different batch sampling rates, noise, and number of iterations). One obvious solution is to calculate the desired ()-DP parameter obtained from each phase and compose them to obtain -DP using basic composition of differential privacy [DMNS06]. However, we can obtain a tighter privacy bound by composing the privacy at the Renyi Divergence level before translating Renyi Divergence into -DP. In other words, we first apply Proposition 2 to compute RDP of two-phase training before applying Proposition 3 to translate RDP into DP, as analogous to the approach described in Section 2.2 for RDP composition for DP-SGD. This is the approach highlighted in Corollary 4. In practice, this reduces the privacy parameter by about 30%.
4 Evaluation Metrics
In this section, we discuss the evaluation metrics that we use in the experiments (described in Section 5) to empirically measure the quality of the synthetic data. Some of these metrics have been used in the literature, while many are novel contributions in this work. The evaluation metrics are summarized in Table 2; our contributions are in bold.
|TYPES||EVALUATION METHODS||DATA TYPES|
|Supervised||Label prediction* [chen2018differentially, abay2018privacy, frigerio2019differentially]||Yes||Yes||Yes|
|Predictive model ranking* [jordon2018pate]||Yes||Yes||Yes|
|Unsupervised, prediction-based||Dimension-wise prediction plot*||Yes ([choi2017generating], ours)||Yes||Yes|
|Unsupervised, distributional-distance-based||Dimension-wise probability plot [choi2017generating]||Yes||No||No|
|-way feature marginal, total variation distance [NIST2018Match3]||Yes||Yes||Yes|
|-way feature marginal**||Yes||Yes||Yes|
|-way PCA marginal**||Yes||Yes||Yes|
|Unsupervised, qualitative||-way feature marginal (histogram)||Yes||Yes||Yes|
-way PCA marginal (data visualization)
For the first two metrics described below, the dataset should be partitioned into a training set and testing set , where is the total number of samples the real data, and is the number of features in the data. After training the DP-auto-GAN, we use it to create a synthetic dataset , for sufficiently large .
Dimension-wise probability. This metric is used when the entire dataset is binary, and it serves as a basic sanity check to verify whether DP-auto-GAN has correctly learned the marginal distribution of each feature. Specifically, it compares the proportion of ’s (which can be thought of as estimators of Bernoulli success probability) in each feature of the training set and synthetic dataset .
Dimension-wise prediction. This metric evaluates whether DP-auto-GAN has correctly learned the relationships between features. For the -th feature of training set and synthetic dataset , we choose and as labels of a classification or regression task based on the type of that feature, and the remaining features and are used for prediction. We train either a classification or regression model and measure goodness of fit based on the model’s accuracy using the following well known metrics:
Area under the ROC curve (AUROC) score and score for classification: The
score of a classifier is defined as
where precision is ratio of true positives to true and false positives, and recall is ratio of true positives to total true positives (i.e., true positives plus false negatives). AUROC score is a graphical measure capturing the area under ROC (receiver operating characteristic) curve, and is only intended for binary data. Both metrics take values in intervalwith larger values implying good fit.
score for regression: The score is defined as where is the true label, is the predicted label, and is the mean of the true labels. This is a popular metric used to measure goodness of fit as well as future prediction accuracy for regression.
We also propose following novel evaluation metrics.
1-way feature marginal. This metric works as a sanity check for real features. We compute histograms for the feature interest of both real and synthetic data. The quality of the synthetic data with respect to this metric can be evaluated qualitatively through visual comparison of the histograms on real and synthetic data. This can be extended to -way feature marginals and made into a quantitative measure by adding a distance measure between the histograms.
2-way PCA marginal. This metric generalizes the 3-way marginal score used in NIST2018Match3. In particular, we compute principle components of the original data and evaluate a projection operator for first two principle components. Let us denote as the projection matrix such that is the projection on first two principle components of . Then we evaluate projection of synthetic data and scatterplot 2-D points in and for visual evaluation. For quantitative evaluation, we also compute Wasserstein distance between and . In the simulations described in Section 5, we used Wasserstein distance since we optimize for the WGAN objective, but any distributional divergence metric can be used. This approach can also be extended to -way marginals by making the projection matrix for the first principle components.
Distributional distance. In this metric, we first compute the Wasserstein distance between the entire real and synthetic datasets . The Wasserstein score is then defined as
where the Wasserstein distance is normalized by the maximum distance possible of two datapoints in data universe . To compute the Wasserstein score on -way marginal PCA projection , we normalize the score with additional term , where
is the explained variance of:
For more details about implementation of these new evaluation metrics, their generalizations and relationships among them, we refer the reader to Appendix C.2.
In this section we present details of our datasets and show empirical results of our experiments. Throughout our experiments, we fix for training DP-auto-GAN and show results for different values of including (i.e., non-private GAN) which serves as a benchmark. We also compare our results with existing works in the literature where relevant. Details of hyper-parameters and architecture can be found in the appendix. The code of our implementation is available at https://github.com/DPautoGAN/DPautoGAN.
5.1 Binary Data
First, we consider the MIMIC-III dataset [johnson2016mimic] which is a publicly available dataset consisting of medical records of 46K intensive care unit (ICU) patients over 11 years old. This is a binary dataset with 1071 features.
Even though our DP-auto-GAN framework can handle mixed-type data, we first evaluate it on the MIMIC-III dataset, which is all binary, since this dataset has been used in similar non-private [choi2017generating] and private [xie2018differentially] GAN frameworks. We use the same evaluation metrics used in these papers. First we plot dimension-wise probability for DP-auto-GAN run on this dataset.
As shown in Figure 2, the proportion of 1’s in the marginal distribution for is similar on the real and synthetic datasets for and , because nearly all points fall close to the line . The performance of DP-auto-GAN is affected marginally for which can be noticed by increased variance of points along line . For , DP-auto-GAN is unable to accurately learn the marginal distributions in the real data, as many of the features in the synthetic dataset have much higher proportion of 0’s. This trend in the performance is expected for smaller values of , which correspond to stronger privacy guarantees. We note that our results are significantly stronger than the ones obtained in [xie2018differentially] with because we obtain dramatically better performance with values that are two orders of magnitude smaller. For visual performance comparison, see Figures 4 and 5 of [xie2018differentially].
coordinates represent the AUROC score of a logistic regression classifier trained on real and synthetic datasets, respectively. The linecorresponds to the ideal performance.
Figure 3 shows the plots of dimension-wise prediction using DP-auto-GAN for different values of . As shown in the figure, for , many points are concentrated along the lower side of line , which indicates that the AUROC score of the real dataset is only marginally higher than that of the synthetic dataset. For and , there is a gradual shift downwards relative to the line , with larger variance in the plotted points. This indicates that AUROC scores of real and synthetic data shows more difference for smaller values of . The plot for shows the same trend, but has noticeably fewer datapoints plotted. This is because many features in the synthetic data under this small value have a high proportion of ’s, so the logistic regression classifier trained on these features uniformly outputs on the hold-out test dataset . In such cases, the AUROC score is by default and as such, does not have any meaning, so we drop those features from the plot. The plots of dimension-wise prediction with these points included are given in Figure 7 in Appendix B.2, along with training specifications of DP-auto-GAN on the MIMIC-III dataset in Appendix B.1.
5.2 Mixed Data
Second, we consider the ADULT dataset [uci_ml] which is an extract of the U.S. Census and contains information about working adults. This dataset has 14 features out of which 10 features are categorical and four are real-valued.
shows the dimension-wise prediction plot of DP-auto-GAN on this dataset. For categorical features (represented by blue points and a single green point), we use random forest classifier in order to compare our result with[frigerio2019differentially]
. For real-valued features (represented by red points), we used a lasso regression model. The green point corresponds to thesalary feature of the data, which is real-valued but treated as binary, based on the condition , which is similarly used as a binary label in [frigerio2019differentially]. We use score as our classification accuracy measure for categorical features in in Figure 4, and we use score as our regression accuracy for real-valued features. The score is preferred over AUROC score for the ADULT dataset because it has many non-binary features where AUROC cannot be used. Each point in Figure 4 corresponds to one feature, and the and coordinates respectively show the accuracy score on the real data and the synthetic data.
Similar to the MIMIC-III dataset, we see that for large values of , points are scattered close to line, and as gets smaller, points gradually shift downward implying, that accuracy of synthetic data deceases with stronger privacy guarantees. For the salary feature, we also compute accuracy scores for comparison with [frigerio2019differentially]. In Table 3, we report the accuracy of each synthetic dataset as well as benchmark accuracy. The results reported in [frigerio2019differentially] use , whereas our algorithms used parameter values . We see that our accuracy guarantees are higher than those of [frigerio2019differentially] with smaller values, and DP-auto-GAN achieved higher accuracy in the non-private setting.
Note that in the ADULT dataset, we have four real-valued features (age, capital gain, capital loss, and hours worked per week), but there are not four red points in each plot of Figure 4. While AUROC for the binary features is always supported on , the score for real-valued features can be negative if the predictive model is poor, and these values fell outside the range of Figure 4. As decreased—corresponding to stronger privacy and hence diminished accuracy of performance—fewer red points are observed in Figure 4. We were not able to find a regression model with good fit (as measured by score) for the latter three features (capital gain, capital loss, and hours worked per week) in terms of the other features even on the real data. We attempted several different approaches, ranging from simple regression models such as lasso to complex models such as neural networks, and all had a low score on both the real and synthetic data. The capital gain and capital loss attributes are inherently hard to predict because the data are sparse (mostly zero) in these attributes.
Since the did not prove to be a good metric for these features, we instead plotted 1-way feature marginal histograms for each of these three remaining features to check whether the marginal distribution was learned correctly. These 1-way histograms are shown in Figure 5. The figure shows that DP-auto-GAN identifies the marginal distribution of capital gain and capital loss quite well, and it does reasonably well on the hours-per-week feature.
In order to understand combined performance of all features, we use two metrics. First, we show the qualitative results from 2-way PCA marginal score in Figure 6. A close qualitative inspection of plots clearly shows the similarities of trends between the plots for real dataset and for different values of , as low as . We can turn this qualitative measure into a quantitative one by evaluating the Wasserstein distributional distance between the synthetic and real data, shown in Table 4. We measure this distance both on the 2-way PCA marginal distribution and on the full dataset. Computing exact Wasserstein distance can be computationally expensive; in practice, we uniformly sample datapoints from real and synthetic (projected) data to compute Wasserstein distance. This sampling and distance computation are repeated several times, and the average of the distances over all iterations is used as the final Wasserstein distance.
|Method||2-way PCA score||Whole-data score|
We proposed a method called DP-auto-GAN for differentially private synthetic data generation. This method combines the efficient low-dimensional representation of variational autoencoders with the flexibility and versatility of GANs. Relative to prior work on differentially private autoencoders, we show that it suffices to only train the decoder privately, which allows the noise from privacy to be reduced by a factor of 2.
We show how this framework can be used to privately learn a model for generating synthetic data, and once trained, this model can then be used to generate arbitrary amounts of synthetic data that will enjoy the same privacy guarantees, due to the post-processing property of differential privacy. This method can be used for mixed-type data, that includes binary, categorical, and real valued data.
We introduce a number of new metrics for evaluating the quality of mixed-type synthetic data, particularly in unsupervised settings. We then evaluate the performance of our DP-auto-GAN algorithm on two datasets (one all-binary and one mixed-type data) using our new metrics as well as existing metrics from the literature. We show that DP-auto-GAN performs better than existing techniques in terms of the privacy-accuracy tradeoff for a wide variety of accuracy metrics.
Appendix A Algorithm Description and Pseudocode of DP-Auto-GAN
In this appendix, we provide the pseudocode of the subroutines in DP-auto-GAN (Algorithm 2): DPTrain, DPTrain, and Train. The complete DP-auto-GAN algorithm is specified by the architecture and training parameters of the encoder, decoder, generator, and discriminator.
After initial data pre-processing, the DPTrain algorithm trains the autoencoder. Details of this training process are fully specified in Algorithm 3. As noted earlier, the decoder is trained privately by clipping gradient norm and injecting Gaussian noise in order to obtain the gradient of decoder , while the gradient of encoder can be used directly as encoder can be trained non-privately.
The second phase of DP-auto-GAN is to train the GAN. As suggested by [GAN14], the discriminator trained for several iterations per one iteration of generator training. While the discriminator is being trained, the generator is fixed, and vice-versa. The discriminator and generator training are described in Algorithms 4 (DPTrain) and 5 (Train) respectively. Since the discriminator receives real data samples as input for training, the training is made differentially private by clipping the norm of the gradient updates, and adding Gaussian noise to the gradient . The generator does not use any real data in training (or any functions of the real data that were computed without differential privacy), and hence it can be trained without any need to clip the gradient norm or to inject noise into the gradient.
Finally, the overall privacy analysis of DP-auto-GAN is done via the RDP accountant for each training, and composing at the RDP level (as a function of ) as described in Corollary 4. After the sum of the RDP privacy parameters is obtained (which is a function of ), then for any given fixed , we optimize to get the best in Proposition 3. Because the value of obtained from Proposition 3 is a convex function of [van2014renyi, wang2018subsampled], we implement ternary search to efficiently optimize for .
DP-auto-GAN trained with differentially private algorithms on the decoder and on the discriminator (and possibly a non-private algorithm on the encoder) achieves differential privacy guarantee equivalent to that of the composition of .
DP-auto-GAN needs to release only generator and decoder as an output. Releasing the decoder incurs cost of privacy equal to that of . The generator accesses the data only through a discriminator, which is differentially private by mechanism , so releasing the generator has the same privacy loss as from post-processing. Therefore, releasing both decoder and generator incurs privacy loss of composition of and . ∎
Proposition 5 is stated more formally using the RDP notion of privacy (where the privacy parameters are a function of ) in Corollary 4 in the main body. That corollary follows immediately from Propositions 5 and 2.
Appendix B Additional Experimental Details
b.1 Model and Training Specification of Experiment on MIMIC-III data
The autoencoder was trained via Adam with Beta 1 = 0.9, Beta 2 = 0.999, and a learning rate of 0.001. It was trained on minibatches of size 100 and microbatches of size 1. L2 clipping norm was selected to be the median L2 norm observed in a non-private training loop, set to 0.8157. The noise multiplier was then calibrated to achieve the desired privacy guarantee.
The GAN was composed of two neural networks, the generator and the discriminator. The generator was a simple feed-forward neural network, trained via RMSProp with alpha = 0.99 with a learning rate of 0.001. The discriminator was also a simple feed-forward neural network, also trained via RMSProp with the same parameters. The L2 clipping norm of the discriminator was set to 0.35. The pair was trained on minibatches of size 1,000 and a microbatch size of 1, with 2 updates to the discriminator per 1 update to the generator. Again, the noise multiplier was then calibrated to achieve desired privacy guarantees.
A serialization of the model architectures used in the experiment can be found below.
(0): Linear(in-feature=1071, out-feature=128, bias=True)
(0): Linear(in-feature=128, out-feature=1071, bias=True)
(0): Linear(in-feature=128, out-feature=128)
(2): Linear(in-feature=128, out-feature=128)
(0): Linear(in-feature=1071, out-feature=256, bias=True)
(2): Linear(in-feature=256, out-feature=1, bias=True) )
b.2 Additional MIMIC-III Empirical Results
Here show Figure 7, which is the full version of Figure 3 (dimension-wise prediction for MIMIC-III dataset), before cleaning the data by removing features with sparse values of 1. As observed in the figure, smaller values cause more features the in synthetic data to have a high proportion of 0’s so the logistic regression classifier trained on these features uniformly outputs 0, causing a default AUROC score of 1/2. A closer inspection of real data shows that nearly all of those features indeed have very sparse 1’s (appearing less than 1% of the time). This suggests that with smaller values, the features that always output as 0 have been learned accurately with respect to the training set, but may not necessarily generalize to the hold-out test set.
b.3 Model and Training Specification of Experiment on ADULT data
The autoencoder was trained via Adam with Beta 1 = 0.9, Beta 2 = 0.999, and a learning rate of 0.005 for 20,000 minibatches of size 64 and a microbatch size of 1. The L2 clipping norm was selected to be the median L2 norm observed in a non-private training loop, equal to 0.012. The noise multiplier was then calibrated to achieve the desired privacy guarantee.
The GAN was composed of two neural networks, the generator and the discriminator. The generator used a ResNet architecture, adding the output of each block to the output of the following block. It was trained via RMSProp with alpha = 0.99 with a learning rate of 0.005. The discriminator was a simple feed-forward neural network with LeakyReLU hidden activation functions, also trained via RMSProp with alpha = 0.99. The L2 clipping norm of the discriminator was set to 0.022. The pair was trained on 15,000 minibatches of size 128 and a microbatch size of 1, with 15 updates to the discriminator per 1 update to the generator. Again, the noise multiplier was then calibrated to achieve the desired privacy guarantee.
A serialization of the model architectures used in the experiment can be found below.
0: Linear(in-features=106, out-feature=60, bias=True)
(2): Linear(in-feature=60, out-feature=15, bias=True)
(0): Linear(in-feature=15, out-feature=60, bias=True)
(2): Linear(in-feature=60, out-feature=106, bias=True)
(0): Linear(in-feature=64, out-feature=64, bias=False)
(0): Linear(in-feature=64, out-feature=64, bias=False)
(0): Linear(in-feature=64, out-feature=15, bias=False)
(0): Linear(in-feature=106, out-feature=70, bias=True)
(2): Linear(in-feature=70, out-feature=35, bias=True)
(4): Linear(in-feature=35, out-feature=1, bias=True) )
Appendix C Additional Related Work
c.1 Differentially Private Training of Deep Models
There are numerous works on optimizing the performance of differentially private GANs, including data partitioning (either by class of labels in supervised setting or a private algorithm) [yu2019differentially, papernot2016semi, papernot2018scalable, jordon2018pate, abay2018privacy, acs2018differentially, chen2018differentially]; reducing the number of parameters in deep models [mcmahan2017learning]; changing the norm clipping for the gradient in DP-SGD during training [mcmahan2017learning, van2018three, thakkar2019differentially]; changing parameters of the Gaussian noise used during training [yu2019differentially]; and using publicly available data to pre-train the private model with a warm start [zhang2018differentially, mcmahan2017learning]. Clipping gradients per-layer of models [mcmahan2018general, mcmahan2017learning] and per-dynamic parameter grouping [zhang2018differentially] are also proposed. Additional details for some of these optimization approaches are given below.
Three ways are known to sample a batch from data in each optimization step. These methods are described in [mcmahan2018general]; we summarize them here for completeness. The first is to sample each individual’s data independently with a fixed probability. This sampling procedure is the one used in the analysis of the subsampled moment accountant in [abadi2016deep, mcmahan2018general] and subsampled RDP composition in [mironov2017renyi]
. This RDP composition is publicly available at Tensorflow Privacy[TensorflowPrivacy]. We implement this sampling procedure and use Tensorflow Privacy to account Renyi Divergence during training. Another sampling policy is to sample uniformly at random a fixed-size subset of all datapoints. This achieves a different RDP guarantee, which was analyzed in [wang2018subsampled]. Finally, a common subsampling procedure is to shuffle the data via uniformly random permutation, and take a fixed-size batch of the first
points in shuffled order. The process is repeated after a pass over all datapoints (an epoch). Although this batch sampling is most common in practice, no subsampled privacy composition is known in this case for the centralized model.
Training deep learning models involves hyperparameter tuning to find good architecture and optimization parameters. This process is also done differentially privately, and the privacy budget must be accounted for.abadi2016deep accounts for hyperparameter search using the work of [gupta2010differentially]. beaulieu2019privacy uses Report Noisy Max [DR14] to private select a model with top performance when a model evaluation metric is known. Some work has also been done to account for selecting high-performance models without spending much privacy budget [chaudhuri2013stability, liu2019private]. In our experimental work, we omit the privacy accounting of hyperparameter search, as this is not the focus of our contribution.
c.2 Evaluation Metrics for Synthetic Data
In this section, we review the evaluation schemes for measuring quality of synthetic data and discuss our contribution of novel metrics in comparison with existing literature. Various evaluation metrics have been considered in the literature to quantify the quality of synthetic data [charest2011can]. Broadly, evaluation metrics can be divided into two major categories: supervised and unsupervised. Supervised evaluation metrics are used when clear distinctions exist between features and labels in the dataset, e.g., for healthcare applications, whether a person has a disease or not could be a natural label. Unsupervised evaluation metrics are used when no feature of the data can be decisively termed as a label. For example, a data analyst who wants to learn a pattern from synthetic data may not know what specific prediction tasks to perform, but rather wants to explore the data using an unsupervised algorithm such as Principle Component Analysis (PCA). Unsupervised metrics can then be divided into three broad types: prediction-based, distributional-distance-based, and qualitative (or visualization-based). We describe supervised evaluation metrics and all three types of unsupervised evaluation metrics below. Metrics in previous work and our proposed metrics are summarized in Table 2 in Section 4.
Supervised evaluation metrics.
The main aim of generating synthetic data in a supervised setting is to best understand the relationship between features and labels. A popular metric for such cases is to train a machine learning model on the synthetic data and report its accuracy on the real test data [xie2018differentially]. zhang2018differentially used inception scores on the image data with classification tasks. Inception scores were proposed in salimans2016improved for images which measure quality as well as diversity of the generated samples. Another metric used in jordon2018pate reports whether the accuracy ranking of different machine learning models trained on the real data is preserved when the same machine learning model is trained on the synthetic data. Although these metrics are used for classification in the literature, they can be easily generalized to the regression setting.
Unsupervised evaluation metric, prediction-based.
Rather than measuring accuracy by predicting one particular feature as in supervised-setting, one can predict every individual feature using the rest of features. The prediction score is therefore created for each single feature, creating a list of dimension- (or feature-) wise prediction scores. Good synthetic data should have similar dimension-wise prediction scores to that of the real data. Intuitively, similar dimension-wise prediction shows that synthetic data correctly captures inter-feature relationships in the real data.
One metric of this type is proposed by choi2017generating for binary data. Although it was originally proposed for binary data, we extend this to mixed-type data by allowing varieties of predictive models appropriate for each data type present in the dataset. For each feature, we try predictive models on the real dataset in order of increasing complexity until a good accuracy score is achieved. For example, to predict a real-valued feature, we first used a linear classifier and then a neural network predictor. This ensures that a choice of predictive model is appropriate to the feature. Synthetic data is then evaluated by measuring the accuracy of the same predictive model (trained on the real data) on the synthetic data. Similarly high accuracy scores on synthetic data and real data indicates that the synthetic data closely approximates the real data.
provides an unsupervised Jensen-Shannon score metric which measures the Jensen-Shannon divergence between the output of a discriminating neural network on the real and synthetic datasets, and a Bernoulli random variable withprobability. This metric differs from dimension-wise prediction in that the predictive model (discriminator) is trained over the whole dataset at once, rather than dimension-wise, to obtain a score.
Unsupervised evaluation metric, distributional-distance-based.
Another way to evaluate the quality of synthetic data is computing a dimension-wise probability distribution, which was also proposed inchoi2017generating for binary data. This metric compares the marginal distribution of real and synthetic data on each individual feature. Below we survey other metrics in this class that can extend to mixed-type data.
3-way marginal: Recently, the NIST2018Match3 challenge used a 3-way marginal evaluation metric in which three random features of the real and synthetic data are used to compute the total variation distance as a statistical score. This process is repeated a few times and finally, average score is returned. In particular, values for each of the three features are partitioned in 100 disjoint bins as follows:
where is the value of -th datapoint’s -th feature in datasets and , and are respectively the minimum and maximum value of the -th feature in . For example, if are the selected features then -th data points of and are put into bins identified by a 3-tuple, and , respectively.
Let be the set of all 3-tuple bins in datasets and , and let denote number of datapoints in 3-tuple bin , normalized by total number of data points. Then, the 3-way marginal metric reports the -norm of the bin-wise difference of and as follows:
Both aforementioned metrics (dimension-wise probability from [choi2017generating] and 3-way marginal from [NIST2018Match3]) involve two steps. First, a projection (or a selection of features) of data is specified, and second some statistical distance or visualization of synthetic and real data in the projected space is computed. Dimension-wise probability for binary data corresponds to projecting data into each single dimension, and visualizing synthetic and real distributions in projected space by histograms (for binary data, the histogram can be specified by one single number: probability of the feature being 1). The 3-way marginal metric first selects a three-dimensional space specified by three features as a space into which data projected, discretizes the synthetic and real distributions on that space, then computes a total variation distance between discretized distributions. Our proposed metrics generalize both steps of designing the metric as follows.
Generalization of Data Projection: One can generalize selection of features (3-way marginal) to any features (-way marginal). However, one can also select principle components instead of features. We distinguish these as -way feature marginal (projection onto a space spanned by feature dimensions) and -way PCA marginal (projection onto a space spanned by principle components of the original dataset). Intuitively, -way PCA marginal best compresses the information of the real data into a small -dimensional space, and hence is a better candidate for comparing projected distributions.
Generalization of Distributional Distance: Total variation distance can be misleading as it does not encode any information on the distance between the supports of two distributions. In general, one can define any metric of choice (optionally with discretization) on two projected distributions, such as Wasserstein distance which also depends on the distance between the supports of the two distributions.
Distributional Distance: The distance between two distributions can also be computed without any data projections. Computing an exact statistical score on high-dimensional datasets is likely computationally hard. However, we can subsample uniformly at random points from two distributions to compute the score more efficiently, then average this distance over many iterations.
Unsupervised evaluation metric, qualitative.
As described above, dimension-wise probability is a specific application of comparing histograms under binary data. One can plot histograms of each feature (1-way feature marginal) for inspection. In practice, histogram visualization is particularly helpful when a feature is strongly skewed, sparse (majority zero), and/or hard to predict well by predictive models. An example of this occurred when predictive models do not have meaningful predictive accuracy on certain features of the ADULT dataset, making prediction-based metric inappropriate. Instead inspection of histograms of those features on synthetic and real data (as in Figure6) indicate that synthetic data replicates those features well.
In addition, -way PCA marginal is a visual representation of data that explains as much variance as possible in a plane, providing a good trade-off between information and ease of visualization on two datasets. This visualization can be augmented with a distributional distance of choice over the two distributions on these two spaces to get a quantitative metric.