Privacy-preserving data synthesis (PPDS) is a solution to sharing private data by constructing a generative model while preserving privacy. Differential privacy (DP) Dwork (2006) is a rigorous notation of privacy to release statistics, and is used in broad domains and applications Abowd (2018); Amin et al. (2019); Bindschaedler and Shokri (2016); Chaudhuri et al. (2019); Papernot et al. (2016); Schein et al. (2019); Wang and Xu (2019). In recent years, several works have proposed differentially private deep generative models Acs et al. (2018); Jordon et al. (2018); Torkzadehmahani et al. (2019); Xie et al. (2018).
Deep generative models have significantly improved in the past few years. Variational autoencoder (VAE) Kingma et al. (2016); Kingma and Welling (2013) is a likelihood-based model to reconstruct training inputs. VAE also enables us to generate random samples from its learned representations. We often build a VAE with an appropriate prior distribution to describe the desired properties of the representations (such as encouraging clustering, sparsity, and disentanglement), and introduce a divergence as a regularization term to close the learned representations to the prior Alemi et al. (2016); Bengio et al. (2013); Eastwood and Williams (2018); Esmaeili et al. (2019); Mathieu et al. (2019). This paper studies how to learn variational autoencoders with a variety of divergences under differential privacy constraints.
A simple way to build a differentially private VAE is to employ differentially private stochastic gradient descent (DP-SGD)Abadi et al. (2016) in the learning process of vanilla VAE. The key idea of DP-SGD is that it injects noises to stochastic gradients for giving DP guarantees to the learned parameters. The noise scale is designed according to the stochastic gradient’s sensitivity, which is the maximal change of the gradient when any single input is modified. To limit the gradient’s sensitivity, DP-SGD first decomposes input samples into disjoint smaller groups (i.e., micro-batches). Then, DP-SGD computes a stochastic gradient for each group and clips the norm of the gradient by a constant. On the other hand, a misuse of the gradient aggregations might cause privacy leakages unconsciously.
Our contributions are three-fold. First, we reveal that several divergences might increase the stochastic gradients’ sensitivity when attaching them to the loss function. To discover the issues, we address a sensitivity study in the learning process of VAEs based on DP-SGD. The sensitivity is increased fromto in terms of batch size by attaching the divergence. Consequently, the sensitivity increase degrades the quality of the learned model since it directly amplifies the amount of noise. If unfortunately, we do not notice the sensitivity increase, we might cause an insufficient differential privacy guarantee.
Second, to solve the above issue, we propose term-wise DP-SGD that crafts randomized gradients in two different ways tailored to the compositions of the loss terms. The term-wise DP-SGD keeps the sensitivity at even when attaching the divergence. We can therefore build a differentially private VAE with a small amount of noise by our proposed method.
Third, based on the term-wise DP-SGD, we present PriVAE, a general model to learn VAEs with attaching a variety of divergences while satisfying differential privacy. Our experiments demonstrate that our proposed method works well with two pairs of the prior distribution and the divergence.
This paper clarifies how to aggregate gradients in VAEs to satisfy differential privacy while refraining the amount of noise. Although we mainly study differentially private VAEs, these contributions also have significant importance for the other machine learning models to satisfy differential privacy.
1.1 Related Works
Generative models under differential privacy have been studied in a last decade. Traditional approaches are based on capturing probabilistic models, low rank structure, and learning statistical characteristics from original sensitive database Chen et al. (2015); Zhang et al. (2014, 2016). Plausible deniability Bindschaedler et al. (2017) is an extended privacy metric behind DP for building a generative model.
We have several studies about DP-SGD McMahan et al. (2017, 2018); Yu et al. (2019). Lee et al. Lee and Kifer (2018) demonstrated that DP-SGD can be improved with adaptive step sizes and careful allocation of privacy budgets between iterations. Bagdasaryan et al. Bagdasaryan et al. (2019) revealed that if the original model is unfair, the unfairness becomes worse once DP is applied.
2.1 Differential Privacy
Differential privacy Dwork (2006, 2011b, 2011a) is a rigorous mathematical privacy definition, which quantitatively evaluates the degree of privacy protection when we publish statistical outputs. The definition of differential privacy is as follows:
Definition 1 (()-differential privacy)
A randomized mechanism satisfies ()-differential privacy if, for any two neighboring input and any subset of outputs , it holds that
Practically, we employ a randomized mechanism that ensures differential privacy for a function . The mechanism perturbs the output of to cover ’s sensitivity that is the maximum degree of change over any pairs of and .
Definition 2 (Sensitivity)
The sensitivity of for any two neighboring input is
where is a norm function defined on ’s output domain.
Based on the sensitivity of , we design the degree of noise to ensure differential privacy. Laplace mechanism and Gaussian mechanism are well-known as standard approaches.
Let be mechanisms satisfying -, , -differential privacy, respectively. Then, a mechanism sequentially applying satisfies (, )-differential privacy. This fact refers to composabilityDwork (2006). In particular, this composition is called sequential composition.
Differentially private stochastic gradient descent (DP-SGD) Abadi et al. (2016), is a useful optimization technique for learning a model under differential privacy constraints. The key idea of DP-SGD is that it adds noise to stochastic gradients during training for making differential privacy guarantees on ’s parameters . To obtain the scale of noise, DP-SGD limits -sensitivity of stochastic gradient
by clipping its norm. The gradient clippingthat limits the sensitivity up to is denoted as follows:
In the DP-SGD, we compute an empirical loss for each micro-batch that includes only one sample. For each micro-batch, DP-SGD generates its clipped gradient. Based on the clipped gradients, DP-SGD crafts a randomized gradient through computing the average over the clipped gradients and adding noise whose scale is defined by and , where is noise scaler to satisfy -DP.
At last, DP-SGD takes a step based on the randomized gradient . Abadi et al. Abadi et al. (2016)
also proposed a moment accountant that maintain privacy loss more precisely than the sequential composition. In the moment accountant,has the following relationship against and (Theorem 1 in Abadi et al. (2016)).
is a sampling probability,is a number of steps and
is a constant number. To compute the privacy loss through moment accountant, we can utilize a useful tool in Tensorflow privacy29.
2.3 Variational Autoencoder
Variational autoencoder (VAE) Kingma and Welling (2013) is a model to learn parametric latent variables by maximizing the marginal log-likelihood of the training data points. VAE consists of two parts, inference model for an encoder , and the likelihood model for a decoder .
Variational evidence lower bound.
Introduction of an approximate posterior enable us to construct variational evidence lower bound (ELBO) on log-likelihood as
To implement encoder and decoder as a neural network, we need to backpropagate through random sampling. However, such backpropagation does not flow through the random samples. To overcome this issue, VAE introduces the reparametrization trick. The trick can be described aswhere
. After constructing VAE, we can generate random samples following the two steps; 1) choose a latent vector, and 2) generate by decoding . = .
Attaching a divergence for regularization.
To capture the desired property in the learned representation space of VAEs, we can employ a variety of prior distributions as and an additional regularization term. We assume an additional regularization term , that is a divergence between and . The ELBO with the regularization is described as follows:
Several are difficult to be decomposed into micro-batch losses that DP-SGD requires.
3 Sensitivity Analysis
Here we address a sensitivity study in DP-SGD for VAEs with various loss functions to clarify the required nose scale for ensuring differential privacy on the parameters of VAEs.
3.1 Learning VAEs in DP-SGD
Let is a randomly selected samples with sampling probability . We assume the loss function of VAE is formed as the following abstract equation:
where is a function which computes a loss only depend on , and is a function which computes loss value across all samples in batch (=). We call sample-wise term, and batch-wise term. The loss function (8) is also rewritten as follows:
where is a micro-batch loss. In DP-SGD, the stochastic gradient of is clipped by as (3). That means the sensitivity of the gradient is bounded by the constant. At the last step in a batch, we craft a randomized gradient through aggregating the clipped gradients and injecting noise whose scale is
to ensure differential privacy. This aggregation has an effort to reduce the variance of the noise. We call the construction of (9) micro aggregation.
Based on the above assumptions, we can see the following series of propositions.
Let be the stochatic gradient of . Since is independent from of , changing only modifies its clipped gradient . Thus, the sensitivity is .
Let be the stochatic gradient of . While is shared in all , , the change of modifies all . Thus, the sensitivity is .
As well as the Proof of Proposition 2, since is shared in all , , the change of modifies all . Thus, the sensitivity is .
From the above three propositions, we reach the following theorem about the sensitivity for learning differentially private VAEs in the DP-SGD manner.
-sensitivity of for learning a vanilla VAE is either or .
Let be the reconstruction loss (i.e., negative log-likelihood) of . The loss functions of vanilla VAE can be written as . For this formulation, the sensitivity is from the proposition 3. Fortunately, the KL term can be decomposed as follows:
Thus we can rewrite the loss as sample-wise form that does not depend on the other samples:
Let a VAE introduces an additional regularization term and the regularization term cannot be decomposed into micro-batch losses that every micro-batch depends on an only single input. The sensitivity of DP-SGD for learning the VAE with the regularization is .
On the other hand, DP-SGD is applicable not only for micro-batches but also for the overall batch (). When we craft the randomized gradient from the overall batch, the stochastic gradient’ sensitivity keeps at . We call this construction batch aggregation. By employing the batch aggregation, we can compute a divergence from all samples in the batch without increasing the sensitivity. However, the batch aggregation also injects a large amount of noise because it does not have a factor to reduce the noise that micro-batch organizations have. Therefore, DP-SGD often organizes the micro-batches whose size is one for crafting the randomized gradient.
3.2 Privacy Leakage
As discussed the above study, ill constructions of the randomized gradient that aggregates micro-batch losses like and injects insufficient scale of noise to cover the increased sensitivity fail into differential privacy guarantee that we expected. In this case, unfortunately, the information of that depends on inputs of the whole batch is leaked. By this leaked sensitive information, we might get beautiful results, but it is the result of our poor understanding of gradient constructions in the DP-SGD manner.
3.3 Augmentation for Estimating Reconstruction Error
Back to the original VAE Kingma and Welling (2013)
, the stochastic gradient variational Bayes (SGVB) estimator enables us to compute the ELBO over a single batch as:
In the original VAE, we can set if the batch size is large enough 111Kingma and Welling (2013) mentioned that L can be set to 1 as long as the minibatch size was large enough. e.g. .. However, DP-SGD assumes micro-batches whose size is 1. In order to accurately estimate the log-likelihood around , we should set in no small number. Thanks to gradient clipping (3), the sensitivity is still bounded by even when utilizing a large . Since is independent from , and the stochastic gradient including it is clipped by the constant , the sensitivity is bounded by against any .
From the above discussion, we can utilize augmentations that reduce the reconstruction error without increasing the sensitivity. However, it consumes much more computational time and memory spaces.
4 Proposed Method
Based on the sensitivity analysis, we present how to learn differentially private variational autoencoders with suppressing the amount of noise. We first introduce a general model PriVAE that learns variational autoencoder in a differentially private way. Second, we propose a novel learning technique term-wise DP-SGD that reduces the amount of noise for DP by decomposing stochastic gradients into term-wise components. Our proposed method also utilizes the augmentation that attempts to reduce the reconstruction error, as discussed in section 3.3.
4.1 PriVAE: a general model of differentially private VAE
Our basic idea is to decompose the terms of the loss function into two groups and compose a noisy gradient that ensures the DP group by group. For each group, we separately run the gradient aggregation sequence for DP, which consists of computing stochastic gradients, clipping gradients, and adding noise as following the DP-SGD manner.
Towards reducing the amount of noise, we first introduce the notation of partitions. Let be a partition of batch, where , . Any pairs of and () are mutually disjoint, that is .
Objective function of PriVAE.
4.2 Termwise DP-SGD
We propose termwise DP-SGD that composes noisy gradient for DP in a term-wise way. The termwise DP-SGD crafts the noisy gradients for sample-wise terms and batch-wise terms , separately. In the last phase of termwise DP-SGD, it combines these noisy gradients and updates parameters . The overall proposed procedure of termwise DP-SGD is in Algorithm 1.
Gradient aggregation for sample-wise term.
For each sample-wise term , we craft its clipped gradient with clip size . We then aggregate the sum of the clipped gradients as follows:
Gradient aggregation for batch-wise term.
For the batch-wise terms , we first partition into sub-groups where . We then compute for and aggregate their clipped gradients with clip size as described below:
Term-wise noise injections and concatenation.
We here discuss the privacy guarantee and noise scale of our proposed method.
Term-wise DP-SGD with the noise scale satisfies (, )-differential privacy if DP-SGD with satisfies (, )-differential privacy for a VAE that has no batch-wise terms.
is the noise scale that satisfies (, )-DP. From the sequential composition of the first term and the second term in (16), the sum of the two terms satisfies (, )-DP.
-sensitivity of (13) is . That means the sensitivity is .
Since all and are disjoint, the change of any single influences only and where . Thus, -sensitivity of and is and , respectively. Finally, -sensitivity of is .
In (13) and (16), the computation of for each partition results in under-estimation against , but it brings the reduction of the noise variance for the second term. In (16), the noise can be divided by the number of partitions . Therefore, we can manipulate the degree of the trade-off between the estimation accuracy of and the second term’s noise scale by .
Finally, we discuss the noise scale. In the existing method DP-SGD with a divergence, the overall noise scale is . While our term-wise DP-SGD has by using . In the DP-SGD with divergence, the order of the noise scale can be written as , while our proposed method has since .
Table 1 summarizes the sensitivity and noise scale of DP-SGD and our term-wise DP-SGD.
|-sensitivity of||noise scale|
|DP-SGD (micro agg.)|
|DP-SGD (batch agg.)|
In this section, we demonstrate the effectiveness of our proposed method PriVAE with two different tasks. We evaluate our method in a sparse coding task and a clustering task. Each task employs a different prior distribution as and divergence as the regularization term
. The experimental settings, including datasets, neural network architectures, construction of prior distributions, regularization divergences, and evaluation metrics, follow the experiments inMathieu et al. (2019)
. The experimental codes are developed in Python 3.7 and PyTorch 1.5Paszke et al. (2017) and run on machines with a Tesla V100 GPU.
We first consider a sparse representation that only a small fraction of available factors are employed for reconstructions. In this task, we utilize the Fashion-MNIST dataset Xiao et al. (2017). As well as Mathieu et al. (2019), we construct a sparse prior as with . This mixture distribution can be interpreted as a mixture of samples being either off or on, whose proportion is set by . We set =0.8. The regularization term we utilize here is a dimension-wise MMD with a sum of Cauchy kernels on each dimension () with . To measure a sparsity of the latent representations, we employ the sparsity metric defined with the Hoyer extrinsic metric Hurley and Rickard (2009) as follows:
where is a vector whose -th dimensional value .
is the standard deviation of-th dimentional latent encoding taken over the dataset. The represents 0 for fully dense vector and 1 for a fully sparse vector.
We use the same convolutional neural networks for both the encoder and decoder as inMathieu et al. (2019) with =50 dimensional latent space. In this task, we use SGD optimizer with =0.05, =0.001, =1, =256, =16, =1 for all privatized models, and =0.005 for PriVAE with the MMD and =0 for PriVAE without it. For non-private VAEs, we use Adam optimizer with =0.0005, =256. For both VAE and PriVAE, we set =100 when attaching the MMD. We also compare with DP-SGD using micro-agg. and batch-agg.. For these methods, we set =0.0002 to avoid exploding gradients. The other hyper-parameters are the same as PriVAE
with MMD. All models are trained in 10 epochs.
Figure 1 shows the substantial sparsity by the sparse prior (Figure (a)a), the log-likelihood (Figure (b)b), and the MMD between q(z) and p(z) (Figure (c)c), those results are observed at several privacy parameter . We plot the average over ten observations. The shaded regions are 1 standard deviation around the averages. In Figure (a)a, PriVAE with the regularization (PriVAE +MMD) demonstrates higher sparsity than the model that does not have it. Although it has a gap between the non-private regularized model (VAE+MMD), our proposed model achieved increasing the sparsity even under differential privacy constraints. In the MMD between and , PriVAE +MMD shows smaller values against PriVAE without it. By employing the regularization term, PriVAE could obtain the sparsity and reduce the MMD, but it was not easy to simultaneously increase the log-likelihood. The trade-off between them seems more significant than non-private models. To obtain more sparsity, PriVAE needs to improve reconstruction performance.
5.2 Clustering Latent Space
Next, we consider a differentially private VAE that wishes to impose clustering of the latent space. For this experiment, we utilize the pinwheel dataset from Johnson et al. (2016), with =400 observations, clustered in 4 spirals. Following the experiment in Mathieu et al. (2019), we utilize a mixture of four Gaussians as the prior, as a regularization divergence, and fully-connected neural networks for both encoder and decoder. The prior is defined as with =2, =4, =0.03, =, and . The divergence is defined as . We set =0.05, =0.01, =20, =1, =20 for all models, =0.0005, =0 for PriVAE with and =0, =1 for PriVAE without it.
We compare the clustering performance between PriVAE with/out the regularization term . Figure 2 shows the reconstructions of the pinwheel data and the (clustered) representations. The first two columns demonstrate the results of PriVAE without , and the others show those of PriVAE with . In the figures the red dots represent the original inputs, the yellow dots are their reconstructions, and the blue dots show the data points in the latent spaces. PriVAE without the regularization demonstrates poor reconstructions against the raw pinwheel clustered data. While, PriVAE with generated better reconstructions than the model without it even though the generated samples have small reconstruction errors. The learned representations of PriVAE with the regularization are well clustered and fitted to the prior that is the four mixture of Gaussians. Through these results, our proposed model worked well with employing the prior and the regularization term those intended to capture the clusters of the pinwheel data.
This paper studied how to learn variational autoencoders with various divergence under differential privacy constraints. We revealed several divergences increase the sensitivity of the stochastic gradient from to in terms of batch size . To reduce the sensitivity and the amount of noise, we proposed a term-wise DP-SGD that crafted randomized gradients in two different ways tailored to the compositions of the loss terms. The term-wise DP-SGD could keep the sensitivity at even when attaching the divergence. In our experiments, we demonstrated that our method worked well with two pairs of the prior distribution and the divergence. We mainly studied differentially private VAEs, but these contributions also have significant importance for the other machine learning models required to satisfy differential privacy.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §2.2.
- The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2867–2867. Cited by: §1.
- Differentially private mixture of generative neural networks. IEEE Transactions on Knowledge and Data Engineering 31 (6), pp. 1109–1121. Cited by: §1.
- Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §1.
- Differentially private covariance estimation. In Advances in Neural Information Processing Systems, pp. 14190–14199. Cited by: §1.
- Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, pp. 15453–15462. Cited by: §1.1.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
- Plausible deniability for privacy-preserving data synthesis. Proceedings of the VLDB Endowment 10 (5), pp. 481–492. Cited by: §1.1.
- Synthesizing plausible privacy-preserving location traces. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 546–563. Cited by: §1.
- Capacity bounded differential privacy. In Advances in Neural Information Processing Systems, pp. 3469–3478. Cited by: §1.
Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138. Cited by: §1.1.
- Differential privacy. In Proceedings of the 33rd international conference on Automata, Languages and Programming-Volume Part II, pp. 1–12. Cited by: §1, §2.1, §2.1.
- A firm foundation for private data analysis. Communications of the ACM 54 (1), pp. 86–95. Cited by: §2.1.
- Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340. Cited by: §2.1.
- A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, Cited by: §1.
Structured disentangled representations.
The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2525–2534. Cited by: §1.
- Comparing measures of sparsity. IEEE Transactions on Information Theory 55 (10), pp. 4723–4741. Cited by: §5.1.
- Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954. Cited by: §5.2.
- PATE-gan: generating synthetic data with differential privacy guarantees. Cited by: §1.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.3, §3.3, footnote 1.
- Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
- Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1656–1665. Cited by: §1.1.
- Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pp. 4402–4412. Cited by: §1, §5.1, §5.1, §5.2, §5.
- A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §1.1.
- Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963. Cited by: §1.1.
- Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755. Cited by: §1.
- Automatic differentiation in pytorch. Cited by: §5.
Locally private bayesian inference for count models. In International Conference on Machine Learning, pp. 5638–5648. Cited by: §1.
-  TensorFlow privacy. Note: https://github.com/tensorflow/privacy Cited by: §2.2.
- DP-cgan: differentially private synthetic data and label generation. In , pp. 0–0. Cited by: §1.
On sparse linear regression in the local differential privacy model. In International Conference on Machine Learning, pp. 6628–6637. Cited by: §1.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.1.
- Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739. Cited by: §1.
- Differentially private model publishing for deep learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 332–349. Cited by: §1.1.
PrivBayes: private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434. Cited by: §1.1.
- Privtree: a differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, pp. 155–170. Cited by: §1.1.