1 Introduction
Privacypreserving data synthesis (PPDS) is a solution to sharing private data by constructing a generative model while preserving privacy. Differential privacy (DP) Dwork (2006) is a rigorous notation of privacy to release statistics, and is used in broad domains and applications Abowd (2018); Amin et al. (2019); Bindschaedler and Shokri (2016); Chaudhuri et al. (2019); Papernot et al. (2016); Schein et al. (2019); Wang and Xu (2019). In recent years, several works have proposed differentially private deep generative models Acs et al. (2018); Jordon et al. (2018); Torkzadehmahani et al. (2019); Xie et al. (2018).
Deep generative models have significantly improved in the past few years. Variational autoencoder (VAE) Kingma et al. (2016); Kingma and Welling (2013) is a likelihoodbased model to reconstruct training inputs. VAE also enables us to generate random samples from its learned representations. We often build a VAE with an appropriate prior distribution to describe the desired properties of the representations (such as encouraging clustering, sparsity, and disentanglement), and introduce a divergence as a regularization term to close the learned representations to the prior Alemi et al. (2016); Bengio et al. (2013); Eastwood and Williams (2018); Esmaeili et al. (2019); Mathieu et al. (2019). This paper studies how to learn variational autoencoders with a variety of divergences under differential privacy constraints.
A simple way to build a differentially private VAE is to employ differentially private stochastic gradient descent (DPSGD)
Abadi et al. (2016) in the learning process of vanilla VAE. The key idea of DPSGD is that it injects noises to stochastic gradients for giving DP guarantees to the learned parameters. The noise scale is designed according to the stochastic gradient’s sensitivity, which is the maximal change of the gradient when any single input is modified. To limit the gradient’s sensitivity, DPSGD first decomposes input samples into disjoint smaller groups (i.e., microbatches). Then, DPSGD computes a stochastic gradient for each group and clips the norm of the gradient by a constant. On the other hand, a misuse of the gradient aggregations might cause privacy leakages unconsciously.Our contributions are threefold. First, we reveal that several divergences might increase the stochastic gradients’ sensitivity when attaching them to the loss function. To discover the issues, we address a sensitivity study in the learning process of VAEs based on DPSGD. The sensitivity is increased from
to in terms of batch size by attaching the divergence. Consequently, the sensitivity increase degrades the quality of the learned model since it directly amplifies the amount of noise. If unfortunately, we do not notice the sensitivity increase, we might cause an insufficient differential privacy guarantee.Second, to solve the above issue, we propose termwise DPSGD that crafts randomized gradients in two different ways tailored to the compositions of the loss terms. The termwise DPSGD keeps the sensitivity at even when attaching the divergence. We can therefore build a differentially private VAE with a small amount of noise by our proposed method.
Third, based on the termwise DPSGD, we present PriVAE, a general model to learn VAEs with attaching a variety of divergences while satisfying differential privacy. Our experiments demonstrate that our proposed method works well with two pairs of the prior distribution and the divergence.
This paper clarifies how to aggregate gradients in VAEs to satisfy differential privacy while refraining the amount of noise. Although we mainly study differentially private VAEs, these contributions also have significant importance for the other machine learning models to satisfy differential privacy.
1.1 Related Works
Generative models under differential privacy have been studied in a last decade. Traditional approaches are based on capturing probabilistic models, low rank structure, and learning statistical characteristics from original sensitive database Chen et al. (2015); Zhang et al. (2014, 2016). Plausible deniability Bindschaedler et al. (2017) is an extended privacy metric behind DP for building a generative model.
We have several studies about DPSGD McMahan et al. (2017, 2018); Yu et al. (2019). Lee et al. Lee and Kifer (2018) demonstrated that DPSGD can be improved with adaptive step sizes and careful allocation of privacy budgets between iterations. Bagdasaryan et al. Bagdasaryan et al. (2019) revealed that if the original model is unfair, the unfairness becomes worse once DP is applied.
2 Preliminaries
2.1 Differential Privacy
Differential privacy Dwork (2006, 2011b, 2011a) is a rigorous mathematical privacy definition, which quantitatively evaluates the degree of privacy protection when we publish statistical outputs. The definition of differential privacy is as follows:
Definition 1 (()differential privacy)
A randomized mechanism satisfies ()differential privacy if, for any two neighboring input and any subset of outputs , it holds that
(1) 
Practically, we employ a randomized mechanism that ensures differential privacy for a function . The mechanism perturbs the output of to cover ’s sensitivity that is the maximum degree of change over any pairs of and .
Definition 2 (Sensitivity)
The sensitivity of for any two neighboring input is
(2) 
where is a norm function defined on ’s output domain.
Based on the sensitivity of , we design the degree of noise to ensure differential privacy. Laplace mechanism and Gaussian mechanism are wellknown as standard approaches.
Let be mechanisms satisfying , , differential privacy, respectively. Then, a mechanism sequentially applying satisfies (, )differential privacy. This fact refers to composabilityDwork (2006). In particular, this composition is called sequential composition.
2.2 DpSgd
Differentially private stochastic gradient descent (DPSGD) Abadi et al. (2016), is a useful optimization technique for learning a model under differential privacy constraints. The key idea of DPSGD is that it adds noise to stochastic gradients during training for making differential privacy guarantees on ’s parameters . To obtain the scale of noise, DPSGD limits sensitivity of stochastic gradient
by clipping its norm. The gradient clipping
that limits the sensitivity up to is denoted as follows:(3) 
In the DPSGD, we compute an empirical loss for each microbatch that includes only one sample. For each microbatch, DPSGD generates its clipped gradient. Based on the clipped gradients, DPSGD crafts a randomized gradient through computing the average over the clipped gradients and adding noise whose scale is defined by and , where is noise scaler to satisfy DP.
(4) 
At last, DPSGD takes a step based on the randomized gradient . Abadi et al. Abadi et al. (2016)
also proposed a moment accountant that maintain privacy loss more precisely than the sequential composition. In the moment accountant,
has the following relationship against and (Theorem 1 in Abadi et al. (2016)).(5) 
where
is a sampling probability,
is a number of steps andis a constant number. To compute the privacy loss through moment accountant, we can utilize a useful tool in Tensorflow privacy
29.2.3 Variational Autoencoder
Variational autoencoder (VAE) Kingma and Welling (2013) is a model to learn parametric latent variables by maximizing the marginal loglikelihood of the training data points. VAE consists of two parts, inference model for an encoder , and the likelihood model for a decoder .
Variational evidence lower bound.
Introduction of an approximate posterior enable us to construct variational evidence lower bound (ELBO) on loglikelihood as
(6) 
To implement encoder and decoder as a neural network, we need to backpropagate through random sampling. However, such backpropagation does not flow through the random samples. To overcome this issue, VAE introduces the reparametrization trick. The trick can be described as
where. After constructing VAE, we can generate random samples following the two steps; 1) choose a latent vector
, and 2) generate by decoding . = .Attaching a divergence for regularization.
To capture the desired property in the learned representation space of VAEs, we can employ a variety of prior distributions as and an additional regularization term. We assume an additional regularization term , that is a divergence between and . The ELBO with the regularization is described as follows:
(7) 
Several are difficult to be decomposed into microbatch losses that DPSGD requires.
3 Sensitivity Analysis
Here we address a sensitivity study in DPSGD for VAEs with various loss functions to clarify the required nose scale for ensuring differential privacy on the parameters of VAEs.
3.1 Learning VAEs in DPSGD
Let is a randomly selected samples with sampling probability . We assume the loss function of VAE is formed as the following abstract equation:
(8) 
where is a function which computes a loss only depend on , and is a function which computes loss value across all samples in batch (=). We call samplewise term, and batchwise term. The loss function (8) is also rewritten as follows:
(9) 
where is a microbatch loss. In DPSGD, the stochastic gradient of is clipped by as (3). That means the sensitivity of the gradient is bounded by the constant. At the last step in a batch, we craft a randomized gradient through aggregating the clipped gradients and injecting noise whose scale is
to ensure differential privacy. This aggregation has an effort to reduce the variance of the noise. We call the construction of (
9) micro aggregation.Based on the above assumptions, we can see the following series of propositions.
Proposition 1
Assume and the stochastic gradient of is clipped by (3) with the constant , sensitivity of is .
Proof
Let be the stochatic gradient of . Since is independent from of , changing only modifies its clipped gradient . Thus, the sensitivity is .
Proposition 2
Assume and the stochastic gradient of is clipped by (3) with the constant , sensitivity of is .
Proof
Let be the stochatic gradient of . While is shared in all , , the change of modifies all . Thus, the sensitivity is .
Proposition 3
Assume and the stochastic gradient of is clipped by (3) with the constant , sensitivity of is .
Proof
As well as the Proof of Proposition 2, since is shared in all , , the change of modifies all . Thus, the sensitivity is .
From the above three propositions, we reach the following theorem about the sensitivity for learning differentially private VAEs in the DPSGD manner.
Theorem 1
sensitivity of for learning a vanilla VAE is either or .
Proof
Let be the reconstruction loss (i.e., negative loglikelihood) of . The loss functions of vanilla VAE can be written as . For this formulation, the sensitivity is from the proposition 3. Fortunately, the KL term can be decomposed as follows:
(10) 
Thus we can rewrite the loss as samplewise form that does not depend on the other samples:
(11) 
From Proposition 1, the sensitivity when we utilize (11) is .
Lemma 1
Let a VAE introduces an additional regularization term and the regularization term cannot be decomposed into microbatch losses that every microbatch depends on an only single input. The sensitivity of DPSGD for learning the VAE with the regularization is .
On the other hand, DPSGD is applicable not only for microbatches but also for the overall batch (). When we craft the randomized gradient from the overall batch, the stochastic gradient’ sensitivity keeps at . We call this construction batch aggregation. By employing the batch aggregation, we can compute a divergence from all samples in the batch without increasing the sensitivity. However, the batch aggregation also injects a large amount of noise because it does not have a factor to reduce the noise that microbatch organizations have. Therefore, DPSGD often organizes the microbatches whose size is one for crafting the randomized gradient.
3.2 Privacy Leakage
As discussed the above study, ill constructions of the randomized gradient that aggregates microbatch losses like and injects insufficient scale of noise to cover the increased sensitivity fail into differential privacy guarantee that we expected. In this case, unfortunately, the information of that depends on inputs of the whole batch is leaked. By this leaked sensitive information, we might get beautiful results, but it is the result of our poor understanding of gradient constructions in the DPSGD manner.
3.3 Augmentation for Estimating Reconstruction Error
Back to the original VAE Kingma and Welling (2013)
, the stochastic gradient variational Bayes (SGVB) estimator enables us to compute the ELBO over a single batch as:
(12) 
In the original VAE, we can set if the batch size is large enough ^{1}^{1}1Kingma and Welling (2013) mentioned that L can be set to 1 as long as the minibatch size was large enough. e.g. .. However, DPSGD assumes microbatches whose size is 1. In order to accurately estimate the loglikelihood around , we should set in no small number. Thanks to gradient clipping (3), the sensitivity is still bounded by even when utilizing a large . Since is independent from , and the stochastic gradient including it is clipped by the constant , the sensitivity is bounded by against any .
From the above discussion, we can utilize augmentations that reduce the reconstruction error without increasing the sensitivity. However, it consumes much more computational time and memory spaces.
4 Proposed Method
Based on the sensitivity analysis, we present how to learn differentially private variational autoencoders with suppressing the amount of noise. We first introduce a general model PriVAE that learns variational autoencoder in a differentially private way. Second, we propose a novel learning technique termwise DPSGD that reduces the amount of noise for DP by decomposing stochastic gradients into termwise components. Our proposed method also utilizes the augmentation that attempts to reduce the reconstruction error, as discussed in section 3.3.
4.1 PriVAE: a general model of differentially private VAE
Our basic idea is to decompose the terms of the loss function into two groups and compose a noisy gradient that ensures the DP group by group. For each group, we separately run the gradient aggregation sequence for DP, which consists of computing stochastic gradients, clipping gradients, and adding noise as following the DPSGD manner.
Towards reducing the amount of noise, we first introduce the notation of partitions. Let be a partition of batch, where , . Any pairs of and () are mutually disjoint, that is .
Objective function of PriVAE.
4.2 Termwise DPSGD
We propose termwise DPSGD that composes noisy gradient for DP in a termwise way. The termwise DPSGD crafts the noisy gradients for samplewise terms and batchwise terms , separately. In the last phase of termwise DPSGD, it combines these noisy gradients and updates parameters . The overall proposed procedure of termwise DPSGD is in Algorithm 1.
Gradient aggregation for samplewise term.
For each samplewise term , we craft its clipped gradient with clip size . We then aggregate the sum of the clipped gradients as follows:
(14) 
Gradient aggregation for batchwise term.
For the batchwise terms , we first partition into subgroups where . We then compute for and aggregate their clipped gradients with clip size as described below:
(15) 
Termwise noise injections and concatenation.
4.3 Discussion
We here discuss the privacy guarantee and noise scale of our proposed method.
Theorem 2
Termwise DPSGD with the noise scale satisfies (, )differential privacy if DPSGD with satisfies (, )differential privacy for a VAE that has no batchwise terms.
Proof
is the noise scale that satisfies (, )DP. From the sequential composition of the first term and the second term in (16), the sum of the two terms satisfies (, )DP.
Lemma 2
sensitivity of (13) is . That means the sensitivity is .
Proof
Since all and are disjoint, the change of any single influences only and where . Thus, sensitivity of and is and , respectively. Finally, sensitivity of is .
In (13) and (16), the computation of for each partition results in underestimation against , but it brings the reduction of the noise variance for the second term. In (16), the noise can be divided by the number of partitions . Therefore, we can manipulate the degree of the tradeoff between the estimation accuracy of and the second term’s noise scale by .
Finally, we discuss the noise scale. In the existing method DPSGD with a divergence, the overall noise scale is . While our termwise DPSGD has by using . In the DPSGD with divergence, the order of the noise scale can be written as , while our proposed method has since .
Table 1 summarizes the sensitivity and noise scale of DPSGD and our termwise DPSGD.
sensitivity of  noise scale  

DPSGD (micro agg.)  
DPSGD (batch agg.)  
Termwise DPSGD 
5 Evaluation
In this section, we demonstrate the effectiveness of our proposed method PriVAE with two different tasks. We evaluate our method in a sparse coding task and a clustering task. Each task employs a different prior distribution as and divergence as the regularization term
. The experimental settings, including datasets, neural network architectures, construction of prior distributions, regularization divergences, and evaluation metrics, follow the experiments in
Mathieu et al. (2019). The experimental codes are developed in Python 3.7 and PyTorch 1.5
Paszke et al. (2017) and run on machines with a Tesla V100 GPU.5.1 Sparsity
We first consider a sparse representation that only a small fraction of available factors are employed for reconstructions. In this task, we utilize the FashionMNIST dataset Xiao et al. (2017). As well as Mathieu et al. (2019), we construct a sparse prior as with . This mixture distribution can be interpreted as a mixture of samples being either off or on, whose proportion is set by . We set =0.8. The regularization term we utilize here is a dimensionwise MMD with a sum of Cauchy kernels on each dimension () with . To measure a sparsity of the latent representations, we employ the sparsity metric defined with the Hoyer extrinsic metric Hurley and Rickard (2009) as follows:
(17) 
where is a vector whose th dimensional value .
is the standard deviation of
th dimentional latent encoding taken over the dataset. The represents 0 for fully dense vector and 1 for a fully sparse vector.We use the same convolutional neural networks for both the encoder and decoder as in
Mathieu et al. (2019) with =50 dimensional latent space. In this task, we use SGD optimizer with =0.05, =0.001, =1, =256, =16, =1 for all privatized models, and =0.005 for PriVAE with the MMD and =0 for PriVAE without it. For nonprivate VAEs, we use Adam optimizer with =0.0005, =256. For both VAE and PriVAE, we set =100 when attaching the MMD. We also compare with DPSGD using microagg. and batchagg.. For these methods, we set =0.0002 to avoid exploding gradients. The other hyperparameters are the same as PriVAEwith MMD. All models are trained in 10 epochs.
Figure 1 shows the substantial sparsity by the sparse prior (Figure (a)a), the loglikelihood (Figure (b)b), and the MMD between q(z) and p(z) (Figure (c)c), those results are observed at several privacy parameter . We plot the average over ten observations. The shaded regions are 1 standard deviation around the averages. In Figure (a)a, PriVAE with the regularization (PriVAE +MMD) demonstrates higher sparsity than the model that does not have it. Although it has a gap between the nonprivate regularized model (VAE+MMD), our proposed model achieved increasing the sparsity even under differential privacy constraints. In the MMD between and , PriVAE +MMD shows smaller values against PriVAE without it. By employing the regularization term, PriVAE could obtain the sparsity and reduce the MMD, but it was not easy to simultaneously increase the loglikelihood. The tradeoff between them seems more significant than nonprivate models. To obtain more sparsity, PriVAE needs to improve reconstruction performance.
5.2 Clustering Latent Space
Next, we consider a differentially private VAE that wishes to impose clustering of the latent space. For this experiment, we utilize the pinwheel dataset from Johnson et al. (2016), with =400 observations, clustered in 4 spirals. Following the experiment in Mathieu et al. (2019), we utilize a mixture of four Gaussians as the prior, as a regularization divergence, and fullyconnected neural networks for both encoder and decoder. The prior is defined as with =2, =4, =0.03, =, and . The divergence is defined as . We set =0.05, =0.01, =20, =1, =20 for all models, =0.0005, =0 for PriVAE with and =0, =1 for PriVAE without it.
We compare the clustering performance between PriVAE with/out the regularization term . Figure 2 shows the reconstructions of the pinwheel data and the (clustered) representations. The first two columns demonstrate the results of PriVAE without , and the others show those of PriVAE with . In the figures the red dots represent the original inputs, the yellow dots are their reconstructions, and the blue dots show the data points in the latent spaces. PriVAE without the regularization demonstrates poor reconstructions against the raw pinwheel clustered data. While, PriVAE with generated better reconstructions than the model without it even though the generated samples have small reconstruction errors. The learned representations of PriVAE with the regularization are well clustered and fitted to the prior that is the four mixture of Gaussians. Through these results, our proposed model worked well with employing the prior and the regularization term those intended to capture the clusters of the pinwheel data.


6 Conclusion
This paper studied how to learn variational autoencoders with various divergence under differential privacy constraints. We revealed several divergences increase the sensitivity of the stochastic gradient from to in terms of batch size . To reduce the sensitivity and the amount of noise, we proposed a termwise DPSGD that crafted randomized gradients in two different ways tailored to the compositions of the loss terms. The termwise DPSGD could keep the sensitivity at even when attaching the divergence. In our experiments, we demonstrated that our method worked well with two pairs of the prior distribution and the divergence. We mainly studied differentially private VAEs, but these contributions also have significant importance for the other machine learning models required to satisfy differential privacy.
References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §2.2.
 The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2867–2867. Cited by: §1.
 Differentially private mixture of generative neural networks. IEEE Transactions on Knowledge and Data Engineering 31 (6), pp. 1109–1121. Cited by: §1.
 Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §1.
 Differentially private covariance estimation. In Advances in Neural Information Processing Systems, pp. 14190–14199. Cited by: §1.
 Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, pp. 15453–15462. Cited by: §1.1.
 Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
 Plausible deniability for privacypreserving data synthesis. Proceedings of the VLDB Endowment 10 (5), pp. 481–492. Cited by: §1.1.
 Synthesizing plausible privacypreserving location traces. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 546–563. Cited by: §1.
 Capacity bounded differential privacy. In Advances in Neural Information Processing Systems, pp. 3469–3478. Cited by: §1.

Differentially private highdimensional data publication via samplingbased inference
. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138. Cited by: §1.1.  Differential privacy. In Proceedings of the 33rd international conference on Automata, Languages and ProgrammingVolume Part II, pp. 1–12. Cited by: §1, §2.1, §2.1.
 A firm foundation for private data analysis. Communications of the ACM 54 (1), pp. 86–95. Cited by: §2.1.
 Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340. Cited by: §2.1.
 A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, Cited by: §1.

Structured disentangled representations.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 2525–2534. Cited by: §1.  Comparing measures of sparsity. IEEE Transactions on Information Theory 55 (10), pp. 4723–4741. Cited by: §5.1.
 Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954. Cited by: §5.2.
 PATEgan: generating synthetic data with differential privacy guarantees. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.3, §3.3, footnote 1.
 Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
 Concentrated differentially private gradient descent with adaptive periteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1656–1665. Cited by: §1.1.
 Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pp. 4402–4412. Cited by: §1, §5.1, §5.1, §5.2, §5.
 A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §1.1.
 Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963. Cited by: §1.1.
 Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755. Cited by: §1.
 Automatic differentiation in pytorch. Cited by: §5.

Locally private bayesian inference for count models
. In International Conference on Machine Learning, pp. 5638–5648. Cited by: §1.  [29] TensorFlow privacy. Note: https://github.com/tensorflow/privacy Cited by: §2.2.

DPcgan: differentially private synthetic data and label generation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 0–0. Cited by: §1. 
On sparse linear regression in the local differential privacy model
. In International Conference on Machine Learning, pp. 6628–6637. Cited by: §1.  Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.1.
 Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739. Cited by: §1.
 Differentially private model publishing for deep learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 332–349. Cited by: §1.1.

PrivBayes: private data release via bayesian networks
. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434. Cited by: §1.1.  Privtree: a differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, pp. 155–170. Cited by: §1.1.
Comments
There are no comments yet.