1 Introduction
Releasing statistical and sensory data to a broad community has contributed towards advances in numerous machine learning (ML) techniques e.g., object recognition (ImageNet
Deng et al. (2009)), language modeling (RCV Lewis et al. (2004)), recommendation systems (Netflix ratings Bennett et al. (2007)). However, in many sensitive domains (e.g., medical, financial), similar advances are often held back as the private nature of collected data prohibits release in its original form. Privacypreserving data publishing Dwork et al. (2009); Fung et al. (2010); BeaulieuJones et al. (2017) provides a reasonable solution, where only a sanitized form of the original data (with rigorous privacy guarantees) is publicly released.Traditionally, sanitization is performed in a differentially private (DP) framework Dwork (2008). The sanitization method employed is often handcrafted for the given input data Li et al. (2016); Zhang et al. (2017); Mckenna et al. (2019) and the specific datadependent task the sanitized data is intended for (e.g., answering linear queries) Dwork et al. (2010); Roth and Roughgarden (2010); Hardt and Rothblum (2010); Blum et al. (2013). As a result, such sanitization techniques greatly restrict the expressiveness of the released data distribution and fail to generalize to novel tasks unanticipated by the publisher. Instead, recent privacypreserving techniques Xie et al. (2018); Zhang et al. (2018); Yoon et al. (2019); BeaulieuJones et al. (2017) build on top of successes in generative adversarial network (GANs) Goodfellow et al. (2014) literature, to generate synthetic data faithful to the original input distribution. Specifically, GANs are trained using a privacypreserving algorithm (e.g., using DPSGD Abadi et al. (2016)
) and demonstrate promising results in modeling a variety of realworld highdimensional data distributions. Common to most privacypreserving training algorithms for neural network models is
manipulatingthe gradient information generated during backpropagation. Manipulation most commonly involves clipping the gradients (to bound sensitivity) and adding calibrated random noise (to introduce stochasticity). Although recent techniques that employ such an approach demonstrate reasonable success, they are largely limited to shallow networks and fail to sufficiently capture the sample quality of the original data.
In this paper, towards the goal of a generative model capable of synthesizing highquality samples in a privacypreserving manner, we propose a differentially private GAN. We first identify that in such a datapublishing scenario, only a subset of the trained model (specifically the generator) and its parameters need to be publiclyreleased. This insight allows us to surgically manipulate the gradient information during training, and thereby allowing more meaningful gradient updates. By coupling the approach with a Wasserstein Arjovsky et al. (2017) objective with gradientpenalty term Gulrajani et al. (2017)
, we further improve the amount of gradient information flow during training. The new objective additionally allows us to precisely estimate the gradient norms and analytically determine the sensitivity values. As an added benefit, we find our approach bypasses an intensive and fragile hyperparameter search for DPspecific hyperparameters (particularly clipping values).
Contributions. (i) A novel gradientsanitized Wasserstein GAN (GSWGAN), which is capable of generating highdimensional data with DP guarantee; (ii) Our approach naturally extends to both centralized and decentralized datasets. In the case of decentralized scenarios, our work can provide userlevel DP guarantee Li et al. (2020) under an untrusted server; (iii) Extensive evaluations on various datasets demonstrate that our method significantly improves the sample quality of privacypreserving data over stateoftheart approaches.
2 Related Work
We review several generative models in the area of differential privacy, as well as their relations to our work.
DPSGD GAN. Training GANs via DPSGD Abadi et al. (2016); Xie et al. (2018); Zhang et al. (2018); BeaulieuJones et al. (2017); Torkzadehmahani et al. (2019) has proven effective in generating highdimensional sanitized data. However, DPSGD relies on carefully tuning of the clipping bound of gradient norm, i.e., the sensitivity value. Specifically, the optimal clipping bound varies greatly with the model architecture and the training dynamics, making the implementation of DPSGD difficult. Unlike DPSGD, our framework provides precise estimation of the sensitivity value, avoiding the intensive search of hyperparameters.
PATE. Private Aggregation of Teacher Ensembles (PATE) is recently adapted to generative models and two main approaches were studied: PATEGAN Yoon et al. (2019) and GPATE Long et al. (2019). PATEGAN trained multiple teacher discriminators on disjoint data partitions together with a student discriminator. In contrast, we consider a simplified model without a student discriminator.
GPATE Long et al. (2019) is similar to our work in the sense that, both works trained the discriminator nonprivately while only training the generator with DP guarantee, and both sanitized gradients that the generator received from the discriminator. However, GPATE suffers from two main limitations: (i) gradients need to be discretized by using manually selected bins in order to suit for the PATE framework and (ii) highdimensional gradients in the PATE framework bring high privacy costs and thus dimension reduction techniques are required. Our framework can effectively avoid these two limitations and achieve better sample quality due to the novel gradient sanitation, see our experiments.
FedAvg GAN Augenstein et al. (2020). While many works focus on centralized setting, the decentralized case has rarely been studied. To address this, Federated Average GAN (FedAvg GAN) proposed to adapt GAN training by using the DPFedAvg McMahan et al. (2018) algorithm, providing userlevel DP guarantee under trusted server. In comparison with FedAvg GAN that merely works on decentralized data, our work can tackle both centralized and decentralized data using a single framework. Note that FedAvg sanitized parameter gradients of the discriminator in a similar way to DPSGD, it also suffers from the difficulty of turning hyperparameters.
3 Background
DP provides rigorous privacy guarantees for algorithms while allowing for quantitative privacy analysis. We below present several definitions and theorems that will be used in this work.
Definition 3.1.
(Differential Privacy (DP) Dwork (2008)) A randomized mechanism with range is DP, if
(1) 
holds for any subset of outputs and for any adjacent datasets and , where and differ from each other with only one training example. is the GAN training algorithm in our case, corresponds to the upper bound of privacy loss, and
is the probability of breaching DP constraints.
Definition 3.2.
(Rényi Differential Privacy (RDP) Mironov (2017)) A randomized mechanism is RDP with order , if
(2) 
holds for any adjacent datasets and , where denotes the Rényi divergence. Moreover, a RDP mechanism is also DP.
In contrast to DP, RDP provides convenient composition properties to accumulate privacy cost over a sequence of mechanisms.
Theorem 3.1.
(Composition) For a sequence of mechanisms s.t. is RDP , the composition is RDP.
Definition 3.3.
Theorem 3.2.
(Postprocessing Dwork et al. (2014)) If satisfies DP, will satisfy DP for any function with denoting the composition operator.
4 Proposed Method
Generative Adversarial Networks (GANs) Goodfellow et al. (2014). Our approach models the underlying (private) data distribution using a generative neural network, building on top of recent successes of GANs. GANs (see Fig. 1(a)) formulate the task of sample generation as a zerosum twoplayer game, between two neural network models: discriminator and generator . The discriminator
is rewarded for correctly classifying whether a given sample is ‘real’ (i.e., from the input data distribution) or ‘fake’ (generated by the generator). In contrast, the task of the generator
is (given some random noise ) to generate samples which fool the discriminator (i.e., causes misclassifications). After training the models in an adversarial manner, the discriminator is discarded and the generator is used as a proxy to draw samples from the original distribution.Differentially Private GANs. Releasing the generator as a substitute for the original training data distribution entails privacy risks. Consequently, along the lines of recent work BeaulieuJones et al. (2017); Xie et al. (2018); Torkzadehmahani et al. (2019); Zhang et al. (2018), our goal is instead to train the GAN in a privacypreserving manner, such that any privacy leakage upon disclosing the generator is bounded. A simple approach towards the goal is replacing the typical training procedure (SGD) with a differentially private variant (DPSGD Abadi et al. (2016)) and thereby limiting the contribution of a particular training example in the final trained model. DPSGD enforces the desired privacy requirement by (i) clipping the gradients to have an norm of at each training step; and (ii) sampling random noise and adding it to the gradients, before performing descent on the trained parameters :
(gradient)  (5)  
(6)  
(gradient descent step)  (7) 
While such an approach provides rigorous privacy guarantees, there are multiple shortcomings: (i) the sanitization mechanism , primarily due to clipping, significantly destroys the original gradient information, and thereby affects utility; and (ii) finding a reasonable clipping value in the mechanism to balance utility with privacy is especially challenging. In particular, as the gradient norms exhibit a heavytailed distribution, choosing a clipping value requires an exhaustive search. Moreover, since the clipping value is extremely sensitive to many other hyperparameters (e.g., learning rate, architecture), it requires persistent retuning. Now, we discuss how we address these shortcomings within our differentially private GAN approach.
Selectively applying Sanitization Mechanism. We begin by exploiting the fact that after training the GAN, only the generator is released. Consequently, we can perform gradient steps by selectively applying the sanitization mechanism only to the corresponding subset of parameters :
(8)  
(9) 
Apart from reducing the number of parameters sanitized, this also provides a benefit of more reliably training a discriminator. In addition, we exploit the chain rule to further narrow the scope of the sanitization mechanism:
(10)  
(11) 
The above becomes easier to intuit by considering a typical loss function
. As illustrated in Fig. 1(b), Eq. 11 can then be considered as placing the privacy barrier for gradient information backpropagating from the discriminator back to the generator, by applying the sanitization mechanism on . Note that the second term () is the local generator jacobian computed independent of training data, and hence does not require sanitization. Consequently, using a more precise application of the sanitization mechanism on the gradient information, our goal here is to maximally preserve the true gradient direction during training.Bounding sensitivity using Wasserstein distance. To bound the sensitivity of the optimizer on individual training examples, a key step in sanitization mechanisms is to clip (Eq. 6
) the gradient vector
(Eq. 5) before updating parameters (Eq. 7). Clipping is typically performed in norm, by replacing the gradient vector by to ensure . However, clipping significantly destroys gradient information, as reasonable choices of (e.g., 4 Abadi et al. (2016)) are significantly lower than the gradientnorms observed (12 10 in our case) when training neural networks using standard loss functions. We propose to alleviate the issue by leveraging a more suitable loss function, which generates bounded gradients (with norms close to 1) by construction. Specifically, we use as our loss the Wasserstein1 metric Arjovsky et al. (2017), which measures the statistical distance between the real and generated data distributions. Here, the training process can be interpreted as minimizing integral probability metrics (IPMs) between real () and generated () data distributions, where (i.e., the discriminator function is 1Lipschitz continuous). It follows that the optimal discriminator has a gradient norm being 1 almost everywhere under and .We incorporate the norm constraint into our training objective in the form of a gradient penalty term Gulrajani et al. (2017):
(12)  
(13) 
where and represent training objectives for the discriminator and the generator, respectively. is the hyperparameter for weighting the gradient penalty term and denotes the prior distribution for the latent code variable . The variable , uniformly sampled from
, regulates the interpolation between real and generated samples.
As a natural consequence of the Wasserstein objective, reducing the norms of our target gradient ( Equation 11) during training is integrated in our training objective (last term in Equation 12). Consequently, we observe significantly lower gradient norms during training (see Fig. 2(c)2(d)) compared to training using a standard GAN loss (see Fig. 2(a)2(b)). As a result, bounding the sensitivity (i.e., gradient norms) is now largely delegated to our training procedure and clipping using the sanitization mechanism destroys significantly less information. Additionally, by choosing a clipping threshold of =1 (i.e., ), we obtain a fixed and bounded sensitivity, and eliminate the need for intensive hyperparameter search for the optimal clipping threshold. Following this normalization solution, a dataindependent privacy cost can be determined by the following theorem, whose proof is provided in Appendix.
Theorem 4.1.
Each generator update step satisfies RDP where is the batch size.
Privacy Amplification by Subsampling. A wellknown approach for increasing privacy of a mechanism is to apply the mechanism to a random subsample of the database, rather than on the entire dataset Li et al. (2012); Balle et al. (2018); Wang et al. (2019). Intuitively, subsampling decreases the chances of leaking information about a particular individual since nothing about that individual can be leaked once the individual is not included in the subsample. In order to further reduce the privacy cost, we subsample the whole dataset into different subsets and train multiple discriminators independently on each subset. At each training step, the generator randomly queries one discriminator while the selected discriminator updates its parameters on the generated data and its associated subsampled dataset.
Extending to Federated Learning. In addition to improving the privacy guarantee, performing subsampling in our setup also naturally accommodates training a generative model on decentralized datasets (with a discriminator trained on each disjoint data subset). Recently, Augenstein et al. Augenstein et al. (2020) identified such techniques are extremely relevant when training models in a federated setup McMahan et al. (2017), i.e., when the training data is private and distributed among edge devices. We outline our method to train a differentially private GAN in a federated setup in Figure 1(c) and remark some subtle differences between our approach and FedAvg GAN Augenstein et al. (2020) here: (i) the discriminators are retained at each client in our framework while they are shared between the server and client in FedAvg GAN; (ii) the gradients are sanitized at each client before sending to the server, with which we provide DP guarantee even under an untrusted server. In contrast, the unprocessed information is accumulated at the server before being sanitized in FedAvg GAN; and (iii) The gradients w.r.t. the samples are transferred in GSWGAN, while FedAvg GAN transfers the gradients w.r.t. discriminator network parameters.
5 Experiment
5.1 Experiment Setup
To validate the applicability of our method to highdimensional data, we conduct experiments on image datasets. In line with previous works, we use MNIST LeCun et al. (1998) and FashionMNIST Xiao et al. (2017)
dataset. We model the joint distribution of images and the corresponding labels, i.e., the label is supplied to both the generator and the discriminator, and the image is generated conditioned on the input. During both training and inference, we use a uniform prior distribution for generating labels, which is independent of the training dataset and thus does not incur additional privacy cost (in contrast,
Torkzadehmahani et al. (2019) needs to assume the labels are nonprivate).MNIST  IS  FID  MLP  CNN  Avg  Calibrated  
Acc  Acc  Acc  Acc  
Real  9.80  1.02  0.98  0.99  0.88  100 %  
GPATE ^{1}^{1}1PATE provides datadependent , i.e., publishing value will introduce privacy cost. Thus, GPATE is not directly comparable to other methods and is excluded from our analysis study (section 5.3).  3.85  177.16  0.25  0.51  0.34  40%  
DPSGD GAN  4.76  179.16  0.60  0.63  0.52  59%  
DPMerf  2.91  247.53  0.63  0.63  0.57  66%  
DPMerf AE  3.06  161.11  0.54  0.68  0.42  47%  
Ours  9.23  61.34  0.79  0.80  0.60  69%  
FashionMNIST  Real  8.98  1.49  0.88  0.91  0.79  100% 
GPATE  3.35  205.78  0.30  0.50  0.40  54%  
DPSGD GAN  3.55  243.80  0.50  0.46  0.43  53%  
DPMerf  2.32  267.78  0.56  0.62  0.51  65%  
DPMerf AE  3.68  213.59  0.56  0.62  0.45  55%  
Ours  5.32  131.34  0.65  0.65  0.53  67% 
Evaluation Metrics. We evaluate along two fronts: privacy (determined by ) and utility. For utility, we consider two metrics: (a) sample quality: realism of the samples produced – evaluated by Inception Score (IS) Salimans et al. (2016); Li et al. (2017) and Frechet Inception Distance (FID) Heusel et al. (2017) (standard in GAN literature); and (b) usefulness for downstream tasks
: we train downstream classifiers on 60k privatelygenerated data points and evaluate the prediction accuracy on real test set. We consider Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN) and 11 scikitlearn
Pedregosa et al. (2011)classifiers (e.g., SVMs, Random Forest). We include the following metrics in the main paper:
MLP Acc (MLP accuracy), CNN Acc (CNN accuracy), Avg Acc (Averaged accuracy of all classification models), Calibrated Acc (Averaged accuracy of all classification models normalized by the accuracy when trained on real data). The detailed results are presented in Appendix.Architecture and Warmstart. We highlight two strategies adopted for improving the sample quality as well as reducing the privacy cost: (i) Better model architecture: While previous works are limited to shallow networks and thereby bottlenecking generated sample quality, our framework allows stable training with a complex model architecture (DCGAN Radford et al. (2016) architecture for the discriminator, ResNet architecture (adapted from BigGAN Brock et al. (2018)) for the generator) to help improve the sample quality; and (ii) Discriminator warmstarting: To bootstrap the training process, we pretrain discriminators along with a nonprivate generator for a few steps, and we subsequently train the private generator by using the warmstarting values of the discriminators. Note that our framework allows pretraining on the original private dataset without compromising privacy (in contrast, Zhang et al. (2018) needs to use external public datasets).
5.2 Comparison with Baselines
Baselines. We consider the following stateoftheart methods designed for DP highdimensional data generation: DPMerf and DPMerf AE Harder et al. (2020), DPSGD GAN Torkzadehmahani et al. (2019); Xie et al. (2018); Zhang et al. (2018), and GPATE Long et al. (2019). While PATEGAN Yoon et al. (2019) demonstrates promising results on lowdimensional data, we currently do not consider it as we were unable to extend it to our image datasets (more details in appendix) for a fair comparison. For DPMerf, DPMerf AE, and GPATE, we use the source code provided by the authors. For DPSGD GAN, we adopt the implementation of Torkzadehmahani et al. (2019), which is the only work that provides executable code with privacy analysis. For a fair comparison, we evaluate all methods with a privacy budget of = (consistently used in previous works) over 60K generated samples.
Method  MNIST  FashionMNIST 

GPATE  valign=t  valign=t 
DPSGD GAN  valign=t  valign=t 
DPMerf  valign=t  valign=t 
DPMerf AE  valign=t  valign=t 
Ours  valign=t  valign=t 
Results. We present the qualitative results in Figure 3 and the quantitative results in Table 1. In terms of sample quality, we find (Table 1, columns IS and FID) our method consistently provides significant improvements over baselines. For instance, considering inception scores, we find a relative improvement of 94% (9.23 vs. 4.76 of DPSGD GAN) on MNIST and 45% on FashionMNIST (5.32 vs. 3.68 of DPMerf AE).
Furthermore, our method also generates samples that better capture the statistical properties of the original data and are thereby making aiding performances of downstream tasks. For instance, our approach increases performance of a downstream MLP classifier(Table 1, column MLP Acc) by 25% (0.79 vs. 0.63 of DPMerf) on MNIST and 16% (0.65 vs. 0.56 of DPMerf) on FashionMNIST. In a word, our approach demonstrates significant improvements across multiple metrics and highdimensional image datasets.
5.3 Influence of Hyperparameters
The privacy/utility performances of our approach is primarily determined by three factors:(i) subsampling rates , (ii) number of training iterations, and (iii) noise scale . We now investigate how these factors influence privacy cost and utility (sample quality measured by IS and FID), and additionally compare with baselines:
(i) Subsampling rates: We evaluate the sample quality of our method considering multiple choices of subsampling rates () over the training iterations. The results are presented in Figure 4(a), where the axis corresponds to the value evaluated at different iterations. We observe that the subsampling rate should be sufficiently small for achieving a reasonable sample quality while providing a strong privacy guarantee. A value of yields relatively good privacyutility tradeoff, while further decreasing the subsampling rate does not necessarily improve the results. (ii) Iterations: We evaluate all methods during the course of training, where more iterations lead to higher utilities, but at the expense of accumulating a higher privacy cost . From Figure 4(b), we find our approach yields better sample qualities with fewer iterations (and hence lower ). Specifically, across the range of iterations, we find IS increases by 1090%, while the FID decreases by 2060% compared to baselines. (iii) Noise scale: We calibrate the noise scale of each method to certain privacy budget and show the resulting privacyutility curves in Figure 4(c). Similar to the previous case, our method achieves a consistent improvement in both metrics spanning a broad range of noise scale (privacy budget ).
5.4 Federated Setting Evaluation
IS  FID  epsilon  CT (byte)  

Fed Avg GAN  10.88  218.24  
Ours  11.25  60.76 
Our approach allows to perform privacypreserving training of a GAN in federated setup, where sensitive user dataset is partitioned across clients (e.g., edge devices). Such a training scheme is useful to privately inspect data for debugging. For evaluation, we consider a realworld debugging task introduced in Augenstein et al. (2020): to detect the erroneous flipping of pixel intensities, which occurs in a fraction of client devices. Two GAN models are trained: one on client data that are suspected to be erroneous flipped (with bug) and one on the client data that are believed to be normal (without bug). The samples generated by these two GAN models should exhibit different appearance such that the bug can be detected by inspecting the generated samples.
We conduct experiments on the Federated EMNIST dataset Caldas et al. (2018) and compare our GSWGAN with FedAvg GAN Augenstein et al. (2020).As shown in Figure 5(a) and 5(b), the presence of bug is clearly identifiable by inspecting the samples generated by our model. Moreover, as shown in Table 2, our GSWGAN yields better sample quality (0.28 smaller FID) with a significantly lower privacy cost ( smaller ) compared to FedAvg GAN. Furthermore, our method shows better robustness against large injected noise. This is illustrated in Figure 5(e) and 5(f): a noise scale larger than 0.1 inevitably leads to failure in training FedAvg GAN, whereas our method can tolerate 10 times larger noise scale. In addition, we show in the last column of Table 2 the amortized communication cost (CT) required for performing one update step on the generator. Specifically, this corresponds to the total number of transferred bytes (including both servertoclient and clienttoserver) averaged over all participating clients. Our GSWGAN allows each client to retain its discriminator locally and only the gradients w.r.t. generated samples are communicated (which is significantly more compact than gradients w.r.t model parameters, as done by FedAvg GAN). We observe that GSWGAN achieves a magnitude of gain in reducing the communication cost.
6 Conclusion
In this paper, we presented a differentiallyprivate approach GSWGAN to sanitize sensitive highdimensional datasets with provable privacy guarantees while simultaneously preserving informativeness of the sanitized samples. Our primary insight is that privacypreserving training (which sacrifices utility) can be selectively applied only to the generator (which is publicly released) while the discriminator (which is discarded posttraining) can be trained optimally. Additionally, introducing a Wasserstein training objective allowed us to exploit the Lipschitz property of the discriminator and led to tighter estimates of sensitivity values to better estimate privacy. Our extensive evaluation presents encouraging results: sensitive datasets can be effectively distilled to sanitized forms which nonetheless preserves informativeness of the data and allows training downstream models.
References
 Abadi et al. [2016] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016.
 Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning (ICML), 2017.
 Augenstein et al. [2020] S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, and B. A. y Arcas. Generative models for effective ML on private, decentralized datasets. In International Conference on Learning Representations (ICLR), 2020.
 Balle et al. [2018] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 BeaulieuJones et al. [2017] B. K. BeaulieuJones, Z. S. Wu, C. Williams, and C. S. Greene. Privacypreserving generative deep neural networks support clinical data sharing. biorxiv. DOI, 10, 2017.
 Bennett et al. [2007] J. Bennett, S. Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007. Citeseer, 2007.
 Blum et al. [2013] A. Blum, K. Ligett, and A. Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2), 2013.
 Brock et al. [2018] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2018.
 Caldas et al. [2018] S. Caldas, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.

Deng et al. [2009]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2009.  Dwork [2008] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation (TAMC). Springer, 2008.

Dwork et al. [2009]
C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan.
On the complexity of differentially private data release: efficient
algorithms and hardness results.
In
Proceedings of the fortyfirst annual ACM symposium on Theory of computing (STOC)
, 2009.  Dwork et al. [2010] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), 2010.
 Dwork et al. [2014] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 2014.
 Fung et al. [2010] B. C. Fung, K. Wang, R. Chen, and P. S. Yu. Privacypreserving data publishing: A survey of recent developments. ACM Computing Surveys (Csur), 42(4), 2010.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), 2014.
 Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Harder et al. [2020] F. Harder, K. Adamczewski, and M. Park. Differentially private mean embeddings with random features (dpmerf) for simple & practical synthetic data generation. arXiv preprint arXiv:2002.11603, 2020.
 Hardt and Rothblum [2010] M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacypreserving data analysis. In IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), 2010.
 Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
 Lewis et al. [2004] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5(Apr), 2004.
 Li et al. [2017] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Li et al. [2020] J. Li, M. Khodak, S. Caldas, and A. Talwalkar. Differentially private metalearning. In International Conference on Learning Representations (ICLR), 2020.
 Li et al. [2012] N. Li, W. Qardaji, and D. Su. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, 2012.
 Li et al. [2016] N. Li, M. Lyu, D. Su, and W. Yang. Differential privacy: From theory to practice. Synthesis Lectures on Information Security, Privacy, & Trust, 8(4), 2016.
 Long et al. [2019] Y. Long, S. Lin, Z. Yang, C. A. Gunter, and B. Li. Scalable differentially private generative student model via pate. arXiv preprint arXiv:1906.09338, 2019.
 Mckenna et al. [2019] R. Mckenna, D. Sheldon, and G. Miklau. Graphicalmodel based estimation and inference for differential privacy. In International Conference on Machine Learning (ICML), 2019.
 McMahan et al. [2018] B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations (ICLR), 2018.

McMahan et al. [2017]
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al.
Communicationefficient learning of deep networks from decentralized
data.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2017.  Mironov [2017] I. Mironov. Rényi differential privacy. In IEEE 30th Computer Security Foundations Symposium (CSF), 2017.
 Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12, 2011.
 Radford et al. [2016] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Y. Bengio and Y. LeCun, editors, International Conference on Learning Representations (ICLR), 2016.
 Roth and Roughgarden [2010] A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In Proceedings of the fortysecond ACM symposium on Theory of computing (STOC), 2010.
 Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
 Torkzadehmahani et al. [2019] R. Torkzadehmahani, P. Kairouz, and B. Paten. Dpcgan: Differentially private synthetic data and label generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.

Van Erven and Harremos [2014]
T. Van Erven and P. Harremos.
Rényi divergence and kullbackleibler divergence.
IEEE Transactions on Information Theory, 60(7), 2014. 
Wang et al. [2019]
Y.X. Wang, B. Balle, and S. P. Kasiviswanathan.
Subsampled renyi differential privacy and analytical moments accountant.
In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.  Xiao et al. [2017] H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
 Xie et al. [2018] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
 Yoon et al. [2019] J. Yoon, J. Jordon, and M. van der Schaar. PATEGAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations (ICLR), 2019.

Zhang et al. [2017]
J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao.
Privbayes: Private data release via bayesian networks.
ACM Transactions on Database Systems (TODS), 42(4), 2017.  Zhang et al. [2018] X. Zhang, S. Ji, and T. Wang. Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594, 2018.
Appendix A Privacy Analysis
The privacy cost () computation including: (i) bounding the privacy loss for our gradient sanitization mechanism using RDP; (ii) applying analytical moments accountant of subsampled RDP Wang et al. [2019] for a tighter upper bound on the RDP parameters; (iii) tracking the overall privacy cost: multiplying the RDP orders by the number of training iterations and converting the resulting RDP orders to an pair (Definition 3.2 Mironov [2017]). We below present the theoretical results.
Theorem 4.1.
Each generator update step satisfies RDP where is the batch size.
Proof.
Let , i.e., the clipped gradient before being sanitized. The sensitivity can be derived via the triangle inequality:
(14) 
with in our case. Hence, we have is RDP.
Each generator update step (which operates on a batch of data) can be expressed as
(15) 
This can be seen as a composition of Gaussian mechanism. Concretely, we want to bound the Rényi divergence with denoting the neighbouring datasets. We use the following properties of Rényi divergence Van Erven and Harremos [2014]:
(i) Dataprocessing inequality : if the transition probabilities
in the Markov chain
is fixed.(ii) Additivity : For arbitrary distributions and let and . Then
Let and denote the output distribution of the sanitization mechanism when applied on and respectively, and the postprocessing function (i.e., multiplication with the local Jacobian). We have,
(16)  
(17)  
(18)  
(19)  
(20)  
(21) 
where (3)(4)(6) are based on the dataprocessing theorem; (5) follows from the additivity; and the last equation follows from the RDP of . ∎
Theorem A.1.
(RDP for Subsampled Mechanisms Wang et al. [2019]) Given a dataset containing datapoints with domain and a randomized mechanism that takes an input from for , let the randomized algorithm be defined as: (i) subsample: subsample without replacement datapoints of the dataset (with subsampling rate ); (ii) apply : a randomized algorithm taking the subsampled dataset as the input. For all integers , if is RDP, then is RDP where
In practice, we adopt the official implementation of Wang et al. [2019] ^{1}^{1}1https://github.com/yuxiangw/autodp for computing the accumulated privacy cost (i.e., tracking the RDP orders and converting RDP to DP).
Appendix B Algorithm
Appendix C Experiment Setup
c.1 Hyperparameters
We adopt the hyperparameters setting in Gulrajani et al. [2017] for the GAN training, and list below the hyperparameters relevant for privacy computation.
Centralized Setting. We use by default a subsampling rate of =1/1000, noise scale =1.07, pretraining (warmstart) for 2K iterations and subsequently training for 20K iterations.
Federated Setting. We use by default a noise scale =1.07, pretraining (warmstart) for 2K iterations and subsequently training for 30K iterations.
c.2 Datasets
Centralized Setting. MNIST LeCun et al. [1998] and FashionMNIST Xiao et al. [2017] datasets contain 60K training images and 10K testing images. Each image has dimension and belongs to one of the 10 classes.
Federated Setting. Federated EMNIST Caldas et al. [2018] dataset contains grayscale images of handwritten letters and numbers, grouped by user. The entire dataset contains 3400 users with 671,585 training examples and 77,483 testing examples. Following Augenstein et al. [2020], the users are filtered by the prediction accuracy of a 36class (10 numeric digits + 26 letters) CNN classifier. For evaluating the sample quality, we train GAN models on the users’ data which yields classification accuracy (866 users); For simulating the debugging task, we randomly choose of the users and preprocess their data by flipping the pixel intensities. To mimic the realworld situation where the server is blind to the erroneous preprocessing, users with low classification accuracy 88.2% are selected (2136 users) as they are suspected to be affected by erroneous flipping (with bug). Note that only a fraction of them is indeed affected by the bug (1720 with bug, 416 without bug). This has the realistic property that the client data is nonIID and poses additional difficulties in the GAN training.
c.3 Evaluation Metrics
In line with previous literature, we use Inception Score (IS) Salimans et al. [2016], Li et al. [2017] and Frechet Inception Distance (FID) Heusel et al. [2017] for measuring sample quality, and classification accuracy for evaluating the usefulness of generated samples. We present below a detailed explanation of the evaluation metrics we adopted in the experiments.
Inception Score (IS). Formally, the IS is defined as follows,
which corresponds to exponential of the KL divergence between the conditional class and the marginal class distribution , where both and are measured by the output distribution of a pretrained classifier when passing the generated samples as input. Intuitively, the IS should exhibit a high value if has low entropy (i.e., the generated images are sharp and contain clear objects) and is of high entropy (i.e., the generated samples have a high diversity covering all the different classes). In our experiments, we use pretrained classifiers on the real datasets (with test accuracy equals to 99.25%, 93.75%, 92.16% on the MNIST, FashionMNIST and Federated EMNIST dataset respectively) ^{2}^{2}2https://github.com/ChunyuanLI/MNIST_Inception_Score for computing the IS.
Frechet Inception Distance (FID). The FID is formularized as follows,
where and are the 2048dimensional activations of the Inceptionv3 pool3 layer for real and generated samples respectively. A lower FID value indicates a smaller discrepancy between the real and generated samples, which corresponds to a better sample quality and diversity. Following previous works ^{3}^{3}3https://github.com/google/compare_gan , we rescale the images and convert them to RGB by repeating the grayscale channel three times before inputting them to the Inception network.
Classification Accuracy.
We consider the following classification models in our experiments: Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), AdaBoost (adaboost), Bagging (bagging), Bernoulli Naive Bayes (bernoulli nb), Decision tree (decision tree), Gaussian Naive Bayes (gaussian nb), Gradient Boosting (gbm), Linear Discriminant Analysis (lda), Linear Support Vector Machine (linear svc), Logistic Regression (logistic reg), Random Forest (random forest), and XGBoost (xgboost). For implementing the CNN model, we use two hidden layers (with dropout) each containing 32 and 64 kernels and apply ReLU as the activation function. For implementing the MLP, we use one hidden layer with 100 neurons and set ReLU as the activation function. All the other classification models are implemented using the default hyperparameters supplied by the scikitlearn
Pedregosa et al. [2011] package.c.4 Baseline Methods
We present more details about the implementation of the baseline methods. In particular, we provide the default value of the privacy hyperparameters below.
DPMerf (AE) ^{4}^{4}4https://github.com/frhrdr/DifferentiallyPrivateMeanEmbeddingswithRandomFeaturesforSyntheticDataGeneration We use as default a batch size= (=1/120), noise scale
=0.588, training iteration=600 (epoch=5) for implementing DPMerf, and batch size=
, noise scale =0.686, training iteration=2040 (epoch=17) for implementing DPMerf AE.DPSGD GAN ^{5}^{5}5https://github.com/reihanehtorkzadehmahani/DPCGAN
We set the default hyperparameters as follows: gradient clipping bound
=1.1, noise scale =2.1, batch size=600, training iterations=30K.GPATE We use 2000 teacher discriminators with batch size of 30 and set noise scales =600 and =100, consensus threshold =0.5. A random projection with projection dimension=10 is applied.
PATEGAN ^{6}^{6}6https://bitbucket.org/mvdschaar/mlforhealthlabpub/src/2534877d99c8fdf19cbade16057990171e249ef3/alg/pategan/ When extending PATEGAN to highdimensional image datasets, we observe that after a few iterations, the generated samples are classified as fake by all teacher discriminators and the learning signals (gradients) for student discriminator and the generator vanish. Consequently, the training stuck at the early stage where the losses remain unchanged and no progress can be observed. While this issue is well resolved by careful design of the prior distribution, as reported in the original paper, we find that this technique has a limited effect when applied to the highdimensional image dataset. In addition, we make the following attempts to address this issue: (i) changing the network initialization (ii) increasing (or decreasing) the network capacity of the student discriminator, the teacher discriminators, and the generator (iii) increasing the number of iterations for updating the student discriminator and/or the generator. Despite some progress in preserving the gradients for larger iterations, none of the above attempts successfully eliminate the issue, as the training inevitably gets stuck within 1K iterations.
Appendix D Additional Results
Effects of gradient clipping. We show in Figure A1
the gradient norm distribution before and after gradient clipping. The clipping bound is set to be 1.1 for DPSGD and 1 for our method. In contrast to DPSGD, the clipping operation distorts less information in our framework, witnessed by a much smaller difference in the average gradient norm before and after the clipping. Moreover, the gradients used in our method exhibit much less variance both before and after the clipping compared with DPSGD.
Comparison to Baselines. We provide the detailed quantitative results in Table A1 and A2, which are supplementary to Table 1 in the main paper. We show in parentheses the calibrated accuracy, i.e., the absolute accuracy of each classifier trained on generated data divided by the accuracy when trained on real data. The results are averaged over five runs.
Real  GAN (nonprivate)  GPATE  DPSGD GAN  DPMerf  DPMerf AE  Ours  

MLP  0.98  0.84 (85%)  0.25 (26%)  0.60 (61%)  0.63 (64%)  0.54 (55%)  0.79 (81%) 
CNN  0.99  0.84 (85%)  0.51 (52%)  0.64 (65%)  0.63 (64%)  0.68 (69%)  0.80 (81%) 
adaboost  0.73  0.28 (39%)  0.11 (16%)  0.32 (44%)  0.38 (52%)  0.21 (29%)  0.21 (29%) 
bagging  0.93  0.46 (49%)  0.36 (38%)  0.44 (47%)  0.43 (46%)  0.33 (35%)  0.45 (48%) 
bernoulli nb  0.84  0.80 (95%)  0.71 (84%)  0.62 (74%)  0.76 (90%)  0.50 (60%)  0.77 (92%) 
decision tree  0.88  0.40 (45%)  0.13 (14%)  0.36 (41%)  0.29 (33%)  0.27 (31%)  0.35 (40%) 
gaussian nb  0.56  0.71 (126%)  0.61 (110%)  0.37 (66%)  0.57 (102%)  0.17 (30%)  0.64 (114%) 
gbm  0.91  0.50 (55%)  0.11 (12%)  0.45 (49%)  0.36 (40%)  0.20 (22%)  0.39 (43%) 
lda  0.88  0.84 (95%)  0.60 (68%)  0.59 (67%)  0.72 (82%)  0.55 (63%)  0.78 (89%) 
linear svc  0.92  0.81 (88%)  0.24 (26%)  0.56 (61%)  0.58 (63%)  0.43 (47%)  0.76 (83%) 
logistic reg  0.93  0.83 (90%)  0.26 (28%)  0.60 (65%)  0.66 (71%)  0.55 (59%)  0.79 (85%) 
random forest  0.97  0.39 (41%)  0.33 (34%)  0.63 (65%)  0.66 (68%)  0.45 (46%)  0.52 (54%) 
xgboost  0.91  0.44 (49%)  0.15 (16%)  0.60 (66%)  0.70 (77%)  0.54 (59%)  0.50 (55%) 
Average  0.88  0.63 (71%)  0.34 (40%)  0.52 (59%)  0.57 (66%)  0.42 (47%)  0.60 (69%) 
Real  GAN (nonprivate)  GPATE  DPSGD GAN  DPMerf  DPMerf AE  Ours  

MLP  0.88  0.77 (88%)  0.30 (34%)  0.50 (57%)  0.56 (64%)  0.56 (64%)  0.65 (74%) 
CNN  0.91  0.73 (80%)  0.50 (54%)  0.46 (51%)  0.54 (59%)  0.62 (68%)  0.64 (70%) 
adaboost  0.56  0.41 (74%)  0.42 (75%)  0.21 (38%)  0.33 (59%)  0.26 (46%)  0.25 (45%) 
bagging  0.84  0.57 (68%)  0.38 (45%)  0.32 (38%)  0.40 (47%)  0.45 (54%)  0.47 (56%) 
bernoulli nb  0.65  0.59 (91%)  0.57 (88%)  0.50 (77%)  0.62 (95%)  0.54 (83%)  0.55 (85%) 
decision tree  0.79  0.53 (67%)  0.24 (30%)  0.33 (42%)  0.25 (32%)  0.36 (46%)  0.40 (51%) 
gaussian nb  0.59  0.55 (93%)  0.57 (97%)  0.28 (47%)  0.59 (100%)  0.12 (20%)  0.48 (81%) 
gbm  0.83  0.44 (53%)  0.25 (30%)  0.38 (46%)  0.27 (33%)  0.30 (36%)  0.38 (46%) 
lda  0.80  0.77 (96%)  0.55 (69%)  0.55 (69%)  0.67 (84%)  0.65 (81%)  0.67 (84%) 
linear svc  0.84  0.77 (91%)  0.30 (36%)  0.39 (46%)  0.46 (55%)  0.40 (48%)  0.65 (77%) 
logistic reg  0.84  0.76 (90%)  0.35 (42%)  0.51 (61%)  0.59 (70%)  0.50 (60%)  0.68 (81%) 
random forest  0.88  0.69 (78%)  0.33 (37%)  0.51 (58%)  0.61 (69%)  0.55 (63%)  0.54 (61%) 
xgboost  0.83  0.65 (78%)  0.49 (59%)  0.52 (63%)  0.62 (75%)  0.55 (66%)  0.47 (57%) 
Average  0.79  0.61 (77%)  0.40 (54%)  0.42 (53%)  0.50 (65%)  0.45 (56%)  0.53 (67%) 
Privacyutility Curves. We show in Figure A2
the privacyutility curves of different methods when applied to the FashionMNIST dataset. We evaluate over three runs and show the corresponding mean and standard deviation. Similar to the results shown in Figure
4 in the main paper, our method achieves a consistent improvement over prior methods across a broad range of privacy budget .
Comments
There are no comments yet.