1 Introduction
The task of density estimation requires constructing an estimate of an unknown probability density function, given observed data. This density estimate can then be used to perform a variety of relevant analysis tasks, including log likelihood evaluation and synthetic data generation. In settings involving sensitive data, the construction and subsequent release of such an estimate could potentially leak private information. Without a rigorous privacy guarantee, nothing prevents a model from memorizing a row in the training set, assigning disproportionate density to a point, or any other vulnerability due to arbitrary analysis of the learned parameters. Since density estimation remains a task of interest to the modeling community, continued attention is required to develop privacypreserving methods for density estimation.
Differential privacy [17] has emerged as the predominant privacy notion in the context of statistical data analysis. At a high level, differentially private analyses limit the extent to which the distribution of outputs can change due to the inclusion or exclusion of any one individual from the analysis. Algorithms which adhere to this notion exhibit a number of desirable properties, including privacy guarantees which hold regardless of the auxiliary information an adversary may have and composition of privacy guarantees across multiple analyses. Hence differential privacy acts as a compelling gold standard in the design of privacypreserving analyses.
Tools for density estimation have held longstanding interest due to their versatility. Their ability to address a wide range of distributional learning tasks is precisely why the existence of an accurate and privacypreserving density estimation is surprising. For example, privately constructing such a model implicitly yields a differentially private approach to anomaly detection—a task of substantial investigation [3, 41, 21]—as an immediate application of likelihood inference. In addition, given that density estimators often enable efficient sampling, such a model would yield a method for privacypreserving synthetic data generation. This task in particular has been of longstanding interest to the privacy community [47] as it addresses many of the limitations imposed by the queryrelease model [14] by allowing large numbers of arbitrary analyses. Privately generating a synthetic dataset only incurs a fixed privacy cost during the training process; all subsequent queries on the synthetic data are automatically differentially private due to the privacy notion’s postprocessing guarantee, so the privacy cost does not scale with the number of downstream analyses performed.
Normalizing flow models are an attractive approach to the task of density estimation due to their empirical ability to approximate arbitrary, highdimensional distributions. These models approach the task of density estimation via a transformation on a chosen base density by a sequence of invertible, nonlinear transformations, enabling density querying on the resulting distribution via an application of the changeofvariables formula. Approaches to density estimation in this manner include: Nonlinear Independent Components Estimation (NICE)
[9], Real NVP [10], Glow [33], and Masked Autoregressive Flows (MAF) [44]. Until this work, it was an open question whether normalizing flow models could be constructed in a differentially private manner to handle the task of privacypreserving density estimation, combining the rigorous guarantees of differential privacy with the strong empirical performance exhibited by normalizing flows.In this work we propose the use of normalizing flow models trained in a differentially private manner as a novel approach to the task of privacypreserving density estimation. We provide an algorithm (DPNF, Algorithm 1 in Section 3) that privately optimizes the model parameters via gradient descent using DPSGD [1], which adds Gaussian noise to clipped gradient updates ensure differential privacy. Additionally, we achieve tighter privacy guarantees than established in previous work [1] via composition with the recently introduced notion of Gaussian differential privacy [11]. We apply this optimization to the parameters of a Masked Autoregressive Flow [44], our primary architecture of consideration, and achieve empirical results (Section 4) which convincingly outperform previous approaches. Further, we show that our algorithm can be applied to solve the problem of differentially private anomaly detection (Section 5), and show that it leads to better true/false positive rates than existing private methods.
1.1 Related Work
Gaussian mixture models (GMMs) are known to be a particularly strong density estimation tool [43] since they are a universal approximator of densities — that is, they are able to approximate any density function arbitrarily well given a sufficient number of components [36]
. They approach the task of density estimation by modeling the data distribution as a weighted sum of Gaussian distributions. The first differentially private algorithm for learning the parameters of a Gaussian mixture model comes from the work of
[40], which uses their sampleandaggregate framework to convert nonprivate algorithms into private algorithms, applied to the task of learning mixtures of Gaussians. However, their approach exhibits strong assumptions on the range of the parameter space and assumes a uniform mixture of spherical Gaussians. Followup work of [31] proposes a modernized approach which improves upon the sample complexity of the aforementioned work and removes the strong a priori bounds on the parameters of the mixture components, although it makes the assumption that the components of the mixture are wellseparated.There has also been work in learning the parameters of a Gaussian mixture model through differentially private variants of the expectation maximization (EM) algorithm. One notable instance of this is DPGMM
[50], which achieves a privacy guarantee at each iteration of EM through the application of calibrated Laplace noise to the estimated model parameters following each maximization step. These individual privacy guarantees are then combined into an overall privacy guarantee via sequential composition, i.e., by taking the sum of privacy parameters in each iteration. The work of [45]introduces DPEM, a general framework for privacypreserving optimization via expectation maximization. Their approach follows a conceptually similar idea of applying either calibrated Laplace or Gaussian noise to the model parameters at the end of each EM iteration. They apply this method to learning mixtures of Gaussians, henceforth referred to as DPMoG, and they demonstrate significantly better privacy guarantees through composition via the moments accountant and zeroconcentrated differential privacy (zCDP)
[6]. Given that their work makes no notable assumptions about the task and provides an empirical evaluation of their method, this is the most comparable approach to our own. As such, it is used as a baseline in our experimental results.In addition, we take note of more classical approaches to the task of privacypreserving density estimation. One of the simplest yet most widely used methods for density estimation is through the use of histograms, and previous work [8, 49]
has investigated their private estimation. Unfortunately, such an approach scales poorly with the dimension and complexity of the distribution while asserting an unrealistic discretization of the space. Kernel density estimation is another closely related approach, often characterized as the smooth analog to the classical discrete histogram. The work of
[26] proposes a method for privately querying the density of such an estimator through the addition of calibrated Gaussian noise. As a nonparametric approach, it has the drawback that it requires storage of the entire dataset at test time to enable querying (proving impractical for largescale datasets) while still degrading similarly with dimension.There have also been a number of deep learning based approaches to generative modeling which vary in their relevance. Although work of this nature technically allows for both sampling and likelihood evaluation, it does not allow for
exactlikelihood inference as is the case for mixtures of Gaussians and normalizing flows. There is also expansive literature concerning differentially private approaches to training Generative Adversarial Networks, yet these methods are strictly limited to sampling and do not provide a straightforward approach to likelihood inference.
Finally, we include a brief overview of the extensive literature concerning density estimation via normalizing flows. One important subset are those characterized by coupling layers: transformations which partition the dimensions of its input and map them in a way that retains invertibility and a tractable Jacobian. This includes Nonlinear Independent Components Estimation (NICE) [9], as well as its subsequent generalization Real NVP [10]. Another notable approach, Glow [33], makes use of such coupling layers while also proposing the use of an invertible weight matrix decomposition to generalize the notion of permutation layers. Alternatively, some make use of autoregressive transformations
, which are transformations that utilize the chain rule of probability to represent a joint distribution as a product of its conditionals. Such models include Masked Autoregressive Flow (MAF)
[44], a generalization of Real NVP optimized for density estimation, as well as its closely related Inverse Autoregressive Flow [34] optimized for variational inference, among others [42, 28, 20].2 Preliminaries
2.1 Normalizing Flows
Let be the probability density function characterizing an unobservable distribution of interest, and let be observed i.i.d. samples from this distribution. The task of density estimation is to find an approximation of via some model given . In the context of normalizing flows, this model is characterized by a prior distribution , chosen to exhibit a simple and tractable density (e.g., the spherical multivariate Gaussian distribution), and a sequence of bijective functions , parameterized fullyby . The function acts as a transformation between the prior distribution and the approximated distribution .
Given such a model, it can be used to efficiently sample by first sampling and then transforming the sample as . If is a good approximation of , then this generative process gives an efficient (approximate) oracle for sampling from the unknown distribution.
Since is invertible, one can also perform exact likelihood evaluation on observed points from the data distribution via the change of variables formula, as follows:
Finding a good approximation is achieved through optimization of to minimize the negative log likelihood of the observed dataset:
(1) 
In practice, one will typically find the MLE
using some nonconvex optimization method, such as stochastic gradient descent.
2.2 Differential Privacy
Differential privacy [17] has become the gold standard for ensuring the privacy of statistical analyses applied to sensitive databases. At a high level, it ensures that changing a single entry in the database will have only a small effect on the distribution of analysis results.
Definition 1 ([17]).
A randomized algorithm satisfies differential privacy (DP) if for any two input database that differ in a single entry and for any subset of outputs , it satisfies,
One common algorithmic approach for achieving differential privacy is adding noise that scales with the sensitivity of the function being evaluated, which is the maximum change in the function’s value that can result from changing a single data point. Differentially private algorithms are robust to postprocessing, meaning that any dataindependent function of a differentially private output retains the same privacy guarantee, and they enjoy composition, meaning that the privacy parameters degrade gracefully as additional analyses are performed on the dataset. The simplest version of composition is that the privacy parameters and “add up” over multiple analyses, although stronger versions of composition are also used.
Differentially Private Stochastic Gradient Descent (DPSGD, presented formally in Algorithm 3 in Appendix A.4) was introduced by [1] as a method for private nonconvex optimization. At each step , DPSGD subsamples^{1}^{1}1The original algorithm of [1] does this via Poisson subsampling, but it can also be done via uniform subsampling while retaining a privacy guarantee [48].
a small set of data points and uses this batch to compute a gradient update. To achieve a differential privacy guarantee, DPSGD adds meanzero Gaussian noise to the average of the perexample gradients. The standard deviation of this noise is scaled with the sensitivity of the gradient estimation. Since this is unbounded, the perexample gradients are first clipped to ensure that the
norm is at most some input parameter , thus bounding the sensitivity, and then adds noise which scales with .[1] also introduced the moments accountant, which provides tight privacy composition across multiple gradient update steps in DPSGD. To describe the moments accountant, given an algorithm and two neighboring datasets , first we denote the privacy loss of a particular outcome as
. The moments accountant calculates a privacy budget by bounding the moments of the privacy loss random variable
. That is, if we consider the log of the moment generating function (MGF) of the privacy loss random variable evaluated at
, i.e., , the worst case over all neighboring databases composes linearly across multiple mechanisms (Theorem 2.1 [1]) and allows for conversion to an associated differential privacy guarantee through the relation . Follow up work of [5] introduced NoisySGD, which followed the same algorithmic structure but analyzed privacy composition under Gaussian differential privacy [11]. For the purpose of this work it is sufficient to simply note the associated benefits of analysis under Gaussian differential privacy: it naturally lends itself to composition under subsampling, allows for analytically tractable expressions of the privacy guarantees of NoisySGD, while providing a slightly tighter overall privacy bound than that achieved by the moments accountant. Further details are provided in Appendix A.3 Differentially Private Normalizing Flows
In this section we introduce our algorithm for differentially private density estimation via normalizing flows, DPNF, presented in Algorithm 1. It is based on the DPSGD algorithm of [1], which is a differentially private method for performing stochastic gradient descent. We also briefly discuss performance improvements using datadependent initialization of normalization layers and using a differentially private estimate of the distribution to act as a prior, both of which are explored further Appendix B. We emphasize that our primary technical contribution is not in the design of these algorithms, but rather the novel application of these tools to the problem of differentially private density estimation in a way that yields substantial performance over prior work, as demonstrated by our empirical results in Section 4.
3.1 DPNF Algorithm
Training a normalizing flow model corresponds to minimizing the loss function in Equation (
1): . This loss function is nonconvex when applied to the optimization of a nonlinear normalizing flow model, and hence optimization is typically performed via gradient descent on . To make this training private in Algorithm 1, we update using the DPSGD algorithm of [1] described in Section 2.2, with some subtle yet important augmentations to the standard minibatch gradient descent process to allow for an explicit privacy guarantee, in accordance with DPSGD.First, batches are sampled via uniform subsampling (Line 4). That is, each possible batch of size has equal likelihood of being chosen (as opposed to repeatedly shuffling the dataset and taking equally sized partitions of the dataset, which is often preferred in practice). Second, rather than computing the gradient with respect to the entire batch, the gradient with respect to each individual data point is calculated, clipped to have maximum norm
, averaged, then added with a randomly sampled Gaussian noise vector (Lines 69).
Algorithm 1 also requires a privacy accountant to be specified as input. This privacy accountant will dynamically track the privacy loss incurred by composition over all gradient update steps as a function of the training parameters, and will halt the algorithm once a prespecified budget is reached. A privacy accountant takes in the round of training, the sampling probability of a single point (here a batch of size is sampled uniformly from a set of data points), the noise scale that is added to preserve privacy, the bound on the norm of each gradient, and the privacy parameter . At every time step, the privacy accountant maintains the current privacy budget that has been expended until round given the input parameters. Common choices for this accountant include the moments accountant (MA) [1] or composition via Gaussian differential privacy (GDP) [11]. In our experiments in Section 4, we yield preferable results using a GDP privacy accountant.
In summary, DPNF in Algorithm 1 is a modified version of DPSGD, instantiated to train a normalizing flow model with the analyst’s choice of privacy accountant.
The privacy guarantees of DPNF follow as an immediate corollary from those of DPSGD [1] when instantiated with the moments accountant, and from NoisySGD [5] when instantiated with the Gaussian differential privacy accountant.
Theorem 1.
DPNF is differentially private.
3.2 DPNF Extensions
In practice, one will find that many deep learning models (including the normalizing flow models used in our experiments) are much better optimized using adaptive learning rate optimization schemes. Given this, we found significant benefit in using a direct extension to DPSGD which applies noisy gradients to the model according to the Adam [32] optimizer. Both methods achieve identical privacy guarantees given that computation of the first and second moments of the noisy gradients are merely deterministic dataindependent functions of them. Thus they differ only in the postprocessing of the noisy gradients, and the privacy guarantees are unchanged.
Two further extensions of Algorithm 1 are proposed below, which may provide substantial improvements to empirical performance.
DataDependent Initialization of Normalization Layers. Intermediate normalization layers such as activation normalization [33] have been proposed as a means to improve the stability of normalizing flow models. Activation normalization is characterized by a featurewise offset and scaling of inputs by a learned set of parameters and , i.e., . In practice, these parameters are typically set via datadependent initialization [46] by setting and as the perfeature means and standard deviations observed throughout a forward pass of a sampled batch of data. These parameters can also be estimated privately, e.g., by applying the Laplace Mechanism [17] to the clipped mean and standard deviation, thus allowing for datadependent initialization of these normalization layers. For more details, see Appendix B.1.
Differentially Private DataDependent Priors. Section 2.1 suggested the analyst choose a dataindependent prior , such as the multivariate spherical Gaussian. However, recent work suggests that modest improvements in empirical results can be achieved through the use of more complex priors, such as a mixture of Gaussians [44], or by fitting a Gaussian mixture model to the data [30]. A natural privacypreserving approach would be to first use DPMoG [45] with privacy budget to estimate a prior, and then refine the prior using DPNF with privacy budget to yield an encompassing normalizing flow model. This process would be differentially private, and may yield preferable results in settings where the distribution is highly discontinuous, but also locally nonlinear. For more details, see Appendix B.2.
4 Experimental Results
In this section we present experimental results demonstrating the empirical performance of our approach, evaluating our algorithm on a variety of real and synthetic datasets on varying tasks. In the main body we focus our evaluation on a single dataset (the Life Science dataset [12], described next), although refer to Appendix C for all additional results on other real and synthetic datasets.
In all our experiments on the Life Science dataset, we used . Our baseline method for comparison [45] used . However, this corresponds to on this dataset, which is typically deemed unacceptably large in the privacy community. Instead, our choice of , which is sublinear in the size of the database. Smaller values of would not change our qualitative results, nor would they substantially change our quantitative results.
4.1 Datasets, Implementation, & Setup
The Life Science dataset is a standard density estimation benchmark dataset from the UCI machine learning repository
[12] containing 26,733 realvalued records of dimension 10. This dataset was used in the original evaluation of our baseline model [45]. Results using additional datasets are presented in Appendix C.Experiments were run on a machine with 2 CPUs, 13 GB RAM, and a single NVIDIA Tesla K80 GPU, and took on the order of half an hour to five hours to run in walltime, depending on the number of iterations and the dimensionality of the dataset. Models were implemented in the Jax [4]
deep learning framework, and used privacy accounting implementations from TensorFlow Privacy
[23].Hyperparameter Search and Model Selection.
Reported privacy budgets in our results correspond only to the training of each model, and does not include privacy loss from hyperparameter search and model selection. We chose not to select hyperparameters in a privacypreserving manner because this was not the focus of our contribution and because it was not done in our baseline method.
^{2}^{2}2These can be done privately. For example, [25] provides discrete optimization methods that can be used for private hyperparameter search over discrete model architectures. [2] uses ReportNoisyMax [16] for private model selection. Some work has also been done to account for highperformance models without having to spend a significant privacy budget [7, 35].It was generally observed that choices in network structure itself had relatively negligible impacts on results. We found that training parameters such as the gradient clipping bound and batch size had a much more substantial impact on model performance, which is consistent with observations made in
[1].Model Architecture. The architecture of the model used in our experiments was a variant of a Masked Autoregressive Flow (MAF) [44] composed of a repeated sequence of five blocks, each containing a MADE [22] layer, a reversal layer, and an optional activation normalization layer. Models were optimized via Adam, with default parameters of and . Further details of training parameters and procedures are given in Appendix C.3.
4.2 Empirical Performance of DPNF
We implemented our algorithm for differentially private normalizing flows on the Life Science dataset (and other datasets as described in Appendix C), and evaluated our performance against the baseline of DPMoG [45] for a variety of quantitative and qualitative metrics related to density estimation tasks.
4.2.1 Quantitative Evaluation: Expected Log Likelihood.
Arguably the most foundational metric for density estimators is the expected log likelihood they assign to held out test points. Figure 1 presents average log likelihood assigned to a held out test set under DPNF and the baseline method DPMoG [45] as a function of . We divided the dataset into 10 pairs of training (90%) and test sets (10%), and reported the average test log likelihood per data point across the 10 independent trials. Better methods should assign higher log likelihood for points in the held out test set since these points were indeed sampled from the underlying distribution of interest. We found that DPNF reliably assigned much higher likelihoods to holdout data than that of DPMoG for identical privacy budgets, across a variety of privacy accountant methods.
The privacy guarantees of DPNF proved quite practical, providing substantial privacy improvements over DPMoG for the same model performance. For example, DPNF matched the peak performance of DPMoG (achieved around ) for only an expenditure of . These results are also listed in Table 1 with error bars showing standard deviation across 10 independent runs.
Life Science  

DPNF (GDP)  
DPNF (MA)  
DPMoG (MA)  
DPMoG (zCDP) 
[45] showed performance of DPMoG under several different privacy accountant methods, with the moments accountant of [1] providing the best performance. We compared DPNF using the moments accountant for fair comparison, and using the novel Gaussian differential privacy (GDP) accountant of [5]. Figure 1 and Table 1 show that DPNF outperforms DPMoG for all privacy accountant methods considered for either model, emphasizing that while the GDP accountant does provide some benefit, the vast majority of the performance improvements come from the DPNF method itself. The benefits of using the GDP accountant are further explored in the appendix.
4.2.2 Quantitative Evaluation: Downstream Machine Learning Tasks.
Next we further evaluate the quality of our model by measuring the performance of downstream machine learning models trained on its generated synthetic data. A proper method for evaluating the strength of density estimation approaches is through the quality of their synthetic data, as measured by the ability to train a machine learning model that performs well on future, real data.
To perform this evaluation, we trained DPNF and DPMoG (along with their nonprivate variants for reference) to learn the distribution of the training data, and then used these models to generate a synthetic dataset. We then trained a simple regressor—nearest neighbors with default library settings ()—on each synthetically generated dataset and evaluated their in predicting a target value on real, held out Life Science data. The Life Science dataset does not have an immediately associated prediction task, as it is primarily used as solely a density estimation benchmark. To artificially construct a prediction task, we simply chose to isolate the last column to act as a label and treated the remaining nine columns as features.
Figure 2 shows the mean squared error attained by each regressor as a function of the privacy expenditure of the data generation approach. Horizontal lines denote that the approach is , i.e., nonprivate. “Baseline” refers to training the regressor directly on the training dataset provided to the density estimator. Upon inspection, we find that DPNF generates data of convincingly higher quality than that of our comparison method DPMoG for all values of . In addition, for higher values of , DPNF converges to the quality achievable by a nonprivate MoG, whereas DPMoG hits an apparent plateau well before this point.
4.2.3 Qualitative Evaluations.
Figure 3 shows that DPNF provides a qualitative increase in sample quality under visualization. It presents dimensionwise histograms of synthetically generated features for three features of the Life Science dataset, using DPNF (left column) and DPMoG (right column) for comparison. (See Figure 11 in Appendix C for dimensionwise histograms of all 10 features.) Both methods used and . In every plot, the synthetic data in orange is superimposed over the real data in blue. We qualitatively see that for nearly all ten features, the distribution of data generated by DPNF closely matches that of the real data, while DPMoG was relatively unable to replicate regions of concentrated density for certain dimensions. This could be due to the fact that that for a fixed number of components, the DPMoG model is constrained to cover the support of the distribution and must ignore nuanced details. Normalizing flow models, on the other hand, have heightened expressiveness over traditional statistical methods like Gaussian mixture models, and we see that they are able to capture these nuances more readily.
As another qualitative evaluation of sample visualization, Figure 4 shows the density of synthetic data generated by each model when projected to two dimensional space via PCA, for varying values. The top row shows DPNF, the bottom row shows DPMoG, and the right figure shows the real data. In all plots, lighter pixels correspond to regions of higher density, and dark pixels indicate lower density. We see that DPNF is better able to capture some of the observable qualities exhibited in the real data, for example the gradual compression of density to the left of the distribution.
5 Application: Differentially Private Anomaly Detection
Our DPNF algorithm can be used as a tool for differentially private anomaly detection. Given a density estimator, a straightforward approach to anomaly detection is through a simple likelihood thresholding mechanism. For a given point, to determine whether it is indistribution or outofdistribution, we can simply return a binary value which denotes indistribution if the log likelihood assigned to the point by the model is above some empirically derived threshold , and outofdistribution otherwise. For the purposes of our experiments, we assume that such a threshold is easily estimated, i.e., by selecting the value for which optimizes anomaly detection performance on a public test set. In the private setting, we can approach this task in a privacypreserving manner by training either DPNF or DPMoG on the dataset. By the postprocessing property of differential privacy, we can make arbitrarily many anomaly detection queries to the privately trained model while incurring no privacy loss beyond what is incurred during training.
In Figure 5
, we illustrate the efficacy of our approach in performing anomaly detection under this likelihood thresholding mechanism. We randomly generated points that were uniformly distributed around the tails of the test dataset, i.e., between the 5th and 30th percentiles and the 70th and 95th percentiles dimensionwise. The total number of synthetically generated anomalies was equal to the total number of test points. Figure
5 shows ROC curves of both private and nonprivate methods for this binary prediction problem. That is, it shows the tradeoff in the true positive rate and the false positive rate in predicting indistribution or outofdistribution correctly for varying selections of the likelihood threshold. We observe that DPNF outperformed the other private method DPMoG for the same privacy guarantee. Our approach performed comparably to nonprivate MoG, and is of course upperbounded by a nonprivate normalizing flow.There are other possible methods for making anomaly detection algorithms differentially private, beyond the approach described above of using DPNF directly to privately train the model. An alternative approach is to, rather than learning the parameters of the model in a differentially private manner, partition the dataset into parts and train a nonprivate density estimator on each part. Then, given a new point of interest, each model casts a vote regarding its belief on the point being indistribution or outofdistribution by testing whether their density assigned to the point exceeds . We then aggregate these votes privately, to ensure that the final prediction is differentially private with respect to the training set. In our algorithm for differentially private anomaly detection (Algorithm 2), we use the Exponential Mechanism [38] for this private aggregation. We note that our overall approach is an instantiation of the sampleandaggregate framework of [40]. This approach is visualized in Figure 6 and presented formally in Algorithm 2.
The privacy guarantees of DPAD follow as an immediate corollary from those of sampleandaggregate [40] and the Exponential Mechanism [38].
Theorem 2.
DPAD is differentially private.
Although Algorithm 2 is instantiated with the Exponential Mechanism [38], one can opt for any differentially private aggregation method. In contexts requiring answers to a large but bounded number of queries where the number of anomalies is expected to be small in comparison to the total number of queries (an appropriate assumption for many applications), an immediate extension of this approach using the Sparse Vector technique [18, 19] would be natural.
To evaluate this approach, we applied Algorithm 2 to the HalfMoons dataset, which is a synthetic dataset with 30,000 data points, each with two realvalued features. This dataset is visualized in Figure 7, and more details about this synthetic dataset are given in Appendix C.
Figure 8 illustrates the data used and the results of this evaluation. The training data (dark purple) was partitioned into 10 pieces and used to fit a set of independent nonprivate models (all either MoG or NF). Then, anomalies (light purple) were added to the dataset, and both real and fake data were fed into the learned models. These data are illustrated in the top image of Figure 8. The Exponential Mechanism was used to perform differentially private aggregation to yield the final ensemble prediction when applied to held out test points. The bottom image of Figure 8 presents the probability of a correct classification between “insample” and “outofsample” data as a function of under this scheme. Both approaches were given the threshold that optimized their classification performance. Notice that using normalizing flows in this setting naturally yields a performance improvement over the existing method, which is due in part to the ability of normalizing flows to capture densities which do not adhere to a Gaussian distribution.
The benefit of this ensembled approach is that each individual model can be trained nonprivately, which may substantially improve the quality of the learned models. For certain problems, this might be necessary when differentially private optimization is known to degrade performance substantially, for example in high dimensional problems with image data. The associated downside is that the analyst’s privacy budget would now degrades as a function of the number of queries, whereas before the privacy budget was independent of the number of queries. The analyst would have to determine the applicationspecific tradeoff between costs and benefits of this approach, based on practical constraints imposed by the problem context.
6 Conclusion
Privacy is a subject of increasing importance and growing concern. By our work adhering to the framework of differential privacy, one is able to make definitive statements regarding the privacy of participants involved in our analysis. Our results could also be used to enable differentially private synthetic data generation, which would allow data curators to provide privatized synthetic versions of their sensitive or protected datasets, thereby enabling broader access to these data.
In this work, we have demonstrated the efficacy of differentially private normalizing flow models as a novel approach to the task of privacypreserving density estimation. We have shown the ability of these models to assign high likelihoods to holdout data and generate qualitatively realistic synthetic data, improving on existing stateoftheart methods. Going forward, there exist several interesting directions for further development. For example, it remains to be seen how normalization layers such as activation normalization, whose parameters are likely disproportionally sensitive to perturbation during differentially private optimization, could be better adapted to such. Further, one might hypothesize that sampling via partitions of a shuffled dataset may yield improved results given more regular sampling of each data point, and associating such a sampling method with rigorous privacy guarantees if possible could yield empirical improvements. Finally, in this study we only considered a particular subset of normalizing flows in existence. Although, many alternative neural density estimators capable of expressing highly discontinuous distributions are in continuous development, including FFJORD [24], Neural Spline Flows [13], Neural Autoregressive Flows [28], and Transformation Autoregressive Networks [42].
References
 Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM Conference on Computer and Communications Security, CCS ’16, pages 308–318, 2016.

BeaulieuJones et al. [2019]
Brett K. BeaulieuJones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P.
Bhavnani, James Brian Byrd, and Casey S. Greene.
Privacypreserving generative deep neural networks support clinical data sharing.
Circulation: Cardiovascular Quality and Outcomes, 12(7):e005122, 2019.  Bittner et al. [2018] D. Bittner, A. D. Sarwate, and R. Wright. Using noisy binary search for differentially private anomaly detection. In Proceedings of the 2nd International Symposium on Cyber Security Cryptography and Machine Learning (CSCML), volume 10879, pages 20–37, 2018.
 Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.

Bu et al. [2020]
Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J. Su.
Deep learning with gaussian differential privacy.
Harvard Data Science Review
, 2(3), 2020.  Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proceedings of the 13th Conference on Theory of Cryptography, TCC ’16, pages 635–658, 2016.
 Chaudhuri and Vinterbo [2013] Kamalika Chaudhuri and Staal A Vinterbo. A stabilitybased validation procedure for differentially private machine learning. In Advances in Neural Information Processing Systems 26, NIPS ’13, pages 2652–2660, 2013.

Chawla et al. [2005]
Shuchi Chawla, Cynthia Dwork, Frank McSherry, and Kunal Talwar.
On the utility of privacypreserving histograms.
In
Proceedings of the Conference on Uncertainty in Artificial Intelligence
, UAI ’05, 2005.  Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear independent components estimation, 2014. arXiv preprint 1410.8516.
 Dinh et al. [2017] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using Real NVP. In Proceedings of the International Conference on Learning Representations, ICLR ’17, 2017.
 Dong et al. [2019] Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian differential privacy, 2019. arXiv preprint 1905.02383.
 Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, 2019.
 Dwork [2008] Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation (TAMC), volume 4978 of Lecture Notes in Computer Science, pages 1–19. Springer Verlag, 2008.

Dwork and Lei [2009]
Cynthia Dwork and Jing Lei.
Differential privacy and robust statistics.
In
Proceedings of the 41st ACM Symposium on Theory of Computing
, STOC ’09, pages 371–380, 2009.  Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 2014.
 Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC ’06, pages 265–284, 2006.
 Dwork et al. [2009] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In Proceedings of the 41st ACM Symposium on Theory of Computing, STOC ’09, pages 381–390, 2009.
 Dwork et al. [2010] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, 2010.
 Fakoor et al. [2020] Rasool Fakoor, Pratik Chaudhari, Jonas Mueller, and Alexander J. Smola. Trade: Transformers for density estimation, 2020. arXiv preprint 2004.02441.
 Fan and Xiong [2013] Liyue Fan and Li Xiong. Differentially private anomaly detection with a case study on epidemic outbreak detection. In Proceedings of the IEEE 13th International Conference on Data Mining Workshops, pages 833–840, 2013.

Germain et al. [2015]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.
MADE: Masked autoencoder for distribution estimation.
In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 881–889, 2015.  Google [2018] Google. Tensorflow privacy, 2018. URL https://github.com/tensorflow/privacy.
 Grathwohl et al. [2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Freeform continuous dynamics for scalable reversible generative models. In Proceedings of International Conference on Learning Representations, ICLR ’19, 2019.

Gupta et al. [2010]
Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar.
Differentially private combinatorial optimization.
In Proceedings of the 21st Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’10, pages 1106–1125, 2010.  Hall et al. [2013] Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14(1):703–727, 2013.

He et al. [2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)
, ICCV ’15, pages 1026–1034, 2015.  Huang et al. [2018] ChinWei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087, 2018.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37, ICML ’15, pages 448–456, 2015.
 Izmailov et al. [2020] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semisupervised learning with normalizing flows. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4615–4630, 2020.
 Kamath et al. [2019] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private algorithms for learning mixtures of separated gaussians. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, 2019.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations, ICLR ’15, 2015.
 Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, NeurIPS ’18, 2018.
 Kingma et al. [2016] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, NIPS ’16, 2016.
 Liu and Talwar [2019] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC ’19, page 298?309, 2019.
 Mclachlan and Basford [1988] G. Mclachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.
 McMahan et al. [2018] H. Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures, 2018. arXiv preprint 1812.06210.
 McSherry and Talwar [2007] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, pages 94–103, 2007.
 Mironov [2017] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium, CSF ’17, pages 263–275, 2017.
 Nissim et al. [2007] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, STOC ’07, pages 75–84, 2007.

Okada et al. [2015]
Rina Okada, Kazuto Fukuchi, and Jun Sakuma.
Differentially private analysis of outliers.
In Proceedings of the 2015 European Conference on Machine Learning and Knowledge Discovery in Databases  Volume Part II, ECMLPKDD ’15, pages 458–473, 2015.  Oliva et al. [2018] Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, and Jeff Schneider. Transformation autoregressive networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3898–3907, 2018.
 Papamakarios [2019] George Papamakarios. Neural density estimation and likelihoodfree inference, 2019. arXiv preprint 1910.13233.
 Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30, NIPS ’17, pages 2338–2347, 2017.
 Park et al. [2017] Mijung Park, James Foulds, Kamalika Choudhary, and Max Welling. DPEM: Differentially Private Expectation Maximization. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 896–904, 2017.
 Salimans and Kingma [2016] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29, NIPS ’16, 2016.
 Surendra and Mohan [2017] H. S. Surendra and H. S. Mohan. A review of synthetic data generation methods for privacy preserving data publishing. International Journal of Scientific & Technology Research, 6:95–101, 2017.
 Wang et al. [2019] YuXiang Wang, Borja Balle, and Shiva Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, AISTATS ’19, pages 1226–1235, 2019.
 Xu et al. [2012] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In Proceedings of the IEEE 28th International Conference on Data Engineering, pages 32–43, 2012.
 Yuncheng Wu et al. [2016] Yuncheng Wu, Yao Wu, Hui Peng, Juru Zeng, Hong Chen, and Cuiping Li. Differentially private density estimation via gaussian mixtures model. In Proceedings of the IEEE/ACM 24th International Symposium on Quality of Service (IWQoS), pages 1–6, 2016.
Appendix A Additional Privacy Preliminaries
a.1 Differential Privacy
We achieve differential privacy in Algorithm 1 through the Gaussian Mechanism, which adds meanzero Gaussian noise to the value of a function evaluated on the data. The scale of the noise depends on the sensitivity of the function. The sensitivity of a function is denoted , and is the maximum change in the norm of if one entry in the database were to be changed. Formally, for , . In our case, this function is the computation of the gradient given a sampled batch of data.
Theorem 3 (Gaussian Mechanism [16]).
Let and let where are i.i.d. random variables drawn from and . Then is differentially private.
Differential privacy composes, meaning that the privacy guarantees degrade smoothly as more analyses are performed on the same dataset. The simplest version of privacy composition is that the s and s “add up” across analyses. Tighter composition bounds are possible, including the approaches outlined in the following two subsections.
a.2 Moments Accountant
The moments accountant [1] was proposed initially as a means for tight composition of the privacy gaurantees of DPSGD. To characterize this analysis, we note the privacy loss associated with a given outcome , given as . Further, for two given datasets we define the MGF of this random variable evaluated at some value as . Finally, the ”worst case” upper bound across all possible pairs of datasets is given as .
This notation can be used to give composition guarantees for the privacy parameters across multiple algorithms run on the same dataset.
Theorem 4 ([1]).
Suppose that an algorithm consists of a sequence of adaptive algorithms where . Then for any , . For any ¿ 0, is differentially private for .
This analysis was then later characterized under the framework of Rényi differential privacy [39], a relaxation of differential privacy which is defined in a manner closely resembling the moments accountant privacy analysis.
Definition 2 (()Rdp [39]).
A randomized algorithm is said to have Rényi differential privacy of order , or RDP for short, if for any adjacent it holds that , where .
Finally, the privacy analysis performed in [1] regarding DPSGD assumes the sampling is performed via Poisson subsampling, i.e., each individual example has independent sampling probability of being included in the batch at each iteration. In some contexts, fixedsized batches can enable a variety of performance improvements by allowing for compilation. Privacy analysis under uniform subsampling, where each batch is sampled uniformly across all possible batches of size , was considered in [48] :
Theorem 5 (RDPDP Conversion [48]).
For all integers , if is RDP, then the randomized algorithm applied to a subsampled batch of data without replacement is RDP where .
a.3 Gaussian Differential Privacy
Gaussian differential privacy (GDP) is a recently proposed relaxation of differential privacy established in [11], and further expanded upon in the context of deep learning in [5]. This definition exhibits several appealing properties, including simplified analysis under composition and subsampling, derivation of analytically tractable expressions for the privacy guarantees of NoisySGD, while providing a slightly tighter privacy bound than that which is achieved through analysis via the moments accountant. The framework of Gaussian differential privacy acts as the basis for our analysis.
We note that the Gaussian mechanism (Definition 3) is GDP [11].^{3}^{3}3Further detail concerning the privacy guarantees achieved when batches are subsampled is given in Section 2 of [5].
The overall privacy guarantee corresponding to the fold adaptive composition of mechanisms each satisfying GDP is GDP. Finally, GDP allows for a conversion to a corresponding differential privacy guarantee using the fact that an algorithm is GDP if and only if it is differentially private for all , where and
is the cumulative density function of the Normal distribution.
Figure 9 shows that GDP privacy accounting gives substantial improvements over the moments accountant method when used as the privacy accountant method in DPNF. For each number of iterations, GDP accounting yields a lower privacy value.
a.4 DpSgd
Appendix B DPNF Extensions
b.1 DataDependent Initialization of Normalization Layers
Intermediate normalization layers such as batch normalization [29] and activation normalization [33] have been shown to improve the stability of normalizing flow models. In our context, batch normalization is incompatible with our approach since that batch statistics are shared when computing the forward pass of the layer, precluding the ability to calculate truly independent perexample gradients as required by NoisySGD. Activation normalization is more appropriate in our setting since no such batch statistics are calculated. Activation normalization is characterized by an offset and scaling of its inputs featurewise by a learned set of parameters and , i.e. . In practice, typically these parameters are set via datadependent initialization [46] by computing a forward pass on a sampled batch of data and setting and to be the perfeature means and standard deviations of the inputs it had observed respectively. Since these statistics are calculated directly from the data, this approach is not privacypreserving.
One potential approach to making differentially private activation normalization is to privatize these statistics using the Laplace Mechanism [17]. This approach is outlined in Algorithm 4, where clips the values of to be in the range , computes the featurewise mean of , computes the featurewise standard deviation of , and is some dataindependent parameter initialization method which maps standardized inputs to standardized outputs in expectation, e.g., He initialization [27].
We note that Algorithm 4 is far from the only approach for differentially private activation normalization. If the analyst has some domain knowledge about appropriate ranges of these parameters, she could use the differentially private ProposeTestRelease framework [15] to first normalize and then add noise proportionally to her proposed (and tested) sensitivity. While this approach seems practical for the outer layers, it is unlikely that an analyst would have numerical intuition for appropriate parameter values in all layers (especially when is large). Thus even though the noise addition scheme described in Algorithm 4 may seem naive, it is likely the most practical approach.
Although the utility of activation normalization layers is quite evident, the original work [33] proposing such layers provided little evidence to support the idea that datadependent initialization yielded statistically significant improvements over a default initialization scheme, i.e., and
. In our experiments, we observed little distinction in contexts where the input data was assumed to be standardized and parameters were initialized to maintain variance between layers. Despite this, we include the approach for completeness for potential future contexts where datadependent initialization of such parameters deems necessary.
b.2 DPMoG as a Prior
Thus far it has been assumed that we use the spherical multivariate Gaussian distribution to act as a prior for our model. Although, naturally any distribution could act as such a prior as long as it exhibits a tractable density function. For example, simply extending the single standardized Gaussian to a mixture of Gaussians has been shown [44] to exhibit modest performance improvements. This mixture could be fit to the data [30] a priori as well.
Hence, a natural extension to our proposed approach would be to fit DPMoG first with privacy budget to act as a prior, and then to refine this prior by training a sequence of nonlinear bijective functions with privacy budget to yield an encompassing normalizing flow model. This yields a worstcase differential private guarantee by sequential composition. Although, this guarantee is easily improved by composing these privacy guarantee under some alternative privacy definition (e.g. RDP or GDP) before subsequently converting to a corresponding ()DP guarantee. One might hypothesize that this approach would yield preferable results in contexts where the distribution at hand is composed of several discontinuous components, while exhibiting locally nonlinear density within each component.
To capture this context, we evaluate the efficacy of this approach on the Pinwheel dataset, as illustrated in Figure 10 (further details of this dataset are given in Appendix C). The Pinwheel dataset is a common density estimation benchmark consisting of a number of disconnected components with nonlinear boundary. A Gaussian mixture model would naturally have difficulty approximating such a distribution for a small number of components, while classical normalizing flow models with a single standardized Gaussian as a prior might have difficulty expressing its discontinuous density.
As shown in Table 2, using a trained GMM to act as a prior can aid in performance. First, we note that both DPNF and DPMoG demonstrate difficulty in achieving negative log likelihoods lower than 2.652.70, even when the number of components for DPMoG is increased. DPNF with DPMoG as a prior modestly outperforms both alternatives used in isolation. Additionally, if one assumes that a GMM prior of five components fit to the population can be assumed to be public, one achieves dramatic performance improvements over all methods.
Pinwheel  

DPNF_{spherical Gaussian prior}  
DPNF_{nonprivate GMM prior}  
DPNF_{private GMM prior, ε=0.2}  
DPMoG_{5}  
DPMoG_{10} 
Appendix C Additional Results and Training Details
In this section, we provide additional results on the performance of DPNF, compared to the baseline mechanism of DPMoG, as evaluated on several real and synthetic datasets. These datasets are summarized in Table 3. Details of the real and synthetic datasets are respectively given in Sections C.1 and C.2. Further details of hyperparameter training for all experiments are given in Section C.3.
Dataset  Real  

Life Science  Real  10  26,733 
Gowalla  Real  2  100,000 
Power  Real  6  100,000 
Gaussians  Synthetic  2  30,000 
HalfMoons  Synthetic  2  30,000 
Pinwheel  Synthetic  2  30,000 
c.1 Additional Results on Real Datasets
Our analysis aims to cover a range of datasets composed of both synthetic and realworld datasets. In this section, we provide short descriptions of each dataset used for evaluation, and provide results of DPNF evaluated on these datasets.
Life Science. The Life Science dataset is a density estimation benchmark dataset from the UCI machine learning repository [12] used in our baseline [45] in their evaluation of DPMoG. It contains 26,733 realvalued records of dimension 10 characterizing the principle components of measurements made in a variety of chemical and biological experiments.
Power. The Power dataset is a density estimation benchmark dataset from the UCI machine learning repository [12] used in much of the normalizing flow literature [44, 24]. It contains measurements of electric power consumption in a household over a period of 47 months, and was preprocessed according to the description given in [44].
Gowalla. The Gowalla dataset contains the locations in terms of longitude and latitude of the social network’s users’ checkins. The total number of points is 1,256,384, which was reduced to 100,000 via a random sample. It was used in the evaluation of our baseline [45], but applied to the task of means clustering rather than learning the components of a Gaussian mixture model.
We evaluated the performance of DPNF and the baseline DPMoG for comparison on these realworld datasets. Results on the Life Science dataset are given in the body of the paper in Table 1. Results on the Power and Gowalla datasets are given in Table 4 below.
Power  

DPNF (GDP)      
DPNF (MA)  
DPMoG_{10} (MA)  
DPMoG_{10} (zCDP)  
Gowalla  
DPNF (GDP)      
DPNF (MA)  
DPMoG_{11} (MA)  
DPMoG_{11} (zCDP) 
We also provide Figure 11, which is the full version of Figure 3 in Section 4. Figure 11 provides dimensionwise histograms of the synthetically generated data for all ten dimensions of the Life Science dataset, presented in numerical order by axis index from left to right. The top two rows correspond to our baseline, and the bottom two rows correspond to our approach. For each image, we visualize the synthetically generated data (given in orange) and superimpose it over real data (given in blue) for comparison. We observe that for nearly all ten features, the distribution of data generated by DPNF closely resembled that of the real data while DPMoG was unable to replicate regions of concentrated density for certain dimensions.
c.2 Additional Results on Synthetic Datasets
We also evaluate our DPNF method on several synthetic datasets. We perform this on the HalfMoons dataset (Figure 7), as well as a synthetically constructed dataset of a mixture of 8 Gaussians (Figure 12). This is done to demonstrate the heightened expressiveness of our approach as compared to a Gaussian mixture approach, alongside a worstcase scenario where the data is truly generated by a mixture of Gaussians, where DPMoG would be expected to outperform our method. Results are presented in Table 5.
HalfMoons  

DPNF (GDP)      
DPNF (MA)  
DPMoG_{3} (MA)  
DPMoG_{3} (zCDP)  
Gaussians  
DPNF (GDP)      
DPNF (MA)  
DPMoG_{8} (MA)  
DPMoG_{8} (zCDP) 
c.3 Hyperparameters and Training details
In this subsection we detail decisions about hyperparameter section in training, including the gradient clipping parameter , regularization of the loss function, and choice of privacy accountants.
For the gradient clipping parameter , prior work [1]
suggested that a reasonable heuristic for setting this clipping parameter was to set
equal to the median of the norms of the unclipped gradients observed over the course of a nonprivate training execution. In the context of normalizing flows, we found that much larger values for yielded significantly preferable results. A natural explanation for this arises under consideration of log likelihood as an objective. In cases where a given point is assigned nearzero density, a large gradient update would be incurred to prevent further deterioration. When this gradient update is clipped, the resulting update may be insufficient to avoid associated numerical instability if this point is assigned density further approaching zero. Despite excess noise being applied to updates with larger , we found this to merely prolong training without a significant degradation in resulting model quality.With respect to regularization, [44] suggested a modest amount of regularization (i.e., a coefficient of ) in the context of nonprivate normalizing flows. We found this regularization approach to substantially degrade the quality of the resulting models and was generally omitted in our training. This makes intuitive sense, as the suggested regularization of [44] serves to decrease model weights over the course of training, and differentially private optimization applies Gaussian noise vectors of constant variance to gradients throughout training. As model weights tend toward zero, one would expect that the noise injection from privacy eventually dominates the learned criteria of the model.
For privacy accountants, we found that composition under Gaussian differential privacy (GDP) [11] consistently yielded the tightest privacy bounds throughout our experiments. See Figure 9 in Appendix A.3 for an illustration of the improvements in privacy composition that can be achieved by GDP, over the moments accountant of [1]. Since our baseline method DPMoG yielded the best performance under the moments accountant in [45], we included both GDP composition and moments accountant for fair comparison. We found that DPNF consistently outperformed DPMoG even when both methods used the moments accountant, and that further performance improvements could be achieved by DPNF using GDP composition as a privacy accountant method.
Appendix D Limitations of Proposed Approach
Normalizing flow models are trained in a manner which minimizes the average negative log likelihood of the observed data. As such, it is not uncommon when training such models that a given point is assigned nearzero density, provoking a loss approaching infinity. We observed this issue was somewhat exacerbated due to differentially private optimization, in part due to noise injection, but primarily as a result of subsampling. In the case of uniform subsampling, as required in our privacy analysis, there is no strict guarantee that any individual point is regularly sampled. This acts in contrast to typical sampling methodology in which the dataset is repeatedly shuffled and partitioned into equal sized batches over the course of an epoch. Rigorous privacy guarantees associated with equally sized and disjoint sampling is, to the best of our knowledge, a currently unsettled issue
[37] and a potential avenue for improvement given its theoretical convergence guarantees. We did not find this limitation to be ultimately confounding in any way.Additionally, given that DPNF is ultimately a deep learning based approach to density estimation, it naturally involves the optimization of a large number of parameters and a high resource expenditure in terms of time and space complexity. This is especially highlighted in comparison to DPMoG as a baseline, which takes on the order of seconds to run on CPU as compared to our method which can take on the order of an hour on GPU. Although, we find this tradeoff between resource expenditure and distribution quality to be justified, particularly in the context of differentially private data analysis and social science applications which rarely demand strict resource constraints within reason.