Differentially Private Normalizing Flows for Privacy-Preserving Density Estimation

03/25/2021 ∙ by Chris Waites, et al. ∙ Columbia University Stanford University 0

Normalizing flow models have risen as a popular solution to the problem of density estimation, enabling high-quality synthetic data generation as well as exact probability density evaluation. However, in contexts where individuals are directly associated with the training data, releasing such a model raises privacy concerns. In this work, we propose the use of normalizing flow models that provide explicit differential privacy guarantees as a novel approach to the problem of privacy-preserving density estimation. We evaluate the efficacy of our approach empirically using benchmark datasets, and we demonstrate that our method substantially outperforms previous state-of-the-art approaches. We additionally show how our algorithm can be applied to the task of differentially private anomaly detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of density estimation requires constructing an estimate of an unknown probability density function, given observed data. This density estimate can then be used to perform a variety of relevant analysis tasks, including log likelihood evaluation and synthetic data generation. In settings involving sensitive data, the construction and subsequent release of such an estimate could potentially leak private information. Without a rigorous privacy guarantee, nothing prevents a model from memorizing a row in the training set, assigning disproportionate density to a point, or any other vulnerability due to arbitrary analysis of the learned parameters. Since density estimation remains a task of interest to the modeling community, continued attention is required to develop privacy-preserving methods for density estimation.

Differential privacy [17] has emerged as the predominant privacy notion in the context of statistical data analysis. At a high level, differentially private analyses limit the extent to which the distribution of outputs can change due to the inclusion or exclusion of any one individual from the analysis. Algorithms which adhere to this notion exhibit a number of desirable properties, including privacy guarantees which hold regardless of the auxiliary information an adversary may have and composition of privacy guarantees across multiple analyses. Hence differential privacy acts as a compelling gold standard in the design of privacy-preserving analyses.

Tools for density estimation have held longstanding interest due to their versatility. Their ability to address a wide range of distributional learning tasks is precisely why the existence of an accurate and privacy-preserving density estimation is surprising. For example, privately constructing such a model implicitly yields a differentially private approach to anomaly detection—a task of substantial investigation [3, 41, 21]—as an immediate application of likelihood inference. In addition, given that density estimators often enable efficient sampling, such a model would yield a method for privacy-preserving synthetic data generation. This task in particular has been of longstanding interest to the privacy community [47] as it addresses many of the limitations imposed by the query-release model [14] by allowing large numbers of arbitrary analyses. Privately generating a synthetic dataset only incurs a fixed privacy cost during the training process; all subsequent queries on the synthetic data are automatically differentially private due to the privacy notion’s post-processing guarantee, so the privacy cost does not scale with the number of downstream analyses performed.

Normalizing flow models are an attractive approach to the task of density estimation due to their empirical ability to approximate arbitrary, high-dimensional distributions. These models approach the task of density estimation via a transformation on a chosen base density by a sequence of invertible, non-linear transformations, enabling density querying on the resulting distribution via an application of the change-of-variables formula. Approaches to density estimation in this manner include: Non-linear Independent Components Estimation (NICE)

[9], Real NVP [10], Glow [33], and Masked Autoregressive Flows (MAF) [44]. Until this work, it was an open question whether normalizing flow models could be constructed in a differentially private manner to handle the task of privacy-preserving density estimation, combining the rigorous guarantees of differential privacy with the strong empirical performance exhibited by normalizing flows.

In this work we propose the use of normalizing flow models trained in a differentially private manner as a novel approach to the task of privacy-preserving density estimation. We provide an algorithm (DP-NF, Algorithm 1 in Section 3) that privately optimizes the model parameters via gradient descent using DP-SGD [1], which adds Gaussian noise to clipped gradient updates ensure differential privacy. Additionally, we achieve tighter privacy guarantees than established in previous work [1] via composition with the recently introduced notion of Gaussian differential privacy [11]. We apply this optimization to the parameters of a Masked Autoregressive Flow [44], our primary architecture of consideration, and achieve empirical results (Section 4) which convincingly outperform previous approaches. Further, we show that our algorithm can be applied to solve the problem of differentially private anomaly detection (Section 5), and show that it leads to better true/false positive rates than existing private methods.

1.1 Related Work

Gaussian mixture models (GMMs) are known to be a particularly strong density estimation tool [43] since they are a universal approximator of densities — that is, they are able to approximate any density function arbitrarily well given a sufficient number of components [36]

. They approach the task of density estimation by modeling the data distribution as a weighted sum of Gaussian distributions. The first differentially private algorithm for learning the parameters of a Gaussian mixture model comes from the work of

[40], which uses their sample-and-aggregate framework to convert non-private algorithms into private algorithms, applied to the task of learning mixtures of Gaussians. However, their approach exhibits strong assumptions on the range of the parameter space and assumes a uniform mixture of spherical Gaussians. Follow-up work of [31] proposes a modernized approach which improves upon the sample complexity of the aforementioned work and removes the strong a priori bounds on the parameters of the mixture components, although it makes the assumption that the components of the mixture are well-separated.

There has also been work in learning the parameters of a Gaussian mixture model through differentially private variants of the expectation maximization (EM) algorithm. One notable instance of this is DPGMM

[50], which achieves a privacy guarantee at each iteration of EM through the application of calibrated Laplace noise to the estimated model parameters following each maximization step. These individual privacy guarantees are then combined into an overall privacy guarantee via sequential composition, i.e., by taking the sum of privacy parameters in each iteration. The work of [45]

introduces DP-EM, a general framework for privacy-preserving optimization via expectation maximization. Their approach follows a conceptually similar idea of applying either calibrated Laplace or Gaussian noise to the model parameters at the end of each EM iteration. They apply this method to learning mixtures of Gaussians, henceforth referred to as DP-MoG, and they demonstrate significantly better privacy guarantees through composition via the moments accountant and zero-concentrated differential privacy (zCDP)

[6]. Given that their work makes no notable assumptions about the task and provides an empirical evaluation of their method, this is the most comparable approach to our own. As such, it is used as a baseline in our experimental results.

In addition, we take note of more classical approaches to the task of privacy-preserving density estimation. One of the simplest yet most widely used methods for density estimation is through the use of histograms, and previous work [8, 49]

has investigated their private estimation. Unfortunately, such an approach scales poorly with the dimension and complexity of the distribution while asserting an unrealistic discretization of the space. Kernel density estimation is another closely related approach, often characterized as the smooth analog to the classical discrete histogram. The work of

[26] proposes a method for privately querying the density of such an estimator through the addition of calibrated Gaussian noise. As a non-parametric approach, it has the drawback that it requires storage of the entire dataset at test time to enable querying (proving impractical for large-scale datasets) while still degrading similarly with dimension.

There have also been a number of deep learning based approaches to generative modeling which vary in their relevance. Although work of this nature technically allows for both sampling and likelihood evaluation, it does not allow for

exact

likelihood inference as is the case for mixtures of Gaussians and normalizing flows. There is also expansive literature concerning differentially private approaches to training Generative Adversarial Networks, yet these methods are strictly limited to sampling and do not provide a straightforward approach to likelihood inference.

Finally, we include a brief overview of the extensive literature concerning density estimation via normalizing flows. One important subset are those characterized by coupling layers: transformations which partition the dimensions of its input and map them in a way that retains invertibility and a tractable Jacobian. This includes Non-linear Independent Components Estimation (NICE) [9], as well as its subsequent generalization Real NVP [10]. Another notable approach, Glow [33], makes use of such coupling layers while also proposing the use of an invertible weight matrix decomposition to generalize the notion of permutation layers. Alternatively, some make use of autoregressive transformations

, which are transformations that utilize the chain rule of probability to represent a joint distribution as a product of its conditionals. Such models include Masked Autoregressive Flow (MAF)

[44], a generalization of Real NVP optimized for density estimation, as well as its closely related Inverse Autoregressive Flow [34] optimized for variational inference, among others [42, 28, 20].

2 Preliminaries

2.1 Normalizing Flows

Let be the probability density function characterizing an unobservable distribution of interest, and let be observed i.i.d. samples from this distribution. The task of density estimation is to find an approximation of via some model given . In the context of normalizing flows, this model is characterized by a prior distribution , chosen to exhibit a simple and tractable density (e.g., the spherical multivariate Gaussian distribution), and a sequence of bijective functions , parameterized fullyby . The function acts as a transformation between the prior distribution and the approximated distribution .

Given such a model, it can be used to efficiently sample by first sampling and then transforming the sample as . If is a good approximation of , then this generative process gives an efficient (approximate) oracle for sampling from the unknown distribution.

Since is invertible, one can also perform exact likelihood evaluation on observed points from the data distribution via the change of variables formula, as follows:

Finding a good approximation is achieved through optimization of to minimize the negative log likelihood of the observed dataset:

(1)

In practice, one will typically find the MLE

using some non-convex optimization method, such as stochastic gradient descent.

2.2 Differential Privacy

Differential privacy [17] has become the gold standard for ensuring the privacy of statistical analyses applied to sensitive databases. At a high level, it ensures that changing a single entry in the database will have only a small effect on the distribution of analysis results.

Definition 1 ([17]).

A randomized algorithm satisfies -differential privacy (DP) if for any two input database that differ in a single entry and for any subset of outputs , it satisfies,

One common algorithmic approach for achieving differential privacy is adding noise that scales with the sensitivity of the function being evaluated, which is the maximum change in the function’s value that can result from changing a single data point. Differentially private algorithms are robust to post-processing, meaning that any data-independent function of a differentially private output retains the same privacy guarantee, and they enjoy composition, meaning that the privacy parameters degrade gracefully as additional analyses are performed on the dataset. The simplest version of composition is that the privacy parameters and “add up” over multiple analyses, although stronger versions of composition are also used.

Differentially Private Stochastic Gradient Descent (DP-SGD, presented formally in Algorithm 3 in Appendix A.4) was introduced by [1] as a method for private non-convex optimization. At each step , DP-SGD subsamples111The original algorithm of [1] does this via Poisson subsampling, but it can also be done via uniform subsampling while retaining a privacy guarantee [48].

a small set of data points and uses this batch to compute a gradient update. To achieve a differential privacy guarantee, DP-SGD adds mean-zero Gaussian noise to the average of the per-example gradients. The standard deviation of this noise is scaled with the sensitivity of the gradient estimation. Since this is unbounded, the per-example gradients are first clipped to ensure that the

-norm is at most some input parameter , thus bounding the sensitivity, and then adds noise which scales with .

[1] also introduced the moments accountant, which provides tight privacy composition across multiple gradient update steps in DP-SGD. To describe the moments accountant, given an algorithm and two neighboring datasets , first we denote the privacy loss of a particular outcome as

. The moments accountant calculates a privacy budget by bounding the moments of the privacy loss random variable

. That is, if we consider the log of the moment generating function (MGF) of the privacy loss random variable evaluated at

, i.e., , the worst case over all neighboring databases composes linearly across multiple mechanisms (Theorem 2.1 [1]) and allows for conversion to an associated -differential privacy guarantee through the relation . Follow up work of [5] introduced NoisySGD, which followed the same algorithmic structure but analyzed privacy composition under Gaussian differential privacy [11]. For the purpose of this work it is sufficient to simply note the associated benefits of analysis under Gaussian differential privacy: it naturally lends itself to composition under subsampling, allows for analytically tractable expressions of the privacy guarantees of NoisySGD, while providing a slightly tighter overall privacy bound than that achieved by the moments accountant. Further details are provided in Appendix A.

3 Differentially Private Normalizing Flows

In this section we introduce our algorithm for differentially private density estimation via normalizing flows, DP-NF, presented in Algorithm 1. It is based on the DP-SGD algorithm of [1], which is a differentially private method for performing stochastic gradient descent. We also briefly discuss performance improvements using data-dependent initialization of normalization layers and using a differentially private estimate of the distribution to act as a prior, both of which are explored further Appendix B. We emphasize that our primary technical contribution is not in the design of these algorithms, but rather the novel application of these tools to the problem of differentially private density estimation in a way that yields substantial performance over prior work, as demonstrated by our empirical results in Section 4.

3.1 DP-NF Algorithm

Training a normalizing flow model corresponds to minimizing the loss function in Equation (

1): . This loss function is non-convex when applied to the optimization of a non-linear normalizing flow model, and hence optimization is typically performed via gradient descent on . To make this training private in Algorithm 1, we update using the DP-SGD algorithm of [1] described in Section 2.2, with some subtle yet important augmentations to the standard minibatch gradient descent process to allow for an explicit privacy guarantee, in accordance with DP-SGD.

1:  Input: Dataset , initialized parameters , learning rate , batch size , noise scale , upper-bound on norm of per-example gradient , training privacy budget , training privacy tolerance , privacy accountant .
2:  
3:  while  do
4:     Take a uniformly random subsample with batch size .
5:     for  do
6:        
7:        
8:     end for
9:     
10:     
11:  end while
12:  Output
Algorithm 1 DP-NF, Differentially private density estimation via normalizing flows

First, batches are sampled via uniform subsampling (Line 4). That is, each possible batch of size has equal likelihood of being chosen (as opposed to repeatedly shuffling the dataset and taking equally sized partitions of the dataset, which is often preferred in practice). Second, rather than computing the gradient with respect to the entire batch, the gradient with respect to each individual data point is calculated, clipped to have maximum norm

, averaged, then added with a randomly sampled Gaussian noise vector (Lines 6-9).

Algorithm 1 also requires a privacy accountant to be specified as input. This privacy accountant will dynamically track the privacy loss incurred by composition over all gradient update steps as a function of the training parameters, and will halt the algorithm once a pre-specified budget is reached. A privacy accountant takes in the round of training, the sampling probability of a single point (here a batch of size is sampled uniformly from a set of data points), the noise scale that is added to preserve privacy, the bound on the norm of each gradient, and the privacy parameter . At every time step, the privacy accountant maintains the current privacy budget that has been expended until round given the input parameters. Common choices for this accountant include the moments accountant (MA) [1] or composition via Gaussian differential privacy (GDP) [11]. In our experiments in Section 4, we yield preferable results using a GDP privacy accountant.

In summary, DP-NF in Algorithm 1 is a modified version of DP-SGD, instantiated to train a normalizing flow model with the analyst’s choice of privacy accountant.

The privacy guarantees of DP-NF follow as an immediate corollary from those of DP-SGD [1] when instantiated with the moments accountant, and from NoisySGD [5] when instantiated with the Gaussian differential privacy accountant.

Theorem 1.

DP-NF is -differentially private.

3.2 DP-NF Extensions

In practice, one will find that many deep learning models (including the normalizing flow models used in our experiments) are much better optimized using adaptive learning rate optimization schemes. Given this, we found significant benefit in using a direct extension to DP-SGD which applies noisy gradients to the model according to the Adam [32] optimizer. Both methods achieve identical privacy guarantees given that computation of the first and second moments of the noisy gradients are merely deterministic data-independent functions of them. Thus they differ only in the post-processing of the noisy gradients, and the privacy guarantees are unchanged.

Two further extensions of Algorithm 1 are proposed below, which may provide substantial improvements to empirical performance.

Data-Dependent Initialization of Normalization Layers. Intermediate normalization layers such as activation normalization [33] have been proposed as a means to improve the stability of normalizing flow models. Activation normalization is characterized by a feature-wise offset and scaling of inputs by a learned set of parameters and , i.e., . In practice, these parameters are typically set via data-dependent initialization [46] by setting and as the per-feature means and standard deviations observed throughout a forward pass of a sampled batch of data. These parameters can also be estimated privately, e.g., by applying the Laplace Mechanism [17] to the clipped mean and standard deviation, thus allowing for data-dependent initialization of these normalization layers. For more details, see Appendix B.1.

Differentially Private Data-Dependent Priors. Section 2.1 suggested the analyst choose a data-independent prior , such as the multivariate spherical Gaussian. However, recent work suggests that modest improvements in empirical results can be achieved through the use of more complex priors, such as a mixture of Gaussians [44], or by fitting a Gaussian mixture model to the data [30]. A natural privacy-preserving approach would be to first use DP-MoG [45] with privacy budget to estimate a prior, and then refine the prior using DP-NF with privacy budget to yield an encompassing normalizing flow model. This process would be -differentially private, and may yield preferable results in settings where the distribution is highly discontinuous, but also locally non-linear. For more details, see Appendix B.2.

4 Experimental Results

In this section we present experimental results demonstrating the empirical performance of our approach, evaluating our algorithm on a variety of real and synthetic datasets on varying tasks. In the main body we focus our evaluation on a single dataset (the Life Science dataset [12], described next), although refer to Appendix C for all additional results on other real and synthetic datasets.

In all our experiments on the Life Science dataset, we used . Our baseline method for comparison [45] used . However, this corresponds to on this dataset, which is typically deemed unacceptably large in the privacy community. Instead, our choice of , which is sublinear in the size of the database. Smaller values of would not change our qualitative results, nor would they substantially change our quantitative results.

4.1 Datasets, Implementation, & Setup

The Life Science dataset is a standard density estimation benchmark dataset from the UCI machine learning repository

[12] containing 26,733 real-valued records of dimension 10. This dataset was used in the original evaluation of our baseline model [45]. Results using additional datasets are presented in Appendix C.

Experiments were run on a machine with 2 CPUs, 13 GB RAM, and a single NVIDIA Tesla K80 GPU, and took on the order of half an hour to five hours to run in wall-time, depending on the number of iterations and the dimensionality of the dataset. Models were implemented in the Jax [4]

deep learning framework, and used privacy accounting implementations from TensorFlow Privacy

[23].

Hyperparameter Search and Model Selection.

Reported privacy budgets in our results correspond only to the training of each model, and does not include privacy loss from hyperparameter search and model selection. We chose not to select hyperparameters in a privacy-preserving manner because this was not the focus of our contribution and because it was not done in our baseline method.

222These can be done privately. For example, [25] provides discrete optimization methods that can be used for private hyperparameter search over discrete model architectures. [2] uses ReportNoisyMax [16] for private model selection. Some work has also been done to account for high-performance models without having to spend a significant privacy budget [7, 35].

It was generally observed that choices in network structure itself had relatively negligible impacts on results. We found that training parameters such as the gradient clipping bound and batch size had a much more substantial impact on model performance, which is consistent with observations made in

[1].

Model Architecture. The architecture of the model used in our experiments was a variant of a Masked Autoregressive Flow (MAF) [44] composed of a repeated sequence of five blocks, each containing a MADE [22] layer, a reversal layer, and an optional activation normalization layer. Models were optimized via Adam, with default parameters of and . Further details of training parameters and procedures are given in Appendix C.3.

4.2 Empirical Performance of DP-NF

We implemented our algorithm for differentially private normalizing flows on the Life Science dataset (and other datasets as described in Appendix C), and evaluated our performance against the baseline of DP-MoG [45] for a variety of quantitative and qualitative metrics related to density estimation tasks.

Figure 1: Average log likelihood (higher is better) across ten independent cross-validation splits as a function of the cumulative privacy loss . DP-MoG was configured to use the Gaussian mechanism with 3 components, as per the original work. DP-NF composed with GDP (as well as MA for fair comparison).

4.2.1 Quantitative Evaluation: Expected Log Likelihood.

Arguably the most foundational metric for density estimators is the expected log likelihood they assign to held out test points. Figure 1 presents average log likelihood assigned to a held out test set under DP-NF and the baseline method DP-MoG [45] as a function of . We divided the dataset into 10 pairs of training (90%) and test sets (10%), and reported the average test log likelihood per data point across the 10 independent trials. Better methods should assign higher log likelihood for points in the held out test set since these points were indeed sampled from the underlying distribution of interest. We found that DP-NF reliably assigned much higher likelihoods to holdout data than that of DP-MoG for identical privacy budgets, across a variety of privacy accountant methods.

The privacy guarantees of DP-NF proved quite practical, providing substantial privacy improvements over DP-MoG for the same model performance. For example, DP-NF matched the peak performance of DP-MoG (achieved around ) for only an expenditure of . These results are also listed in Table 1 with error bars showing standard deviation across 10 independent runs.

Life Science
DP-NF (GDP)
DP-NF (MA)
DP-MoG (MA)
DP-MoG (zCDP)
Table 1: Average test log likelihood for varying privacy budgets . Error bars denote standard deviation over ten independent cross-validation splits. Bolded results denote best performing model for a given .

[45] showed performance of DP-MoG under several different privacy accountant methods, with the moments accountant of [1] providing the best performance. We compared DP-NF using the moments accountant for fair comparison, and using the novel Gaussian differential privacy (GDP) accountant of [5]. Figure 1 and Table 1 show that DP-NF outperforms DP-MoG for all privacy accountant methods considered for either model, emphasizing that while the GDP accountant does provide some benefit, the vast majority of the performance improvements come from the DP-NF method itself. The benefits of using the GDP accountant are further explored in the appendix.

4.2.2 Quantitative Evaluation: Downstream Machine Learning Tasks.

Next we further evaluate the quality of our model by measuring the performance of downstream machine learning models trained on its generated synthetic data. A proper method for evaluating the strength of density estimation approaches is through the quality of their synthetic data, as measured by the ability to train a machine learning model that performs well on future, real data.

To perform this evaluation, we trained DP-NF and DP-MoG (along with their non-private variants for reference) to learn the distribution of the training data, and then used these models to generate a synthetic dataset. We then trained a simple regressor—-nearest neighbors with default library settings ()—on each synthetically generated dataset and evaluated their in predicting a target value on real, held out Life Science data. The Life Science dataset does not have an immediately associated prediction task, as it is primarily used as solely a density estimation benchmark. To artificially construct a prediction task, we simply chose to isolate the last column to act as a label and treated the remaining nine columns as features.

Figure 2: Mean squared error of regressor (-nearest neighbors with ) on real, held out test data when trained on synthetically generated data by various approaches. Baseline refers to training the regressor on the real data.

Figure 2 shows the mean squared error attained by each regressor as a function of the privacy expenditure of the data generation approach. Horizontal lines denote that the approach is , i.e., non-private. “Baseline” refers to training the regressor directly on the training dataset provided to the density estimator. Upon inspection, we find that DP-NF generates data of convincingly higher quality than that of our comparison method DP-MoG for all values of . In addition, for higher values of , DP-NF converges to the quality achievable by a non-private MoG, whereas DP-MoG hits an apparent plateau well before this point.

4.2.3 Qualitative Evaluations.

Figure 3 shows that DP-NF provides a qualitative increase in sample quality under visualization. It presents dimension-wise histograms of synthetically generated features for three features of the Life Science dataset, using DP-NF (left column) and DP-MoG (right column) for comparison. (See Figure 11 in Appendix C for dimension-wise histograms of all 10 features.) Both methods used and . In every plot, the synthetic data in orange is superimposed over the real data in blue. We qualitatively see that for nearly all ten features, the distribution of data generated by DP-NF closely matches that of the real data, while DP-MoG was relatively unable to replicate regions of concentrated density for certain dimensions. This could be due to the fact that that for a fixed number of components, the DP-MoG model is constrained to cover the support of the distribution and must ignore nuanced details. Normalizing flow models, on the other hand, have heightened expressiveness over traditional statistical methods like Gaussian mixture models, and we see that they are able to capture these nuances more readily.

As another qualitative evaluation of sample visualization, Figure 4 shows the density of synthetic data generated by each model when projected to two dimensional space via PCA, for varying values. The top row shows DP-NF, the bottom row shows DP-MoG, and the right figure shows the real data. In all plots, lighter pixels correspond to regions of higher density, and dark pixels indicate lower density. We see that DP-NF is better able to capture some of the observable qualities exhibited in the real data, for example the gradual compression of density to the left of the distribution.

Figure 3: Dimension-wise histograms of synthetically generated Life Science data, superimposed over real data, for and . Left Column: DP-NF. Right Column: DP-MoG. Note DP-NF’s ability to capture regions of concentrated density, whereas DP-MoG struggles in this respect.
Figure 4: Synthetically generated Life Science data for = 2, 4, and 6, projected to two dimensions via PCA. Top row: DP-NF. Bottom row: DP-MoG. Right: Real data. Note the compression to the left of the distribution of real data that is captured by DP-NF as increases, but not present in the synthetic data generated by DP-MoG.

5 Application: Differentially Private Anomaly Detection

Our DP-NF algorithm can be used as a tool for differentially private anomaly detection. Given a density estimator, a straightforward approach to anomaly detection is through a simple likelihood thresholding mechanism. For a given point, to determine whether it is in-distribution or out-of-distribution, we can simply return a binary value which denotes in-distribution if the log likelihood assigned to the point by the model is above some empirically derived threshold , and out-of-distribution otherwise. For the purposes of our experiments, we assume that such a threshold is easily estimated, i.e., by selecting the value for which optimizes anomaly detection performance on a public test set. In the private setting, we can approach this task in a privacy-preserving manner by training either DP-NF or DP-MoG on the dataset. By the post-processing property of differential privacy, we can make arbitrarily many anomaly detection queries to the privately trained model while incurring no privacy loss beyond what is incurred during training.

In Figure 5

, we illustrate the efficacy of our approach in performing anomaly detection under this likelihood thresholding mechanism. We randomly generated points that were uniformly distributed around the tails of the test dataset, i.e., between the 5th and 30th percentiles and the 70th and 95th percentiles dimension-wise. The total number of synthetically generated anomalies was equal to the total number of test points. Figure

5 shows ROC curves of both private and non-private methods for this binary prediction problem. That is, it shows the tradeoff in the true positive rate and the false positive rate in predicting in-distribution or out-of-distribution correctly for varying selections of the likelihood threshold. We observe that DP-NF outperformed the other private method DP-MoG for the same privacy guarantee. Our approach performed comparably to non-private MoG, and is of course upper-bounded by a non-private normalizing flow.

Figure 5: ROC curves displaying true positive rate and false positive rate for private and non-private likelihood threshold models. Privacy expenditure was calculated using the moments accountant with .
1:  Input: Dataset , example , number of partitions , likelihood threshold , privacy budget .
2:  
3:  
4:  
5:  for  do
6:     if  then
7:        
8:     end if
9:  end for
10:  Sample response as “in-distribution” with probability and “out-of-distribution” with probability
11:  Output response
Algorithm 2 DP-AD, Differentially private anomaly detection via an ensemble of density estimators
Figure 6: Ensembled anomaly detection framework. In this setting, non-private density estimators are trained on separate partitions of the training dataset. At inference time new examples (purple) are fed to each density estimator, voting on whether the given example is “in-distribution“ or “out-of-distribution“ depending on whether it exceeds a known likelihood threshold. Then, we perform a noisy aggregation on these votes to preserve privacy.

There are other possible methods for making anomaly detection algorithms differentially private, beyond the approach described above of using DP-NF directly to privately train the model. An alternative approach is to, rather than learning the parameters of the model in a differentially private manner, partition the dataset into parts and train a non-private density estimator on each part. Then, given a new point of interest, each model casts a vote regarding its belief on the point being in-distribution or out-of-distribution by testing whether their density assigned to the point exceeds . We then aggregate these votes privately, to ensure that the final prediction is differentially private with respect to the training set. In our algorithm for differentially private anomaly detection (Algorithm 2), we use the Exponential Mechanism [38] for this private aggregation. We note that our overall approach is an instantiation of the sample-and-aggregate framework of [40]. This approach is visualized in Figure 6 and presented formally in Algorithm 2.

The privacy guarantees of DP-AD follow as an immediate corollary from those of sample-and-aggregate [40] and the Exponential Mechanism [38].

Theorem 2.

DP-AD is -differentially private.

Although Algorithm 2 is instantiated with the Exponential Mechanism [38], one can opt for any differentially private aggregation method. In contexts requiring answers to a large but bounded number of queries where the number of anomalies is expected to be small in comparison to the total number of queries (an appropriate assumption for many applications), an immediate extension of this approach using the Sparse Vector technique [18, 19] would be natural.

Figure 7: Half-Moons Dataset
Figure 8: Anomaly detection through private aggregation of votes (either “in-distribution” or “out-of-distribution”) on synthetically generated data. Training set was partitioned into 10 pieces, then each piece was fit to a non-private model, either MoG or NF. Differentially private aggregation was performed via the exponential mechanism.

To evaluate this approach, we applied Algorithm 2 to the Half-Moons dataset, which is a synthetic dataset with 30,000 data points, each with two real-valued features. This dataset is visualized in Figure 7, and more details about this synthetic dataset are given in Appendix C.

Figure 8 illustrates the data used and the results of this evaluation. The training data (dark purple) was partitioned into 10 pieces and used to fit a set of independent non-private models (all either MoG or NF). Then, anomalies (light purple) were added to the dataset, and both real and fake data were fed into the learned models. These data are illustrated in the top image of Figure 8. The Exponential Mechanism was used to perform differentially private aggregation to yield the final ensemble prediction when applied to held out test points. The bottom image of Figure 8 presents the probability of a correct classification between “in-sample” and “out-of-sample” data as a function of under this scheme. Both approaches were given the threshold that optimized their classification performance. Notice that using normalizing flows in this setting naturally yields a performance improvement over the existing method, which is due in part to the ability of normalizing flows to capture densities which do not adhere to a Gaussian distribution.

The benefit of this ensembled approach is that each individual model can be trained non-privately, which may substantially improve the quality of the learned models. For certain problems, this might be necessary when differentially private optimization is known to degrade performance substantially, for example in high dimensional problems with image data. The associated downside is that the analyst’s privacy budget would now degrades as a function of the number of queries, whereas before the privacy budget was independent of the number of queries. The analyst would have to determine the application-specific tradeoff between costs and benefits of this approach, based on practical constraints imposed by the problem context.

6 Conclusion

Privacy is a subject of increasing importance and growing concern. By our work adhering to the framework of differential privacy, one is able to make definitive statements regarding the privacy of participants involved in our analysis. Our results could also be used to enable differentially private synthetic data generation, which would allow data curators to provide privatized synthetic versions of their sensitive or protected datasets, thereby enabling broader access to these data.

In this work, we have demonstrated the efficacy of differentially private normalizing flow models as a novel approach to the task of privacy-preserving density estimation. We have shown the ability of these models to assign high likelihoods to holdout data and generate qualitatively realistic synthetic data, improving on existing state-of-the-art methods. Going forward, there exist several interesting directions for further development. For example, it remains to be seen how normalization layers such as activation normalization, whose parameters are likely disproportionally sensitive to perturbation during differentially private optimization, could be better adapted to such. Further, one might hypothesize that sampling via partitions of a shuffled dataset may yield improved results given more regular sampling of each data point, and associating such a sampling method with rigorous privacy guarantees if possible could yield empirical improvements. Finally, in this study we only considered a particular subset of normalizing flows in existence. Although, many alternative neural density estimators capable of expressing highly discontinuous distributions are in continuous development, including FFJORD [24], Neural Spline Flows [13], Neural Autoregressive Flows [28], and Transformation Autoregressive Networks [42].

References

  • Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM Conference on Computer and Communications Security, CCS ’16, pages 308–318, 2016.
  • Beaulieu-Jones et al. [2019] Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene.

    Privacy-preserving generative deep neural networks support clinical data sharing.

    Circulation: Cardiovascular Quality and Outcomes, 12(7):e005122, 2019.
  • Bittner et al. [2018] D. Bittner, A. D. Sarwate, and R. Wright. Using noisy binary search for differentially private anomaly detection. In Proceedings of the 2nd International Symposium on Cyber Security Cryptography and Machine Learning (CSCML), volume 10879, pages 20–37, 2018.
  • Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Bu et al. [2020] Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J. Su. Deep learning with gaussian differential privacy.

    Harvard Data Science Review

    , 2(3), 2020.
  • Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proceedings of the 13th Conference on Theory of Cryptography, TCC ’16, pages 635–658, 2016.
  • Chaudhuri and Vinterbo [2013] Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for differentially private machine learning. In Advances in Neural Information Processing Systems 26, NIPS ’13, pages 2652–2660, 2013.
  • Chawla et al. [2005] Shuchi Chawla, Cynthia Dwork, Frank McSherry, and Kunal Talwar. On the utility of privacy-preserving histograms. In

    Proceedings of the Conference on Uncertainty in Artificial Intelligence

    , UAI ’05, 2005.
  • Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation, 2014. arXiv preprint 1410.8516.
  • Dinh et al. [2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In Proceedings of the International Conference on Learning Representations, ICLR ’17, 2017.
  • Dong et al. [2019] Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian differential privacy, 2019. arXiv preprint 1905.02383.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, 2019.
  • Dwork [2008] Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation (TAMC), volume 4978 of Lecture Notes in Computer Science, pages 1–19. Springer Verlag, 2008.
  • Dwork and Lei [2009] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In

    Proceedings of the 41st ACM Symposium on Theory of Computing

    , STOC ’09, pages 371–380, 2009.
  • Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 2014.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC ’06, pages 265–284, 2006.
  • Dwork et al. [2009] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil P. Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In Proceedings of the 41st ACM Symposium on Theory of Computing, STOC ’09, pages 381–390, 2009.
  • Dwork et al. [2010] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, 2010.
  • Fakoor et al. [2020] Rasool Fakoor, Pratik Chaudhari, Jonas Mueller, and Alexander J. Smola. Trade: Transformers for density estimation, 2020. arXiv preprint 2004.02441.
  • Fan and Xiong [2013] Liyue Fan and Li Xiong. Differentially private anomaly detection with a case study on epidemic outbreak detection. In Proceedings of the IEEE 13th International Conference on Data Mining Workshops, pages 833–840, 2013.
  • Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.

    MADE: Masked autoencoder for distribution estimation.

    In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 881–889, 2015.
  • Google [2018] Google. Tensorflow privacy, 2018. URL https://github.com/tensorflow/privacy.
  • Grathwohl et al. [2019] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In Proceedings of International Conference on Learning Representations, ICLR ’19, 2019.
  • Gupta et al. [2010] Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar.

    Differentially private combinatorial optimization.

    In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’10, pages 1106–1125, 2010.
  • Hall et al. [2013] Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14(1):703–727, 2013.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    In

    Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

    , ICCV ’15, pages 1026–1034, 2015.
  • Huang et al. [2018] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087, 2018.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML ’15, pages 448–456, 2015.
  • Izmailov et al. [2020] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semi-supervised learning with normalizing flows. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4615–4630, 2020.
  • Kamath et al. [2019] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private algorithms for learning mixtures of separated gaussians. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, 2019.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations, ICLR ’15, 2015.
  • Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, NeurIPS ’18, 2018.
  • Kingma et al. [2016] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, NIPS ’16, 2016.
  • Liu and Talwar [2019] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC ’19, page 298?309, 2019.
  • Mclachlan and Basford [1988] G. Mclachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.
  • McMahan et al. [2018] H. Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures, 2018. arXiv preprint 1812.06210.
  • McSherry and Talwar [2007] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, pages 94–103, 2007.
  • Mironov [2017] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium, CSF ’17, pages 263–275, 2017.
  • Nissim et al. [2007] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, STOC ’07, pages 75–84, 2007.
  • Okada et al. [2015] Rina Okada, Kazuto Fukuchi, and Jun Sakuma.

    Differentially private analysis of outliers.

    In Proceedings of the 2015 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II, ECMLPKDD ’15, pages 458–473, 2015.
  • Oliva et al. [2018] Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, and Jeff Schneider. Transformation autoregressive networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3898–3907, 2018.
  • Papamakarios [2019] George Papamakarios. Neural density estimation and likelihood-free inference, 2019. arXiv preprint 1910.13233.
  • Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30, NIPS ’17, pages 2338–2347, 2017.
  • Park et al. [2017] Mijung Park, James Foulds, Kamalika Choudhary, and Max Welling. DP-EM: Differentially Private Expectation Maximization. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 896–904, 2017.
  • Salimans and Kingma [2016] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29, NIPS ’16, 2016.
  • Surendra and Mohan [2017] H. S. Surendra and H. S. Mohan. A review of synthetic data generation methods for privacy preserving data publishing. International Journal of Scientific & Technology Research, 6:95–101, 2017.
  • Wang et al. [2019] Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, AISTATS ’19, pages 1226–1235, 2019.
  • Xu et al. [2012] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In Proceedings of the IEEE 28th International Conference on Data Engineering, pages 32–43, 2012.
  • Yuncheng Wu et al. [2016] Yuncheng Wu, Yao Wu, Hui Peng, Juru Zeng, Hong Chen, and Cuiping Li. Differentially private density estimation via gaussian mixtures model. In Proceedings of the IEEE/ACM 24th International Symposium on Quality of Service (IWQoS), pages 1–6, 2016.

Appendix A Additional Privacy Preliminaries

a.1 Differential Privacy

We achieve differential privacy in Algorithm 1 through the Gaussian Mechanism, which adds mean-zero Gaussian noise to the value of a function evaluated on the data. The scale of the noise depends on the sensitivity of the function. The -sensitivity of a function is denoted , and is the maximum change in the norm of if one entry in the database were to be changed. Formally, for , . In our case, this function is the computation of the gradient given a sampled batch of data.

Theorem 3 (Gaussian Mechanism [16]).

Let and let where are i.i.d. random variables drawn from and . Then is -differentially private.

Differential privacy composes, meaning that the privacy guarantees degrade smoothly as more analyses are performed on the same dataset. The simplest version of privacy composition is that the s and s “add up” across analyses. Tighter composition bounds are possible, including the approaches outlined in the following two subsections.

a.2 Moments Accountant

The moments accountant [1] was proposed initially as a means for tight composition of the privacy gaurantees of DP-SGD. To characterize this analysis, we note the privacy loss associated with a given outcome , given as . Further, for two given datasets we define the MGF of this random variable evaluated at some value as . Finally, the ”worst case” upper bound across all possible pairs of datasets is given as .

This notation can be used to give composition guarantees for the privacy parameters across multiple algorithms run on the same dataset.

Theorem 4 ([1]).

Suppose that an algorithm consists of a sequence of adaptive algorithms where . Then for any , . For any ¿ 0, is -differentially private for .

This analysis was then later characterized under the framework of Rényi differential privacy [39], a relaxation of -differential privacy which is defined in a manner closely resembling the moments accountant privacy analysis.

Definition 2 (()-Rdp [39]).

A randomized algorithm is said to have -Rényi differential privacy of order , or -RDP for short, if for any adjacent it holds that , where .

Finally, the privacy analysis performed in [1] regarding DP-SGD assumes the sampling is performed via Poisson subsampling, i.e., each individual example has independent sampling probability of being included in the batch at each iteration. In some contexts, fixed-sized batches can enable a variety of performance improvements by allowing for compilation. Privacy analysis under uniform subsampling, where each batch is sampled uniformly across all possible batches of size , was considered in [48] :

Theorem 5 (RDP-DP Conversion [48]).

For all integers , if is -RDP, then the randomized algorithm applied to a subsampled batch of data without replacement is -RDP where .

a.3 Gaussian Differential Privacy

Gaussian differential privacy (GDP) is a recently proposed relaxation of -differential privacy established in [11], and further expanded upon in the context of deep learning in [5]. This definition exhibits several appealing properties, including simplified analysis under composition and subsampling, derivation of analytically tractable expressions for the privacy guarantees of NoisySGD, while providing a slightly tighter privacy bound than that which is achieved through analysis via the moments accountant. The framework of -Gaussian differential privacy acts as the basis for our analysis.

We note that the Gaussian mechanism (Definition 3) is -GDP [11].333Further detail concerning the privacy guarantees achieved when batches are subsampled is given in Section 2 of [5].

The overall privacy guarantee corresponding to the -fold adaptive composition of mechanisms each satisfying -GDP is -GDP. Finally, -GDP allows for a conversion to a corresponding -differential privacy guarantee using the fact that an algorithm is -GDP if and only if it is -differentially private for all , where and

is the cumulative density function of the Normal distribution.

Figure 9 shows that GDP privacy accounting gives substantial improvements over the moments accountant method when used as the privacy accountant method in DP-NF. For each number of iterations, GDP accounting yields a lower privacy value.

Figure 9: Cumulative privacy loss given Life Science training parameters () as a function of training iterations.

a.4 Dp-Sgd

Algorithm 3 is a formal description of DP-SGD of [1], from which our approach is based. Specifically, Algorithm 1 is an instantiation of DP-SGD, substituting negative log likelihood for the generic loss function in Algorithm 3.

1:  Input: Dataset , loss function , learning rate , noise multiplier , batch size , gradient norm bound .
2:  Initialize randomly
3:  for  do
4:     Take a Poisson random subsample with per-example probability .
5:     Compute gradient
6:     for each , compute
7:     Clip gradient
8:     
9:     Add noise
10:     
11:     Descend
12:     
13:  end for
14:  Output and compute overall using a privacy accounting method.
Algorithm 3 DP-SGD, differentially private stochastic gradient descent [1]

Appendix B DP-NF Extensions

b.1 Data-Dependent Initialization of Normalization Layers

Intermediate normalization layers such as batch normalization [29] and activation normalization [33] have been shown to improve the stability of normalizing flow models. In our context, batch normalization is incompatible with our approach since that batch statistics are shared when computing the forward pass of the layer, precluding the ability to calculate truly independent per-example gradients as required by NoisySGD. Activation normalization is more appropriate in our setting since no such batch statistics are calculated. Activation normalization is characterized by an offset and scaling of its inputs feature-wise by a learned set of parameters and , i.e. . In practice, typically these parameters are set via data-dependent initialization [46] by computing a forward pass on a sampled batch of data and setting and to be the per-feature means and standard deviations of the inputs it had observed respectively. Since these statistics are calculated directly from the data, this approach is not privacy-preserving.

One potential approach to making differentially private activation normalization is to privatize these statistics using the Laplace Mechanism [17]. This approach is outlined in Algorithm 4, where clips the values of to be in the range , computes the feature-wise mean of , computes the feature-wise standard deviation of , and is some data-independent parameter initialization method which maps standardized inputs to standardized outputs in expectation, e.g., He initialization [27].

1:  Input: Dataset , transformation (e.g. MADE [22]), number of layers , initialization privacy budget , initialization privacy tolerance , data-independent parameter initialization method (e.g. He initialization [27]).
2:  
3:  for  do
4:     
5:     
6:     
7:     
8:  end for
9:  
Algorithm 4 DP-NF-INIT, data dependent initialization of activation normalization layers

We note that Algorithm 4 is far from the only approach for differentially private activation normalization. If the analyst has some domain knowledge about appropriate ranges of these parameters, she could use the differentially private Propose-Test-Release framework [15] to first normalize and then add noise proportionally to her proposed (and tested) sensitivity. While this approach seems practical for the outer layers, it is unlikely that an analyst would have numerical intuition for appropriate parameter values in all layers (especially when is large). Thus even though the noise addition scheme described in Algorithm 4 may seem naive, it is likely the most practical approach.

Although the utility of activation normalization layers is quite evident, the original work [33] proposing such layers provided little evidence to support the idea that data-dependent initialization yielded statistically significant improvements over a default initialization scheme, i.e., and

. In our experiments, we observed little distinction in contexts where the input data was assumed to be standardized and parameters were initialized to maintain variance between layers. Despite this, we include the approach for completeness for potential future contexts where data-dependent initialization of such parameters deems necessary.

b.2 DP-MoG as a Prior

Thus far it has been assumed that we use the spherical multivariate Gaussian distribution to act as a prior for our model. Although, naturally any distribution could act as such a prior as long as it exhibits a tractable density function. For example, simply extending the single standardized Gaussian to a mixture of Gaussians has been shown [44] to exhibit modest performance improvements. This mixture could be fit to the data [30] a priori as well.

Hence, a natural extension to our proposed approach would be to fit DP-MoG first with privacy budget to act as a prior, and then to refine this prior by training a sequence of nonlinear bijective functions with privacy budget to yield an encompassing normalizing flow model. This yields a worst-case -differential private guarantee by sequential composition. Although, this guarantee is easily improved by composing these privacy guarantee under some alternative privacy definition (e.g. RDP or GDP) before subsequently converting to a corresponding ()-DP guarantee. One might hypothesize that this approach would yield preferable results in contexts where the distribution at hand is composed of several discontinuous components, while exhibiting locally nonlinear density within each component.

Figure 10: Pinwheel Dataset

To capture this context, we evaluate the efficacy of this approach on the Pinwheel dataset, as illustrated in Figure 10 (further details of this dataset are given in Appendix C). The Pinwheel dataset is a common density estimation benchmark consisting of a number of disconnected components with nonlinear boundary. A Gaussian mixture model would naturally have difficulty approximating such a distribution for a small number of components, while classical normalizing flow models with a single standardized Gaussian as a prior might have difficulty expressing its discontinuous density.

As shown in Table 2, using a trained GMM to act as a prior can aid in performance. First, we note that both DP-NF and DP-MoG demonstrate difficulty in achieving negative log likelihoods lower than 2.65-2.70, even when the number of components for DP-MoG is increased. DP-NF with DP-MoG as a prior modestly outperforms both alternatives used in isolation. Additionally, if one assumes that a GMM prior of five components fit to the population can be assumed to be public, one achieves dramatic performance improvements over all methods.

Pinwheel
DP-NFspherical Gaussian prior
DP-NFnon-private GMM prior
DP-NFprivate GMM prior,
DP-MoG
DP-MoG
Table 2: Test negative log likelihood (lower is better) on Pinwheel dataset (Figure 10) for varying privacy budgets , composed via the moments accountant. From top to bottom: Standard DP-NF with a single spherical Gaussian as a prior, DP-NF with a GMM prior of 5 components (fit non-privately), DP-NF with a GMM prior of 5 components (learned privately for ), DP-MoG with 5 components, DP-MoG with 10 components. When a privacy budget is expended on the prior, this was included when calculating the overall cost indicated by the column headings.

Appendix C Additional Results and Training Details

In this section, we provide additional results on the performance of DP-NF, compared to the baseline mechanism of DP-MoG, as evaluated on several real and synthetic datasets. These datasets are summarized in Table 3. Details of the real and synthetic datasets are respectively given in Sections C.1 and C.2. Further details of hyperparameter training for all experiments are given in Section C.3.

Dataset Real
Life Science Real 10 26,733
Gowalla Real 2 100,000
Power Real 6 100,000
Gaussians Synthetic 2 30,000
Half-Moons Synthetic 2 30,000
Pinwheel Synthetic 2 30,000
Table 3: Dimensionality and number of examples for each dataset.

c.1 Additional Results on Real Datasets

Our analysis aims to cover a range of datasets composed of both synthetic and real-world datasets. In this section, we provide short descriptions of each dataset used for evaluation, and provide results of DP-NF evaluated on these datasets.

Life Science. The Life Science dataset is a density estimation benchmark dataset from the UCI machine learning repository [12] used in our baseline [45] in their evaluation of DP-MoG. It contains 26,733 real-valued records of dimension 10 characterizing the principle components of measurements made in a variety of chemical and biological experiments.

Power. The Power dataset is a density estimation benchmark dataset from the UCI machine learning repository [12] used in much of the normalizing flow literature [44, 24]. It contains measurements of electric power consumption in a household over a period of 47 months, and was preprocessed according to the description given in [44].

Gowalla. The Gowalla dataset contains the locations in terms of longitude and latitude of the social network’s users’ check-ins. The total number of points is 1,256,384, which was reduced to 100,000 via a random sample. It was used in the evaluation of our baseline [45], but applied to the task of -means clustering rather than learning the components of a Gaussian mixture model.

We evaluated the performance of DP-NF and the baseline DP-MoG for comparison on these real-world datasets. Results on the Life Science dataset are given in the body of the paper in Table 1. Results on the Power and Gowalla datasets are given in Table 4 below.

Power
DP-NF (GDP) - -
DP-NF (MA)
DP-MoG10 (MA)
DP-MoG10 (zCDP)
Gowalla
DP-NF (GDP) - -
DP-NF (MA)
DP-MoG11 (MA)
DP-MoG11 (zCDP)
Table 4: Average test log likelihood for varying privacy budgets . Error bars denote standard deviation over ten independent cross-validation splits. DP-MoG refers to a Gaussian mixture of components. Omitted entries occur when model training was early-stopped because it had already converged at lower values.

We also provide Figure 11, which is the full version of Figure 3 in Section 4. Figure 11 provides dimension-wise histograms of the synthetically generated data for all ten dimensions of the Life Science dataset, presented in numerical order by axis index from left to right. The top two rows correspond to our baseline, and the bottom two rows correspond to our approach. For each image, we visualize the synthetically generated data (given in orange) and superimpose it over real data (given in blue) for comparison. We observe that for nearly all ten features, the distribution of data generated by DP-NF closely resembled that of the real data while DP-MoG was unable to replicate regions of concentrated density for certain dimensions.

Figure 11: Dimension-wise histograms of synthetically generated Life Science data, superimposed over real data, for and . Dimensions are given in order from 0 to 9 left-to-right, top-to-bottom. Top two rows: DP-NF. Bottom two rows: DP-MoG. Note that synthetic data from DP-NF represents the real data well, while DP-MoG is relatively unable to to capture concentrated regions of density in the real data.

c.2 Additional Results on Synthetic Datasets

We also evaluate our DP-NF method on several synthetic datasets. We perform this on the Half-Moons dataset (Figure 7), as well as a synthetically constructed dataset of a mixture of 8 Gaussians (Figure 12). This is done to demonstrate the heightened expressiveness of our approach as compared to a Gaussian mixture approach, alongside a worst-case scenario where the data is truly generated by a mixture of Gaussians, where DP-MoG would be expected to outperform our method. Results are presented in Table 5.

Figure 12: Gaussian Dataset
Half-Moons
DP-NF (GDP) - -
DP-NF (MA)
DP-MoG3 (MA)
DP-MoG3 (zCDP)
Gaussians
DP-NF (GDP) - -
DP-NF (MA)
DP-MoG8 (MA)
DP-MoG8 (zCDP)
Table 5: Average test log likelihood for varying privacy budgets . Error bars denote standard deviation over ten independent cross-validation splits. DP-MoG refers to a Gaussian mixture of components. Omitted entries occur when model training was early-stopped because it had already converged at lower values.

c.3 Hyperparameters and Training details

In this subsection we detail decisions about hyperparameter section in training, including the gradient clipping parameter , regularization of the loss function, and choice of privacy accountants.

For the gradient clipping parameter , prior work [1]

suggested that a reasonable heuristic for setting this clipping parameter was to set

equal to the median of the norms of the unclipped gradients observed over the course of a non-private training execution. In the context of normalizing flows, we found that much larger values for yielded significantly preferable results. A natural explanation for this arises under consideration of log likelihood as an objective. In cases where a given point is assigned near-zero density, a large gradient update would be incurred to prevent further deterioration. When this gradient update is clipped, the resulting update may be insufficient to avoid associated numerical instability if this point is assigned density further approaching zero. Despite excess noise being applied to updates with larger , we found this to merely prolong training without a significant degradation in resulting model quality.

With respect to regularization, [44] suggested a modest amount of regularization (i.e., a coefficient of ) in the context of non-private normalizing flows. We found this regularization approach to substantially degrade the quality of the resulting models and was generally omitted in our training. This makes intuitive sense, as the suggested regularization of [44] serves to decrease model weights over the course of training, and differentially private optimization applies Gaussian noise vectors of constant variance to gradients throughout training. As model weights tend toward zero, one would expect that the noise injection from privacy eventually dominates the learned criteria of the model.

For privacy accountants, we found that composition under Gaussian differential privacy (GDP) [11] consistently yielded the tightest privacy bounds throughout our experiments. See Figure 9 in Appendix A.3 for an illustration of the improvements in privacy composition that can be achieved by GDP, over the moments accountant of [1]. Since our baseline method DP-MoG yielded the best performance under the moments accountant in [45], we included both GDP composition and moments accountant for fair comparison. We found that DP-NF consistently outperformed DP-MoG even when both methods used the moments accountant, and that further performance improvements could be achieved by DP-NF using GDP composition as a privacy accountant method.

Appendix D Limitations of Proposed Approach

Normalizing flow models are trained in a manner which minimizes the average negative log likelihood of the observed data. As such, it is not uncommon when training such models that a given point is assigned near-zero density, provoking a loss approaching infinity. We observed this issue was somewhat exacerbated due to differentially private optimization, in part due to noise injection, but primarily as a result of subsampling. In the case of uniform subsampling, as required in our privacy analysis, there is no strict guarantee that any individual point is regularly sampled. This acts in contrast to typical sampling methodology in which the dataset is repeatedly shuffled and partitioned into equal sized batches over the course of an epoch. Rigorous privacy guarantees associated with equally sized and disjoint sampling is, to the best of our knowledge, a currently unsettled issue

[37] and a potential avenue for improvement given its theoretical convergence guarantees. We did not find this limitation to be ultimately confounding in any way.

Additionally, given that DP-NF is ultimately a deep learning based approach to density estimation, it naturally involves the optimization of a large number of parameters and a high resource expenditure in terms of time and space complexity. This is especially highlighted in comparison to DP-MoG as a baseline, which takes on the order of seconds to run on CPU as compared to our method which can take on the order of an hour on GPU. Although, we find this tradeoff between resource expenditure and distribution quality to be justified, particularly in the context of differentially private data analysis and social science applications which rarely demand strict resource constraints within reason.