Roundtrip: a deep generative neural density estimator
Density estimation is a fundamental problem in both statistics and machine learning. We proposed Roundtrip as a universal neural density estimator based on deep generative models. Roundtrip exploits the advantage of GANs for generating samples and estimates density by either importance sampling or Laplace approximation. Unlike prior studies modeling target density by constructing a tractable Jacobian w.r.t to a base density (e.g., Gaussian), Roundtrip learns target density by generating a manifold from a base density to approximate the distribution of observation data. In a series of experiments, Roundtrip achieves state-of-the-art performance in a diverse range of density estimation tasks.READ FULL TEXT VIEW PDF
Roundtrip: a deep generative neural density estimator
Density estimation is a fundamental problem in statistics. Let be a density on a -dimensional Euclidean space . Our task is to estimate the density based on a set of independently and identically distributed data points drawn from this density.
A large number of density estimators have been proposed. Histogram is perhaps the simplest nonparametric density estimator which partitions into rectangular regions and estimate the density by calculating the frequency of data points in each region. Data-driven histograms were studied by (scott1979optimal, ; lugosi1996consistency, ). Histogram-based methods are usually used in a univariate case due to the low efficiency. Kernel-based method (rosenblatt1956remarks, ; parzen1962estimation, )
is another popular nonparametric density estimator where the density is estimated as a convolution of a kernel function with the empirical distribution of the observations. However, the success of kernel density estimator (KDE) heavily depends on the bandwidth and kernel function. KDE is typically not effective in problems of dimension higher than five(lu2013multivariate, ). Density estimation in high-dimensional or highly structured data remains a challenging problem.
Recently, neural networks have been applied for constructing density estimators. There are mainly two families of such neural density estimators:autoregressive models (uria2016neural, ; germain2015made, ; papamakarios2017masked, ) and normalizing flows (rezende2015variational, ; balle2015density, ; dinh2016density, ). Then each conditional probability is modeled by a parametric density (e.g., Gaussian or mixture of Gaussian), of which the parameters are learned by a neural network. Density estimators based on normalizing flows represent x as an invertible transformation of a latent variable z with known density, where the invertible transformation is a composition of a series of simple funcitons whose Jacobian is easy to compute. The parameters of these component functions are then learned by neural networks.
Kingma et al (kingma2016improved, ) first pointed out autoregressive models can be equivalently interpreted as normalizing flow where the tractable Jacobian in the autoregressive model is a triangular structure. In principle, most of the current neural density estimators can be summarized as follows. Given a differentiable and invertible mapping and a base density , the density of can be represented using the change of variable rule as
where Jacobian matrix . To achieve efficient and practical computation of the Jacobian matrix, previous neural density estimators carefully designed model architectures to impose a strong constrain on the Jacobian matrix. For example, (berg2018sylvester, ) constructed a low rank perturbations of a diagonal matrix as Jacobian, (papamakarios2017masked, ; dinh2016density, ; kingma2016improved, ) used triangular matrices as Jacobian, (kingma2018glow, ; karami2018generative, ) designed special convolutions to achieve a block diagonal Jacobian. However, such neural density estimators based on tractable Jacobian may have the following two major limitations. 1) The strong constrain on Jacobian may sacrifice the expressiveness of neural networks due to the feature dependency. For example, density estimators based on autoregressive model which assume that the latter features only depend on previous ones are naturally sensitive to the order of the features. 2) change of variable rule requires equal dimension in base density and target density. The computation of Jacobian might still be inefficient in high dimensional cases. No previous density estimators have considered the case where dimension in base density and target density is not equal.
and Adversarial autoencoder(makhzani2015adversarial, ), we proposed Roundtrip as a universal neural density estimator using deep generative models (Figure 1). There are two major characteristics that differ Roundtrip from previous neural density estimators. 1) Roundtrip directly uses neural networks to generate samples that are similar to observation data based on deep generative models. In contrast, previous neural density estimators use neural networks only to represent the component functions that are used for building up the invertible transformation. 2) Roundtrip maps the base density in a low dimension space to the target density which can be approximated by a learned manifold while previous methods require equal dimension in base density and target density. More importantly, we proposed two strategies to estimate the density by either importance sampling or a derived closed-form approximation. Here, we summarize our major contributions in this study as follows.
We proposed Roundtrip as a universal neural density estimator based on deep generative models. Roundtrip requires much less model assumptions compared to previous neural density estimators.
We provided theoretical guarantees which ensure the feasibility of density estimation with deep generative models. Specifically, we proved that change of variable rule contained in previous neural density estimators is a special case in our Roundtrip framework.
We demonstrated state-of-the-art performance of Roundtrip model through a series of experiments, including density estimation tasks in simulations as well as in several real data applications.
Consider two random variablesand where z has a known density (e.g., standard Gaussian) and x is distributed according to a target density that we intend to estimate based on observations from it. We introduced two functions and for learning an forward and backward mapping relationship between the above two distributions. The above two functions are two major components in Roundtrip which are implictly learned by two neural netowrks (Figure 1). We denote and
. We assume that the forward mapping error follows a Gaussian distribution
Typically, we set , which means that takes values in a manifold of with intrinsic dimension . Basically, this roundtrip model utilizes to produce a manifold and then approximate the target density as a mixture of Gaussians where the mixing density is the induced density on the manifold. In what follows, we will set to be a standard Gaussian . Then, the target density can be expressed as
where . The density estimation problem has been tranformed to computing the integral in equation (3). We can compute this integral approximately by either importance sampling or Laplace approximation.
If we directly sample from by discretizing expectation, then where . Such sampling maybe extremely inefficient as typically takes low values at most values of sampled from . In importance sampling, we sample from an importance distribution instead of base density which can be represented by
where is the sample size, is the importance weight function, are samples from . We propose to set to be a Student’s t distribution with the center at . This choice is motivated by the following considerations. 1) For a given x, p(x|z) is likely to be maximized at values of z near
. 2) Student’s t distribution has a heavier tail than Gaussian which provides a control of the variance of the summand in (4) . See more details in Appendix A.
We can also obtain an approximation to the integral in (3) by Laplace’s method. To achieve this goal, we expand around to obtain a quadratic approximation to , which then leads to a multivariate Gaussian integral that is solvable in closed-form. The expansion of can be represented as
where is the Jacobian matrix at . Substitute (5) into , we have
Next, We made variable substitutions as
is the identity matrix and. The integral in (3) z can now be solved by constructing a multivariate Gaussian distribution w in (8) as the following
where , denotes the determinant of the covariant matrix. The constructed mean and covariant matrix of the multivariate Gaussian should be
There are two important properties related to the closed-form approximation which we list as follows.
The constructed covariance matrix
is positive definite and all the eigenvalues are positive, which ensure the constructed multivariate Gaussian is valid. Seeproof in Appendix B1.
The change of variable rule represented by (1) where is a differentiable and invertible function is a special case in our framework if , and taking a limit of . See proof in Appendix B2.
The key idea of Roundtrip is to approximate the target distribution as a convolution of a Gaussian with a distribution induced on a manifold by transforming a base distribution where the transformation is learned by joint training of two GAN models (Figure 1). For the forward mapping function and the discriminator , aims at generating samples that are similar to observation data while the discriminator tries to discern observation data (positive) from generated samples (negative). The backward mapping function and the discriminator have the same training principle. The overall training process can be regarded as two min-max problems: and where
During the training, we also aim to minimize the roundtrip loss which is defined as and . The principle is to minimize the distance when a data point goes through a roundtrip transformation between two data domains. e.g., which ensures that the observation data point will stay close to the learned manifold after a roundtrip data transformation. In practice, we used loss for both and as minimizing loss implies the data is drawn from a Gaussian distribution (mathieu2015deep, ) which exactly matches our model assumption. We denoted roundtrip loss as
where and are two constant coefficients. The idea of roundtrip loss which exploits transitivity for regularizing structured data can also be found in previous works (zhu2017unpaired, ; yi2017dualgan, ).
Combining the adversarial training loss and Roundtrip loss together, we can get the full training loss as . To achieve joint training of two of GAN models. We iteratively updated the parameters in two generative models and two discriminative models, respectively. Thus, the roundtrip transformation used in our density estimator can be represented as
The model architecture for Roundtrip model is highly flexible. In most cases, when it is utilized for density estimation tasks with vector-valued data, we used fully-connected networks for both generative networks and discriminative networks. Specifically, thenetwork contains 10 fully-connected layers and each layer has 512 hidden nodes while the network contains 10 fully-connected layers and each layer has 256 hidden nodes. The networks contains 4 fully-connected layers and each layer has 256 hidden nodes while the
will be combined in the hidden representations inand networks by concatenation. Compared to vector-valued data, tensor-valued data such as images will be flattened and reshaped when taken as input and output to all networks in Roundtrip, respectively. Similar to the model architecture in DCGAN (radford2015unsupervised, ), we used transposed convolutional layers for upsampling images from latent space for
network. Besides, we used traditional convolutional neural networks for, while
still adopts a fully-connected network architecture. Note that Batch normalization(ioffe2015batch, ) is applied after each convolutional layer or transposed convolutional layer.
We test the performance of Roundtrip model in a series of experiments, including simulation studies and real data studies. In these experiments, we compared Roundtrip to the widely used Gaussian kernel density estimator as well as several neural density estimators, including MADE (germain2015made, ), Real NVP (dinh2016density, ) and MAF (papamakarios2017masked, )scholkopf2001estimating, ) and Isolation Forest (liu2008isolation, ). Note that the default setting of Roundtrip model was based on the importance sampling strategy. Results of Roundtrip density estimator based on Laplace approximation are reported in Appendix C.
The neural networks in Roundtrip model were implemented with TensorFlow(abadi2016tensorflow, )111The reproducible code of Roundtrip can be found at https://github.com/kimmo1019/Roundtrip. In all experiments, we set =10 and =10 in equation (13). For the parameter in our model assumption, we selected from of which the value maximizes the average likelihood on validation test. Sample size
in importance sampling is set to 40,000. An adam optimizer with a learning rate of 0.0002 was used for backpropagation and updating model parameters. We took Gaussian kernel density estimator (KDE) as a baseline where the bandwidth is selected by Silverman’s "rule of thumb"(silverman1986density, ) or Scott’s rule (scott1992multivariate, ). We choose the one with better results to present. The three alternative neural density estimators (MADE, Real NVP, and MAF) were implemented through https://github.com/gpapamak/maf. In outlier detection tasks, we implemented One-class SVM and Isolation Forest using scikit-learn library (scikit-learn, ), where the default parameters were used. To ensure fair model comparison, both simulation and real data were randomly split into 90% traning set and 10% test set. For neural density estimators including Roundtrip, 10% of the training set were kept as a validation set. The image datasets with training and test set were directly provided which require no further data split.
For simulation datasets where the true density can be calculated, we evaluate different density estimators by calculating the Spearman (rank) correlation between true density and estimated density. For real data where the ground truth is not available, the average estimated density (natural log-likelihood) on the test set will be considered as a measurement. In the application of outlier detection, we measure performance by calculating the precision at , which is defined as the proportion of correct results in the top ranks. We set to the number of outliers in the test set.
We first designed three 2D simulation datasets to test the performance of different neural density estimators where the truth density can be calculated.
a) Independent Gaussian mixture. , =1,2.
b) 8-octagon Gaussian mixture. where and , =1,…,8.
c) Involute. , where
20000 points were sampled from the true data distribution. After model training, we directly estimated the density in a 2D bounded region (100100 grid) with different methods (Figure 2). For the independent Gaussian mixture in case (a), Roundtrip clearly separates the independent components in the Gaussian mixture while other neural density estimators either failed (MADE) or contain obvious trajectory between different components (Real NVP and MAF). Roundrip can capture a better density distribution even for the highly non-linear structure in case (c). Then we took the case (a) for a further studies by increasing the dimension up to 10 (containing modes). The performance of kernel density estimator (KDE) will decrease dramatically when dimension increases. Roundtrip still achieves a Spearman correlation of 0.829 at dimension 10, compared to 0.669 of Real NVP, 0.595 of MAF and 0.14 of KDE (See Appendix C).
We collected five datasets (AReM, CASP, HEPMASS, BANK and YPMSD) from the UCI machine learning repository (Dua:2019, ) with dimensions ranging from 6 to 90 and sample size from 42,240 to 515,345 (see more details about data description and data preprocessing in Appendix D). Unlike simulation data, these real datasets have no ground truth for the density. Hence, we evaluated different methods by calculating the average log-likelihood on the test set. Table 1 illustrates the performance of Roundtrip and other neural density estimators. A Gaussian kernel density estimator (KDE) fitted to the training data is reported as a baseline. Roundtrip outperforms other neural density estimators on every dataset, which again demonstrates the superiority of our model.
Performance of different methods on five UCI datasets. The average log likelihood (.nat) and 2 standard deviations are shown. The model with best performance is shown in bold.
We further applied Roundtrip model to generate images and assess the quality of the generated images by estimated density. Deep generative models have demonstrated their power in generating synthetic images. However, a deep generative model alone cannot provide quality scores for generated images. Here, we propose to use our Roundtrip method to generate images and quality score (e.g., the density of the image). We test this approach on two commonly used image datasets MNIST (lecun2010mnist, ) and CIFAR-10 (krizhevsky2009learning, ) where in each of the these datasets, the image comes from 10 distinct classes. Roundtrip model was modified by introducing an additional one-hot encoded class label y to both and network and convolutional layers were used in , and (see Methods). We then model the conditional density estimation by where Cat(10) denoting a categorical distribution with 10 distinct classes. We use this modified Roundtrip model to simultaneously generage images conditional on a class label and compute the within class density of the image. The comparing neural density estimators typically require a lot of tricks, including rescaling pixel values to
, transforming the bounded pixel values into a unbounded logit space and adding uniform noise, to achieve images generation and density estimation. Roundtrp did not require additional transformation except for rescaling. In Figure3, the generated images of each class were sorted by decreased likelihood. It is seen that images generated by Roundtrip are more realistic than those generated by MAF (which is the best among alternative neural density estimators, see Figure 2 and Table 1). Furthermore, the density provided by Roundtrip seems to correlate well with the quality of the generated iamges.
Finally, we applied Roundtrip model to an outlier detection task, where a data point with extremely low density value is regarded as likely to be an outlier. We tested this method on three outlier detection datasets (Shuttle, Mammography, and ForestCover) from ODDS database (http://odds.cs.stonybrook.edu/). Each dataset is split into training, validation and test set (details of data description can be found in Appendix D). Besides the neural density estimators, we also introduced two baselines One-class SVM (scholkopf2001estimating, ) and Isolation Forest (liu2008isolation, ). The results were shown in Table 2. Roundtrip achieves the best or comparable results in different outlier detection tasks. Especially in the last dataset ForestCover, in which the outlier percetage is only 0.9%, Roundtrip still achieves a precision of 17.7% while the precision of other neural density estimators is less than 6%.
We proposed Roundtrip as a novel neural density estimator based on deep generative models. Unlike prior studies modeling the invertible transformation from a base density, of which the parameters are learned by neural networks, Roundtrip directly learns the joint distribution of data based on deep generative models. Roundtrip outperforms previous neural density estimators in a variety of density estimation tasks, including simulation/real data studies and an outlier detection application. We also demonstrated the high flexibility in Roundtrip as it can be either used for estimating density in vector-valued data and tensor-values data (e.g., images).
This work is supported by NSF grants DMS1721550, DMS1811920, National Key Research and Development Program of China No. 2018YFC0910404, the National Natural Science Foundation of China Nos. 61873141, 61721003, 61573207, and the Tsinghua-Fuzhou Institute for Data Technology.
Workshop on Bayesian Deep Learning NeurIPS 2018, 2018.
Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
On estimation of a probability density function and mode.The annals of mathematical statistics, 33(3):1065–1076, 1962.
Dualgan: Unsupervised dual learning for image-to-image translation.In Proceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017.