Roundtrip: A Deep Generative Neural Density Estimator

04/20/2020
by   Qiao Liu, et al.
Tsinghua University
Stanford University
32

Density estimation is a fundamental problem in both statistics and machine learning. We proposed Roundtrip as a universal neural density estimator based on deep generative models. Roundtrip exploits the advantage of GANs for generating samples and estimates density by either importance sampling or Laplace approximation. Unlike prior studies modeling target density by constructing a tractable Jacobian w.r.t to a base density (e.g., Gaussian), Roundtrip learns target density by generating a manifold from a base density to approximate the distribution of observation data. In a series of experiments, Roundtrip achieves state-of-the-art performance in a diverse range of density estimation tasks.

READ FULL TEXT VIEW PDF
01/06/2019

Understanding the (un)interpretability of natural image distributions using generative models

Probability density estimation is a classical and well studied problem, ...
05/14/2018

Normal Similarity Network for Generative Modelling

Gaussian distributions are commonly used as a key building block in many...
03/16/2016

One-Shot Generalization in Deep Generative Models

Humans have an impressive ability to reason about new concepts and exper...
06/22/2022

Neural Inverse Transform Sampler

Any explicit functional representation f of a density is hampered by two...
02/24/2021

Density Sketches for Sampling and Estimation

We introduce Density sketches (DS): a succinct online summary of the dat...
10/25/2021

Kernel density estimation-based sampling for neural network classification

Imbalanced data occurs in a wide range of scenarios. The skewed distribu...
10/07/2013

A Deep and Tractable Density Estimator

The Neural Autoregressive Distribution Estimator (NADE) and its real-val...

Code Repositories

Roundtrip

Roundtrip: a deep generative neural density estimator


view repo

1 Introduction

Density estimation is a fundamental problem in statistics. Let be a density on a -dimensional Euclidean space . Our task is to estimate the density based on a set of independently and identically distributed data points drawn from this density.

A large number of density estimators have been proposed. Histogram is perhaps the simplest nonparametric density estimator which partitions into rectangular regions and estimate the density by calculating the frequency of data points in each region. Data-driven histograms were studied by (scott1979optimal, ; lugosi1996consistency, ). Histogram-based methods are usually used in a univariate case due to the low efficiency. Kernel-based method (rosenblatt1956remarks, ; parzen1962estimation, )

is another popular nonparametric density estimator where the density is estimated as a convolution of a kernel function with the empirical distribution of the observations. However, the success of kernel density estimator (KDE) heavily depends on the bandwidth and kernel function. KDE is typically not effective in problems of dimension higher than five

(lu2013multivariate, ). Density estimation in high-dimensional or highly structured data remains a challenging problem.

Recently, neural networks have been applied for constructing density estimators. There are mainly two families of such neural density estimators:

autoregressive models (uria2016neural, ; germain2015made, ; papamakarios2017masked, ) and normalizing flows (rezende2015variational, ; balle2015density, ; dinh2016density, )

. Autoregression-based neural density estimators decompose the density into the product of conditional distribution based on probability chain rule

. Then each conditional probability is modeled by a parametric density (e.g., Gaussian or mixture of Gaussian), of which the parameters are learned by a neural network. Density estimators based on normalizing flows represent x as an invertible transformation of a latent variable z with known density, where the invertible transformation is a composition of a series of simple funcitons whose Jacobian is easy to compute. The parameters of these component functions are then learned by neural networks.

Kingma et al (kingma2016improved, ) first pointed out autoregressive models can be equivalently interpreted as normalizing flow where the tractable Jacobian in the autoregressive model is a triangular structure. In principle, most of the current neural density estimators can be summarized as follows. Given a differentiable and invertible mapping and a base density , the density of can be represented using the change of variable rule as

(1)

where Jacobian matrix . To achieve efficient and practical computation of the Jacobian matrix, previous neural density estimators carefully designed model architectures to impose a strong constrain on the Jacobian matrix. For example, (berg2018sylvester, ) constructed a low rank perturbations of a diagonal matrix as Jacobian, (papamakarios2017masked, ; dinh2016density, ; kingma2016improved, ) used triangular matrices as Jacobian, (kingma2018glow, ; karami2018generative, ) designed special convolutions to achieve a block diagonal Jacobian. However, such neural density estimators based on tractable Jacobian may have the following two major limitations. 1) The strong constrain on Jacobian may sacrifice the expressiveness of neural networks due to the feature dependency. For example, density estimators based on autoregressive model which assume that the latter features only depend on previous ones are naturally sensitive to the order of the features. 2) change of variable rule requires equal dimension in base density and target density. The computation of Jacobian might still be inefficient in high dimensional cases. No previous density estimators have considered the case where dimension in base density and target density is not equal.

Motivated by the advances of deep generative models, such as GAN (goodfellow2014generative, ), cycleGAN (zhu2017unpaired, )

and Adversarial autoencoder

(makhzani2015adversarial, ), we proposed Roundtrip as a universal neural density estimator using deep generative models (Figure 1). There are two major characteristics that differ Roundtrip from previous neural density estimators. 1) Roundtrip directly uses neural networks to generate samples that are similar to observation data based on deep generative models. In contrast, previous neural density estimators use neural networks only to represent the component functions that are used for building up the invertible transformation. 2) Roundtrip maps the base density in a low dimension space to the target density which can be approximated by a learned manifold while previous methods require equal dimension in base density and target density. More importantly, we proposed two strategies to estimate the density by either importance sampling or a derived closed-form approximation. Here, we summarize our major contributions in this study as follows.

  • We proposed Roundtrip as a universal neural density estimator based on deep generative models. Roundtrip requires much less model assumptions compared to previous neural density estimators.

  • We provided theoretical guarantees which ensure the feasibility of density estimation with deep generative models. Specifically, we proved that change of variable rule contained in previous neural density estimators is a special case in our Roundtrip framework.

  • We demonstrated state-of-the-art performance of Roundtrip model through a series of experiments, including density estimation tasks in simulations as well as in several real data applications.

2 Methods

2.1 Density modeling in Roundtrip

Consider two random variables

and where z has a known density (e.g., standard Gaussian) and x is distributed according to a target density that we intend to estimate based on observations from it. We introduced two functions and for learning an forward and backward mapping relationship between the above two distributions. The above two functions are two major components in Roundtrip which are implictly learned by two neural netowrks (Figure 1). We denote and

. We assume that the forward mapping error follows a Gaussian distribution

(2)

Typically, we set , which means that takes values in a manifold of with intrinsic dimension . Basically, this roundtrip model utilizes to produce a manifold and then approximate the target density as a mixture of Gaussians where the mixing density is the induced density on the manifold. In what follows, we will set to be a standard Gaussian . Then, the target density can be expressed as

(3)

where . The density estimation problem has been tranformed to computing the integral in equation (3). We can compute this integral approximately by either importance sampling or Laplace approximation.

Figure 1: The overview framework of Roundtrip.

Importance sampling

If we directly sample from by discretizing expectation, then where . Such sampling maybe extremely inefficient as typically takes low values at most values of sampled from . In importance sampling, we sample from an importance distribution instead of base density which can be represented by

(4)

where is the sample size, is the importance weight function, are samples from . We propose to set to be a Student’s t distribution with the center at . This choice is motivated by the following considerations. 1) For a given x, p(x|z) is likely to be maximized at values of z near

. 2) Student’s t distribution has a heavier tail than Gaussian which provides a control of the variance of the summand in (

4) . See more details in Appendix A.

Laplace approximation

We can also obtain an approximation to the integral in (3) by Laplace’s method. To achieve this goal, we expand around to obtain a quadratic approximation to , which then leads to a multivariate Gaussian integral that is solvable in closed-form. The expansion of can be represented as

(5)

where is the Jacobian matrix at . Substitute (5) into , we have

(6)

Next, We made variable substitutions as

(7)

Taking equations (6) and (7) into in equation (3) and we can finally get

(8)

where

is the identity matrix and

. The integral in (3) z can now be solved by constructing a multivariate Gaussian distribution w in (8) as the following

(9)

where , denotes the determinant of the covariant matrix. The constructed mean and covariant matrix of the multivariate Gaussian should be

(10)

Substitute (9) into (3) and we can get the final closed-form solution for density of point x

(11)

There are two important properties related to the closed-form approximation which we list as follows.

  • The constructed covariance matrix

    is positive definite and all the eigenvalues are positive, which ensure the constructed multivariate Gaussian is valid. See

    proof in Appendix B1.

  • The change of variable rule represented by (1) where is a differentiable and invertible function is a special case in our framework if , and taking a limit of . See proof in Appendix B2.

2.2 Adversarial training loss

The key idea of Roundtrip is to approximate the target distribution as a convolution of a Gaussian with a distribution induced on a manifold by transforming a base distribution where the transformation is learned by joint training of two GAN models (Figure 1). For the forward mapping function and the discriminator , aims at generating samples that are similar to observation data while the discriminator tries to discern observation data (positive) from generated samples (negative). The backward mapping function and the discriminator have the same training principle. The overall training process can be regarded as two min-max problems: and where

(12)

Note that the least square loss we used in equation (12) was detailedly discussed in LSGAN (mao2017least, ).

2.3 Roundtrip loss

During the training, we also aim to minimize the roundtrip loss which is defined as and . The principle is to minimize the distance when a data point goes through a roundtrip transformation between two data domains. e.g., which ensures that the observation data point will stay close to the learned manifold after a roundtrip data transformation. In practice, we used loss for both and as minimizing loss implies the data is drawn from a Gaussian distribution (mathieu2015deep, ) which exactly matches our model assumption. We denoted roundtrip loss as

(13)

where and are two constant coefficients. The idea of roundtrip loss which exploits transitivity for regularizing structured data can also be found in previous works (zhu2017unpaired, ; yi2017dualgan, ).

2.4 Full training loss

Combining the adversarial training loss and Roundtrip loss together, we can get the full training loss as . To achieve joint training of two of GAN models. We iteratively updated the parameters in two generative models and two discriminative models, respectively. Thus, the roundtrip transformation used in our density estimator can be represented as

(14)

2.5 Model architecture

The model architecture for Roundtrip model is highly flexible. In most cases, when it is utilized for density estimation tasks with vector-valued data, we used fully-connected networks for both generative networks and discriminative networks. Specifically, the

network contains 10 fully-connected layers and each layer has 512 hidden nodes while the network contains 10 fully-connected layers and each layer has 256 hidden nodes. The networks contains 4 fully-connected layers and each layer has 256 hidden nodes while the

network contains 2 fully-connected layers and each layer has 128 hidden nodes. The leaky-Relu activation function is deployed after each layer propagation.

We also extended Roundtrip for estimating the density of tensor-valued data (e.g., images) by introducing a one-hot encoded class label

y as an additional input to both and networks in a conditional GAN manner (mirza2014conditional, ). y

will be combined in the hidden representations in

and networks by concatenation. Compared to vector-valued data, tensor-valued data such as images will be flattened and reshaped when taken as input and output to all networks in Roundtrip, respectively. Similar to the model architecture in DCGAN (radford2015unsupervised, ), we used transposed convolutional layers for upsampling images from latent space for

network. Besides, we used traditional convolutional neural networks for

, while

still adopts a fully-connected network architecture. Note that Batch normalization

(ioffe2015batch, ) is applied after each convolutional layer or transposed convolutional layer.

3 Results

3.1 Experiement setup

We test the performance of Roundtrip model in a series of experiments, including simulation studies and real data studies. In these experiments, we compared Roundtrip to the widely used Gaussian kernel density estimator as well as several neural density estimators, including MADE (germain2015made, ), Real NVP (dinh2016density, ) and MAF (papamakarios2017masked, )

. In the outlier detection experiment, we additionally compared to two commonly used outlier detection methods: One-class SVM

(scholkopf2001estimating, ) and Isolation Forest (liu2008isolation, ). Note that the default setting of Roundtrip model was based on the importance sampling strategy. Results of Roundtrip density estimator based on Laplace approximation are reported in Appendix C.

The neural networks in Roundtrip model were implemented with TensorFlow

(abadi2016tensorflow, )111The reproducible code of Roundtrip can be found at https://github.com/kimmo1019/Roundtrip. In all experiments, we set =10 and =10 in equation (13). For the parameter in our model assumption, we selected from of which the value maximizes the average likelihood on validation test. Sample size

in importance sampling is set to 40,000. An adam optimizer with a learning rate of 0.0002 was used for backpropagation and updating model parameters. We took Gaussian kernel density estimator (KDE) as a baseline where the bandwidth is selected by Silverman’s "rule of thumb"

(silverman1986density, ) or Scott’s rule (scott1992multivariate, ). We choose the one with better results to present. The three alternative neural density estimators (MADE, Real NVP, and MAF) were implemented through https://github.com/gpapamak/maf. In outlier detection tasks, we implemented One-class SVM and Isolation Forest using scikit-learn library (scikit-learn, ), where the default parameters were used. To ensure fair model comparison, both simulation and real data were randomly split into 90% traning set and 10% test set. For neural density estimators including Roundtrip, 10% of the training set were kept as a validation set. The image datasets with training and test set were directly provided which require no further data split.

3.2 Evaluation

For simulation datasets where the true density can be calculated, we evaluate different density estimators by calculating the Spearman (rank) correlation between true density and estimated density. For real data where the ground truth is not available, the average estimated density (natural log-likelihood) on the test set will be considered as a measurement. In the application of outlier detection, we measure performance by calculating the precision at , which is defined as the proportion of correct results in the top ranks. We set to the number of outliers in the test set.

3.3 Simulation studies

We first designed three 2D simulation datasets to test the performance of different neural density estimators where the truth density can be calculated.

a) Independent Gaussian mixture. , =1,2.

b) 8-octagon Gaussian mixture. where and , =1,…,8.

c) Involute. , where

20000 points were sampled from the true data distribution. After model training, we directly estimated the density in a 2D bounded region (100100 grid) with different methods (Figure 2). For the independent Gaussian mixture in case (a), Roundtrip clearly separates the independent components in the Gaussian mixture while other neural density estimators either failed (MADE) or contain obvious trajectory between different components (Real NVP and MAF). Roundrip can capture a better density distribution even for the highly non-linear structure in case (c). Then we took the case (a) for a further studies by increasing the dimension up to 10 (containing modes). The performance of kernel density estimator (KDE) will decrease dramatically when dimension increases. Roundtrip still achieves a Spearman correlation of 0.829 at dimension 10, compared to 0.669 of Real NVP, 0.595 of MAF and 0.14 of KDE (See Appendix C).

Figure 2: True density and estimated density by different neural density estimators with three simulation datasets. Density plots were shown on a 100100 grid 2D bounded region.

3.4 Real data studies

UCI datasets

We collected five datasets (AReM, CASP, HEPMASS, BANK and YPMSD) from the UCI machine learning repository (Dua:2019, ) with dimensions ranging from 6 to 90 and sample size from 42,240 to 515,345 (see more details about data description and data preprocessing in Appendix D). Unlike simulation data, these real datasets have no ground truth for the density. Hence, we evaluated different methods by calculating the average log-likelihood on the test set. Table 1 illustrates the performance of Roundtrip and other neural density estimators. A Gaussian kernel density estimator (KDE) fitted to the training data is reported as a baseline. Roundtrip outperforms other neural density estimators on every dataset, which again demonstrates the superiority of our model.

AReM CASP HEPMASS BANK YPMSD
KDE 6.260.07 20.470.10 -25.460.03 15.840.12 247.030.61
MADE 6.000.11 21.820.23 -15.150.02 14.970.53 273.200.35
Real NVP 9.520.18 26.810.15 -18.710.02 26.330.22 287.740.34
MAF 9.490.17 27.610.13 -17.390.02 20.090.20 290.760.33
Roundtrip 11.740.04 28.380.08 -4.180.02 35.160.14 297.980.52
Table 1:

Performance of different methods on five UCI datasets. The average log likelihood (.nat) and 2 standard deviations are shown. The model with best performance is shown in bold.

Image datasets

We further applied Roundtrip model to generate images and assess the quality of the generated images by estimated density. Deep generative models have demonstrated their power in generating synthetic images. However, a deep generative model alone cannot provide quality scores for generated images. Here, we propose to use our Roundtrip method to generate images and quality score (e.g., the density of the image). We test this approach on two commonly used image datasets MNIST (lecun2010mnist, ) and CIFAR-10 (krizhevsky2009learning, ) where in each of the these datasets, the image comes from 10 distinct classes. Roundtrip model was modified by introducing an additional one-hot encoded class label y to both and network and convolutional layers were used in , and (see Methods). We then model the conditional density estimation by where Cat(10) denoting a categorical distribution with 10 distinct classes. We use this modified Roundtrip model to simultaneously generage images conditional on a class label and compute the within class density of the image. The comparing neural density estimators typically require a lot of tricks, including rescaling pixel values to

, transforming the bounded pixel values into a unbounded logit space and adding uniform noise, to achieve images generation and density estimation. Roundtrp did not require additional transformation except for rescaling. In Figure

3, the generated images of each class were sorted by decreased likelihood. It is seen that images generated by Roundtrip are more realistic than those generated by MAF (which is the best among alternative neural density estimators, see Figure 2 and Table 1). Furthermore, the density provided by Roundtrip seems to correlate well with the quality of the generated iamges.

Figure 3: (a) True and generated images of MNIST. (b) True and generated images of CIFAR-10. Note that images generated by Roundtrip and MAF were sorted by decreased likelihood for each class. The results of MAF were directly collected from its original paper.

3.5 Outlier detection

Finally, we applied Roundtrip model to an outlier detection task, where a data point with extremely low density value is regarded as likely to be an outlier. We tested this method on three outlier detection datasets (Shuttle, Mammography, and ForestCover) from ODDS database (

http://odds.cs.stonybrook.edu/). Each dataset is split into training, validation and test set (details of data description can be found in Appendix D). Besides the neural density estimators, we also introduced two baselines One-class SVM (scholkopf2001estimating, ) and Isolation Forest (liu2008isolation, ). The results were shown in Table 2. Roundtrip achieves the best or comparable results in different outlier detection tasks. Especially in the last dataset ForestCover, in which the outlier percetage is only 0.9%, Roundtrip still achieves a precision of 17.7% while the precision of other neural density estimators is less than 6%.

OC-SVM I-Forest Real NVP MAF Roundtrip
Shuttle 0.953 0.973 0.784 0.929 0.973
Mammography 0.370 0.482 0.482 0.407 0.482
ForestCover 0.127 0.058 0.054 0.046 0.177
Table 2: The precision at of different methods in three ODDS datasets.

4 Discussion

We proposed Roundtrip as a novel neural density estimator based on deep generative models. Unlike prior studies modeling the invertible transformation from a base density, of which the parameters are learned by neural networks, Roundtrip directly learns the joint distribution of data based on deep generative models. Roundtrip outperforms previous neural density estimators in a variety of density estimation tasks, including simulation/real data studies and an outlier detection application. We also demonstrated the high flexibility in Roundtrip as it can be either used for estimating density in vector-valued data and tensor-values data (e.g., images).

Acknowledgements

This work is supported by NSF grants DMS1721550, DMS1811920, National Key Research and Development Program of China No. 2018YFC0910404, the National Natural Science Foundation of China Nos. 61873141, 61721003, 61573207, and the Tsinghua-Fuzhou Institute for Data Technology.

References