Deep Learning for Sampling from Arbitrary Probability Distributions

01/12/2018 ∙ by Felix Horger, et al. ∙ FAU 0

This paper proposes a fully connected neural network model to map samples from a uniform distribution to samples of any explicitly known probability density function. During the training, the Jensen-Shannon divergence between the distribution of the model's output and the target distribution is minimized. We experimentally demonstrate that our model converges towards the desired state. It provides an alternative to existing sampling methods such as inversion sampling, rejection sampling, Gaussian mixture models and Markov-Chain-Monte-Carlo. Our model has high sampling efficiency and is easily applied to any probability distribution, without the need of further analytical or numerical calculations. It can produce correlated samples, such that the output distribution converges faster towards the target than for independent samples. But it is also able to produce independent samples, if single values are fed into the network and the input values are independent as well. We focus on one-dimensional sampling, but additionally illustrate a two-dimensional example with a target distribution of dependent variables.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sampling from a given probability density function (PDF) is required in various applications, e.g. computer graphics, Monte-Carlo- or physical simulations. Existing methods for generating such samples include inversion sampling, rejection sampling, Gaussian Mixture or Markov-Chain-Monte-Carlo methods. In any case, random samples of simple distributions, such as a uniform distribution, are used to generate samples of the desired distribution.

Here we examine how a fully connected neural network (FCNN) model performs on this task. Our model is constructed to map an input vector consisting of

samples from a uniform distribution to an output vector with the same dimension. The target distribution has to be explicitly known and normalized.

Such a model yields a high sampling efficiency since input samples are required to generate output samples. It is flexible towards the choice of the target PDF and needs little manual effort.

This paper gives a brief overview of the mentioned sampling methods and compares them to our FCNN model in one dimension, regarding the properties of the produced samples and the effort, both computational and manual. We also elaborate an example of sampling from two-dimensional PDFs.

Ii Conventional Sampling Methods

Ii-a Inversion Sampling

This method provides a function, which maps samples from an arbitrary distribution to samples from the target PDF . Let be this function, then, assuming that is a bijection, the differential equation

(1)

holds. If is the uniform distribution over , then

can be determined to be equal to the inverse cumulative distribution function (CDF) corresponding to

. In many cases, the CDF or its inverse might not have an analytical representation. If the integration and inversion are performed numerically, the computational effort increases and the quality of the samples decreases. In contrast, if is explicitly known, this method has high efficiency and produces samples with exactly the desired properties. If the samples drawn from are independent, the same holds for the produced samples.

This method may be applied to higher dimensional PDFs, using either the separability of PDFs of uncorrelated random variables or the Bayes’ theorem for correlated random variables, to split the sampling process into multiple one-dimensional sampling steps. For this approach all the conditional one-dimensional PDFs have to be known

[1, 526ff].

target PDF

KDE estimation

(a) Average of KDEs each from random values produced with the mixture of Gaussians method.
target PDF KDE estimation
(b) Mean of the KDE of output vectors, each consisting of elements. The training was performed on input vectors drawn from the uniform distribution over .
(c) Two randomly chosen elements of output vectors plotted against each other, showing the dependence of the model’s output values.
Fig. 1: Comparison of the model to the mixture of Gaussians method on a bimodal Gaussian target, given by (4).

Ii-B Rejection Sampling

This method requires a proposal distribution from which sampling can be performed. Additionally, there has to exist a constant such that .

The procedure starts with drawing a sample from and another sample from the uniform distribution over . If , then is a valid sample from , otherwise is rejected. This process is continued until enough valid samples have been generated.

The major disadvantage of this method is that sampling efficiency depends on how close the proposal distribution lies to the target distribution. Besides that, the produced samples are independent samples from , if the samples drawn from and are independent. Another advantage is that the target does not have to be normalized [1, 528ff].

Ii-C Mixture of Gaussians

The quality of the samples produces by this method depends on how well a suitable sum of Gaussians approximates the target. Samples from the approximation can be obtained by randomly choosing a Gaussian mode from the sum with probability proportional to its weight and generating a sample from it.

This method is easy to perform since sampling from a Gaussian PDF can be done using inversion sampling and if the Gaussian samples are independent, the generated samples are independent as well [1, 110ff].

Ii-D Markov-Chain-Monte-Carlo

The aim of this method is to construct a Markov chain with a stationary distribution equal to the target. The Metropolis-Hastings algorithm is a commonly used method of doing so. An initial sample is used to propose a possible next sample , drawn from an arbitrary conditional distribution . The sample is accepted as , if

(2)

where is drawn from the uniform distribution over . If is rejected, is equal to . This process is continued, until enough samples were obtained.

The proposal distribution has a great impact on the convergence of the distribution of the samples towards the target distribution. Consider for example a target with two modes placed distant to each other and a narrow proposal distribution located at . It is very unlikely to switch between the modes, leading to slow convergence. Further, the initial sample has impact on the convergence: if is small, it might need some time to reach areas of higher probability (“burn in”). Taking into account that the next sample is generated using the last one yields that the samples depend on each other [1, 539ff].

There is a possibility to link Markov-chain-Monte-Carlo methods and generative adversarial networks [2] in order to produce random numbers [3]. The generator is used as the transition kernel of the Markov chain if samples from the target distribution are accessible during the training. If this is not the case, the generator is trained to propose samples . Again, the proposal distribution has a great impact on the convergence and also the correlation of samples, thus this method is optimizing the step of proposing, leading to fast convergence and low correlation.

target PDF histogram
(a) Histogram of output values from the model.
target PDF histogram
(b) Histogram of random values produced with the inversion sampling method.
Fig. 2: Comparison of the model to the inversion sampling method on an exponential target given by (5).

Iii Sampling with a Fully Connected Neural Network

Iii-a Concept and Setup

Our model is a simple FCNN, which is able to map an input vector consisting of samples from the input distribution to an output vector of the same dimension. The target distribution has to be explicitly known. The term “sample” refers in this context only to an element of any in- or output vector and is not to be understood as a “sample from the training set”.

The model has units in any layer and exponential linear unit (ELU) activation [4] in each but the last layer. A number of layers equal to ten has proven to lead to good results. The input dimension is , limited by the resources of the used hardware.

The ADAM optimizer [5] was used for the training process, the weights were initialized using the Xavier Glorot uniform distribution [6]

and the biases were set to zero. The models were implemented using Python and Keras

[7]

with TensorFlow

[8] backend.

The loss-function of the model consists of three parts. The kernel density estimation (KDE)

[9] of each output vector in a mini-batch is compared to the target PDF. Additionally, the -th element from each output vector in a mini-batch is extracted, treated as a set samples and its KDE is compared to the target. This performed for all . It promotes diversity, otherwise the model produces the same output vector with the correct distribution for any input.

The comparison of the KDE and the target distribution may be done using the mean-squared-error or the Jensen-Shannon-divergence [10], it was empirically found that the latter yields better results for most cases. The Jensen-Shannon-divergence

(3)

between the PDFs and measures their similarity. The above integral has to approximated numerically, which is possible due to the properties of and . It is calculated on a finite set of discrete values. The third part of the loss-function to confines the output values in between the borders of this set using a “potential well”, which was chosen to have linearly increasing sides.

The uniform distribution over

was used to generate input samples. Any input distribution with zero mean leads to equal results, other distributions perform significantly worse. This is caused by the internal covariate shift, since no batch normalization was used for our model

[11].

Our modeltarget PDF histogram
(a) Histogram of output values from our model.
target PDF histogram
(b) Histogram of random values produced with the rejection sampling method.
target PDF histogram
(c) Histogram of random values produced with the Metropolis-Hastings algorithm.
Fig. 3: Comparison of our model to rejection sampling and the Metropolis-Hastings algorithm on the target given by (6).
(a) 2D KDE of output values from the model.
(b) The two-dimensional bimodal Gaussian target PDF.
Fig. 4: Comparison of the final KDE of our model’s output values to the two-dimensional bimodal Gaussian target.

Iii-B Results

Using this setup, our model is able to produce samples of the given target distribution, i.e. the kernel density estimation of the output values converges towards the target PDF. The input dimension is equal to and the model has ten layers. The target is

(4)

which is a bimodal asymmetric Gaussian. The resulting KDE of the output values is shown in Figure (b)b.

Since values are fed at once into the network, it may happen that the output values depend on each other. This property allows the KDE of the output values to lie closer to the target PDF as if the values were drawn independently, e.g. using the mixture of Gaussians method (see Figure (a)a). Consider that the KDE is calculated of only values, which are too few for a reasonable estimation of the underlying PDF. In fact, the model makes the output values interdependent, in order to overcome this issue. The dependence can be more clearly seen in Figure (c)c, where two randomly chosen elements of the output vectors are plotted against each other. If they are independent, there would be peaks at and , too.

If independent samples are required, the input dimension may be reduced to one. Such setup introduces no further correlation and thus the output values are independent if the input values are independent as well. The model used in this scenario has units per layer and ten layers. We can shown experimentally that the model with input dimension one represents the mapping function given by the inversion sampling method. This is not surprising because the differential equation (1) has a single solution on every finite subset of , given any boundary condition (Picard-Lindelöf theorem, note that there exist other mapping functions, which are not bijections, see Chapter II-A).

The model with input dimension one was compared to the inversion sampling method for the target

(5)

an exponential distribution with extended domain

. The training was performed on input values and the histogram displayed in Figure (a)a was calculated out of output values. In contrast, Figure (b)b shows the histogram of just as many random values obtained from the inversion sampling method.

Comparing these two yields no difference except that in Figure (b)b higher values occur. This may have been caused by the numerical precision and a more complicated fitting at the borders, since the model is represented by a continuous function, but the inverse CDF diverges at zero and one. This model was further compared to the rejection sampling method with a target

(6)

that has an inverse CDF with no analytical representation. Figure (a)a shows the resulting histogram of output values after a training on input values. Comparatively, the histogram of the same amount of values produced with the rejection sampling method is displayed in Figure (b)b. The proposal distribution was chosen to be equal to (5). This is not the best choice since it has its maximum where the target is zero. Note that a bimodal Gaussian proposal can not be used since no constant fulfills the condition given in Chapter II-B. But a poor choice of the proposal distribution does not affect the quality of the samples, only the computational effort.

Comparing Figure (a)a and (b)b yields that the histogram of the samples produced by our model approximates the target as well as samples produced with the rejection sampling method.

The same target was used for the Metropolis-Hastings algorithm, the proposal distribution was chosen to be a Gaussian with standard deviation

located at the current sample. As in Figure (c)c depicted, the histogram of the samples approximates the target and the goodness of the fit is comparable to Figures (a)a and (b)b.

Further, the model is able to sample from two-dimensional PDFs of dependent variables. The model was trained on input samples for a 2D bimodal Gaussian target with peaks at

and variances of one in each direction. The result together with the target is depicted in Figure

4.

The computational effort for the comparison of the estimation of output distribution and the target scales exponentially with the dimension. So there is a trade-off between training time and correct PDF-estimation. A possible solution is to manually split the PDF into its conditional one-dimensional parts and train a separate model for each dimension.

Iv Conclusion

Summarizing this paper, our FCNN model is able to sample from any target PDF.

The presented findings show that our model produces results with a goodness of the fit comparable to any existing sampling method. The quality of the approximation can be tuned with the size of the model and the duration of the training.

In order to apply our model, a normalized target PDF is required, in contrast to the Metropolis-Hastings algorithm or the rejection sampling method. On the other side, no proposal distribution, constant (rejection sampling) or location of Gaussian modes (Gaussian mixture) has to be determined for our model. The only parameter of our model that has to be adjusted by hand is the width of the kernel function for the KDE of the output values.

Our model has the highest possible sampling efficiency equal to the inversion sampling method. Whereas the other described methods transform multiple samples into a single one. Especially the rejection sampling method may have low sampling efficiency.

Another important advantage is the flexibility towards the choice of the target. Compared to inversion sampling, no integration or inversion is required, neither analytical nor numerical. Our model with input dimension one tries to represent the inverse CDF. This raises the question if the inversion may not simply be performed numerically from the beginning. Instead of using our model, the fit can be performed in any other way. But an important advantage of FCNN models is that they are universal function approximators. It is neither required to choose how to proceed with the inversion, nor being bound to an approximation with Gaussian modes.

For high input dimensions, our model is able to generate dependent samples such that their distribution converges faster towards the target than the distribution of independent samples would do. If the input dimension is set to one, our model is able to produce independent samples if the input values are independent as well. This makes it more attractive than the Metropolis-Hastings algorithm, which produces highly dependent values and may have slow convergence.

Sampling from two-dimensional PDFs of dependent variables is also possible, but the curse of dimensionality has not yet been overcome. Splitting the target into conditional one-dimensional distributions using Bayes’ theorem is a possible solution.

References