I Introduction
Sampling from a given probability density function (PDF) is required in various applications, e.g. computer graphics, MonteCarlo or physical simulations. Existing methods for generating such samples include inversion sampling, rejection sampling, Gaussian Mixture or MarkovChainMonteCarlo methods. In any case, random samples of simple distributions, such as a uniform distribution, are used to generate samples of the desired distribution.
Here we examine how a fully connected neural network (FCNN) model performs on this task. Our model is constructed to map an input vector consisting of
samples from a uniform distribution to an output vector with the same dimension. The target distribution has to be explicitly known and normalized.Such a model yields a high sampling efficiency since input samples are required to generate output samples. It is flexible towards the choice of the target PDF and needs little manual effort.
This paper gives a brief overview of the mentioned sampling methods and compares them to our FCNN model in one dimension, regarding the properties of the produced samples and the effort, both computational and manual. We also elaborate an example of sampling from twodimensional PDFs.
Ii Conventional Sampling Methods
Iia Inversion Sampling
This method provides a function, which maps samples from an arbitrary distribution to samples from the target PDF . Let be this function, then, assuming that is a bijection, the differential equation
(1) 
holds. If is the uniform distribution over , then
can be determined to be equal to the inverse cumulative distribution function (CDF) corresponding to
. In many cases, the CDF or its inverse might not have an analytical representation. If the integration and inversion are performed numerically, the computational effort increases and the quality of the samples decreases. In contrast, if is explicitly known, this method has high efficiency and produces samples with exactly the desired properties. If the samples drawn from are independent, the same holds for the produced samples.This method may be applied to higher dimensional PDFs, using either the separability of PDFs of uncorrelated random variables or the Bayes’ theorem for correlated random variables, to split the sampling process into multiple onedimensional sampling steps. For this approach all the conditional onedimensional PDFs have to be known
[1, 526ff].IiB Rejection Sampling
This method requires a proposal distribution from which sampling can be performed. Additionally, there has to exist a constant such that .
The procedure starts with drawing a sample from and another sample from the uniform distribution over . If , then is a valid sample from , otherwise is rejected. This process is continued until enough valid samples have been generated.
The major disadvantage of this method is that sampling efficiency depends on how close the proposal distribution lies to the target distribution. Besides that, the produced samples are independent samples from , if the samples drawn from and are independent. Another advantage is that the target does not have to be normalized [1, 528ff].
IiC Mixture of Gaussians
The quality of the samples produces by this method depends on how well a suitable sum of Gaussians approximates the target. Samples from the approximation can be obtained by randomly choosing a Gaussian mode from the sum with probability proportional to its weight and generating a sample from it.
This method is easy to perform since sampling from a Gaussian PDF can be done using inversion sampling and if the Gaussian samples are independent, the generated samples are independent as well [1, 110ff].
IiD MarkovChainMonteCarlo
The aim of this method is to construct a Markov chain with a stationary distribution equal to the target. The MetropolisHastings algorithm is a commonly used method of doing so. An initial sample is used to propose a possible next sample , drawn from an arbitrary conditional distribution . The sample is accepted as , if
(2) 
where is drawn from the uniform distribution over . If is rejected, is equal to . This process is continued, until enough samples were obtained.
The proposal distribution has a great impact on the convergence of the distribution of the samples towards the target distribution. Consider for example a target with two modes placed distant to each other and a narrow proposal distribution located at . It is very unlikely to switch between the modes, leading to slow convergence. Further, the initial sample has impact on the convergence: if is small, it might need some time to reach areas of higher probability (“burn in”). Taking into account that the next sample is generated using the last one yields that the samples depend on each other [1, 539ff].
There is a possibility to link MarkovchainMonteCarlo methods and generative adversarial networks [2] in order to produce random numbers [3]. The generator is used as the transition kernel of the Markov chain if samples from the target distribution are accessible during the training. If this is not the case, the generator is trained to propose samples . Again, the proposal distribution has a great impact on the convergence and also the correlation of samples, thus this method is optimizing the step of proposing, leading to fast convergence and low correlation.
Iii Sampling with a Fully Connected Neural Network
Iiia Concept and Setup
Our model is a simple FCNN, which is able to map an input vector consisting of samples from the input distribution to an output vector of the same dimension. The target distribution has to be explicitly known. The term “sample” refers in this context only to an element of any in or output vector and is not to be understood as a “sample from the training set”.
The model has units in any layer and exponential linear unit (ELU) activation [4] in each but the last layer. A number of layers equal to ten has proven to lead to good results. The input dimension is , limited by the resources of the used hardware.
The ADAM optimizer [5] was used for the training process, the weights were initialized using the Xavier Glorot uniform distribution [6]
and the biases were set to zero. The models were implemented using Python and Keras
[7]with TensorFlow
[8] backend.The lossfunction of the model consists of three parts. The kernel density estimation (KDE)
[9] of each output vector in a minibatch is compared to the target PDF. Additionally, the th element from each output vector in a minibatch is extracted, treated as a set samples and its KDE is compared to the target. This performed for all . It promotes diversity, otherwise the model produces the same output vector with the correct distribution for any input.The comparison of the KDE and the target distribution may be done using the meansquarederror or the JensenShannondivergence [10], it was empirically found that the latter yields better results for most cases. The JensenShannondivergence
(3) 
between the PDFs and measures their similarity. The above integral has to approximated numerically, which is possible due to the properties of and . It is calculated on a finite set of discrete values. The third part of the lossfunction to confines the output values in between the borders of this set using a “potential well”, which was chosen to have linearly increasing sides.
The uniform distribution over
was used to generate input samples. Any input distribution with zero mean leads to equal results, other distributions perform significantly worse. This is caused by the internal covariate shift, since no batch normalization was used for our model
[11].IiiB Results
Using this setup, our model is able to produce samples of the given target distribution, i.e. the kernel density estimation of the output values converges towards the target PDF. The input dimension is equal to and the model has ten layers. The target is
(4) 
which is a bimodal asymmetric Gaussian. The resulting KDE of the output values is shown in Figure (b)b.
Since values are fed at once into the network, it may happen that the output values depend on each other. This property allows the KDE of the output values to lie closer to the target PDF as if the values were drawn independently, e.g. using the mixture of Gaussians method (see Figure (a)a). Consider that the KDE is calculated of only values, which are too few for a reasonable estimation of the underlying PDF. In fact, the model makes the output values interdependent, in order to overcome this issue. The dependence can be more clearly seen in Figure (c)c, where two randomly chosen elements of the output vectors are plotted against each other. If they are independent, there would be peaks at and , too.
If independent samples are required, the input dimension may be reduced to one. Such setup introduces no further correlation and thus the output values are independent if the input values are independent as well. The model used in this scenario has units per layer and ten layers. We can shown experimentally that the model with input dimension one represents the mapping function given by the inversion sampling method. This is not surprising because the differential equation (1) has a single solution on every finite subset of , given any boundary condition (PicardLindelöf theorem, note that there exist other mapping functions, which are not bijections, see Chapter IIA).
The model with input dimension one was compared to the inversion sampling method for the target
(5) 
an exponential distribution with extended domain
. The training was performed on input values and the histogram displayed in Figure (a)a was calculated out of output values. In contrast, Figure (b)b shows the histogram of just as many random values obtained from the inversion sampling method.Comparing these two yields no difference except that in Figure (b)b higher values occur. This may have been caused by the numerical precision and a more complicated fitting at the borders, since the model is represented by a continuous function, but the inverse CDF diverges at zero and one. This model was further compared to the rejection sampling method with a target
(6) 
that has an inverse CDF with no analytical representation. Figure (a)a shows the resulting histogram of output values after a training on input values. Comparatively, the histogram of the same amount of values produced with the rejection sampling method is displayed in Figure (b)b. The proposal distribution was chosen to be equal to (5). This is not the best choice since it has its maximum where the target is zero. Note that a bimodal Gaussian proposal can not be used since no constant fulfills the condition given in Chapter IIB. But a poor choice of the proposal distribution does not affect the quality of the samples, only the computational effort.
Comparing Figure (a)a and (b)b yields that the histogram of the samples produced by our model approximates the target as well as samples produced with the rejection sampling method.
The same target was used for the MetropolisHastings algorithm, the proposal distribution was chosen to be a Gaussian with standard deviation
located at the current sample. As in Figure (c)c depicted, the histogram of the samples approximates the target and the goodness of the fit is comparable to Figures (a)a and (b)b.Further, the model is able to sample from twodimensional PDFs of dependent variables. The model was trained on input samples for a 2D bimodal Gaussian target with peaks at
and variances of one in each direction. The result together with the target is depicted in Figure
4.The computational effort for the comparison of the estimation of output distribution and the target scales exponentially with the dimension. So there is a tradeoff between training time and correct PDFestimation. A possible solution is to manually split the PDF into its conditional onedimensional parts and train a separate model for each dimension.
Iv Conclusion
Summarizing this paper, our FCNN model is able to sample from any target PDF.
The presented findings show that our model produces results with a goodness of the fit comparable to any existing sampling method. The quality of the approximation can be tuned with the size of the model and the duration of the training.
In order to apply our model, a normalized target PDF is required, in contrast to the MetropolisHastings algorithm or the rejection sampling method. On the other side, no proposal distribution, constant (rejection sampling) or location of Gaussian modes (Gaussian mixture) has to be determined for our model. The only parameter of our model that has to be adjusted by hand is the width of the kernel function for the KDE of the output values.
Our model has the highest possible sampling efficiency equal to the inversion sampling method. Whereas the other described methods transform multiple samples into a single one. Especially the rejection sampling method may have low sampling efficiency.
Another important advantage is the flexibility towards the choice of the target. Compared to inversion sampling, no integration or inversion is required, neither analytical nor numerical. Our model with input dimension one tries to represent the inverse CDF. This raises the question if the inversion may not simply be performed numerically from the beginning. Instead of using our model, the fit can be performed in any other way. But an important advantage of FCNN models is that they are universal function approximators. It is neither required to choose how to proceed with the inversion, nor being bound to an approximation with Gaussian modes.
For high input dimensions, our model is able to generate dependent samples such that their distribution converges faster towards the target than the distribution of independent samples would do. If the input dimension is set to one, our model is able to produce independent samples if the input values are independent as well. This makes it more attractive than the MetropolisHastings algorithm, which produces highly dependent values and may have slow convergence.
Sampling from twodimensional PDFs of dependent variables is also possible, but the curse of dimensionality has not yet been overcome. Splitting the target into conditional onedimensional distributions using Bayes’ theorem is a possible solution.
References

[1]
C. M. Bishop,
Pattern Recognition and Machine Learning (Information Science and Statistics)
. Secaucus, NJ, USA: SpringerVerlag New York, Inc., 2006.  [2] I. J. Goodfellow, PougetAbadieJean, M. Mehdi, X. Bing, W. David, O. Sherjil, C. C. Aaron, and B. Yoshua, “Generative adversarial networks,” 2014, http://arxiv.org/abs/1406.2661.
 [3] J. Song, S. Zhao, and S. Ermon, “ANICEMC: adversarial training for MCMC,” 2017, http://arxiv.org/abs/1706.07561.
 [4] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” 2015, http://arxiv.org/abs/1511.07289.
 [5] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, http://arxiv.org/abs/1412.6980.

[6]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 1315, 2010
, 2010, pp. 249–256. [Online]. Available: http://www.jmlr.org/proceedings/papers/v9/glorot10a.html  [7] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
 [8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
 [9] B. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC, 1986.
 [10] D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,” IEEE Transactions on Information Theory, vol. 49, no. 7, pp. 1858–1860, July 2003.
 [11] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” 2015, http://arxiv.org/abs/1502.03167.
Comments
There are no comments yet.