    # Efficient sampling generation from explicit densities via Normalizing Flows

For many applications, such as computing the expected value of different magnitudes, sampling from a known probability density function, the target density, is crucial but challenging through the inverse transform. In these cases, rejection and importance sampling require suitable proposal densities, which can be evaluated and sampled from efficiently. We will present a method based on normalizing flows, proposing a solution for the common problem of exploding reverse Kullback-Leibler divergence due to the target density having values of 0 in regions of the flow transformation. The performance of the method will be demonstrated using a multi-mode complex density function.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Generating samples from a target probability density

to obtain data sets or compute expected values are fundamental tasks in many science and engineering disciplines. The generated Monte Carlo (MC) data is especially important when working within a high-dimensional space, since the integrals become intractable to be computed analytically. Additionally, when trying to sample in a high-dimensional space, numerical methods such as Markov Chain Monte Carlo (MCMC) have to be applied, since finding an explicit inverse transform to a simple-to-sample base distribution is usually impossible. Other methods, such as rejection and importance sampling, rely on defining a suitable

proposal density function

similar enough to the target density in order to obtain a good estimate for either sampling or computing expectations efficiently, as will be shown in Sec.

2.1.

The proposal density has to satisfy additionally two properties:

1. Sampling from it has to be fast and efficient.

2. The proposal density of the samples has to be evaluable.

Normalizing flows provide an expressive family of parametrized density functions which satisfy precisely these conditions. As it will be described in Sec. 2.2, they consist in finding a differentiable and invertible transformation from a base distribution to a target distribution which allows for both of these tasks, hence being the perfect candidates to define a suitable proposal function.

This approach of utilizing normalizing flows for importance sampling has been proposed previously by (Müller et al., 2018). We have added two modifications to obtain an optimal proposal function through normalizing flows, discussed in Sec. 3:

• Usage of the reverse Kullback-Leibler (KL) divergence as the objective function.

• Presentation of a solution to the vanishing (exploding) gradient for the (reverse) Kullback-Leibler divergence via a convex combination of the target density and a support density.

In Sec. 4, a toy problem is defined, which has a complicated vanishing and multi-mode density. We show that by using the reverse KL divergence with the convex combination of target and support densities , we can train the normalizing flow to find a adequate proposal function to perform rejection and importance sampling.

## 2 Background

In this Section we will discuss the theoretical background of both the sampling methodologies and the proposal function framework. Sec. 2.1 describes how Rejection and Importance sampling work, and how their performances depend on the proposal function. In Sec. 2.2, the concept of Normalizing flow is introduced to find such suitable proposal function. In particular we will focus on a particular implementation of Normalizing flows, the Neural Spline Flows (NSF) (Durkan et al., 2019b).

### 2.1 Rejection and Importance sampling

Very often one is in the situation that one has a complex probability density function, , which can be evaluated for any given and one would like to generate a data set following this distribution. However, it is too complex as that one can sample directly from it. A common approach for these cases is to find a simpler function, , the proposal function, from which one can easily draw samples.

If the aim is to calculate the expected value of a function with the drawn from a suitable MC technique is Importance Sampling.

Using the proposal function, can be approximated by:

 Ex∼p(x)[f(x)] =∫p(x)f(x)dx (1) =∫q(x)p(x)q(x)f(x)dx (2) =Ex∼q(x)[f(x)p(x)q(x)] (3) ≈1NN∑n=1wnf(xn) with% xn∼q(x), (4)

where the factors are known as the importance weights. The distribution of the weights indicates the efficiency of the importance sampling. For the ideal case of being identical to all will be equal 1. A broad width of the weight distribution on the other hand indicates that a larger number of samples is necessary to achieve the same precision for the expected value.

The variance of the distribution can be estimated by

 σ2q≈1N−1N∑n=1(wnf(xn)−^μ)2 with xn∼q(x), (5)

where is the approximated expected value obtained from Eq. (4). Note that the larger the values of are, the higher the estimated variance is. The error on the estimated mean value is then

 σ^μ=σq√N, with N=\# samples. (6)

Another MC technique for sampling is Rejection Sampling which has the advantage over importance sampling that it allows to produce samples which directly follow the distribution , a feature interesting for example for high energy physics MC generators.

In rejection sampling, the proposal distribution is used to create a comparison function, , with being a constant factor, which has to satisfy that

 kq(x)≥p(x)∀x:p(x)>0. (7)

The procedure is the following: First a sample is generated following . In a second step a random number, , is generated uniformly in the range , . If fulfills the condition , the sample is accepted; otherwise it is rejected. The probability that a sample is accepted is proportional to: , i.e., gives an intuition of the number of tries until we obtain an accepted sample.

Thus, for both sampling techniques it is crucial to find a as similar as possible to which at the same time fulfills the condition that for all which fulfill . If this last condition is not satisfied, the methods fail to produce the desired result. In the following Sections we describe a method to achieve this.

### 2.2 Normalizing flows

In the following a short introduction to the concept of normalizing flows will be shown, continuing with a concrete implementation, the Neural Spline Flows, and finishing with the objective function to train this kind of neural networks.

#### 2.2.1 General introduction

Normalizing flows

are a mechanism of constructing flexible probability densities for continuous random variables. A comprehensive review on the topic can be found in

(Papamakarios et al., 2019), from which a brief summary will be shown in this Section on how they are defined, and how the parameters of the transformation are obtained.

Consider a random variable defined over , with known probability density . A normalizing flow characterizes itself by a transformation from this known density to another density of a random variable , the target density, in the same space , via

 x=T(u), with u∼pu(u). (8)

The density is known as base density, and has to satisfy that it is easy to sample from and easy to evaluate (e.g., a multivariate normal, or a uniform in dimension ). The transformation has to be invertible, and both and have to be differentiable, i.e., defines a diffeomorphism over .

This allows us to sample from by sampling from and applying the transformation. Additionally, we are able to evaluate the target density by evaluating the base density using the change of variables for density functions,

 ptarget(x)=pu(u)|detJT(u)|−1 with u=T−1(x), (9)

where the Jacobian is a matrix of the partial derivatives of the transormation :

 JT(u)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂T1∂u1⋯∂T1∂uD⋮⋱⋮∂TD∂u1⋯∂TD∂uD⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (10)

The transformation in a normalizing flow is defined partially through a neural network with parameters , as will be described below. If the transformation is flexible enough, the flow could be used to sample and evaluate any continuous density in . In practice, however, the property that the composition of diffeomorphisms is a diffeomorphism is used, allowing to construct a complex transformation via composition of simpler transformations. Consider the transformation as a composition of simpler transformations:

 T=TK∘⋯∘T1. (11)

Assuming and , the forward evaluation and Jacobian are

 zk =Tk(zk1),k=1:K, (12) |JT(u)| =∣∣ ∣∣K∏k=1JTk(zk−1)∣∣ ∣∣. (13)

These two computations (plus their inverse) are the building block of a normalizing flow (Rezende and Mohamed, 2015). Hence, to make a transformation efficient, both operations have to be efficient. From now on forth, we will focus on a simple transformation , since constructing a flow from it is simply making the composition.

To define a transformation satisfying both efficiency properties, the transformation is broken down into autoregressive one-dimensional ones:

 xi=τ(ui;hi) with hi=ci(u

where is the -th component of and the -th of . is the transformer, which is a one-dimensional diffeomorphism with respect to with parameters . is the -th conditioner, which depends on , i.e., the previous components of , and , the parameters of the neural network. The transformer is chosen to be a differentiable monotonic function, since then it satisfies the requirements to be a diffeomorphism. The transformer also satisfies that it makes the transformation easily computational in parallel and decomposing the transformation in one dimensional autoregressive transformers allows the computation of the Jacobian to be trivial, because of its triangular shape. To compute the parameter of each transformer, one would need to process a neural network with input for each component, a total of times.

Masked autoregressive neural networks (Germain et al., 2015) enable to compute all the conditional functions simultaneously in a single forward iteration of the neural network. This is done by masking out, with a binary matrix, the connections of the -th output with respect to all the components with index bigger or equal to , , making it a function of the components.

The transformer can be defined by any monotonic function, such as affine transformations (Papamakarios et al., 2017), monotonic neural networks (Huang et al., 2018; Cao et al., 2019; Wehenkel and Louppe, 2019), sum-of-squares polynomials (Jaini et al., 2019) or monotonic splines (Müller et al., 2018; Durkan et al., 2019a, b). In this work we will focus on a specific implementation of monotonic splines, the Neural Spline Flows.

#### 2.2.2 Neural Spline Flows

In their work on Neural Spline Flows, (Durkan et al., 2019b) advocate for utilizing monotonic rational-quadratic splines as transformers , which are easily differentiable, more flexible than previous attempts of using polynomials for these transformations, since their Taylor-series expansion is infinite, and which are analytically invertible.

The monotonic rational-quadratic transformation is defined by a quotient of two quadratic polynomials. In particular, the splines map the interval to , and outside of it the identity function is considered. The splines are parametrized following (Gregory and Delbourgo, 1982), where different rational-quadratic functions are used, with boundaries set by the pair of coordinates , known as knots of the spline and are the points where it passes through. Note that and . Additionally, the intermediate derivative values of the nodes need to be provided, and have to be positive for the spline to be monotonic. At the boundary points, derivatives are set to 1 to match the identity function.

Having this in mind, the conditioner given by the neural network outputs a vector

of dimension for the transformer , . and give the width and height between the knots, while is the positive derivative at the intermediate knots.

#### 2.2.3 Objective function

Neural density estimators, such as the normalizing flows, are used for two main tasks:

1. Estimate a density through samples in order to be able to evaluate it.

2. Estimate a density through its density function in order to sample data from it.

In both cases, the desire is to minimize the discrepancy between the approximated density of the neural network and the target density . A common measure of discrepancy is given by the Kullback-Leibler (KL) divergence. Because of its asymmetry, it can be defined in two ways:

1. The forward KL divergence:

 DKL(ptarget(x)∥qϕ(x)) (15) =∫ptarget(x)log(ptarget(x)qϕ(x))dx. (16)

When minimizing it with respect to the parameters of the neural network, the objective function becomes:

 argminϕDKL(p% target(x)∥qϕ(x)) (17) =argminϕ−∫ptarget(x)logqϕ(x)dx (18) =argmaxϕ∫ptarget(x)logqϕ(x)dx (19) ≈argmaxϕ∑logqϕ(x) with x∼ptarget(x). (20)
2. The reverse KL divergence:

 DKL(qϕ(x)∥∥ptarget(x)) (21) =∫qϕ(x)log(qϕ(x)ptarget(x))dx. (22)

When minimizing, the objective function becomes:

 argminϕDKL(qϕ(x)∥∥ptarget(x)) (23) (24) (25)

When considering the nature of the problem being solved, the forward KL divergence helps to estimate the target density from samples, since it does not require to evaluate it. In the case of reverse KL divergence, sampling form the target density is not required, but being able to evaluate it is. The target density does not need to be normalized.

In the following Sections we will consider the reverse KL divergence as our objective function, since our goal is to find a good approximation to a given target density.

## 3 Method

Consider an analytical/numerical target density from which one wants to either sample from to create a data set or extract an expected value over it. As described in Sec. 2.1, to perform either rejection or importance sampling, a suitable proposal density is necessary. In the following, we use normalizing flows, implemented as NSF, to find such proposal function .

When using the reverse KL divergence in Eq. (25) for this task, however, depending on the support of , the objective function might explode if such that . For densities with small support area such as the one of the toy problem in Sec. 4, this could hold for a major proportion of when starting the training, making the update of the neural network’s parameters not ideal.

We propose to redefine the target function as a convex combination of the target distribution and a support density :

 pnew(x)=(1−α)⋅p% target(x)+α⋅psupport(x), (26)

with . Effectively, background noise in form of the support density is added to the target density, with magnitude proportional to .

The support function should be positive over a domain which is a union of the supports of both the target density and the initialized neural network density at the beginning of the training. In the case of NSF, for instance, the initial neural network is mapped into a bounding box of , suggesting that a good support function might be a multivariate normal density with mean zero and covariance matrix , where

is the identity matrix in dimension

. Experimentally this has given better result than using a uniform distribution in the bounding box, due to the gradient of the density being different than zero.

If is small, , allowing us to use as a proposal function for rejection and importance sampling. An important remark is that does not have to fit nor perfectly for these algorithms to sample/perform expected value computation accurately from the target function. Since can be evaluated, it will correct the small discrepancies through the algorithms from Sec. 2.1. The discrepancies come from adding background noise through the support density and from the imperfection of the training, but will be in practice small enough to produce an efficient sampling as will be shown experimentally in Sec. 4. is used only for training while is used for the sampling algorithms. This approach avoids the issue of when computing the divergence during the training while introducing only a small inefficiency. Figure 1: Three views of the toy model target density ptarget(x), with x=(x1,x2,x3), marginalizing one of the variables in each case to visualize the corresponding 2-dimensional densities. It displays the three sharp modes of the density, with a small support over the transformation area.

The problem of exploding KL divergence is not unique to the reverse KL divergence. If instead the forward KL divergences, Eq. (22), is slightly modified using importance sampling to approximate the integral, as proposed in (Müller et al., 2018), the objective function has the expression:

 argmaxϕ∑ptarget(x)qϕ(x)logqϕ(x) with x∼qϕ(x). (27)

Note that if such that , then the neural network cannot update its parameters through training properly. The problem occurs under the exact same condition as when the reverse Kl divergence is used. The solution of tweaking the target density with a support function, Eq. (26), also helps in case one wants to use the forward KL divergence, as it eliminates this problem as well.

## 4 Toy Problem

A test of the previous described methodology was performed on a toy model. In Sec. 4.1 the density is described as a combination of three modes of sharp densities. Then, the training procedure is detailed in Sec. 4.2 using a support density to avoid the exploding gradient of the reverse KL-divergence. The results111 Both training and results were performed on an Intel Core i7-8700 @ 3.20GHz CPU with a GeForce RTX 2080Ti GPU machine. of the training are discussed in Sec. 4.3, where it is shown that the usage of a support function not only avoids the problem of the support domain, but also does not have a negative impact on the performance of the modified learned density.

### 4.1 Toy multi-model description

The following toy model describes a non-trivial 3 dimensional density with 3 modes of data . The three modes are transformations of a base density , following different rotations, scalings and translations in each of them (see Fig. 1).

is sampled through the following procedure:

 v1∼ N(0,1), (28) (v2+3)∼ Gamma(|v1|+3,0.3), (29) v3∼ SkewNormal(|v1⋅v2|) (30) v= (v1,v2,v3), (31) Rmix= ⎛⎜⎝0.29−0.190.06−0.190.370.0150.060.0150.11⎞⎟⎠, (32) u= vRmix, (33)

with

• the Gamma distribution with shape parameter

and scale parameter .

• is the Skew normal distribution with shape parameter

.

• an invertible matrix to mix the components, generated randomly once, fixed for purpose of reproducibility. Figure 2: Three views of the marginalized initialized neural network density of the Neural Spline Flow. The density samples over the whole space initially, making highly likely that ptarget(x)=0 with x∼qϕ(x), thus producing the effect of exploding reverse KL-divergence of Eq. (25).

Starting from , is defined as

 ptarget(x) =3∑i=1αipu(fi(x)), (34) fi(x) =(x−ti)R(θi)−1⋅si, (35) R(θ) =Rz(θ)Ry(θ)Rx(θ), (36)

with

• the weights of each mode.

• the translation vectors of each mode.

• a composition of rotations with angle on the three axes, with angles for each mode.

• the scale factor of each mode.

The three marginalized 2-dimensional densities can be seen in Fig. 1, which are composed of three modes , each scaled, rotated and translated according to the definition described above. This density has overall small support ( of the cube volume), centered in located regions of the volume.

### 4.2 Training setup

The setup for the Neural Spline Flow was the following (refer to (Durkan et al., 2019b)

for a comprehensive description of the hyperparameters)

222These hyperparameters were chosen due to the proximity of the nature of the density to the ones explored in the original NSF paper for similar problems and worked well.: 128 hidden features, tail bound of 3.5 (i.e., the space transformed within the cube), 10 composed transformations, 1024 batch size, validation size of 500k, learning rate, 100k training steps with cosine annealing scheduler. The complete set of configuration file can be found in the code attached.

Fig. 2 shows the initialized neural network density one could obtain, with the three 2-dimensional marginalized densities. Notice that the density spreads over the whole space in this particular case, making it likely to sample some point at a region outside of the support of the target density, thus causing the exploding reverse KL-divergence of Eq. (25).

We redefine the target probability as described in Eq. (26), with a multivariate standard normal distribution in dimension 3 as the support function and . With this new objective, the exploding divergence is no longer an issue, as in the cube.

The training proceeds and converges to a value close to zero properly for the new modified target density, depicted in Fig. 3. is taken as the model with lowest KL-divergence for the validation set, with KL-divergence of value 0.017. Due to the nature of constantly generating new samples during training from , no overfitting is possible. Figure 3: Kullback-Leibler divergence during the training of the modified density function with added support, converging to a value close to zero (0.017) for the training samples and the validation set. Training time was 5h 32m.

### 4.3 Results

For the purpose of comparison, two straightforward alternative proposal functions aside from the NSF where used:

• A uniform distribution over the cube .

• A multivariate normal distribution. To find the appropriate mean and covariance matrix, one million samples were drawn uniformly in the cube , and assigned weights according to . The mean of the distribution is the weighted average and the covariance is the weighted covariance matrix multiplied by a factor 3 to cover a larger space.

A sample size of one million is generated with each of the proposal, whose weights are computed as , with either the NSF, the uniform or the multivariate normal distribution. In the Table 1 statistics of the weights of these samples are shown. In particular the magnitudes are: mean, variance, maximum value, ratio of zeros, quantile .99 and quantile .9999. Notice that even though all of them have mean (although NSF has more significant digits), the variance and quantiles of the uniform and normal distributions are orders of magnitude bigger than the ones of the NSF.

Figures 4 and 5 show the explicit weight distributions. For the NSF case in Fig. 4, notice how, aside from the 2.6% of zeros, the weights are well distributed around 1, with a slight bias towards values bigger than 1. This is due to the convex combination of the target and the support function, making the new density to learn in general slightly smaller than the original one. In Fig. 5, long queues can be observed, which account for a bigger variance of the computed expectation value due to Eq. (5). Additionally the proportion of zero weight values is quite large compared to the proportion of the NSF, as seen in Table 1. Figure 4: Weight distribution for the Neural Spline Flow proposal function. Weights are centered around 1, with a slight bias towards values greater than 1, due to the modified objective density through the convex combination. Figure 5: Weight values of the uniform (left) and multivariate normal (right) proposal functions. Long queues are observed, together with large values of the weights, aside from a great number of zero weight samples.

To test the performance of the method, two functions of are defined whose expected values are computed with all three proposals:

 f1(x) =x1+x2+x3, (37) f2(x) =√x21+x22+x23. (38)

First, the expected values over the target density for these two functions are computed using importance sampling with one million samples and their respective weights. Results are shown in Table 2, comparing them with the expectation with real samples of . Notice how for the NSF, the relative error of the mean is smaller, at least one order of magnitude better than the other two proposal functions. The error on the mean is of the same order of magnitude than the true one of the result.

Next, sets of one million samples from the target distribution are generated using rejection sampling, with a constant factor from Eq. (7) equal to the quantile .9999 of the weights generated for Table 1. The results are shown in Table 3, with the time it took to generate the data sets respectively with each different proposal function. Although all three proposal functions perform highly accurate regarding the results (as expected using rejection sampling), NSF performs 146 times faster than the uniform prior and 18.6 times faster than the multivariate normal one. Not only that, but taking into consideration that it took only 8.19s to sample one million samples, it can be used over importance sampling for this task, since the result is more accurate.

Overall, NSF provides an important improvement in the toy problem over the other two proposal functions for computing the expected value directly through importance sampling regarding precision (Table. 2) and time to sample exactly from the distribution through rejection sampling (Table. 3).

## 5 Future plans

Normalizing flows offer a new and powerful way of finding adequate proposal functions for rejection and importance sampling in order to perform sampling and compute expected values. This has been seen in this work through the implementation of normalizing flows via neural spline flows, applying it to a complex multimode target density, with a small volume of non-zero values. The modification of the target density through a support function makes the tool useful for real world applications, which usually have narrow, delimited regions of non-zero density.

In the near future, the authors’ aim is to apply this methodology tested here to real science problems. In particular the interest lies in applications to High Energy Physics to generate samples from a cross section and compute expectations related to it.

## References

• N. D. Cao, I. Titov, and W. Aziz (2019) Block neural autoregressive flow. In UAI, Cited by: §2.2.1.
• C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019a) Cubic-spline flows. ArXiv abs/1906.02145. Cited by: §2.2.1.
• C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019b) Neural spline flows. In NeurIPS, Cited by: §2.2.1, §2.2.2, §2, §4.2.
• M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015)