 # Exhaustive Neural Importance Sampling applied to Monte Carlo event generation

The generation of accurate neutrino-nucleus cross-section models needed for neutrino oscillation experiments require simultaneously the description of many degrees of freedom and precise calculations to model nuclear responses. The detailed calculation of complete models makes the Monte Carlo generators slow and impractical. We present Exhaustive Neural Importance Sampling (ENIS), a method based on normalizing flows to find a suitable proposal density for rejection sampling automatically and efficiently, and discuss how this technique solves common issues of the rejection algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Motivation

In modern science and engineering disciplines, the generation of random samples from a probability density function to obtain data sets or compute expectation values has become an essential tool. These theoretical models can be described by a target probability density function

. Ideally, to generate samples following , the inverse transformation method is used. To perform the inverse transformation, the cumulative probability has to be calculated and the inverse to this function has to be found. Numerical methods have to be applied to obtain the Monte Carlo (MC) data when this is not feasible computationally. This is especially true for high-dimensional spaces, where the integrals required to find such inverse transformation become analytically challenging.

A standard numerical method to obtain such data set is to perform a Markov Chain Monte Carlo (MCMC) algorithm

Hastings (1970), which provides good results for expected value calculations. Compared to other methods, it has the advantage that, in general, it requires very little calibration, and high dimensions can be broken down into conditional smaller dimension densities Geman and Geman (1984). However, the MCMC method produces samples that form a correlated sequence. Also, the convergence of the samples’ chain to the target density cannot be guaranteed for all possible models.

Another standard algorithm to produce MC samples is the acceptance-rejection or simply rejection sampling Casella et al. (2004); Martino and Míguez (2010); Bishop and Nasrabadi (2007); von Neumann (1951), which produces i.i.d. (independent and identically distributed) samples from the target density via an auxiliary proposal function. The proposal has to satisfy being a density which can both be sampled from and evaluated efficiently, as well as being as close to the target density as possible. The main disadvantages of the method are the following Robert and Casella (2004):

1. Designing the proposal function close to a particular target density can be very costly in human dedication.

2. If taken a generic proposal function, such as a uniform distribution over the domain, the algorithm is usually very inefficient.

3. The inefficiency grows rapidly with the number of dimensions.

Ideally, to avoid these inconveniences, one would like to have a method to find a proposal function that adapts to a given target density automatically. This would solve simultaneously the human time cost as well as the inefficiency of generic proposal densities.

An approach of the usage of normalizing flows to find a suitable proposal for a given target density has been suggested previously as Neural Importance Sampling (NIS) Müller et al. (2018), focused on the integration of functions via importance sampling Kahn and Marshall (1953). Normalizing flows provide an expressive family of parametrized density functions

through neural networks, by defining a differentiable and invertible transformation from a base distribution to a target distribution, allowing to evaluate and sample from complex distributions by transforming simpler ones. The concept of integrating via importance sampling with normalizing flows for High Energy Physics (HEP) has been explored in

Gao et al. (2020), to simulate collider experimental observables for the Large Hadron Collider.

In this work we further explore the possibility of utilizing normalizing flows to find a proposal function for a given target density to perform rejection sampling for MC samples , and analyze its viability through the following points:

• We discuss the importance of adding a background density to the target one to assure the coverage of the whole phase space when performing rejection sampling.

• We define a two-phase training scheme for the normalizing flow to boost initial inefficiency in the optimization when adjusting the initialized random density towards the target one.

• We measure the performance of the method and argue for relaxing the rejection sampling constant factor to improve largely the efficiency of acceptance while quantifying the error committed in doing this approximation via the concept of coverage.

Considering the proposed algorithm covers the whole phase space by modifying NIS with the background density, we denote this method by Exhaustive Neural Importance Sampling (ENIS).

We apply the above algorithm to a HEP problem, in the form of the charged current quasi-elastic (CCQE) cross-section for anti-neutrinos interactions with nuclei, performing in-depth analysis and discussion of the efficiency of the method. Neutrino-nucleus cross-section modeling is one of the main sources of systematic uncertainties in neutrino oscillations measurements Alvarez-Ruso and others (2018). Cross-section models are either analytically simple but poorly describing the experimental data or involving complex numerical computations, normally related to the description of the nucleus, that imposes limitations in their MC implementation. New tendencies in the field also call for a fully exclusive description of the interaction adding complexity to the calculations. The analytical model utilized in this paper is simple, but it is a realistic one and a good reference to demonstrate the capabilities of the proposed method to generate neutrino-nucleus cross-sections efficiently. We will show that ENIS opens the possibility to incorporate efficiently complex theoretical models in the existing MC models enhancing the physics reach of running and future neutrino oscillation experiments.

ENIS algorithm may be used beyond the scope of neutrino physics. Further applications to be evaluated in detail in the future are particle/nuclear physics experiments, detector responses for medical physics, engineering studies or theoretical modelling. In general, it could be applied to any Monte Carlo simulation that is limited by the algorithm’s speed, such as for importance sampling to provide fast Monte Carlo with sufficient accuracy (i.e. fast detector simulation, design studies, minimum bias background simulations, etc.). Additionally, the technique may help model developers extract expected values from their theoretical predictions in realistic conditions by including simple detector effects in models, such as effects of detector acceptance cuts, impact of model degrees of freedom on the predictions or uncertainty propagation.

## Ii Framework

In this Section we will describe background and framework needed for the rest of the paper. Sec. II.1 explains the physical model of charged current quasi-elastic neutrino interaction we will apply ENIS to in Sec. IV. As a summary and to introduce our notation, Sec. II.2 overviews the rejection sampling algorithm. Finally, in Sec. II.3 we make a modest introduction to normalizing flows, focusing on the implementation of Neural Spline Flows.

### ii.1 Model of Charged Current Quasi-Elastic Anti-Neutrinos Interactions with Nuclei

The Charged Current Quasi-elastic (CCQE) is a basic model of neutrino interactions that might be expressed in simple formulae. The CCQE model has many advantages during this exploratory work, as it can be implemented in a simple software function, while at the same time it is also a realistic environment to understand the implications of modeling cross-sections with the proposed methodology. The selected model to describe CCQE is the well established Smith-Monith Smith and Moniz (1972). The nucleon momentum distribution follows a Relativistic Fermi Gas (non-interacting nucleons in a nuclear potential well) with a

GeV/c Fermi level. The model includes the Pauli blocking, preventing the creation of final state nucleons below the nucleus Fermi level. The model can be applied both to neutrinos and antineutrinos interactions. Antineutrinos are selected for this study due to the vector axial current cancellation imposing more stringent conditions at the edges of the kinematic phase space. The model includes the following degrees of freedom generated by the Monte Carlo model: the neutrino energy, the

momentum and angle, and the target nucleon Fermi momentum. Contrary to other MC implementations, the neutrino energy is not a fixed input value but it is generated by the algorithm to add complexity to the calculations and to check the capabilities of the calculations to reproduce the cross-section as a function of the neutrino energy. The implementation of this model for fixed energy value is also possible. The basic kinematic distributions obtained with this model will be discussed in Sec. IV.

### ii.2 Rejection sampling

Rejection sampling is a well known technique Casella et al. (2004); Martino and Míguez (2010); Bishop and Nasrabadi (2007); von Neumann (1951); Robert and Casella (2004) to obtain MC samples from a target density which can be evaluated (up to a constant), but cannot be sampled from through the inverse transform. It relies on an auxiliary proposal function , from which one should be able to sample from and evaluate efficiently. A constant is introduced which has to satisfy that

 k⋅q(x)≥p(x)∀x:p(x)>0. (1)

The resulting function is called the comparison function.

The procedure to sample from the target density is then the following:

1. A sample is generated following , .

2. A random number is generated uniformly in the range , .

3. If fulfills the condition , the sample is accepted; otherwise, it is rejected.

Additionally, if is normalized, the probability that a sample is accepted is proportional to , i.e., gives an intuition of the number of tries until we obtain an accepted sample.

### ii.3 Neural density estimation using Neural Spline Flows

A family of density functions over the real D-dimensional space parametrized by satisfies that for all , and that for all . Normalizing flows are a mechanism of constructing such flexible probability density families

. A comprehensive review on the topic can be found in Papamakarios et al. (2019), from which a brief summary will be shown in this Section on how normalizing flows are defined, and how the parameters are obtained, together with a specific implementation, the Neural Spline Flows (NSF) Durkan et al. (2019b).

Consider a random variable defined over , with known probability density . A normalizing flow characterizes itself by a transformation from another density of a random variable defined also over , the target density, to this known density, via

 u=T(x), with x∼p(x). (2)

The density is known as base density, and has to satisfy that it is easy to evaluate (e.g., a multivariate -dimensional normal, as will be chosen through this work, or a uniform distribution in dimension ). The transformation has to be invertible, and both and have to be differentiable, i.e., defines a diffeomorphism over .

This allows us to evaluate the target density by evaluating the base density using the change of variables for density functions,

 (3)

where the Jacobian is a matrix of the partial derivatives of the transformation :

 JT(x)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂T1∂x1⋯∂T1∂xD⋮⋱⋮∂TD∂x1⋯∂TD∂xD⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (4)

The transformation in a normalizing flow is defined partially through a neural network with parameters , as will be described below, defining a density

 qϕ(x)=pu(Tϕ(x))|detJTϕ(x)|. (5)

The subindex of will be omitted in the following, simply denoting the transformation of the neural network by .

If the transformation is flexible enough, the flow could be used to evaluate any continuous density in . In practice, however, the property that the composition of diffeomorphisms is a diffeomorphism is used, allowing to construct a complex transformation via composition of simpler transformations. Consider the transformation as a composition of simpler transformations:

 T=TK∘⋯∘T1. (6)

Assuming and , the forward evaluation and Jacobian are

 zk =Tk(zk−1),k=1:K, (7) |JT(x)| =∣∣ ∣∣K∏k=1JTk(zk−1)∣∣ ∣∣. (8)

These two computations (plus their inverse) are the building blocks of a normalizing flow Rezende and Mohamed (2015). Hence, to make a transformation efficient, both operations have to be efficient. From now on forth, we will focus on a simple transformation , since constructing a flow from it is simply applying compositions.

To define a transformation satisfying both operations to be efficient, the transformation is broken down into autoregressive one-dimensional ones for each dimension of :

 ui=τ(xi;hi) with hi=ci(x

where is the -th component of and the -th of . is the transformer, which is a one-dimensional diffeomorphism with respect to with parameters . is the -th conditioner, a neural network, which takes as input , i.e., the previous components of , and are the parameters of the neural network. The conditioner provides the parameters of the -th transformer of depending on the previous components , defining implicitly a conditional density over with respect to . The transformer is chosen to be a differentiable monotonic function, since then it satisfies the requirements to be a diffeomorphism. The transformer also satisfies that it makes the transformation easily computable in parallel and decomposing the transformation in one dimensional autoregressive transformers allows the computation of the Jacobian to be trivial, because of its triangular shape. To compute the parameter of each transformer, one would need to process a neural network with input for each component, a total of times.

Masked autoregressive neural networks Germain et al. (2015) enable us to compute all the conditional functions simultaneously in a single forward iteration of the neural network. This is done by masking out, with a binary matrix, the connections of the -th output with respect to all the components with index bigger or equal to , , making it a function of the components.

The transformer can be defined by any monotonic function, such as affine transformations Papamakarios et al. (2017), monotonic neural networks Huang et al. (2018); Cao et al. (2019); Wehenkel and Louppe (2019), sum-of-squares polynomials Jaini et al. (2019) or monotonic splines Müller et al. (2018); Durkan et al. (2019a, b). In this work we will focus on a specific implementation of monotonic splines, the Neural Spline Flows.

In their work on Neural Spline Flows Durkan et al. (2019b), Durkan et al. advocate for utilizing monotonic rational-quadratic splines as transformers , which are easily differentiable, more flexible than previous attempts of using polynomials for these transformers, since their Taylor-series expansion is infinite, and are analytically invertible.

Each monotonic rational-quadratic function in the splines is defined by a quotient of two quadratic polynomial. In particular, the splines map the interval to , and outside of it the identity function is considered. The splines are parametrized following Gregory and Delbourgo Gregory and Delbourgo (1982), where different rational-quadratic functions are used, with boundaries set by the pair of coordinates , known as knots of the spline and are the points where it passes through. Note that and . Additionally, we need intermediate positive derivative values, since the boundary points derivatives are set to 1 to match the identity function.

Having this in mind, the conditioner given by the neural network outputs a vector of dimension for the transformer , . and give the width and height of the bins, while is the positive derivative at the intermediate knots.

Stacking up many of these transformations, a highly flexible neural density estimator, the NSF, can be build, which satisfies:

1. It is easy to sample from using the inverse transform in Eq. (2) by sampling .

2. Eq. (5) allows to evaluate the densities of these samples when generating them in an efficient way.

This density estimator will be the one utilized during this work.

## Iii Methodology

With the framework introduced in Sec. II, we are now in a position to define the ENIS method and the different magnitudes we will use to measure its performance.

We start in Sec. III.1 by showing the objective function to be minimized by the NSF to adjust its proposal function to the target density . Then, in Sec. III.2, we discuss the importance of adding background noise to both ensure coverage of the whole phase space of and to boost the initial phase of the training of ENIS. The exact training scheme is then shown in Sec. III.3, differentiating the warm-up phase from the iterative phase. Finally, in Sec. III.4, the performance metrics are introduced, explaining the concept of coverage and effective sample size when considering a more relaxed condition on the rejection constant .

### iii.1 Optimizing the parameters of the NSF

Consider a target probability density function which can be evaluated for all but from which we are unable to sample data directly through an analytical inverse transform. If we could approximate this target density by our neural density estimator , then we could exactly sample from the target density using rejection sampling, since we can both sample and evaluate from .

To obtain the parameters of given a density

which can be evaluated, we want to minimize the Kullback-Leibler divergence (KL-divergence)

Kullback and Leibler (1951) between both distributions, which is and only equal to zero if both distributions match:

 DKL(p(x)∥qϕ(x))=∫p(x)log(p(x)qϕ(x))dx. (10)

When minimizing with respect to , the KL-divergence is simplified to

 (11) argminϕDKL=argminϕ∫p(x)log(p(x)qϕ(x))dx (12) argminϕDKL=argminϕ−∫p(x)logqϕ(x)dx (13) (14) argminϕDKL=argmaxϕEx∼p(x)[logqϕ(x)]. (15)

The last expression could be approximated numerically if we could sample , since it corresponds to approximating an expected value of a function we can evaluate, , and is equal to maximizing the log-likelihood of these samples:

 L=1NN∑i=1log(qϕ(xi)) with xi∼p(x). (16)

Müller et al. Müller et al. (2018) propose a solution for computing the gradient with respect to for this maximization problem. They suggest using importance sampling Kahn and Marshall (1953) for this particular expected value:

 ∇ϕEx∼p(x)[logqϕ(x)] =∫p(x)∇ϕlogqϕ(x)dx (17) =∫qϕ(x)p(x)qϕ(x)∇ϕlogqϕ(x)dx (18) (19) ≈1NN∑i=1w(xi)∇ϕlogqϕ(xi), (20)

with and the weights defined as . Notice how we only need to be able to evaluate to compute this quantity. With this gradient, we are able to minimize the KL-divergence in Eq. (10) if the support of (i.e., the domain where the function is non-zero) contains the support of to perform the importance sampling of Eq. (20) correctly. Notice that, in order to properly optimize the parameters , does not need to be normalized, since this simply changes the magnitude of the gradient, but not its direction. The lack of proper normalization can be properly handled by standard neural network optimizers such as Adam Kingma and Ba (2014).

The method described by Eq. (20) implies an iterative way of optimizing with the following steps:

1. A batch of is generated according to the current state of the neural network, .

2. Using this batch, the parameters of the neural network are optimized via the gradient of Eq. (20).

3. This updated neural network then generates the next batch.

### iii.2 Relevance of background noise

As briefly discussed in previous Section II, in order to optimize the neural network following Eq. (20), the gradient is only correctly computed if the support of contains the one of . Moreover, if we want to use as our proposal function to sample from via rejection sampling, this also has to hold.

To ensure the proper support, we introduce the concept of a background density function, . In HEP, as in many other scientific areas, the density is restricted to a certain domain of , e.g., the cosine has to be in , the magnitude of the momentum in an experiment has to be positive and has a maximum value of , there are constraints in the conservation of energy and momentum, etc… Hence should be a density that has a support beyond these phase-space boundaries. In what follows, a uniform distribution will be considered, with limits in each dimension according to the phase space of that coordinate. The selection of the functional form of the is arbitrary and it can be selected to adapt to the requirements of each project.

The background density will be used for two tasks:

1. Improve initial training: At the beginning of the training, we cannot assure that the support of contains the one of . Hence, instead of using for the importance sampling of Eq. (20), will be used during the warm-up phase of the training. The distribution of the weight function might span several orders of magnitude, but this way we ensure the full support of . This strategy gives a better approximation than the one obtained by the randomly initialized neural network at the start of the training.

2. Ensure exhaustive coverage of the phase space: The target density that the neural network will learn will be constructed as a linear combination of the true target density and the background :

 ptarget(x)=(1−α)⋅p(x)+α⋅pbg(x), (21)

with . This implementation adds a certain percentage of background noise to the target density, spreading it over all the domain of the background density, allowing to properly apply the methods rejection and importance sampling with as the proposal function, covering exhaustively the phase space. Experimentally we have found good compromise with .

Optimizing to match of Eq. (21) instead of will make the proposal slightly worse for rejection/importance sampling. By performing the optimization to directly in an iterative way, as explained at the end of the last Section, some regions of the phase space might disappear for future samplings. These regions are located normally close to the boundaries of sampled volume. Having a constant background noise prevents these losses from appearing, as the neural network has to also learn to generate this noise, covering properly the required phase-space volume. We will discuss the impact of the background term on the method performance in Section IV.

### iii.3 ENIS training scheme of the proposal function

The training procedure to obtain from following ENIS consists of two phases:

1. Warm-up phase:

1. Sample and compute their weights .

2. Sample background with associated weights , where .

3. Optimize the parameters of via Eq. (20) using with weights .

2. Iterative phase:

1. Sample and compute their weights .

2. Sample background with associated weights , where .

3. Optimize the parameters of via Eq. (20) using with weights .

Fig. 1 depicts a flow diagram of the training method for ENIS, showing on the left block the warm-up phase, while on the right block the iterative phase.

Steps 1. (ii) and 2. (ii) allow the method to add background following Eq. (21) to construct even if is not normalized. Figure 1: Exhaustive Neural Importance Sampling flow diagram. The warm-up phase is depicted on the left block and the iterative phase on the right block. See text for a detail description.

### iii.4 Measuring the performance of the proposal function

The proposed method is not to use as a direct approximation of , but as proposal function to perform either rejection (Sec. II.2) or importance sampling Kahn and Marshall (1953). This allows for the methods to correct any deviation in the neural network modeling of the exact density while utilizing its proximity to such density.

We use the learned probability density function to generate samples via rejection sampling (see Sec. II.2), which, in HEP, is of high interest and costly via standard procedures. The parameter of the rejection algorithm has to be estimated empirically. Consider samples generated with the proposal function , with weights , satisfying:

• The average of the weights is

 ⟨w⟩=1NN∑i=1w(xi)≈∫q(x)w(x)dx=C, (22)

where is the normalization of the density , i.e., its volume.

• , the smallest constant such that the inequality of Eq. (1) holds, is equal to .

In real conditions, the parameter can be relaxed. Instead of choosing the maximum value among the empirically computed weight distribution, it can be taken as the inverse of the

-quantile of these weights,

, denoted by :

 kQ=(Q-quantile(w))−1=w−1Q. (23)

This is equivalent of clipping the weights’ maximum value to the -quantile of , capping the desired density function we are generating using these weights for the rejection. The new weights are simply:

 w′(xi)={w(xi)% if w(xi)≤wQwQif w(xi)>wQ. (24)

The ratio of volume with respect to the original density we are maintaining by clipping the weights this way defines the coverage we have of the rejection sampling, and is equal to

 Coverage=∑Ni=1w′(xi)∑Ni=1w(xi). (25)

This allows us to quantify the error we are committing when choosing a quantile over the maximum of weights when defining a constant for rejection sampling.

The idea behind relaxing this constant is that we will approximate wrongly only a small region of with . In that small region, the ratio is large compared to the rest of the domain but still it is occupying a small volume of the density . This region can be ignored by relaxing , making the overall ratio of closer to 1 and improving drastically the rejection sampling at the cost of this small discrepancy which we are committing, quantified in Eq. (25).

As an additional qualitative measurement of the goodness of different proposals under different constants , the effective sample size (ESS) will be used Liu (1996), which corresponds approximately to the number of independent samples drawn. The ESS for samples of weights is defined as:

 NESS=(N∑i=1w(xi))2/N∑i=1w(xi)2. (26)

This is a rule of thumb to obtain the number of independent samples. The closer is to the number of samples , the more uncorrelated the weighted samples are. If large weights appear, then the independence of the samples will be diminished, as a same sample gets represented many times. We define as a rough estimate for the ratio of independence of the samples.

## Iv Monte Carlo generation of the CCQE antineutrino cross-section

We will now proceed to apply ENIS to the CCQE antineutrino cross-section density. In Sec. IV.1, we discuss how the training for the NSF was performed, describing the background we added to cover the phase space. We show qualitatively the obtained densities and compare them to the target one. After obtaining a suitable proposal, we discuss in depth the performance of the obtained result in Sec. IV.2, comparing the ENIS proposal to a generic uniform one, demonstrating its potential while justifying the relaxation on the constant for the rejection sampling.

### iv.1 Training

To find the proposal function via NSF for the CCQE antineutrino interaction cross-section density, described in Sec. II.1, we followed the training scheme from Sec. III.3.

The background chosen is a uniform distribution, covering a range of for the incoming neutrino energy (in GeV), for the outgoing muon momentum (in GeV/c), for the angle of the outgoing muon (in rad), and for the target nucleon Fermi momentum (in GeV/c). These bounds were expanded by covering a slightly more extended domain, of an additional 2% at the beginning and end of each dimension, to assure that the physical boundaries are completely covered. This expanded background was added with a contribution to the cross-section density in Eq. (21), as well as used during training for the warm-up phase.

As for the hyperparameters of the NSF, we have chosen the

Adam optimizer Kingma and Ba (2014) with learning rate , batch size , training steps , flow steps, transform blocks, hidden features and bins. This gives a total dimension of for the parameters of . This configuration for the NSF was chosen experimentally to have a relatively low number of parameters (one can have easily six million parameters instead of the we have) since a lower number speeds up the generation and evaluation of samples . Additionally, the learning rate was decreased during the training using a cosine scheduler to ensure stabilization at the end of the training procedure.

The training consists in maximizing the log-likelihood of Eq. (16) by computing its gradient via Eq. (20), and is shown over the iterations in Fig. 2. In the grey area, the training is performed with samples of the background distribution , while in the white area the samples of the training samples are generated by the current proposal distribution

. Notice that since the samples are generated in real-time during the training, there is no need to worry about possible overfitting of the parameters of the neural network, which is a common issue in many machine-learning applications. The values of Fig.

2 are computed every one thousand steps, for a batch of samples . The log probability can be seen to converge at the end of the training, which is mainly due to the cosine scheduler, but also due to the saturation over the family of parametrized densities . Figure 2: Neural Spline Flow training log probability for estimating the modified CCQE cros-section with background noise, following Eq. (16). In grey, the warm-up phase is performed, using pbg(x) to generate the weighted samples, while in the white area the current state of the NSF qϕ(x) is used. The log probability stabilizes during the training to converge to a certain value which depends on the expressiveness of the network and the normalization of the target density.

To have a visual representation, Fig. 3 shows the marginalized 1-dimensional densities of the four cross-section variables of the target density (blue) vs the NSF proposal (orange). The plots show a small discrepancy in each variable, but an overall agreement between the two densities. Aside from a mismodeling on the side of in certain regions, the differences can also come from the fact that the NSF is learning a modified target density (Eq. (21)). Figure 3: p(x) (blue) vs qϕ(x) (orange) 1-dimensional normalized histograms of the marginalized CCQE cross-section density for each of the variables. The plots show light discrepancy in each variable, but an overall agreement between the NSF proposal qϕ(x) and the CCQE cross-section density p(x). Figure 4: 2D histograms comparison of cross-section density for the real cross-section p(x) (left) and the proposal density qϕ(x)(right). Visually, an overall agreement can be seen.

To asses qualitatively that the correlation between the variables are also captured by , Fig. 4 shows 2D-histograms for both the real density (left) and the the proposal density (right). Visually, an overall agreement can be seen. There is a slight discrepancy for high energy values, where the attenuation indicates that for the NSF proposal function it is more spread due to the background noise it is also learning (Eq. (21)).

In what follows the performance of the NSF proposal will be discussed in more quantitative ways, and compared it to a uniform proposal.

### iv.2 Performance and discussion

In this section we will focus on analyzing the performance of the proposal density obtained by the NSF while also comparing it to a uniform proposal density, , which in our case will be the same as , defined in Sec. IV.1. Figure 5: Logarithmic weight distribution for ten million samples from the NSF proposal qϕ(x) (left) vs the same number of samples from the uniform proposal pUnif(x) (right). Notice that we are only computing the logarithm for weights > 0. For the NSF, all weights are concentrated around log10w=0 with a small dispersion around it, while for the uniform distribution the spectrum of weights goes over 9 orders of magnitude.

We start by generating ten million samples from each proposal density and compute their associated weights. The proportion of samples with weight equal to zero is 5.56% for the NSF proposal, compared to the 98.03% for the uniform one. To understand the distribution of such weights, Fig. 5 shows the logarithmic scale of them (for the weights 0), assuming the average of the weights is equal to 1. For the NSF (left), all weights are concentrated around with a small dispersion around it. Notice that there are only three weights in ten million barely greater than hundred. This shape justifies using not the maximum value of to perform rejection sampling, but some quantile of it, as we will discuss below. Contrary, for the uniform distribution (right) we can see that the spectrum of weights goes over nine orders of magnitude. The mean for is , while for we obtain an average of , indicating a huge fluctuation in the magnitude of the weights.

The results of the performance test for rejection sampling are summarized in Tab. 1, where we compare various quantities for the NSF and uniform proposal functions. For this, different quantiles for the constant for the rejection method are used, following Eq. (23), relaxing its restriction as discussed in Sec. III.4. The quantiles for were chosen using the ten million weights computed for the previous discussion of the weight magnitudes, as well as the probability of acceptance, the coverage, and the effective sample size. We considered a case of sampling one million accepted samples via rejection sampling, where samples from the proposal were generated and checked for acceptance/rejection in parallel, in batches of samples. The purpose of the parallelization is to exploit the computational capacities of a GPU. We denoted each of these batches of generating and checking a cycle of the rejection sampling. The values in Tab. 1, for each quantile value and a proposal function, are the following:

• : probability of accepting a single event, given by the average of . If , it is taken as 1 for the computation.

• Cycles: number of rejection sampling cycles of size samples needed to obtain one million accepted samples:

 Cycles=⌈106paccept⋅3×105⌉ (27)
• Time: seconds it takes to compute these cycles and obtain one million accepted samples: .

• Coverage: volume of the original density covering when taking with a certain quantile (Eq. (23)), following Eq. (25).

• : the ratio of effective sample size over the total number of samples, quantifying an estimate of the ratio of independence of the events. This was computed for a sample size of 10 million.

The probability of acceptance, , for the NSF is at least one order of magnitude higher than the one obtained from uniform sampling. Additionally, NSF grows rapidly towards acceptance while also covering of the original density volume, as shown in the Coverage column. This is not the case for the uniform distribution, which, while being only one order of magnitude behind NSF with regards to acceptance, is missing a large volume of coverage of the original density.

The number of rejection sampling cycles needed to achieve the desired number of accepted samples is inversely proportional to , as shown in Eq. (27). In a cycle, the algorithm has to sample from the proposal and evaluate both the proposal and . For the NSF, the cycles get stalled when reaching a high percentage of acceptance, since the number of cycles has to be a whole number, which is equivalent to the whole number +1. Notice how the number of cycles for the NSF is two orders of magnitude smaller compared to the uniform one, however, the coverage of the uniform drops drastically when decreasing the quantile, and hence the quality of the samples. We will discuss more in-depth in Appendix A.

When looking at the time it takes to obtain one million accepted samples, it is directly proportional to the cycles for each proposal. The main difference is that a cycle for the uniform proposal takes a fraction of the time of a cycle for the NSF. This is because sampling and evaluating for the NSF is heavy computationally compared to doing this task for a uniform distribution. As mentioned before, see Appendix A for a more in-depth discussion.

The coverage is the main quantity of measurement of the quality of the produced samples since it measures the volume conserved of the original distribution when performing rejection sampling with certain quantiles. For all the chosen quantiles, the NSF drops a volume  %, while for the uniform distribution the loss is of  % for quantile ,  % for , and  % for , which is unacceptable when trying to produce samples from the original distribution. For the NSF this level of performance when taking the above quantiles is expected, as in Fig. 5 we have seen that the upper tail of weights with large magnitudes is barely a percentage over the whole distribution. However, for the uniform distribution, the loss of coverage is caused by two facts: (i) of the weights are zero, hence placing the whole distribution on a of the weights. (ii) These weights, as seen in Fig. 5, span over many orders of magnitude, making a cut on the quantile of their distribution more noticeable, as will be discussed below.

To visualize the coverage and the regions missing by choosing a quantile and different proposal functions, Fig. 6 shows 2-dimensional histogram representation of the marginalized coverage bin-to-bin, taking the variables in pairs, where each bin quantifies the coverage of that bin (i.e., the sum of weights in that bin after choosing a certain quantile over the sum of weights of those weights without clipping). On the left, the coverage for the NSF is presented and shows that only few regions of the phase space have values smaller than 1, and even in those regions the coverage has no noticeable discrepancies. On the right, the coverage of the uniform proposal is shown for the same quantile , marking clear regions where the coverage drops drastically to values close to zero. Figure 6: 2D histogram representation of the marginalized coverage bin-to-bin for NSF proposal (left) and for the uniform proposal (right), for k-quantile=0.999. The coverage of the NSF presents barely discrepancies in small areas, justifying the use of a quantile for k to improve acceptance and time, as shown in Tab. 1. For the uniform proposal, the coverage presents an important size of the total area with significant low coverage, which is unacceptable when trying to perform rejection sampling from it.

When comparing both coverage regions in Fig. 6, a clear pattern be seen for the uniform one, while it looks quite random for the NSF. This is because the coverage is related to the ratio , with the corresponding proposal density. For the NSF, has a shape very closely related to , as shown in Fig. 3 and Fig. 4, so the coverage would correspond to regions where the discrepancy is large, which has a chaotic behavior. Contrary, for the uniform proposal, , this ratio is proportional to , hence, by clipping, we are doing so according to that particular shape, making the coverage less chaotic and more structured. This translates into making highly probable areas equally likely than others with less probability, affecting this exact group of regions as we will now analyze.

Fig. 6 gives us an overall picture of where the densities are wrongly estimated by choosing certain quantile, but it does not quantify or indicate the amount of error, that is, it is not telling us whether the coverage is poor in areas of small or high density. To answer this question, a multidimensional histogram overall four dimensions was performed, with a binning of 20 in each dimension. Then, for each bin, we compute the percentage of weight for a proposal ,

 \% wq of bin =∑x∈binwq(x)/∑xwq(x), (28)

which is equivalent to the percentage of density in that bin, and the coverage for the quantile . Fig. 7 shows a histogram of the number of bins according to their % of bin vs their coverage. Notice how % is taken on the logarithmic scale. For the NSF (Fig. 7 left), the regions of coverage visibly smaller than one are two to four orders of magnitude smaller in %  than the denser high %  region on the top right. This means that the areas being misrepresented by taking the quantile are relatively small. Also, most of the area with coverage are close to full coverage. Contrary, the uniform proposal (Fig. 7 right) shows the coverage dropping for high values of % , indicating that important regions of the original density are being trimmed down by choosing . This observation is in agreement with the total coverage we are seeing in Tab. 1. Figure 7: Representation of the number of bins according to their %wq of bin vs their coverage for four-dimensional bins in p(x), taking kQ0.999. For the NSF (left), the coverage is barely lower than one in most of the bins, and when it drops it is for low %wq. Contrary, for the uniform proposal (right), the coverage drops for high %wq, making the overall coverage way smaller than the one for NSF, as shown in Tab. 1.

Concerning the ratio of ESS, we can see that for the NSF it is larger than , giving highly uncorrelated events. For the uniform proposal however, the ESS drops to the range of , even for lower quantiles. The differences can be understood from Eq. (26) and by looking at the weight distribution for each proposal in Fig. 5. A large percentage of NSF weights have the same order of magnitude while for the uniform we go over a spectrum of 8 orders of magnitudes. Additionally this ratio of ESS has to be considered for the area of the original distribution given by the coverage, where the uniform distribution misses a lot by choosing smaller quantiles.

To summarize the analysis performed on the results of Tab. 1, in the case of the NSF, by lowering the quantile, the probability of acceptance grows until reaching almost , reducing the time to a  % of the original one while maintaining coverage of over  %. These scores allow us to justify using a smaller quantile for rejection sampling to improve significantly the performance in time, while also quantifying the misrepresentation we are doing by lowering the constant . Contrary, for the uniform proposal, the analysis showed the weakness when trying to utilize smaller quantiles, lowering the coverage by over  % when using a quantile of only . Additionally, looking at the and Fig. 5, we can see that most of the distribution is concentrated in a relatively small number of samples.

## V Conclusion

In this letter we have presented Exhaustive Neural Importance Sampling (ENIS), a framework to find accurate proposal density functions in the form of normalizing flows. This proposal density is subsequently used to perform rejection sampling on a target density to obtain MC data sets or compute expected values. We argue that ENIS solves the main issues associated with rejection sampling algorithms as described in the introduction: (i) The training to find a good proposal is done automatically from the target density, with little configuration needed from the human point of view. (ii) Compared to generic proposal functions such as the uniform one, the normalizing flow adapts its density over the target one, getting rid of the inefficiencies which usually are on the downside of the method. (iii) The proposal function is generated based on a set of trivial normally distributed random numbers transformed through the flow, without any rejection method applied.

The performance of the method has been demonstrated and analyzed in a real case scenario, the CCQE cross-section of the antineutrino interaction with a nucleus , where the density is four-dimensional. We have shown that, for the normalizing flow proposal, we can relax the condition in the constant , used to construct the comparison function, boosting greatly the efficiency (up to of acceptance rate) while committing a very small error on the target density (less than a ), bringing orders of magnitude of speed up in computing time compared to the same error committed by a uniform proposal. Additionally, we investigated the coverage of the generation method as a function of the constant . We showed that the areas of the phase space where the error is committed are less relevant to the final result compared to the error in the uniform case.

The method can be used to generate fast MC samples in cases where the precision is less relevant versus the algorithm speed. High Energy Physics presents some of these examples such as extensive statistical studies based on ”Asimov data sets”, fast detector simulations, or simply in fast studies for detector developments and designs. In those cases, the learned proposal function might be sufficiently precise and easy to generate with a set of simple normal random generators transformed through the flow.

As of usage, ENIS brings the possibility of applying the same normalizing flow for rejection sampling of similar densities, e.g., densities coming from a model where the parameters are changed slightly, altering the overall density smoothly. The weight distribution of the ratio between target and proposal will be altered, but no additional training of the neural network would be needed regarding the theoretical model remains similar to the original one. This is a huge advantage compared to methods like MCMC, where one would have to rerun the complete algorithm to obtain samples from each of the different densities.

As a last remark, we have demonstrated the potential of ENIS for the four-dimensional CCQE interaction density. We believe this will only be more noticeable when applying it to higher dimensions, as the original paper of NSF Durkan et al. (2019b) shows a remarkable performance on spaces of dimensions up to sixty-three. The advantages will be also more obvious as the underlying cross-section model becomes more complex and computationally involved.

###### Acknowledgements.
The authors have received funding from the Spanish Ministerio de Economía y Competitividad (SEIDI-MINECO) under Grants No. FPA2016-77347-C2-2-P and SEV-2016-0588, from the Swiss National Foundation Grant No. 200021_85012 and acknowledge the support of the Secretariat of Universities and Research of the Department of Business and Knowledge of the Generalitat of Catalonia. The authors also are indebted to the Servei d’Estadística Aplicada from the Universitat Autònoma de Barcelona for their continuous support. The authors would also like to thank C. Durkan et al. for the code provided in their paper Durkan et al. (2019a) which served as a basis to build ENIS, and Tobias Golling for the valuable discussion he brought to the table.

## References

• L. Alvarez-Ruso et al. (2018) NuSTEC White Paper: Status and challenges of neutrino–nucleus scattering. Prog. Part. Nucl. Phys. 100, pp. 1–68. External Links: 1706.03621, Document Cited by: §I.
• C. M. Bishop and N. M. Nasrabadi (2007) Pattern recognition and machine learning. J. Electronic Imaging 16, pp. 049901. Cited by: §I, §II.2.
• N. D. Cao, I. Titov, and W. Aziz (2019) Block neural autoregressive flow. In UAI, Cited by: §II.3.
• G. Casella, C. P. Robert, and M. T. Wells (2004) Generalized accept-reject sampling schemes. In A Festschrift for Herman Rubin, A. DasGupta (Ed.), Lecture Notes–Monograph Series, Vol. Volume 45, pp. 342–347. External Links: Cited by: §I, §II.2.
• C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019a) Cubic-spline flows. ArXiv abs/1906.02145. Cited by: §II.3, §V.
• C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019b) Neural spline flows. In NeurIPS, Cited by: §II.3, §II.3, §II.3, §V.
• C. Gao, S. Höche, J. Isaacson, C. Krause, and H. Schulz (2020) Event generation with normalizing flows. Phys. Rev. D 101, pp. 076002. External Links: Cited by: §I.
• S. Geman and D. Geman (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 (6), pp. 721–741. Cited by: §I.
• M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015)

.
In ICML, Cited by: §II.3.
• J. A. Gregory and R. Delbourgo (1982) Piecewise rational quadratic interpola-tion to monotonic data. Cited by: §II.3.
• W. K. Hastings (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), pp. 97–109. External Links: ISSN 0006-3444, Document, Link Cited by: §I.
• C. Huang, D. Krueger, A. Lacoste, and A. C. Courville (2018) Neural autoregressive flows. ArXiv abs/1804.00779. Cited by: §II.3.
• P. Jaini, K. A. Selby, and Y. Yu (2019) Sum-of-squares polynomial flow. In ICML, Cited by: §II.3.
• H. Kahn and A. W. Marshall (1953) Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America 1 (5), pp. 263–278. External Links: ISSN 00963984, Link Cited by: §I, §III.1, §III.4.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §III.1, §IV.1.
• S. Kullback and R. A. Leibler (1951) On information and sufficiency. Ann. Math. Statist. 22 (1), pp. 79–86. External Links: Cited by: §III.1.
• J. S. Liu (1996) Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing 6 (2), pp. 113–119. External Links: Cited by: §III.4.
• L. Martino and J. Míguez (2010) Generalized rejection sampling schemes and applications in signal processing. Signal Processing 90 (11), pp. 2981 – 2995. External Links: ISSN 0165-1684, Document, Link Cited by: §I, §II.2.
• T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák (2018) Neural importance sampling. ACM Trans. Graph. 38, pp. 145:1–145:19. Cited by: §I, §II.3, §III.1.
• G. Papamakarios, I. Murray, and T. Pavlakou (2017) Masked autoregressive flow for density estimation. In NIPS, Cited by: §II.3.
• G. Papamakarios, E. T. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan (2019) Normalizing flows for probabilistic modeling and inference. ArXiv abs/1912.02762. Cited by: §II.3.
• D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. ArXiv abs/1505.05770. Cited by: §II.3.
• C. P. Robert and G. Casella (2004) Monte carlo statistical methods. Springer New York. External Links: Cited by: §I, §II.2.
• R.A. Smith and E.J. Moniz (1972) NEUTRINO REACTIONS ON NUCLEAR TARGETS. Nucl. Phys. B 43, pp. 605. Note: [Erratum: Nucl.Phys.B 101, 547 (1975)] External Links: Document Cited by: §II.1.
• J. von Neumann (1951) Various techniques used in connection with random digits. In Monte Carlo Method, A. S. Householder, G. E. Forsythe, and H. H. Germond (Eds.), National Bureau of Standards Applied Mathematics Series, Vol. 12, pp. 36–38. Cited by: §I, §II.2.
• A. Wehenkel and G. Louppe (2019) Unconstrained monotonic neural networks. In BNAIC/BENELEARN, Cited by: §II.3.

## Appendix A Model complexity discussion

We can see in Tab. 1 that although the number of rejection sampling cycles is remarkably lower for the NSF proposal, the time for obtaining one million samples is fairly similar. This is because the time for a cycle is the sum of the time in generating and evaluating the proposal plus the time it takes to evaluate , in this case the CCQE cross-section. On average, for one rejection cycle of this experiment, generating+sampling for the NSF a batch of  samples takes  seconds, for a uniform proposal it only takes  seconds, and evaluating samples for the CCQE cross-section model takes  seconds. When summing up the time for the selected proposal plus the time of the model, we can see that the NSF takes a large toll of the computational time (more than hundred times more than evaluating the cross-section model), so overall a cycle for the NSF proposal takes over a hundred cycles of uniform proposal. This is because the CCQE cross-section model we have chosen for this study (Sec. II.1) is relatively simple, especially compared to other applications in HEP. Additionally we are still relatively low in the number of dimensions. As a thought experiment, we considered different cases for a similar set up: sampling ten million samples using batches of in each cycle, where the time of evaluating the model is increased by a factor of .

Tab. 2 shows the results for obtaining one million samples under these circumstances for the quantiles , and with the same acceptance probability as in Tab. 1, consider different hypothetical models with factors compared to the CCQE cross-section. We show the time (in seconds) it takes to generate these ten milllion samples for the NSF, , and for the uniform proposal, , while also computing their ratio . For models of with larger computationally complexity, we can see that even for the maximum quantile NSF is faster to obtain ten million accepted samples than using a uniform proposal, gaining a whole magnitude in time when the model is at least hundred times more complex. Choosing quantiles smaller than 1.0, for models which take thousand times more to compute compared to the CCQE cross-section one, the NSF proposal generates these ten million samples over hundred times faster than the uniform distribution. Even for simpler models with only hundred more computation complexity, the gain of using the NSF proposal is noticeable.