i-flow: High-dimensional Integration and Sampling with Normalizing Flows

01/15/2020
by   Christina Gao, et al.
0

In many fields of science, high-dimensional integration is required. Numerical methods have been developed to evaluate these complex integrals. We introduce the code i-flow, a python package that performs high-dimensional numerical integration utilizing normalizing flows. Normalizing flows are machine-learned, bijective mappings between two distributions. i-flow can also be used to sample random points according to complicated distributions in high dimensions. We compare i-flow to other algorithms for high-dimensional numerical integration and show that i-flow outperforms them for high dimensional correlated integrals.

READ FULL TEXT VIEW PDF

Authors

page 11

09/02/2019

Randomized methods to characterize large-scale vortical flow network

We demonstrate the effective use of randomized methods for linear algebr...
12/14/2021

The high-dimensional asymptotics of first order methods with random data

We study a class of deterministic flows in ℝ^d× k, parametrized by a ran...
03/23/2022

Adjoint characteristics for Eulerian two-dimensional supersonic flow

The formal expressions, in terms of local flow variables, defining the a...
11/29/2021

Enabling Reusable Physical Design Flows with Modular Flow Generators

Achieving high code reuse in physical design flows is challenging but in...
03/02/2022

Flow-based density of states for complex actions

Emerging sampling algorithms based on normalizing flows have the potenti...
05/01/2020

Evaluation of Elephant-based Algorithms for Flow Table Reduction under Realistic Traffic Distributions

The majority of Internet traffic is caused by a relatively small number ...
06/21/2022

Discretization and index-robust error analysis for constrained high-index saddle dynamics on high-dimensional sphere

We develop and analyze numerical discretization to the constrained high-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Simulation based on first principles is an important practice, because it is the only way that a theoretical model can be checked against experiments or real-world data. In high-energy physics (HEP) experiments, a thorough understanding of the properties of known physics forms the basis of any searches that look for new effects. This can only be achieved by an accurate simulation, which in many cases boils down to performing an integral and sampling from it. Often high-dimensional phase space integrals with non-trivial correlations between dimensions are required in important theory calculations. Monte-Carlo (MC) methods still remain as the most important techniques for solving high-dimensional problems across many fields, including for instance: biology Hobolth et al. (2008); Mode and others (2011), chemistry Rosenbluth and Rosenbluth (1955), astronomy MacGillivray and Dodd (1982), medical physics Rogers (2006), finance Jäckel (2002) and image rendering Szirmay-Kalos (2008). In high-energy physics, all analyses at the Large Hadron Collider (LHC) rely strongly on multipurpose Monte Carlo event generators Webber (1986); Buckley and others (2011) for signal or background prediction. However, the extraordinary performance of the experiments requires an amount of simulated data that soon cannot be delivered with current algorithms and computational resources Buckley (2019); The ATLAS collaboration .

A main endeavour in the field of MC methods is to improve the error estimate. In particular, stratified sampling — dividing the integration domain in sub-domains, and importance sampling — sampling from non-uniform distributions 

James (1968)

are two ways of reducing the variance. Currently, the most widely used numerical algorithm that exploits importance sampling is the

VEGAS algorithm Lepage (1978, 1980). But VEGAS assumes the factorizability of the integrand, which can be a bad approximation if the variables have complex correlations amongst one another. Foam Jadach (2003) is a popular alternative that tries to address this issue. It uses an adaptive strategy to attempt to model correlations, but requires exponential large samples in high dimensions.

Lately, the burgeoning field of machine learning (ML) has brought new techniques into the game. For the following discussion, we restrict ourselves to focus on progress made in the field of high-energy physics, see Bourilkov (2019) for a recent review. However, these techniques are also widely applied in other areas of research. Concerning event generation, Bendavid (2017)

used boosted decision trees and generative adversarial networks (GANs) to improve MC integration. Reference 

Klimek and Perelstein (2018)

proposed a novel idea that uses a dense neural network (DNN) to learn the phase space directly and shows promising results. In principle, once the neural network(NN)-based algorithm for MC integration is trained, one can invert the network and use it for sampling. However, the inversion of the NN requires evaluating its Jacobian, which incurs a computational cost that scales as

for -dimensional integrals 111An particle final state phase space is a dimensional integral, when including recursive multichannel selection in the integral.. Therefore, it is extremely inefficient to use a standard NN-based algorithm for sampling.

In addition to generating events from scratch, it is possible to generate additional events from a set of precomputed events. References Otten et al. (2019); Hashemi et al. (2019); Di Sipio et al. (2020); Butter et al. (2019b); Carrazza and Dreyer (2019); Ahdida and others (2019); Butter et al. (2019a)

used GANs and Variational Autoencoders (VAEs) to achieve this goal. While their work is promising, they have a few downsides. The major advantage of this approach is the drastic speed improvement over standard techniques. They report improvements in generation of a factor around 1000. However, this approach requires a significant number of events already generated which may be cost prohibitive for interesting, high-multiplicity problems. Furthermore, these approaches can only generate events similar to those already generated. Therefore, this would not improve the corners of distributions and can even result in incorrect total cross-section. Yet another approach to speed up event generation is to use NN as interpolator and learn the Matrix Element 

Bishara and Montull (2019).

Our goal is to explore NN architectures that allow both efficient MC integration and sampling. A ML algorithm based on normalizing flows (NF) provides such a candidate. The idea was first proposed by non-linear independent components estimation (NICE) Dinh et al. (2015, 2016), and generalized in Müller et al. (2019) and Durkan et al. (2019), for example. They introduced coupling layers (CL) allowing the inclusion of NNs in the construction of a bijective mapping between the target and initial distributions such that the evaluation of the Jacobian can be reduced to an analytic expression. This expression can now be evaluated in time.

Our contribution is a complete, openly available implementation of normalizing flows into TensorFlow Abadi et al. (2015), to be used for any high-dimensional integration problem at hand. Our code includes the original proposal of Müller et al. (2019) and the additions of Durkan et al. (2019)

. We further include various different loss functions, based on the class of

-divergences Nielsen and Nock (2014). The paper is organized in the following way. The basic principles of MC integration and importance sampling are reviewed in Section 2. In Section 3, we review the concept of normalizing flows and work done on CL-based flow by Dinh et al. (2015, 2016); Müller et al. (2019); Durkan et al. (2019). We investigate the minimum number of CLs required to capture the correlations between every other input dimension. Section 4 sets up the stage for a comparison between our code, VEGAS, and Foam on various trial functions, of which we give results in Section 5. Section 6 contains our conclusion and outlook.

Ii Monte Carlo Integrators

While techniques exist for accurate one-dimensional integration, such as double exponential integration Takahasi and Mori (1974), using them for high dimensional integrals requires repeated evaluation of one dimensional integrals. This leads to an exponential growth in computation time as a function of the number of dimensions. This is often referred to as the curse of dimensionality. In other words, when the dimensionality of the integration domain increases, the points become more and more sparse and no statistically significant statement can be made without increasing the number of points exponentially. This can be seen in the ratio of the volume of a -dimensional hypersphere to the -dimensional hypercube, which vanishes as goes to infinity. However, Monte-Carlo techniques are statistical in nature and thus always converge as for any number of dimensions.

Therefore, MC integration is the most important technique in solving high-dimensional integrals numerically. The naïve MC approach samples uniformly on the integration domain (). Given uniform samples, the integral of can be approximated by,

(1)

and the uncertainty is determined by the standard deviation of the mean,

(2)

where is the volume encompassed by and indicates that the average is taken with respect to a uniform distribution in . While this works for simple or low-dimensional problems, it soon becomes inefficient for high-dimensional problems. This is what our work is concerned with. In particular, we are going to focus on improving current methods for MC integration that are based on importance sampling.

In importance sampling, instead of sampling from a uniform distribution, one samples from a distribution that ideally has the same shape as the integrand . Using the transformation , with

the cumulative distribution function of

, one obtains

(3)

In the ideal case when , Eq. (3) would be estimated with vanishing uncertainty. However, this requires already knowing the analytic solution to the integral! The goal is thus to find a distribution that resembles the shape of most closely, while being integrable and invertible such as to allow for fast sampling. We review the current MC integrators that are widely used, especially in the field of high-energy physics.

VEGAS Lepage (1978, 1980) approximates all 1-dimensional projections of the integrand using a histogram and an adaptive algorithm. This algorithm adjusts the bin widths such that the area of the bins are roughly equal. To sample a random point from VEGAS can be done in two steps. First, select a bin randomly for each dimension. Second, sample a point from each bin according to a uniform distribution. However, this algorithm is limited because it assumes that the integrand factorizes, i.e.

(4)

where and . High-dimensional integrals with non-trivial correlations between integration variables, that are often needed for LHC data analyses, cannot be integrated efficiently with the VEGAS algorithm (c.f. Höche et al. (2019)). The resulting uncertainty can be reduced further by applying stratified sampling, in addition to the VEGAS algorithm, after the binning Lepage .

Foam Jadach (2003) uses a cellular approximation of the integrand and is therefore able to learn correlations between the variables. In the first phase of the algorithm, the so-called exploration phase, the cell grid is built by subsequent binary splits of existing cells. Since the first cell consists of the full integration domain, all regions of the integration space are explored by construction. The second phase of the algorithm uses this grid to generate points either for importance sampling or as an event generator. In this work we use the implementation of Hoeche , which implemented an additional reweighting of the cells at the end of the optimization.

However, both Foam and VEGAS are based on histograms, whose edge effects would be detrimental to numerical analyses that demand high precision. As we will explain below, our code i-flow uses a spline approximation which does not suffer from these effects. These edge effects are an important source of uncertainty for high-precision physics Dulat et al. (2018).

Iii Importance Sampling with Normalizing Flows

As we detailed in the previous section, importance sampling requires finding an approximation that can easily be integrated and subsequently inverted, so that we can use it for sampling. Mathematically, this corresponds to a coordinate transformation with an inverse Jacobian determinant that is given by . General ML algorithms incorporate NNs in learning the transformation, which inevitably involve evaluating the Jacobian of the NNs. This results in inefficient sampling. Coupling Layer-based Normailzing Flow algorithms precisely circumvent this problem. To begin, let us review the concept of a normalizing flow (NF).

Let , with

, be a series of bijective mappings on the random variable

:

(5)

Based on the chain rule, the output

’s probability distribution,

, can be inferred given the base probability distribution from which is drawn:

(6)

One sees that the target and base distributions are related by the inverse Jacobian determinant of the transformation. For practical uses, the Jacobian determinant must be easy to compute, restricting the allowed functional forms of . However, with the help of coupling layers, first proposed by Dinh et al. (2015, 2016), then generalized by Müller et al. (2019); Durkan et al. (2019), one can incorporate NNs into the construction of , thus greatly enhancing the level of complexity and expressiveness of NF without introducing any intractable Jacobian computations.

Figure 1: Structure of a Coupling Layer. is the output of a neural network and defines the Coupling Transform, , that will be applied to .

Figure 1 shows the basic structure of a coupling layer, which is a special design of the bijective mapping . For each map, the input variable is partitioned into two subsets, and which can be determined arbitrarily so long as neither is the empty set. This arbitrary partitioning will be referred to as masking. Without loss of generality, one simple partitioning is given by and . Different maskings can be achieved via permutations of the simple example above. Under the bijective map, , the resulting variable transforms as

(7)

The NN takes as inputs and outputs that represents the parameters of the invertible “Coupling Transform”, , that will be applied to . The inverse map is given by

(8)

which leads to the simple Jacobian

(9)

Note that Eq. (9) does not require the computation of the gradient of , which would scale as with the number of dimensions. In addition, taking to be diagonal further reduces the computation complexity of the determinant to be linear with respect to the dimensionality of the problem. Linear scaling makes this approach tractable even for high dimensional problems. In summary, the NN learns the parameters of a transformation and not the transformation itself, thus the Jacobian can be calculated analytically.

To construct a complete Normalizing Flow, one simply compounds a series of Coupling Layers with the freedom of choosing any of the outputs of the previous layer to be transformed in the subsequent layer. We show in Sec III.1 that at most number of Coupling Layers are required in order to express non-separable structures of the integrand.

iii.1 Number of Coupling Layers

The minimum number of coupling layers required to capture all the correlations between every dimension of the integration variable, , depends on the dimensionality of the integral,  Müller et al. (2019). In the cases of and , each dimension is transformed once based on the other dimension(s) and thus and , respectively. This way of counting could be generalized to higher . In fact, this is what autoregressive flows are based on Papamakarios et al. (2017). Here we show that the number of coupling layers required to capture all the correlations is bounded below by 3 and above by .

Let be a -dimensional input variable that gets split into two subsets, . After applying three coupling layers, the dependence on other dimensions for each input dimension is given by:

Let be a -dimensional input variable that gets split into two subsets. After applying three coupling layers, all the input dimensions have been transformed based on the information from one another, at least in an indirect way. However, since the transformation introduced by a single coupling layer is not expressive enough for a generic integrand, the actual number of coupling layers that we use is larger. In addition, the normalizing flow shown above transforms each dimension of only once, while the ones in are transformed twice. To approximate the integrand equally well in all dimensions, we require that all of them are transformed the same number of times, which increases the minimum number of coupling layers for to .

To systematically construct coupling layers that ensure a direct dependence between all the input dimensions on each other, we require that no set of two (or more) variables are always transformed together and that all the variables must be transformed equal number of times. To meet these conditions, the minimum number of coupling layers can be shown to be . The proof is as follows. Since the dimension can either be transformed or not, the first step will be to number each dimension from to and take the binary representation of that number with exactly as many bits required to represent the number . The first layer is found by taking the most significant bit of each number, and transforming those with an and passing through those with a . The second layer is done the same as the first, but transforming those with a and passing through those with an , which ensures that each dimension is transformed equally. The rest of the layers are found by continuing to iterate through the bits. Thus, the number of layers is given by twice as many bits required to represent the number in binary, or . See Table 1 for an example.

Dimension 0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 0 0 0 0 0 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0
0 0 1 1 0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1 0 1 0 1
Table 1: Finding the unique masking to capture all correlations in a space

iii.2 Using i-flow

Figure 2: Illustration of one step in the training of i-flow. Users need to provide a normalizing flow network, a function to integrate, and a loss function. stands for the Monte-Carlo estimate of the integral using the sample of points .

The i-flow package requires three pieces of information from the user: the function to be integrated, the normalizing flow network, and the method of optimizing the network. Figure 2 shows schematically how one step in the training of i-flow works. The code is publicly available on gitlab at https://gitlab.com/i-flow/i-flow.

iii.2.1 Integrand

The function to be integrated has very few requirements on how it is implemented in the code. Firstly, the function must accept an array of points with shape , where is the number of points to sample per training step. Secondly, the function must return an array with shape to be used to estimate the integral. Finally, the number of dimensions in the integral is required to be at least 2. However, one dimension can be treated as a dummy dimension integrated from 0 to 1, which will not change any result.

iii.2.2 Normalizing flow network

A normalizing flow network consists of a series of coupling layers compounded together. To construct each coupling layer, one needs to specify the choice of coupling transform (cf App. A), the number of coupling layers, the masking for each level, and the neural network that constitutes the coupling transform. We provide the ability to automatically generate the masking and number of layers according to Sec. III.1.

The neural networks must be provided by the user. However, we provide examples for a dense network and the u-shape network of Müller et al. (2019). This provides the user the flexibility to achieve the expressiveness required for their specific problem.

iii.2.3 Optimizing the network

To uniquely define the optimization algorithm of the network, two pieces of information are required. Firstly, the loss function to be minimized is required. We supply a large set of loss functions from the set of -divergences, which can be found in App. B. By default, the i-flow code uses the exponential loss function. Secondly, an optimizer needs to be supplied. In the examples we used the ADAM optimizer Kingma and Ba (2014). However, the code can use any optimizer implemented within TensorFlow.

iii.2.4 Hyperparameters

The setup we presented here has several hyperparameters that can be adjusted for better performance. However,

i-flow has the flexibility for the user to implement additional features in each section beyond what is discussed below. This would come with additional hyperparameters as well.

The first group concerns the architecture of the NNs . Once the general type of network (dense or u-shape) is set, the number of layers and nodes per layer have to be specified. In the case of the u-shape network, the user can specify the number of nodes in the first layer and the number of “downward” steps.

The second group of hyperparameters concerns the optimization process. Apart from setting an optimizer (e.g. ADAM Kingma and Ba (2014)), a learning schedule (e.g.

constant or exponentially decaying), an initial learning rate, and a loss function have to be specified. Some of these options come with their own, additional set of hyperparameters. The number of points per training epoch and the number of epochs have to be set as well.

The third group of hyperparameters concerns the setup of i-flow directly. As was discussed in Müller et al. (2019), there are two ways to pass into : either directly or with one-blob encoding. i-flow supports both of these options. If the latter is used, the number of input bins has to be specified. Further, the type of coupling function , the number of output bins, the number of CLs and the maskings have to be set.

iii.2.5 Putting it all together

The networks are trained by sampling a fixed number of points using the current state of  222Since we initialize the last layer of each network with vanishing bias and weights, in the first sampling is constant.. We use one of the statistical divergences as a measure for how much the distribution resembles the shape of the integrand

, and an optimizer to minimize it. Because we can generate an infinite set of random numbers and evaluate the target function for each of the points, this approach corresponds to supervised learning with an infinite dataset. Drawing a new set of points at every training epoch automatically also ensures that the networks cannot overfit.

Iv Integrator Comparison

To illustrate the performance of i-flow and compare it to VEGAS and Foam, we present a set of five test functions, each highlighting a different aspect of high-dimensional integration and sampling. These functions demonstrate how each algorithm handles the cases of a purely separable function, functions with correlations, and functions with non-factorizing hard cuts. In most cases, an analytic solution to the integral is known.

The first test function is an -dimensional Gaussian, serving as a sanity check:

(10)

The result of integrating from zero to one is given by:

(11)

In the following, we use .

The second test function is an -dimensional Camel function, which would show how i-flow learns correlations that VEGAS (without stratified sampling) would not learn:

(12)

The result of integrating from zero to one is given by:

(13)

In the following, we use .

The third case is given by

(14)

This function has two circles with shifted centers, varying thickness and height. Also, the function exhibits non-factorizing behavior. The integral of between 0 and 1 can be computed numerically using Mathematica Inc. , which is , with and  333There is no known analytic solution to this given function.

The fourth case is a ring function with hard cuts:

(15)

This function demonstrates how i-flow learns hard, non-factorizing cuts. The result of integrating from zero to one is given by: .

The fifth case is motivated by high energy physics, and is a one-loop scalar box integral representative of an integral required for the calculation of in the Standard Model. This calculation is an important contribution for the total production cross-section of the Higgs boson. As explained in App. C, after Feynman parametrisation and sector decomposition Binoth et al. (2003), the integral of interest is given by

(16)

The result of integrating from zero to one can be obtained through the use of LoopTools Hahn and Pérez-Victoria (1999), which gives a numerical result of for the inputs .

Further applications to event generation of high-energy particle collisions is discussed in Gao et al. (2020) and also in Bothmann et al. (2020). These papers investigate using normalizing flows to improve upon phase space integration for event simulation at particle colliders.

V Results

Dim VEGAS Foam i-flow true value
Gaussian 2 0.99907(5) 0.999186
4 0.99981(35) 0.998373
8 0.99780(320) 0.996749
16 0.72388(11428) 0.993509
Camel 2 0.98169(5) 0.98166
4 0.96356(30) 0.963657
8 0.93007(142) 0.928635
16 0.96498(17337) 0.862363
Entangled circles 2 0.013685(3) 0.0136848
Ring w. cuts 2 0.51083(16) 0.510508
Scalar-top-loop 3
Table 2: Results of integrals based on 1M sample after training with 5M points. for Gaussian and Camel functions.
Dim VEGAS Foam i-flow
Gaussian 2
4
8
16
Camel 2
4
8
16
Entangled circles 2
Ring w. cuts 2
Scalar-top-loop 3
Table 3: Same as Table 2, but showing the relative deviation.

In this section we show the performance of i-flow and compare it to VEGAS and Foam based on the test functions we introduced in Sec. IV. For the VEGAS algorithm, we set the number of bins to 100 and use stratified sampling as implemented in Lepage . The setup of Foam requires a number of points per cell, here we will fix this to be 1000. In the setup of i-flow, we use number of coupling layers with the masking discussed in Sec. III.1, and the coupling transform taken to be a Piecewise Rational Quadratic spline (Appendix A.3). The neural network in each CL is taken to be a DNN of 5 layers with 32 nodes in each of the first four layers. The number of nodes in the last layer depends on the coupling transform and the dimensionality of the integrand. For the case of Piecewise Rational Quadratic splines, the number of nodes is given by , where is the number of dimensions to be transformed. We further set the number of bins () in each output dimension to 16. The learning rate was set to in all cases except for the Ring function, see details further below.

We used in total five million (5M) points to train the integrators. For VEGAS and i-flow, this number is split into 1000 epochs with 5000 points per epoch, for Foam we used 5000 cells with 1000 points per cell. After we completed the training, we used one million (1M) points to evaluate the integral and report the result in Tab. 2. Since this way of presenting results highlights the precision of the integrator, we show in Tab. 3 the relative deviation from the true value in units of standard deviations, i.e. the accuracy of the result. The relative deviation is defined as

(17)

where is the result from VEGAS, Foam, or i-flow, is the true value of the integral, and the terms signify the uncertainty in the integral. Note, that is only non-zero for the case of the entangle circles for which it is .

It is clear from these numbers that for the low-dimensional integrals (), all three integrators achieve reasonable results. However, i-flow is the only one that shows consistent performance up to 16 dimensions. For example, VEGAS fails in the integration of 16D Camel function completely (missing one of the peaks) and Foam has a large uncertainty on the final result. Foam also performs poorly in the case of the Gaussian in 16 dimensions. In both of these cases, Foam approximately requires number of cells to map out all the features of a function, where is the average number of bins in each dimension. If is taken to be 2, for 16 dimensions, the number of cells required is at least , which is far greater than 5000 cells. Therefore, when dealing with high-dimensional integrals, Foam is the least efficient integrator.

Not only does i-flow perform better in high-dimensional integrals, it also yields a smaller uncertainty compared to VEGAS and Foam after the entire training of using 5M points, even though the latter converge much faster at the initial stage. This is a common behaviour observed in i-flow despite the type of functions that undergo training, as illustrated in Figure 3. Interestingly, out of all the examples plotted, VEGAS performs the worst for the 4D Camel and the entangled circles, which is precisely due to its inability to deal with non-factorizable functions. As mentioned before, VEGAS assumes that the integrand must factorize which explains why the uncertainty for the camel function, entangled circles, ring with cuts, and scalar-top-loop all have larger uncertainties than either Foam or i-flow. It is interesting to note, that in the case of the ring with cuts, Foam slightly outperforms i-flow in terms of precision, but not in terms of accuracy.

(a) 4-dimensional Gaussian
(b) 4-dimensional Camel
(c) Entangled circles
(d) Scalar top loop
Figure 3: Relative uncertainty of the integral as a function of number of points in millions (M) used in training for four different types of functions.

As discussed in the earlier sections, i-flow also allows efficient sampling once it “learns” the integral up to small uncertainty. As an example, Fig. 4 shows a sample distribution after training i-flow  with 5M points (1000 epochs with 5000 points per epoch) on the Ring function of eq. (15). The resulting value for the integral is in Tab. 2. For training, we used a learning schedule with exponential decay. An initial learning rate of is halved every 250 epochs. The cut efficiency is . Figure 5 shows the weights of 10000 points sampled after training with 10M points on the Entangled Circles of eq. (14). In the ideal case of , we expect the weight distribution to approach a delta function. In Fig. 5, we see that the trained results are much more like a delta function than the flat prior, showing significant improvement in the ability to draw samples from this function.

Figure 4: A set of 7500 points sampled after training i-flow with 5M points on the Ring function. 6720 are inside (blue), 780 outside (red).

Figure 5: Weights of 10000 points, sampled after training i-flow on Entangled circles (14). is a flat distribution before training and approximately resembles the shape of after training.

Vi Conclusion and Outlook

As shown in the previous section, i-flow tends to do better than both VEGAS and Foam for all the test cases provided. However, i-flow comes with a few downsides. Since i-flow has to learn all the correlations of the function, it takes significantly longer to achieve optimal performance compared to the other integrators. This can be seen in Fig. 3. This obviously translates to longer training times. Additionally, the memory footprint required for i-flow is much larger due to requiring storage for quicker parameter updates within the NNs. Both of these can be overcome with future improvements.

There are several directions in which we plan to improve the presented setup in the future. So far, we only used simple NN architectures in our coupling layers. Using convolutional NNs instead might improve the convergence of the normalizing flow for complicated integrands, as these networks have the ability to learn complicated shapes in images.

The setup suggested in Tran et al. (2019) would allow the extension of i-flow to discrete distributions, which has also applications in HEP Gao et al. (2020); Bothmann et al. (2020). Another way to implement this type of information is utilizing Conditional Normalizing Flows Winkler et al. (2019).

The implementation of transflow-learning, which was suggested in Gambardella et al. (2019), would allow the use of a trained normalizing flow on different, but similar problems without retraining the network. Such problems arise in HEP when new-physics effects from high energy scales modify scattering properties at low energies slightly and are described in an effective field theory framework. Another application for transflow-learning would be to train one network for a given dimensionality and adapt the network for another problem with the same dimensionality.

Using techniques like gradient checkpointing Chen et al. (2016) have the potential to reduce the memory usage substantially, therefore allowing more points to be used at each training step or larger NN architectures. Currently, if too many points per training step are used, the program crashes due to insufficient space in RAM for such an array. These cases can be circumvented by breaking up the large set of points into several smaller mini batches. The gradients are then computed for each mini batch consecutively before they are applied.

The setup presented in Kingma and Dhariwal (2018), utilizing convolutions, showed an improved performance over the vanilla implementation of the normalizing flows, which possibly also applies to our case. Additionally, this would modify the maximum number of coupling layers required by having more expressive permutations.

Acknowledgements

We thank Joao M. Goncalves Caldeira, Felix Kling, Luisa Lucie-Smith, Tilman Plehn, Holger Schulz, Nhan Tran, and the participants of the Aspen workshop “The Energy Frontier Beyond the LHC Run 2” for their comments and discussions. We further thank Stefan Höche for helpful discussions, comments, and for his Foam implementation Hoeche .
This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. This work was performed in part at Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. C.K. acknowledges the support of the Alexander von Humboldt Foundation.

Appendix A Coupling Layer details

The implementation of the layers available in i-flow are detailed below. The layers are based on the work of Müller et al. (2019); Durkan et al. (2019) and are reproduced here for the convenience of the reader.

a.1 Piecewise Linear

For the piecewise linear coupling layer Müller et al. (2019), given bins of width

, the probability density function (PDF) is defined as:

(18)

The cumulative distribution function (CDF) is defined by the integral giving:

(19)

where is the bin in which occurs (), and . Alternatively, we can define as the maximal for which . The inverse CDF is given by:

(20)

The Jacobian for this network is straight forward to calculate, and gives:

(21)

The piecewise linear layers require fixed bin widths in each layer. For details on why this is required, see Appendix B of Müller et al. (2019).

a.2 Piecewise Quadratic

For the piecewise quadratic coupling layer Müller et al. (2019), given bins with widths , with vertex heights given by , the PDF is defined as:

(22)

Integrating the above equation leads to the CDF:

(23)

where is defined as the solution to 444Note that this definition means . , and is the relative position of in bin . Inverting the CDF leads to:

(24)

where is defined as the solution to

(25)

and is the relative position of in the bin , and is given by:

(26)

a.3 Piecewise Rational Quadratic

For the piecewise rational quadratic coupling layer Durkan et al. (2019), given knot points that are monotonically increasing, with and , and non-negative derivatives , the CDF can be calculated using the algorithm from GREGORY and DELBOURGO (1982), which is roughly reproduced below.

First, we define the bin widths () and the slopes (). We next obtain the fractional distance () between the two knots that the point of interest () lies (, where is the bin lies in). The CDF is given by:

(27)

where the details of and can be found in GREGORY and DELBOURGO (1982), but simplifies to:

(28)

which is noted to be less prone to numerical issues GREGORY and DELBOURGO (1982). The inverse can be found by solving a quadratic equation Durkan et al. (2019):

(29)

where the coefficients are given in Durkan et al. (2019), solving this equation for the solution that gives a monotonically increasing results in:

(30)

where the second form is numerically more precise when is small, and is also valid for  Durkan et al. (2019).

Appendix B Loss functions

We implemented several different divergences that can be used as loss functions. They differ in symmetry, relative weight between small and large deviations, treatment of case (also in the derivative), and numerical complexity. All of them are from the class of -divergences Nielsen and Nock (2014).
Pearson divergence:

(31)

Kullback–Leibler divergence:

(32)

squared Hellinger distance:

(33)

Jeffreys divergence:

(34)

Chernoff’s -divergence:

(35)

exponential divergence:

(36)

-product divergence:

(37)

Jensen–Shannon divergence:

(38)

Appendix C Sector decomposition of scalar loop integrals

Following Binoth et al. (2003), we give the integral representations of triangle and box functions in 4 dimensions using the Feynman parametrisation. To begin with, the triangle integral with external particles of energy and internal propagators of masses is given by

(39)

The 3-dimensional integral is further split into 3 sectors by the decomposition:

(40)

For example, when , after the variable transformation , the integral simplifies to

(41)

Therefore,

(42)

One can perform the same trick to treat the box integral with 4 external fields and 4 propagators. After sector decomposition, one gets