## 1 Introduction

Recent works have investigated the use of a particular class of deep generative machine learning models, called normalizing flows, in lattice field theory [flowsforlattice, flowsforlattice2, flowsforlattice3, del2021efficient, hackett2021flow, nicoli2021estimation, de2021scaling] following similar approaches in quantum chemistry [noe2019boltzmann, wirnsberger2020targeted, wirnsberger2021normalizing] and statistical physics [wu2019solving, nicoli2020asymptotically, nicoli2019comment] (see also [bachtis2021quantum, urban2018reducing, tanaka2017towards, bulusu2021generalization]). These works are proof-of-principle demonstrations for simple two-dimensional field theories and aim to reduce the integrated autocorrelation time for systems close to criticality by using the flow to generate decorrelated field samples.

Another important application of flows was recently pointed out in [nicoli2020asymptotically]

: they can directly estimate the free energy of a lattice field theory (which can also be accomplished with methods such as tensor networks, see, e.g.,

[akiyama2020tensor] and references therein). The free energy is important as it allows to compute the entropy, pressure and the equation of state of the considered physical system. In the case of Quantum Chromodynamics, such thermodynamic observables are of the utmost importance in the physics of the early universe and are probed by heavy ion experiments [busza2018heavy].In the following, we will review this deep-learning-based estimation technique of the free energy and discuss the important issue of mode collapse. To illustrate the mode collapse of the flow in a concrete example, consider a target density with two modes, as is the case for a quantum mechanical particle in a double well potential or scalar

-theory in the broken phase. The latter example will be discussed in detail in Section 3.2. For both systems, the theory has a spontaneously broken symmetry and thus two modes corresponding to the vacuum expectation values . As we will discuss in Section 2.1, the training process can however lead to a flow that only approximates one mode of the target density of the lattice field theory and assigns (almost) vanishing probability mass to the other

[nicoli2021estimation, hackett2021flow]. This will lead to systematic errors of the free energy estimate which can be difficult to detect. In this contribution, we report on both mitigation and detection techniques for such a mode collapse and demonstrate their effectiveness for the example of two-dimensional scalar theory.## 2 Normalizing Flows

Let be an orientation-preserving diffeomorphism between two orientable -dimensional Riemannian manifolds and . We assume that there is a probability measure defined on where is the measure associated with volume form on and is a positive smooth map. In particular, it holds that . The push-forward measure is then a probability measure on .

In coordinates on , the push-forward takes the form

where is the determinant of the Jacobian. In the machine learning literature, one therefore often refers to

(1) |

as the push-forward density of .

We will be interested in the case since we will consider real-valued scalar fields. The basic idea of a normalizing flow is to define a family of diffeomorphisms with parameters . We then adjust these parameters such that the push-forward density closely approximates a certain target density .

In practice, the diffeomorphism

is parameterized by a deep neural network. Neural networks are composite functions of the form

(2) |

where is a composition of layers defined by

with weights and biases being the free parameters of the neural network, i.e. .^{1}^{1}1We restrict to all weights and biases being of the same dimensionality since we will be interested in networks that can be used to model invertible maps.
Furthermore, is a non-linear function, such as

, which is applied element-wise to each component of the vector

. A neural network is called deep if the number of layers is large (although there is no clearly defined threshold).There are various approaches for parameterizing diffeomorphisms by neural networks. We will restrict to a particularly straightforward approach, called *Non-linear Independent Component Estimation* (NICE), which splits the input in two parts and for given . A diffeomorphism is then given by

(3) |

where is a (not necessarily invertible) neural network of the form (2). Due to the splitting of the input , this can be easily inverted by

For the NICE architecture, the determinant of the Jacobian is given by

As a result, the diffeomorphism is volume-preserving, i.e. . In practice, we compose several of these volume-preserving diffeomorphisms. This combination is again a volume-preserving diffeomorphism because these maps form a group under composition.

A normalizing flow is typically chosen to be a push-forward of a simple base density, e.g. . This allows for efficient sampling by first drawing and then applying the diffeomorphism to the sample , i.e.

(4) |

where the push-forward density is given by (1).

### 2.1 Training of the Flow

A lattice field theory can be described by a probability density of the form

(5) |

where , and denote the field, its action, and the partition function respectively.

A similarity measure between two densities and is given by the Kullback–Leibler (KL) divergence

(6) |

The KL divergence is non-negative and vanishes if and only if both densities are equal, i.e. .^{2}^{2}2We restrict to continuous densities here. Otherwise, the densities can have different values on a set of zero measure.
We can therefore train the flow to approximate the target density by minimizing this KL divergence using gradient descent, i.e. .
For this, we observe that the KL divergence can be rewritten as

where the last summand contains terms independent of and can thus be ignored for gradient descent. We now sample from the flow to obtain its Monte-Carlo estimator, i.e.

The log probability can efficiently be calculated by (1). Additionally, we can very efficiently sample from the flow by pushing forward samples from the base density, see (4). However, the training of the flow may yield poor results for a multi-modal target density. This is because the training relies on self-sampling. During training, self-sampling may lead to a collapse of almost all the flow’s probability mass to a subset of the modes of the target density . The KL divergence does not penalize this behaviour since the flow does no longer produce samples from the other modes of the target density . We will discuss both detection and mitigation of mode collapse in the next section.

## 3 Flow-based Estimation of Free Energy

A promising application of normalizing flows is estimating the free energy of a lattice field theory at temperature defined by

(7) |

where is the partion function. The temperature is given by with lattice spacing and denoting the number of lattice points along the temporal direction of the lattice.

### 3.1 MCMC-based Estimates of Free Energy

Estimating the free energy with MCMC is challenging. To illustrate this fact, we discuss a reweighting procedure [de2001t, philipsen2013qcd] which starts from the observation that the difference in free energies between two different points and in parameter space can be calculated by

(8) |

This expectation value can be estimated by MCMC. If we choose the point in parameter space such that the free energy can be calculated exactly or approximately, we can obtain the value of the free energy at the point by .

In practice, the variance of the estimator (

8) will become prohibitively large if the two distributions and have a small overlap. This can be avoided by choosing intermediate distributions such that neighbouring distributions and overlap sufficiently. The free energy difference can then be obtained by(9) |

This comes at the price of an accumulated error of all free energy differences . The error therefore crucially depends on all points of the (discretized) trajectory connecting the points and in parameter space.

### 3.2 Example: Two-dimensional Theory

This dependence on the trajectory can lead to serious problems, as we illustrate in a concrete example of the theory in two dimensions with the action

(10) |

where is the hopping parameter and denotes the bare coupling. For vanishing hopping parameter , the free energy can be calculated analytically [nicoli2021estimation] and is given by

where denotes the number of sites of the lattice and

with being the Bessel function of the second kind.

As the hopping parameter is increased, spontaneous breaking of the -symmetry is observed. This is illustrated in Figure 1. Now, suppose we want to calculate the free energy with MCMC for parameters in the broken phase, e.g. and . We can then choose a trajectory through parameter space for which the bare coupling is kept constant, i.e. , and the initial hopping parameter is . We then increase the hopping parameter by a step size up to and then use a smaller step size

in order to ensure sufficient overlap. Crucially, the estimate of the free energy in the broken phase will now suffer from critical slowing down as the corresponding trajectory has to cross the phase transition in order to reach the initial hopping parameter

. This will lead to a significant increase in the statistical error, see Figure 1.### 3.3 Flow-based Estimators of the Free Energy

Normalizing flows allow us to directly estimate the free energy at a given point in parameter space and therefore allow us to avoid critical slowing down in the specific situations discussed in the previous section. This can be seen by observing that we can estimate the partition function using a trained flow in two different ways. Firstly, we can use samples from the flow

Using this definition, we obtain the *reverse estimator* of the free energy by

(11) |

Secondly, one can use samples from the target density

to obtain the *forward estimator*

(12) |

Both estimators have relative strengths and weaknesses. If we are confident that the flow closely approximates the target density , it is advisable to use the reverse estimator (12) because sampling from the flow is more efficient. However, this estimator may lead to incorrect results if the flow is mode-dropping. In contrast, the forward estimator uses samples from and thus cannot neglect any mode of the target density . If mode-dropping is a risk (for example in the broken phase of the -theory), one should therefore also use the forward estimator (12) as a consistency check.

## 4 Numerical Experiments

In the following, we will illustrate the difference in using the forward and the reverse variants for the free energy estimation in the presence of mode-dropping.
To this end, we consider two normalizing flows trained for the two-dimensional scalar -theory for a hopping parameter of and a bare coupling of on a lattice. The theory is thus considered in its broken phase, see Figure 2. One of the flows is mode-collapsed on a single mode of the target density , while the other flow covers both modes, as can be seen on the right of Figure 2.

For both flows, we then use the forward estimator and reverse estimator as defined in (12) and (11) respectively. The estimated values for the free energy are visualized in Figure 2. For the mode-collapsed flow, we see a clear discrepancy in the prediction while the mode-covering flow leads to consistent values of the forward and reverse estimators. The fact that the forward estimator

gives the correct result for the mode-collapsed model can heuristically be understood by assuming that the flow is approximately

for the covered mode and for the other mode . This implies that(13) |

In summary, this experiment clearly illustrates that forward estimation of the free energy is crucial in the presence of mode collapse.

## 5 Conclusion

Deep generative models, in particular normalizing flows, allow for a direct estimation of the free energy. Current normalizing flow architectures are however far from perfect. For example, they are challenging to train in the broken phase (particularly for larger lattices) and can suffer from mode collapse for multi-modal densities. In this contribution, we have briefly outlined how forward estimation of the free energy can help to mitigate this weakness.

## 6 Acknowledgements

K.A.N. , C.A. , P.K. and S.N. are funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref 01IS18037A). P.S. is supported from Agencia Estatal de Investigación (“Severo Ochoa” Center of Excellence CEX2019-000910-S, Plan National FIDEUA PID2019-106901GB-I00/10.13039 / 501100011033, FPI) ), Fundació Privada Cellex, Fundació Mir-Puig, and from Generalitat de Catalunya (AGAUR Grant No. 2017 SGR 1341, CERCA program). L.F. is partially supported by the Co-Design Center for Quantum Advantage (C2QA) under subcontract number 390034, by the DOE QuantiSED Consortium under subcontract number 675352, by the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/), and by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under grant contract numbers DE-SC0011090 and DE-SC0021006. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Industry Canada and by the Province of Ontario through the Ministry of Colleges and Universities.