# On Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models

In this work, we demonstrate that applying deep generative machine learning models for lattice field theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC) methods are problematic. More specifically, we show that generative models can be used to estimate the absolute value of the free energy, which is in contrast to existing MCMC-based methods which are limited to only estimate free energy differences. We demonstrate the effectiveness of the proposed method for two-dimensional ϕ^4 theory and compare it to MCMC-based methods in detailed numerical experiments.

## Authors

• 4 publications
• 7 publications
• 2 publications
• 3 publications
• 4 publications
• 9 publications
• 22 publications
• 2 publications
• ### Flow-based generative models for Markov chain Monte Carlo in lattice field theory

A Markov chain update scheme using a machine-learned flow-based generati...
04/26/2019 ∙ by M. S. Albergo, et al. ∙ 0

• ### Machine Learning of Thermodynamic Observables in the Presence of Mode Collapse

Estimating the free energy, as well as other thermodynamic observables, ...
11/22/2021 ∙ by Kim A. Nicoli, et al. ∙ 0

• ### Computing Absolute Free Energy with Deep Generative Models

Fast and accurate evaluation of free energy has broad applications from ...
05/01/2020 ∙ by Xinqiang Ding, et al. ∙ 0

• ### Deep Involutive Generative Models for Neural MCMC

We introduce deep involutive generative models, a new architecture for d...
06/26/2020 ∙ by Span Spanbauer, et al. ∙ 6

• ### Machine Learning and Variational Algorithms for Lattice Field Theory

In lattice quantum field theory studies, parameters defining the lattice...
06/03/2021 ∙ by Gurtej Kanwar, et al. ∙ 0

• ### Markov-Chain Monte Carlo Approximation of the Ideal Observer using Generative Adversarial Networks

The Ideal Observer (IO) performance has been advocated when optimizing m...
01/26/2020 ∙ by Weimin Zhou, et al. ∙ 0

• ### Towards Sampling from Nondirected Probabilistic Graphical models using a D-Wave Quantum Annealer

A D-Wave quantum annealer (QA) having a 2048 qubit lattice, with no miss...
05/01/2019 ∙ by Yaroslav Koshka, et al. ∙ 12

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Appendix A Conventions for Action

The form of the action used in the main text is

 S(ϕ)=∑x∈Λ−2κ2∑^μ=1φ(x) φ(x+^μ)+(1−2λ)φ(x)2 +λφ(x)4. (11)

It can be obtained by starting from the (more standard) action

 S(φ)=∑x∈Λa2122∑^μ=1 φ(x+a^μ)−φ(x)a2 +m202φ2(x)+g04!φ4(x) (12)

and performing the following re-definitions

 φ =(2κ)12ϕ, (13) (am0)2 =1−2λκ−4, (14) a2g0 =6λκ2. (15)

## Appendix B A brief overview of Deep Learning

### Neural Networks

Neural networks are a machine learning algorithm which has proven to be particularly powerful. A neural network is build of layers which are defined by

 y(l)(x)=σ(Wlx+bl),

where , are input and output of the layer. The output of the layer is also often called the activation of the layer. The weights and the bias are the learnable parameters of the neural network. The non-linearity is a non-linear function which is applied element-wise to the components of

. Widely-used activation function are

or .

A neural network consists of such layers, i.e.

 g(x)=(y(L)∘⋯∘y(1))(x).

It is important to note that the weights and biases do not have to be of the same dimensionality for each layer (although we did not make this explicit in our notation). It is also important to note that we merely described the most simple type of neural network, namely a fully-connected neural network. There is a zoo of other neural networks but we will refrain from a more detailed discussion as it is not needed for our purposes (see Goodfellow et al. (2016) for an overview).

### Learning Parameters with Backpropagation

The parameters of the neural networks,

 W={(Wl,bl),i=1,…,L},

are determined by minimizing a certain loss function

by gradient descent (see (7) for the particular loss function used in this work). It is important to emphasize that the number of parameters are typically large (of order for typical modern neural networks). It is therefore clear that one cannot determine the gradient by finite-difference (as we would need calculate the finite difference ratio for each of these parameters which is prohibitively expensive).

The basic idea for calculating the gradient is to use the fact that we know the functional form of the neural network: the gradient of the loss is given by

 ∇WL=∂L∂yL∂yL∂yL−1…∂y1∂x.

Each term in this expression is known, for example

 ∂yl+1∂yl=σ′(yl)Wl.

For a fixed non-linearity, we know the analytical form of the derivative . This observation leads to the following algorithm: we first perform a forward pass of the neural network, i.e. starting from the input , we calculate the activations for each layer and store them in memory. This process ends with the final activation which is, by definition, the output of the neural network. The gradient of each layer can then directly be calculated (as we have stored the activation ). Crucially, we only need the matrix product of these Jacobians and it is efficient to start by calculating the gradient with respect to the output layer , then the layer and so forth. This is because the loss function has a scalar output value and therefore the matrix product of the Jacobians

 ∂L∂yL∂yL∂yL−1…∂yl+1∂yl

is a vector with the same number of components as

. We can therefore save memory by simply overwriting the stored activation . This algorithm is called backpropagation and allows us to calculate the gradient for roughly the same cost as a forward pass of the neural network.

## Appendix C Relation between Var(C) and KL divergence

###### Theorem.

Let . The following relation between the KL divergence and the variance of holds:

 KL(qθ||p)=12Varq(C)+O(Eq[|w−1|3]),

where is the normalized importance weight.

###### Proof.

The expectation value of the normalized importance weight is

 Eq[pq]=∫D[ϕ]p(ϕ)=1.

The Kullback-Leibler divergence can be rewritten in terms of the normalized importance weight

 KL(q|p)=Eq[lnqp]=−Eq[lnw].

We now expand the KL divergence around the expectation value of the normalized importance weight

 KL(q|p) =−Eq[ln(1−(w−1))] =Eq[∞∑j=0(−1)jj(w−1)j] =Eq[w−1]=0+12Eq[(w−1)2]+O(Eq[|w−1|3]).

We now relate this expression to the variance of . To this end, we first observe that , where is the unnormalized importance weight. We then rewrite the expectation value of as

 Eq[C] =−Eq[ln~w] =−Eq[lnw+lnZ] =−lnZ−Eq[lnw] =−lnZ+KL(q|p) =−lnZ+O(Eq[(w−1)2]),

where the last step uses the expansion for the KL divergence derived above. It then follows its variance is given by

 Varq(C) =Eq[(C−Eq[C])2] =Eq[(−ln~w+Eq[ln~w])2] =Eq[(−ln~w+lnZ=−lnw+O(Eq[(w−1)2]))2]

Expanding the logarithm around again, we obtain

 Varq[C]=Eq[(w−1)2]+O(Eq[|w−1|3]).

Combining this expression with the expansion derived for the KL divergence, we obtain the claim of the theorem. ∎

The higher-order moments will be small towards the end of the training process for which

and thus . Thus, the variance of will become small. We indeed observe this behaviour in our numerical experiments, see Figure 3 for an example.

## Appendix D Analytic Solution for Partition Function

The action of the scalar field theory is given by

 S=∑x∈Λ−2κ2∑^μ=1φ(x)φ(x+^μ)+(1−2λ)φ(x)2+λφ(x)4.

We want to calculate the partion function

 Z=∫D[ϕ]exp(−S(ϕ)),

which for vanishing hopping parameter decouples in independent integrals of each lattice site of the lattice :

 Z=∏x∈Λ(∫dϕ(x)exp(−λϕ(x)4−(1−2λ)ϕ(x)2))

The partition function can then be calculated analytically using the integral

 ∫exp(−ax4−bx2)dx=√b4aexp(b28a)K14(b28a),

where is the modified Bessel function of the second kind. Using this formula, we obtain the following analytic form of the free energy

 F=−T|Λ|ln(z),

where we have defined

 z(λ)=√1−2λ4λexp((1−2λ)28λ)K14((1−2λ)28λ).

This result corresponds to the zeroth order of the hopping expansion Gattringer and Lang (2009) of the partition function and one may, in principle, also calculate higher-order corrections. However, they are not needed for our purposes.

Our discussion is based on Nowozin (2018) which discussed the same results in the context of variational inference. We provide a review here since these results may be hard to extract for physicist not familiar with variational inference. We also point out a subtlety that was not discussed in the previous work.

The estimator for the free energy is given by

 ^F=−Tln1NN∑i=0~w(ϕi), ϕi∼qθ. (16)

Theorems for the variance and bias of this estimator are discussed in the following. For this, we use the delta method of moments which is summarized in the following theorem.

###### Theorem.

Let be the sample mean of independent and identically distributed random variables with for . Let be a real-valued function with uniformly bounded derivatives. It then holds that

 E[h(^XN)]=c0+c1N+O(1N2),

where

 c0=h(μ), c1=h′′(μ)σ22,

with and .

We refer to Chapter 5.3 of Bickel and Doksum (2015) for a proof and more details.

The application of the delta method to the free energy estimator is, in practise, subject to a subtlety regarding the bounded differentiablity of the function . We will ignore this subtlety in the following and return to it at the end of the section.

###### Theorem.

The bias of is given by

 B[−β^F]=−12NEq[(~w−Eq[~w])2]Eq[~w]2+O(N−2),

assuming that for .

###### Proof.

The bias of is given by

 B[−β^F]=Eq[ln^Z]−lnZ.

Using the delta method for moments, we derive that

 Eq[ln^Z]=lnZ−12NZ2Eq[(~w−Eq[~w])2]+O(N−2), (17)

where we have used that has second derivative . The proof then concludes by observing that

 Eq[~w]=Z.

###### Theorem.

The variance of is given by

assuming that for .

###### Proof.

The variance can be written as

 Var[−β^F]=Eq[(ln^Z)2]−Eq[ln^Z]2.

We now evaluate both terms on the right-hand-side individually using the delta method. For the first term, we use the delta method with which has second derivative

 h′′(x)=2x2−2ln(x)x2.

Using this expression, we then obtain that is equal to

 (lnZ)2+(1Z2−logZZ2)Eq[(~w−Eq[~w])2]N+O(N−2)

For the squared expectation value, we use the expansion (17) derived in the proof for the bias. This gives that is equal to

 (lnZ−12NZ2Eq[(~w−Eq[~w])2]+O(N−2))2 =(lnZ)2−1NZ2Eq[(~w−Eq[~w])2]+O(N−2).

Subtracting these two expressions, it then follows

 Var[−β^F]=1Z2Eq[(~w−Eq[~w])2]N+O(N−2),

and the proof concludes by observing that . ∎

A few remarks are in order: from the theorems, it follows that the standard deviation of the estimator

is of order . In the large limit, we can therefore neglect the bias correction as it is of order . Furthermore, we can replace the expectation values in the theorems by the sample mean up to (negligible) higher-order corrections. In practise, we therefore use these results to estimate the variance and bias of . Alternatively, one can use a standard Jackknife analysis to estimate variance and bias (see for example Gattringer and Lang (2009)). In our experiments, we use both methods to estimate the errors and check that they lead to consistent results. Lastly, we remark that error estimators for general observables involving the partition function can be derived, see Nicoli et al. (2020).

As mentioned above, the delta method requires that the derivatives of the function are (uniformly) bounded. For a generic LQFT, this will not be the case for since its derivatives diverge for . To the best of our knowledge, the same problem will generically arise in the context of variational inference but seems to have not been discussed in the literature.

To address this subtlety, one could require that the action of the lattice quantum field theory is bounded. For example, this can be ensured by putting the field theory in a box potential. Since only very high energy configurations are affected by this (for suitably large choice of the box potential) and since these configurations are extremely unlikely to be sampled, this modification will have no practical effect on the numerical experiments. After this modification, is bounded from below and is also bounded as a result.

More rigorously, the result for the variance can be derived without assumptions on a bound for the derivatives by using the delta method for in law approximation which takes the following form

###### Theorem.

Let be the sample mean of independent and identically distributed random variables with for . Let be a differential function at . Then

 √n(h(^XN)−h(μ))D→N(0,σ2(h)),

where .

For a proof, we again refer to Bickel and Doksum (2015), see Theorem 5.3.3. Applying this theorem to the free energy estimator , we obtain the same expression for its variance as derived above. However, the theorem does not require any bound on the derivatives of .

## Appendix F Step Size Analysis

As explained in the main text, the free energy difference is calculated in steps

 ΔFeb=ΔFe,iK+ΔFiKiK−1+⋯+ΔFi1b.

In this appendix, we will analyze the dependency of our results on the chosen steps.

We start from an initial step size corresponding to a change in hopping parameter of . Between and , we however take a finer step size of . Since we are interested in the free energy difference between and , this corresponds to running Markov chains. We focus on the lattice and use an overelaxed HMC algorithm to sample configurations for each chain. The overrelaxation is performed every steps.

We then repeatedly refine the step size in a certain subregion around the critical value. The details can be found in Table 1.

The results of this analysis are shown in Figure 5. We observe that the error of the estimator does not significantly decrease. We note that the error of the flow decreases during refinement because its free energy estimation uses the same number of samples as all Markov chains combined (and this number increases by the additional refinement steps).

In order to ensure that the distributions and in

have sufficient overlap, we also estimate by exchanging with in the relation above. We then check that this leads to compatible results, see Figure 6. We note that this consistency check is relatively cheap as it requires running one additional Markov chain.

We also study the dependence of the integrated autocorrelation of the free energy on this refinement procedure, see Figure 4.

## Appendix G Details on Numerical Experiments

### Hmc:

We use a HMC algorithm with overrelaxation. Each Markov chain has 5k thermalization steps followed by 400k estimation steps. The sign of the field configuration is flipped every ten steps.

### Training of flow:

For every lattice, we use a normalizing flow with six coupling layers. Each coupling layer (5) has neural network with five fully-connected layers with no bias and Tanh non-linearities. The hidden layers of

consist of 1000 neurons each. We train the flow for 1M steps using an 8k mini-batch. We use ReduceLROnPlateau learning rate scheduler of PyTorch with an initial learning rate of

and patience of 3k steps. The minimum learning rate was set to .

### Estimation:

As described in the main text, for HMC-based estimation we use a step size of for and a step size of for all other values of the hopping parameter. As a result 14 Markov chains are run. In total, the HMC-based method therefore uses configurations. We use the same number of samples for the flow-based estimation. For efficiency, we sample these configurations in mini-batches of 3k samples.

### Error estimation:

we use both the uwerr Wolff et al. (2004) and jackknife method to estimate uncertainties for HMC. In order to deal with autocorrelation for jackknife, we perform binning with a 1k bin size. Error estimation for flow is performed by the delta method and also by jackknife, see Figure 7.

From the free energy estimates, one can then derive other thermodynamic observables such as the entropy. We refer to the main text for a discussion of this. Figure 8 shows estimation of entropy. Errors were estimated using both the Jackknife and uwerr method. Both error analysis methods lead to consistent results.