Log In Sign Up

On Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models

by   Kim A. Nicoli, et al.

In this work, we demonstrate that applying deep generative machine learning models for lattice field theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC) methods are problematic. More specifically, we show that generative models can be used to estimate the absolute value of the free energy, which is in contrast to existing MCMC-based methods which are limited to only estimate free energy differences. We demonstrate the effectiveness of the proposed method for two-dimensional ϕ^4 theory and compare it to MCMC-based methods in detailed numerical experiments.


page 1

page 2

page 3

page 4


Stochastic normalizing flows as non-equilibrium transformations

Normalizing flows are a class of deep generative models that provide a p...

Machine Learning of Thermodynamic Observables in the Presence of Mode Collapse

Estimating the free energy, as well as other thermodynamic observables, ...

Computing Absolute Free Energy with Deep Generative Models

Fast and accurate evaluation of free energy has broad applications from ...

Deep Involutive Generative Models for Neural MCMC

We introduce deep involutive generative models, a new architecture for d...

Machine Learning and Variational Algorithms for Lattice Field Theory

In lattice quantum field theory studies, parameters defining the lattice...

Markov-Chain Monte Carlo Approximation of the Ideal Observer using Generative Adversarial Networks

The Ideal Observer (IO) performance has been advocated when optimizing m...

Hierarchical autoregressive neural networks for statistical systems

It was recently proposed that neural networks could be used to approxima...


Appendix A Conventions for Action

The form of the action used in the main text is


It can be obtained by starting from the (more standard) action


and performing the following re-definitions


Appendix B A brief overview of Deep Learning

Neural Networks

Neural networks are a machine learning algorithm which has proven to be particularly powerful. A neural network is build of layers which are defined by

where , are input and output of the layer. The output of the layer is also often called the activation of the layer. The weights and the bias are the learnable parameters of the neural network. The non-linearity is a non-linear function which is applied element-wise to the components of

. Widely-used activation function are

or .

A neural network consists of such layers, i.e.

It is important to note that the weights and biases do not have to be of the same dimensionality for each layer (although we did not make this explicit in our notation). It is also important to note that we merely described the most simple type of neural network, namely a fully-connected neural network. There is a zoo of other neural networks but we will refrain from a more detailed discussion as it is not needed for our purposes (see Goodfellow et al. (2016) for an overview).

Learning Parameters with Backpropagation

The parameters of the neural networks,

are determined by minimizing a certain loss function

by gradient descent (see (7) for the particular loss function used in this work). It is important to emphasize that the number of parameters are typically large (of order for typical modern neural networks). It is therefore clear that one cannot determine the gradient by finite-difference (as we would need calculate the finite difference ratio for each of these parameters which is prohibitively expensive).

The basic idea for calculating the gradient is to use the fact that we know the functional form of the neural network: the gradient of the loss is given by

Each term in this expression is known, for example

For a fixed non-linearity, we know the analytical form of the derivative . This observation leads to the following algorithm: we first perform a forward pass of the neural network, i.e. starting from the input , we calculate the activations for each layer and store them in memory. This process ends with the final activation which is, by definition, the output of the neural network. The gradient of each layer can then directly be calculated (as we have stored the activation ). Crucially, we only need the matrix product of these Jacobians and it is efficient to start by calculating the gradient with respect to the output layer , then the layer and so forth. This is because the loss function has a scalar output value and therefore the matrix product of the Jacobians

is a vector with the same number of components as

. We can therefore save memory by simply overwriting the stored activation . This algorithm is called backpropagation and allows us to calculate the gradient for roughly the same cost as a forward pass of the neural network.

Appendix C Relation between and KL divergence


Let . The following relation between the KL divergence and the variance of holds:

where is the normalized importance weight.


The expectation value of the normalized importance weight is

The Kullback-Leibler divergence can be rewritten in terms of the normalized importance weight

We now expand the KL divergence around the expectation value of the normalized importance weight

We now relate this expression to the variance of . To this end, we first observe that , where is the unnormalized importance weight. We then rewrite the expectation value of as

where the last step uses the expansion for the KL divergence derived above. It then follows its variance is given by

Expanding the logarithm around again, we obtain

Combining this expression with the expansion derived for the KL divergence, we obtain the claim of the theorem. ∎

The higher-order moments will be small towards the end of the training process for which

and thus . Thus, the variance of will become small. We indeed observe this behaviour in our numerical experiments, see Figure 3 for an example.

Figure 3: The variance decreases during training. We use hopping parameter and bare coupling for a lattice.

Appendix D Analytic Solution for Partition Function

Figure 4: Integrated autocorrelation time of the free energy during refinement of the step size. The shaded red areas refers to the interval in hopping parameter for which the refinement is applied. Darker shading indicates narrower step size. The experiments were performed using the overrelaxed HCM algorithm.

The action of the scalar field theory is given by

We want to calculate the partion function

which for vanishing hopping parameter decouples in independent integrals of each lattice site of the lattice :

The partition function can then be calculated analytically using the integral

where is the modified Bessel function of the second kind. Using this formula, we obtain the following analytic form of the free energy

where we have defined

This result corresponds to the zeroth order of the hopping expansion Gattringer and Lang (2009) of the partition function and one may, in principle, also calculate higher-order corrections. However, they are not needed for our purposes.

Appendix E Error Analysis for Free Energy Estimator

Our discussion is based on Nowozin (2018) which discussed the same results in the context of variational inference. We provide a review here since these results may be hard to extract for physicist not familiar with variational inference. We also point out a subtlety that was not discussed in the previous work.

The estimator for the free energy is given by


Theorems for the variance and bias of this estimator are discussed in the following. For this, we use the delta method of moments which is summarized in the following theorem.


Let be the sample mean of independent and identically distributed random variables with for . Let be a real-valued function with uniformly bounded derivatives. It then holds that


with and .

We refer to Chapter 5.3 of Bickel and Doksum (2015) for a proof and more details.

The application of the delta method to the free energy estimator is, in practise, subject to a subtlety regarding the bounded differentiablity of the function . We will ignore this subtlety in the following and return to it at the end of the section.


The bias of is given by

assuming that for .


The bias of is given by

Using the delta method for moments, we derive that


where we have used that has second derivative . The proof then concludes by observing that


The variance of is given by

assuming that for .


The variance can be written as

We now evaluate both terms on the right-hand-side individually using the delta method. For the first term, we use the delta method with which has second derivative

Using this expression, we then obtain that is equal to

For the squared expectation value, we use the expansion (17) derived in the proof for the bias. This gives that is equal to

Subtracting these two expressions, it then follows

and the proof concludes by observing that . ∎

A few remarks are in order: from the theorems, it follows that the standard deviation of the estimator

is of order . In the large limit, we can therefore neglect the bias correction as it is of order . Furthermore, we can replace the expectation values in the theorems by the sample mean up to (negligible) higher-order corrections. In practise, we therefore use these results to estimate the variance and bias of . Alternatively, one can use a standard Jackknife analysis to estimate variance and bias (see for example Gattringer and Lang (2009)). In our experiments, we use both methods to estimate the errors and check that they lead to consistent results. Lastly, we remark that error estimators for general observables involving the partition function can be derived, see Nicoli et al. (2020).

As mentioned above, the delta method requires that the derivatives of the function are (uniformly) bounded. For a generic LQFT, this will not be the case for since its derivatives diverge for . To the best of our knowledge, the same problem will generically arise in the context of variational inference but seems to have not been discussed in the literature.

To address this subtlety, one could require that the action of the lattice quantum field theory is bounded. For example, this can be ensured by putting the field theory in a box potential. Since only very high energy configurations are affected by this (for suitably large choice of the box potential) and since these configurations are extremely unlikely to be sampled, this modification will have no practical effect on the numerical experiments. After this modification, is bounded from below and is also bounded as a result.

More rigorously, the result for the variance can be derived without assumptions on a bound for the derivatives by using the delta method for in law approximation which takes the following form


Let be the sample mean of independent and identically distributed random variables with for . Let be a differential function at . Then

where .

For a proof, we again refer to Bickel and Doksum (2015), see Theorem 5.3.3. Applying this theorem to the free energy estimator , we obtain the same expression for its variance as derived above. However, the theorem does not require any bound on the derivatives of .

Appendix F Step Size Analysis

Figure 5: Free energy at obtained by both MCMC-based and flow-based method. For the MCMC-based method, different step size of the hopping parameter were used. Details on the step sizes are summarized in Table 1. For the flow-based method, we use the same number of samples as for the corresponding refined MCMC method. As a result, also the error of the flow’s estimate decreases.
Figure 6: Free energy at and using both (down) and (up). We use the same setup (i.e. number of steps, hopping parameter change , etc) as for Figure 2.

As explained in the main text, the free energy difference is calculated in steps

In this appendix, we will analyze the dependency of our results on the chosen steps.

We start from an initial step size corresponding to a change in hopping parameter of . Between and , we however take a finer step size of . Since we are interested in the free energy difference between and , this corresponds to running Markov chains. We focus on the lattice and use an overelaxed HMC algorithm to sample configurations for each chain. The overrelaxation is performed every steps.

We then repeatedly refine the step size in a certain subregion around the critical value. The details can be found in Table 1.

The results of this analysis are shown in Figure 5. We observe that the error of the estimator does not significantly decrease. We note that the error of the flow decreases during refinement because its free energy estimation uses the same number of samples as all Markov chains combined (and this number increases by the additional refinement steps).

In order to ensure that the distributions and in

have sufficient overlap, we also estimate by exchanging with in the relation above. We then check that this leads to compatible results, see Figure 6. We note that this consistency check is relatively cheap as it requires running one additional Markov chain.

We also study the dependence of the integrated autocorrelation of the free energy on this refinement procedure, see Figure 4.

chains samples refined region
14 5.6 M 0.20-0.30
24 9.6 M 0.20-0.30
40 16 M 0.22-0.30
76 30.4 M 0.24-0.30
88 35.2 M 0.267-0.279
Table 1: Details on the refinement analysis. In each refinement stage, we take smaller steps in a certain subregion of the hopping parameter trajectory (see last column). The step size taken in this region is shown in the first column. Outside of the most refined region, the same step sizes as in the previous refinement stage are taken. Since each chain is taken to be of the same length, the total number of samples (third column) grows proportional to the number of chains (second column).

Appendix G Details on Numerical Experiments

Figure 7: Free energy estimation with error analysis by both Jackknife and delta method. Both lead to compatible results. We use the same data as in Figure 2.


We use a HMC algorithm with overrelaxation. Each Markov chain has 5k thermalization steps followed by 400k estimation steps. The sign of the field configuration is flipped every ten steps.

Training of flow:

For every lattice, we use a normalizing flow with six coupling layers. Each coupling layer (5) has neural network with five fully-connected layers with no bias and Tanh non-linearities. The hidden layers of

consist of 1000 neurons each. We train the flow for 1M steps using an 8k mini-batch. We use ReduceLROnPlateau learning rate scheduler of PyTorch with an initial learning rate of

and patience of 3k steps. The minimum learning rate was set to .


As described in the main text, for HMC-based estimation we use a step size of for and a step size of for all other values of the hopping parameter. As a result 14 Markov chains are run. In total, the HMC-based method therefore uses configurations. We use the same number of samples for the flow-based estimation. For efficiency, we sample these configurations in mini-batches of 3k samples.

Error estimation:

we use both the uwerr Wolff et al. (2004) and jackknife method to estimate uncertainties for HMC. In order to deal with autocorrelation for jackknife, we perform binning with a 1k bin size. Error estimation for flow is performed by the delta method and also by jackknife, see Figure 7.

From the free energy estimates, one can then derive other thermodynamic observables such as the entropy. We refer to the main text for a discussion of this. Figure 8 shows estimation of entropy. Errors were estimated using both the Jackknife and uwerr method. Both error analysis methods lead to consistent results.

Figure 8: Entropy density estimation with error analysis by both Jackknife and delta method. Both lead to compatible results. We use the same setup as for Figure 2.