References
 Busza et al. (2018) W. Busza, K. Rajagopal, and W. Van Der Schee, Heavy ion collisions: the big picture and the big questions, Annual Review of Nuclear and Particle Science 68, 339 (2018).
 Philipsen (2013) O. Philipsen, The qcd equation of state from the lattice, Progress in Particle and Nuclear Physics 70, 55 (2013).
 De Forcrand et al. (2001) P. De Forcrand, M. D’Elia, and M. Pepe, ’t hooft loop in su(2) yangmills theory, Physical Review Letters 86, 1438 (2001).
 Jarzynski (1997a) C. Jarzynski, Equilibrium freeenergy differences from nonequilibrium measurements: A masterequation approach, Physical Review E 56, 5018 (1997a).
 Jarzynski (1997b) C. Jarzynski, Nonequilibrium equality for free energy differences, Physical Review Letters 78, 2690 (1997b).
 Caselle et al. (2018) M. Caselle, A. Nada, and M. Panero, Qcd thermodynamics from lattice calculations with nonequilibrium methods: The su(3) equation of state, Physical Review D 98, 054513 (2018).
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep learning (MIT press, 2016).
 Albergo et al. (2019) M. S. Albergo, G. Kanwar, and P. E. Shanahan, Flowbased generative models for markov chain monte carlo in lattice field theory, Phys. Rev. D 100, 034515 (2019).
 Kanwar et al. (2020) G. Kanwar, M. S. Albergo, D. Boyda, K. Cranmer, D. C. Hackett, S. Racanière, D. J. Rezende, and P. E. Shanahan, Equivariant flowbased sampling for lattice gauge theory, arXiv preprint arXiv:2003.06413 (2020).
 Urban and Pawlowski (2018) J. M. Urban and J. M. Pawlowski, Reducing autocorrelation times in lattice simulations with generative adversarial networks, arXiv preprint arXiv:1811.03533 (2018).
 Wirnsberger et al. (2020) P. Wirnsberger, A. J. Ballard, G. Papamakarios, S. Abercrombie, S. Racanière, A. Pritzel, D. J. Rezende, and C. Blundell, Targeted free energy estimation via learned mappings, arXiv preprint arXiv:2002.04913 (2020).
 Zwanzig (1955) R. W. Zwanzig, Hightemperature equation of state by a perturbation method. ii. polar gases, The Journal of Chemical Physics 23, 1915 (1955).
 Akiyama et al. (2020) S. Akiyama, D. Kadoh, Y. Kuramashi, T. Yamashita, and Y. Yoshimura, Tensor renormalization group approach to fourdimensional complex theory at finite density, arXiv preprint arXiv:2005.04645 (2020).
 Asakawa et al. (2014) M. Asakawa, T. Hatsuda, E. Itou, M. Kitazawa, H. Suzuki, F. Collaboration, et al., Thermodynamics of s u (3) gauge theory from gradient flow on the lattice, Physical Review D 90, 011501 (2014).
 Giusti and Pepe (2014) L. Giusti and M. Pepe, Equation of state of a relativistic theory from a moving frame, Physical review letters 113, 031601 (2014).
 Papamakarios et al. (2019) G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, Normalizing flows for probabilistic modeling and inference, arXiv preprint arXiv:1912.02762 (2019).
 Dinh et al. (2014) L. Dinh, D. Krueger, and Y. Bengio, Nice: Nonlinear independent components estimation, arXiv preprint arXiv:1410.8516 (2014).
 MacKay and Mac Kay (2003) D. J. MacKay and D. J. Mac Kay, Information theory, inference and learning algorithms (Cambridge university press, 2003).
 (19) More precisely, the distributions and have to be identically only almost everywhere, i.e. up to a set of measure zero.
 Kingma and Welling (2013) D. P. Kingma and M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114 (2013).
 Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint arXiv:1401.4082 (2014).
 Rezende and Mohamed (2015) D. J. Rezende and S. Mohamed, Variational inference with normalizing flows, arXiv preprint arXiv:1505.05770 (2015).
 Müller et al. (2019) T. Müller, B. Mcwilliams, F. Rousselle, M. Gross, and J. Novák, Neural importance sampling, ACM Transactions on Graphics (TOG) 38, 1 (2019).
 Noé et al. (2019) F. Noé, S. Olsson, J. Köhler, and H. Wu, Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning, Science 365 (2019).
 Wu et al. (2019) D. Wu, L. Wang, and P. Zhang, Solving statistical mechanics using variational autoregressive networks, Physical Review Letters 122, 080602 (2019).

Nicoli et al. (2020)
K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.R. Müller, and P. Kessel, Asymptotically unbiased estimation of physical observables with neural samplers, Physical Review E
101, 023304 (2020).  Nicoli et al. (2019) K. Nicoli, P. Kessel, N. Strodthoff, W. Samek, K.R. Müller, and S. Nakajima, Comment on ”solving statistical mechanics using vans”: Introducing savantvans enhanced by importance and mcmc sampling, arXiv preprint arXiv:1903.11048 (2019).
 Adler (1981) S. L. Adler, Overrelaxation method for the monte carlo evaluation of the partition function for multiquadratic actions, Physical Review D 23, 2901 (1981).
 Whitmer (1984) C. Whitmer, Overrelaxation methods for monte carlo simulations of quadratic and multiquadratic actions, Physical Review D 29, 306 (1984).
 Callaway and Rahman (1983) D. J. Callaway and A. Rahman, Lattice gauge theory in the microcanonical ensemble, Physical Review D 28, 1506 (1983).
 Fodor and Jansen (1994) Z. Fodor and K. Jansen, Overrelaxation algorithm for coupled gaugehiggs systems, Physics Letters B 331, 119 (1994).
 Wolff et al. (2004) U. Wolff, A. Collaboration, et al., Monte carlo errors with less errors, Computer Physics Communications 156, 143 (2004).
 Burda et al. (2015) Y. Burda, R. Grosse, and R. Salakhutdinov, Importance weighted autoencoders, arXiv preprint arXiv:1509.00519 (2015).

Nowozin (2018)
S. Nowozin, Debiasing evidence approximations: On importanceweighted autoencoders and jackknife variational inference, ICLR 2018 (2018).

Teh et al. (2007)
Y. W. Teh, D. Newman, and M. Welling, A collapsed variational bayesian inference algorithm for latent dirichlet allocation, in
Advances in neural information processing systems (2007) pp. 1353–1360.  Gattringer and Lang (2009) C. Gattringer and C. Lang, Quantum chromodynamics on the lattice: an introductory presentation, Vol. 788 (Springer Science & Business Media, 2009).
 Bickel and Doksum (2015) P. J. Bickel and K. A. Doksum, Mathematical statistics: basic ideas and selected topics, volume I, Vol. 117 (CRC Press, 2015).
Appendix A Conventions for Action
The form of the action used in the main text is
(11) 
It can be obtained by starting from the (more standard) action
(12) 
and performing the following redefinitions
(13)  
(14)  
(15) 
Appendix B A brief overview of Deep Learning
Neural Networks
Neural networks are a machine learning algorithm which has proven to be particularly powerful. A neural network is build of layers which are defined by
where , are input and output of the layer. The output of the layer is also often called the activation of the layer. The weights and the bias are the learnable parameters of the neural network. The nonlinearity is a nonlinear function which is applied elementwise to the components of
. Widelyused activation function are
or .A neural network consists of such layers, i.e.
It is important to note that the weights and biases do not have to be of the same dimensionality for each layer (although we did not make this explicit in our notation). It is also important to note that we merely described the most simple type of neural network, namely a fullyconnected neural network. There is a zoo of other neural networks but we will refrain from a more detailed discussion as it is not needed for our purposes (see Goodfellow et al. (2016) for an overview).
Learning Parameters with Backpropagation
The parameters of the neural networks,
are determined by minimizing a certain loss function
by gradient descent (see (7) for the particular loss function used in this work). It is important to emphasize that the number of parameters are typically large (of order for typical modern neural networks). It is therefore clear that one cannot determine the gradient by finitedifference (as we would need calculate the finite difference ratio for each of these parameters which is prohibitively expensive).The basic idea for calculating the gradient is to use the fact that we know the functional form of the neural network: the gradient of the loss is given by
Each term in this expression is known, for example
For a fixed nonlinearity, we know the analytical form of the derivative . This observation leads to the following algorithm: we first perform a forward pass of the neural network, i.e. starting from the input , we calculate the activations for each layer and store them in memory. This process ends with the final activation which is, by definition, the output of the neural network. The gradient of each layer can then directly be calculated (as we have stored the activation ). Crucially, we only need the matrix product of these Jacobians and it is efficient to start by calculating the gradient with respect to the output layer , then the layer and so forth. This is because the loss function has a scalar output value and therefore the matrix product of the Jacobians
is a vector with the same number of components as
. We can therefore save memory by simply overwriting the stored activation . This algorithm is called backpropagation and allows us to calculate the gradient for roughly the same cost as a forward pass of the neural network.Appendix C Relation between and KL divergence
Theorem.
Let . The following relation between the KL divergence and the variance of holds:
where is the normalized importance weight.
Proof.
The expectation value of the normalized importance weight is
The KullbackLeibler divergence can be rewritten in terms of the normalized importance weight
We now expand the KL divergence around the expectation value of the normalized importance weight
We now relate this expression to the variance of . To this end, we first observe that , where is the unnormalized importance weight. We then rewrite the expectation value of as
where the last step uses the expansion for the KL divergence derived above. It then follows its variance is given by
Expanding the logarithm around again, we obtain
Combining this expression with the expansion derived for the KL divergence, we obtain the claim of the theorem. ∎
Appendix D Analytic Solution for Partition Function
The action of the scalar field theory is given by
We want to calculate the partion function
which for vanishing hopping parameter decouples in independent integrals of each lattice site of the lattice :
The partition function can then be calculated analytically using the integral
where is the modified Bessel function of the second kind. Using this formula, we obtain the following analytic form of the free energy
where we have defined
This result corresponds to the zeroth order of the hopping expansion Gattringer and Lang (2009) of the partition function and one may, in principle, also calculate higherorder corrections. However, they are not needed for our purposes.
Appendix E Error Analysis for Free Energy Estimator
Our discussion is based on Nowozin (2018) which discussed the same results in the context of variational inference. We provide a review here since these results may be hard to extract for physicist not familiar with variational inference. We also point out a subtlety that was not discussed in the previous work.
The estimator for the free energy is given by
(16) 
Theorems for the variance and bias of this estimator are discussed in the following. For this, we use the delta method of moments which is summarized in the following theorem.
Theorem.
Let be the sample mean of independent and identically distributed random variables with for . Let be a realvalued function with uniformly bounded derivatives. It then holds that
where
with and .
We refer to Chapter 5.3 of Bickel and Doksum (2015) for a proof and more details.
The application of the delta method to the free energy estimator is, in practise, subject to a subtlety regarding the bounded differentiablity of the function . We will ignore this subtlety in the following and return to it at the end of the section.
Theorem.
The bias of is given by
assuming that for .
Proof.
The bias of is given by
Using the delta method for moments, we derive that
(17) 
where we have used that has second derivative . The proof then concludes by observing that
∎
Theorem.
The variance of is given by
assuming that for .
Proof.
The variance can be written as
We now evaluate both terms on the righthandside individually using the delta method. For the first term, we use the delta method with which has second derivative
Using this expression, we then obtain that is equal to
For the squared expectation value, we use the expansion (17) derived in the proof for the bias. This gives that is equal to
Subtracting these two expressions, it then follows
and the proof concludes by observing that . ∎
A few remarks are in order: from the theorems, it follows that the standard deviation of the estimator
is of order . In the large limit, we can therefore neglect the bias correction as it is of order . Furthermore, we can replace the expectation values in the theorems by the sample mean up to (negligible) higherorder corrections. In practise, we therefore use these results to estimate the variance and bias of . Alternatively, one can use a standard Jackknife analysis to estimate variance and bias (see for example Gattringer and Lang (2009)). In our experiments, we use both methods to estimate the errors and check that they lead to consistent results. Lastly, we remark that error estimators for general observables involving the partition function can be derived, see Nicoli et al. (2020).As mentioned above, the delta method requires that the derivatives of the function are (uniformly) bounded. For a generic LQFT, this will not be the case for since its derivatives diverge for . To the best of our knowledge, the same problem will generically arise in the context of variational inference but seems to have not been discussed in the literature.
To address this subtlety, one could require that the action of the lattice quantum field theory is bounded. For example, this can be ensured by putting the field theory in a box potential. Since only very high energy configurations are affected by this (for suitably large choice of the box potential) and since these configurations are extremely unlikely to be sampled, this modification will have no practical effect on the numerical experiments. After this modification, is bounded from below and is also bounded as a result.
More rigorously, the result for the variance can be derived without assumptions on a bound for the derivatives by using the delta method for in law approximation which takes the following form
Theorem.
Let be the sample mean of independent and identically distributed random variables with for . Let be a differential function at . Then
where .
For a proof, we again refer to Bickel and Doksum (2015), see Theorem 5.3.3. Applying this theorem to the free energy estimator , we obtain the same expression for its variance as derived above. However, the theorem does not require any bound on the derivatives of .
Appendix F Step Size Analysis
As explained in the main text, the free energy difference is calculated in steps
In this appendix, we will analyze the dependency of our results on the chosen steps.
We start from an initial step size corresponding to a change in hopping parameter of . Between and , we however take a finer step size of . Since we are interested in the free energy difference between and , this corresponds to running Markov chains. We focus on the lattice and use an overelaxed HMC algorithm to sample configurations for each chain. The overrelaxation is performed every steps.
We then repeatedly refine the step size in a certain subregion around the critical value. The details can be found in Table 1.
The results of this analysis are shown in Figure 5. We observe that the error of the estimator does not significantly decrease. We note that the error of the flow decreases during refinement because its free energy estimation uses the same number of samples as all Markov chains combined (and this number increases by the additional refinement steps).
In order to ensure that the distributions and in
have sufficient overlap, we also estimate by exchanging with in the relation above. We then check that this leads to compatible results, see Figure 6. We note that this consistency check is relatively cheap as it requires running one additional Markov chain.
We also study the dependence of the integrated autocorrelation of the free energy on this refinement procedure, see Figure 4.
chains  samples  refined region  

14  5.6 M  0.200.30  
24  9.6 M  0.200.30  
40  16 M  0.220.30  
76  30.4 M  0.240.30  
88  35.2 M  0.2670.279 
Appendix G Details on Numerical Experiments
Hmc:
We use a HMC algorithm with overrelaxation. Each Markov chain has 5k thermalization steps followed by 400k estimation steps. The sign of the field configuration is flipped every ten steps.
Training of flow:
For every lattice, we use a normalizing flow with six coupling layers. Each coupling layer (5) has neural network with five fullyconnected layers with no bias and Tanh nonlinearities. The hidden layers of
consist of 1000 neurons each. We train the flow for 1M steps using an 8k minibatch. We use ReduceLROnPlateau learning rate scheduler of PyTorch with an initial learning rate of
and patience of 3k steps. The minimum learning rate was set to .Estimation:
As described in the main text, for HMCbased estimation we use a step size of for and a step size of for all other values of the hopping parameter. As a result 14 Markov chains are run. In total, the HMCbased method therefore uses configurations. We use the same number of samples for the flowbased estimation. For efficiency, we sample these configurations in minibatches of 3k samples.
Error estimation:
we use both the uwerr Wolff et al. (2004) and jackknife method to estimate uncertainties for HMC. In order to deal with autocorrelation for jackknife, we perform binning with a 1k bin size. Error estimation for flow is performed by the delta method and also by jackknife, see Figure 7.
From the free energy estimates, one can then derive other thermodynamic observables such as the entropy. We refer to the main text for a discussion of this. Figure 8 shows estimation of entropy. Errors were estimated using both the Jackknife and uwerr method. Both error analysis methods lead to consistent results.