Comment on "Solving Statistical Mechanics Using VANs": Introducing saVANt - VANs Enhanced by Importance and MCMC Sampling

In this comment on "Solving Statistical Mechanics Using Variational Autoregressive Networks" by Wu et al., we propose a subtle yet powerful modification of their approach. We show that the inherent sampling error of their method can be corrected by using neural network-based MCMC or importance sampling which leads to asymptotically unbiased estimators for physical quantities. This modification is possible due to a singular property of VANs, namely that they provide the exact sample probability. With these modifications, we believe that their method could have a substantially greater impact on various important fields of physics, including strongly-interacting field theories and statistical physics.


page 1

page 2

page 3

page 4


Solving Statistical Mechanics using Variational Autoregressive Networks

We propose a general framework for solving statistical mechanics of syst...

Exact Subsampling MCMC

Speeding up Markov Chain Monte Carlo (MCMC) for data sets with many obse...

Neural-network based general method for statistical mechanics on sparse systems

We propose a general method for solving statistical mechanics problems d...

Free Energy Evaluation Using Marginalized Annealed Importance Sampling

The evaluation of the free energy of a stochastic model is considered to...

On importance-weighted autoencoders

The importance weighted autoencoder (IWAE) (Burda et al., 2016) is a pop...

Sampling and statistical physics via symmetry

We formulate both Markov chain Monte Carlo (MCMC) sampling algorithms an...

Statistical mechanics of neocortical interactions: Portfolio of Physiological Indicators

There are several kinds of non-invasive imaging methods that are used to...


Appendix A Additional Details on Algorithm

a.1 Lightning Review of VAN

Wu et al approximate Boltzmann distributions with an auto-regressive generative model by minimizing the KL divergence


A PixelCNN is used which allows exact evaluation of the probability and relatively efficient sampling. Also note that the partition function is a constant and therefore the last summand leads to no contribution to the gradient. As a result, VAN can be trained by sampling from the model , evaluating the probability and the Hamiltonian for these samples and using gradient descent.

a.2 Bounding the output probabilities of VAN

We can simply interpret the original network output as the probability by the following mapping:

a.3 Proof: Estimators are Asymptotically Unbiased

Assume that the support of the sampling distribution contains the support of the target distribution . This property is ensured by ensuring that the probability takes values in .

a.3.1 Neural Importance Sampling

Then, importance sampling with respect to , i.e.


is an asymptotically unbiased estimator of the expectation value because


where . The partition function can be similarly determined


Combining the previous equations, we obtain

with (5)

a.3.2 Neural MCMC Sampling

The sampler can be used as a trial distribution

for a Markov-Chain which uses the following acceptance probability in its Metropolis step


This fulfills the detailed balance condition


because the total transition probability is given by and therefore

where we have used the fact that the min operator is symmetric and that all factors are strictly positive. The latter property is ensured by the fact that .

Appendix B Additional Details on Experiments

b.1 Setup

We use a Tesla P100 GPU with 16GB of memory both for training and sampling. A

lattice is considered. The reference values are generated using the Wolff algorithm using 2M steps with 100k warm-up steps. We use the ResNet version of VAN with the following hyperparameter choices (chosen to match with the ones used by Wu et al.):

name value
net depth 6
net width 3
half kernel size 3
bias true
epsilon 1e-07
Table 1: Hyperparameters for VAN used in our experiments.

The reference implementation of Wu et al. is used to train the VANs. For estimating the results of VAN and saVANt-NIS, we use 1000 iterations sampling 500 configurations each. For saVANt-NMCMC, we use 100k steps sampling 500 candidate configurations in a batch. No warm-up steps are required because candidates are sampled from a pre-trained VAN. As is demonstrated in Table 4, all algorithms have roughly the same runtime for sampling but saVANt leads to significant reduction in training time as explained in the main text.

b.2 Additional Results

b.2.1 Properties of saVANt-NMCMC

saVANt-NMCMC is less likely to get stuck in local minima and has reduced autocorrelation times as compared to Metropolis. This is demonstrated in Table 2 and 3.

Metropolis saVANt NMCMC VAN
-0.99 2e-5 -4e-4 3e-3 -6e-4 2e-3
Table 2: Magnetization at . By -symmetry, the correct result has to be close to vanishing. In contrast to the Metropolis algorithm, both saVANt and VAN can jump in configuration space between (regions close to) the two degenerate minima of the Ising model.
Metropolis saVANt-NMCMC
0.44 231.1 14.0 0.50 0.01
0.45 542.5 47.4 0.50 0.01
Table 3: Integrated autocorrelation time of the magnetization for and . Both runs use 5M steps.

b.2.2 Using Models Trained at Different Temperatures

As already shown in the main text, Fig. 2 and Fig. 3 demonstrate that the bias for VANs holds for different betas. In particular, we looked at and . As for the main study shown in the manuscript for , here we trained a VAN with the aforementioned setup at the reference , and we subsequently used this model to predict energies at different . We note that VAN does not reproduce the reference values for almost all values of . This discrepancy is particularly pronounced as one approaches the critical temperature. Fig. 4 shows the aforementioned transfer property from a different perspective.

1h 7m 1h 7m 1h 8m
Table 4: Runtime for 500k configurations on a Tesla P100 GPU with 16GB memory. This does not include the time for training which is significantly lower for saVANt as compared to VAN.
Figure 2: Analogous plot to the main plot in the paper but for .
Figure 3: Analogous plot to the main plot in the paper but for which closer to the inverse critical temperature. Note that the sampling error of VAN gets more pronounced in this regime.
Figure 4: In this plot, we show the transfer property for two reference value for . The horizontal axis denotes the value used for training unlike in the previous plots. We trained samplers at many different betas. Then we estimate energies at a reference value (0.48 on the left and 0.57 on the right) using both saVANt-MCMC and saVANt-NIS. The orange line shows the estimate of the energy at the reference as determined by the Wolff algorithm.