Appendix A Additional Details on Algorithm
a.1 Lightning Review of VAN
Wu et al approximate Boltzmann distributions with an auto-regressive generative model by minimizing the KL divergence
A PixelCNN is used which allows exact evaluation of the probability and relatively efficient sampling. Also note that the partition function is a constant and therefore the last summand leads to no contribution to the gradient. As a result, VAN can be trained by sampling from the model , evaluating the probability and the Hamiltonian for these samples and using gradient descent.
a.2 Bounding the output probabilities of VAN
We can simply interpret the original network output as the probability by the following mapping:
a.3 Proof: Estimators are Asymptotically Unbiased
Assume that the support of the sampling distribution contains the support of the target distribution . This property is ensured by ensuring that the probability takes values in .
a.3.1 Neural Importance Sampling
Then, importance sampling with respect to , i.e.
is an asymptotically unbiased estimator of the expectation value because
where . The partition function can be similarly determined
Combining the previous equations, we obtain
a.3.2 Neural MCMC Sampling
The sampler can be used as a trial distribution
for a Markov-Chain which uses the following acceptance probability in its Metropolis step
This fulfills the detailed balance condition
because the total transition probability is given by and therefore
where we have used the fact that the min operator is symmetric and that all factors are strictly positive. The latter property is ensured by the fact that .
Appendix B Additional Details on Experiments
We use a Tesla P100 GPU with 16GB of memory both for training and sampling. A
lattice is considered. The reference values are generated using the Wolff algorithm using 2M steps with 100k warm-up steps. We use the ResNet version of VAN with the following hyperparameter choices (chosen to match with the ones used by Wu et al.):
|half kernel size||3|
The reference implementation of Wu et al. is used to train the VANs. For estimating the results of VAN and saVANt-NIS, we use 1000 iterations sampling 500 configurations each. For saVANt-NMCMC, we use 100k steps sampling 500 candidate configurations in a batch. No warm-up steps are required because candidates are sampled from a pre-trained VAN. As is demonstrated in Table 4, all algorithms have roughly the same runtime for sampling but saVANt leads to significant reduction in training time as explained in the main text.
b.2 Additional Results
b.2.1 Properties of saVANt-NMCMC
|-0.99 2e-5||-4e-4 3e-3||-6e-4 2e-3|
|0.44||231.1 14.0||0.50 0.01|
|0.45||542.5 47.4||0.50 0.01|
b.2.2 Using Models Trained at Different Temperatures
As already shown in the main text, Fig. 2 and Fig. 3 demonstrate that the bias for VANs holds for different betas. In particular, we looked at and . As for the main study shown in the manuscript for , here we trained a VAN with the aforementioned setup at the reference , and we subsequently used this model to predict energies at different . We note that VAN does not reproduce the reference values for almost all values of . This discrepancy is particularly pronounced as one approaches the critical temperature. Fig. 4 shows the aforementioned transfer property from a different perspective.
|1h 7m||1h 7m||1h 8m|