## References

## Appendix A Additional Details on Algorithm

### a.1 Lightning Review of VAN

Wu et al approximate Boltzmann distributions with an auto-regressive generative model by minimizing the KL divergence

(1) |

A PixelCNN is used which allows exact evaluation of the probability and relatively efficient sampling. Also note that the partition function is a constant and therefore the last summand leads to no contribution to the gradient. As a result, VAN can be trained by sampling from the model , evaluating the probability and the Hamiltonian for these samples and using gradient descent.

### a.2 Bounding the output probabilities of VAN

We can simply interpret the original network output as the probability by the following mapping:

### a.3 Proof: Estimators are Asymptotically Unbiased

Assume that the support of the sampling distribution contains the support of the target distribution . This property is ensured by ensuring that the probability takes values in .

#### a.3.1 Neural Importance Sampling

Then, importance sampling with respect to , i.e.

(2) |

is an asymptotically unbiased estimator of the expectation value because

(3) |

where . The partition function can be similarly determined

(4) |

Combining the previous equations, we obtain

with | (5) |

#### a.3.2 Neural MCMC Sampling

The sampler can be used as a trial distribution

for a Markov-Chain which uses the following acceptance probability in its Metropolis step

(6) |

This fulfills the detailed balance condition

(7) |

because the total transition probability is given by and therefore

where we have used the fact that the min operator is symmetric and that all factors are strictly positive. The latter property is ensured by the fact that .

## Appendix B Additional Details on Experiments

### b.1 Setup

We use a Tesla P100 GPU with 16GB of memory both for training and sampling. A

lattice is considered. The reference values are generated using the Wolff algorithm using 2M steps with 100k warm-up steps. We use the ResNet version of VAN with the following hyperparameter choices (chosen to match with the ones used by Wu et al.):

name | value |
---|---|

net depth | 6 |

net width | 3 |

half kernel size | 3 |

bias | true |

epsilon | 1e-07 |

The reference implementation of Wu et al. is used to train the VANs. For estimating the results of VAN and saVANt-NIS, we use 1000 iterations sampling 500 configurations each. For saVANt-NMCMC, we use 100k steps sampling 500 candidate configurations in a batch. No warm-up steps are required because candidates are sampled from a pre-trained VAN. As is demonstrated in Table 4, all algorithms have roughly the same runtime for sampling but saVANt leads to significant reduction in training time as explained in the main text.

### b.2 Additional Results

#### b.2.1 Properties of saVANt-NMCMC

saVANt-NMCMC is less likely to get stuck in local minima and has reduced autocorrelation times as compared to Metropolis. This is demonstrated in Table 2 and 3.

Metropolis | saVANt NMCMC | VAN |
---|---|---|

-0.99 2e-5 | -4e-4 3e-3 | -6e-4 2e-3 |

Metropolis | saVANt-NMCMC | |
---|---|---|

0.44 | 231.1 14.0 | 0.50 0.01 |

0.45 | 542.5 47.4 | 0.50 0.01 |

#### b.2.2 Using Models Trained at Different Temperatures

As already shown in the main text, Fig. 2 and Fig. 3 demonstrate that the bias for VANs holds for different betas. In particular, we looked at and . As for the main study shown in the manuscript for , here we trained a VAN with the aforementioned setup at the reference , and we subsequently used this model to predict energies at different . We note that VAN does not reproduce the reference values for almost all values of . This discrepancy is particularly pronounced as one approaches the critical temperature. Fig. 4 shows the aforementioned transfer property from a different perspective.

VAN | saVANt-NMCMC | saVANt-NIS |
---|---|---|

1h 7m | 1h 7m | 1h 8m |