References
Appendix A Variational autoregressive networks
The variational mean field methods approximate the Boltzmann distribution using some tractable variational distributions from certain families. Due to the limitations of computing the variational free energy, the variational distributions are usually not very representative. The recent developments of deep neural networks may be very helpful to give a much more representative variational distribution. The universal approximation theorem DLbook ensures that a neural network containing a large-enough hidden layer can approximate any continuous functions. This means that a deep neural network might be used to approximate the Boltzmann distribution very accurately. This idea has recently been used in VAN for building a neural-network-based variational mean-field method for statistical mechanics problems wu2019solving
VANs factorize the variational distribution by writing the joint probability as the product of conditional probabilities
(6) |
where represents the state of all variables with index less than . Here we can see, for variable of index , its state will be determined by previous variables but independent of all variables behind it. This is called autoregressive property in the machine learning community. From the views of graphical models, Eq. (6) connects all variables through their indices to a complete directed graph (suppose each variable is represented by a vertex). This connection pattern is universal for all graphical models but it makes no conditional independence assumptions about the variables.
Conditional probabilities then will be parametrized by carefully designing neural networks to represent the many-body interactions of variables. Without loss of generality, for the simplest case of one layer autoregressive network without bias, the outputs can be write as
where is achieved by adding mask at weights
. Since the sigmoid function
limits outputs to be in the range of , outputs can be naturally deemed as probability. In this case, the conditional probability can be written as(7) |
which is a Bernoulli distribution, with
. And the variational joint distribution becomes
(8) |
where are network parameters like weights and bias.
For variation methods, they adjust parameters to make close enough to the Boltzmann distribution. The distance between two distributions is usually quantified by Kullback-Leibler (KL) divergence mackay2003information
(9) |
where
(10) |
is the variational free energy and is the true free energy. Since KL divergence is positive, minimizing KL divergence is equivalent to minimizing to its lower bound
. Thus variational free energy is set to be loss function for VANs. Here
is energy function determined by models and can be calculated combining Eq. (7) and Eq. (8)(11) |
Minimizing requires optimization over network parameters , here score function estimator williams1992simple is adopted to calculate gradient of the variational free energy with respect to network parameters
(12) |
Using Adam optimizer kingma2014adam to minimize the loss function will make the variational free energy closer to its lower bound. After training, the variational distribution will be a good approximation to the Boltzmann distribution .
The work flow of a VAN is shown in Fig. 4. At first, an autoregressive network needs to be constructed according to factorization of the joint probability. Then we need to get samples from the autoregressive network that obey variational distribution through vertex by vertex sampling in a given ordering, the probability of next vertex given former vertices is given by Eq. (7). Having samples obey variational distribution, we can calculate energy , entropy and variational free energy . At last, gradients of will propagate back to change network parameters according to Eq. (12).
Appendix B Calculation example of Ising model
Suppose here we consider the classic Ising spin glasses as representative models, in which the energy function is defined as
(13) |
where is the joint state of state of FVS and state of forest , denotes set of edges, is couplings of the Ising model and is the external fied acting on the node .
Given the configuration of FVS , external field FVS vertices exert on the forest will be
(14) |
where is the set of neighbors of vertex . Then the energy of forest will be
(15) |
Here denotes edges of forest and is the overall external field vertex feels.
When performing leaf removal(which removes all leaves of this graph recurrently), the whole forest will be hierarchized to levels with leaves on top and roots on bottom. Our leaf removal ordering is constructed by appending these levels from top to bottom to ensure after removing vertices in upper level, vertices in the next level will all be leaves.
When a leaf is removed from the forest, the partition function of can be written as
(16) | ||||
where represents energy with vertex and its interaction excluded, and is the external field that combines the original external field of vertex and a external field vertex exert on .
Thus when a leaf vertex be removed from the graph, it will give a factor to the partition function and change the external field of its neighbor will be
(17) |
While for root vertex the factor will be since it has no neighbors when removed.
Multiplying these factors together and combining Eq. (3) and Eq. (4), the effective energy of the FVS can be analytically expressed as
(18) |
Here denotes set of roots of the remaining forest, and is the effective field acting on vertex when is a neighbor of :
Appendix C Experimental settings
On the first experiment, we compare correlations of three methods based on spin glass model. MCMC results are deemed as the base line. Hyperparameter settings are roughly the same as the first experiment, so we will not specify it here. Connected correlations calculation of MCMC is trivial since samples obey Boltzmann distribution have already gotten. Thus here we only introduce how to calculate connected correlations of other two methods.
For belief propagation, we adopt the scheme of zhou2015spin by propagating cavity fields . As long as cavity fields reaching their fixed point, we can calculate physical quantities based on them. The magnetism of vertex is
(19) |
where represents all neighbors of vertex . And connected correlation of and is
(20) |
where and .
For our method, calculations can be a little obscure since we only have samples of FVS vertices but correlations of all edges are needed. But we can start from the definition of correlations
(21) |
where is the state of FVS vertices plus vertex and , . Then both the numerator and denominator can be calculated using our method, can be calculated similarly thus we can use expressions above to calculate connected correlations of entire graph.
Dense | Convolution | FVS | |
---|---|---|---|
Batch | |||
Net depth | 3 | 3 | 2 |
Net width | 4 | 64^{1}^{1}1For convolution, this number represent channels. | 3 |
Max step | |||
Learning rate | |||
Number of parameters^{2}^{2}2This quantity is not a hyperparamter, but we list it here for comparison. | 1577216 | 714113 | 22532 |
On the second experiment, our method, two architectures from original version of VAN and belief propagation are implemented on a 2D ferromagnetic Ising model with open boundary conditions. For dense and convolution architectures, we borrow hyperparameter settings from wu2019solving. All hyperparameters used in this experiment are listed in Table 1. Since all these methods have been fine-tuned for their best performance, there are some differences on the hypereparameter settings. We can clearly see that our approach achieves better performance over others with less hyperparameters, which proves the superiority of our construction.
On the third experiment in the main text, two methods are implemented on random regular graphs to compare their running times and convergence speeds. Hyperparamters like batch, net depth, net width and learning rate used in Fig. 3(a) are all the same for both methods in order to compare the differences in performance. The resulting numbers of trainable parameters differ a lot: for the FVS and for the dense VAN, for a random graph of size . We see our method massively decreases the number of hyperparameters.
Comments
There are no comments yet.