 # Neural-network based general method for statistical mechanics on sparse systems

We propose a general method for solving statistical mechanics problems defined on sparse graphs, such as random graphs, real-world networks, and low-dimensional lattices. Our approach extract a small feedback vertex set of the sparse graph, converting the sparse system to a strongly correlated system with many-body and dense interactions on the feedback set, then solve it using variational method based on neural networks to estimate free energy, observables, and generate unbiased samples via direct sampling. Extensive experiments show that our approach is more accurate than existing approaches for sparse spin glass systems. On random graphs and real-world networks, our approach significantly outperforms the standard methods for sparse systems such as belief-propagation; on structured sparse systems such as two-dimensional lattices our approach is significantly faster and more accurate than recently proposed variational autoregressive networks using convolution neural networks.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Appendix A Variational autoregressive networks

The variational mean field methods approximate the Boltzmann distribution using some tractable variational distributions from certain families. Due to the limitations of computing the variational free energy, the variational distributions are usually not very representative. The recent developments of deep neural networks may be very helpful to give a much more representative variational distribution. The universal approximation theorem DLbook ensures that a neural network containing a large-enough hidden layer can approximate any continuous functions. This means that a deep neural network might be used to approximate the Boltzmann distribution very accurately. This idea has recently been used in VAN for building a neural-network-based variational mean-field method for statistical mechanics problems wu2019solving

VANs factorize the variational distribution by writing the joint probability as the product of conditional probabilities

 q(s)=N∏i=1q(si|s

where represents the state of all variables with index less than . Here we can see, for variable of index , its state will be determined by previous variables but independent of all variables behind it. This is called autoregressive property in the machine learning community. From the views of graphical models, Eq. (6) connects all variables through their indices to a complete directed graph (suppose each variable is represented by a vertex). This connection pattern is universal for all graphical models but it makes no conditional independence assumptions about the variables.

Conditional probabilities then will be parametrized by carefully designing neural networks to represent the many-body interactions of variables. Without loss of generality, for the simplest case of one layer autoregressive network without bias, the outputs can be write as

 s′i=σ(∑j

. Since the sigmoid function

limits outputs to be in the range of , outputs can be naturally deemed as probability. In this case, the conditional probability can be written as

 q(si|s

which is a Bernoulli distribution, with

. And the variational joint distribution becomes

 qθ(s)=N∏i=1qθ(si|s

where are network parameters like weights and bias.

For variation methods, they adjust parameters to make close enough to the Boltzmann distribution. The distance between two distributions is usually quantified by Kullback-Leibler (KL) divergence mackay2003information

 DKL(q∥p)=∑sq(s)ln(q(s)p(s))=β(Fq−F) (9)

where

 Fq=∑sqθ(s)[E(s)+1βlnqθ(s)], (10)

is the variational free energy and is the true free energy. Since KL divergence is positive, minimizing KL divergence is equivalent to minimizing to its lower bound

. Thus variational free energy is set to be loss function for VANs. Here

is energy function determined by models and can be calculated combining Eq. (7) and Eq. (8)

 lnqθ(s)=N∑i=1[sis′i+(1−si)(1−s′i)]. (11)

Minimizing requires optimization over network parameters , here score function estimator williams1992simple is adopted to calculate gradient of the variational free energy with respect to network parameters

 ∇θFq=∑sqθ(s){[E(s)+1βlnqθ(s)]∇θlnqθ(s)}. (12)

Using Adam optimizer kingma2014adam to minimize the loss function will make the variational free energy closer to its lower bound. After training, the variational distribution will be a good approximation to the Boltzmann distribution .

The work flow of a VAN is shown in Fig. 4. At first, an autoregressive network needs to be constructed according to factorization of the joint probability. Then we need to get samples from the autoregressive network that obey variational distribution through vertex by vertex sampling in a given ordering, the probability of next vertex given former vertices is given by Eq. (7). Having samples obey variational distribution, we can calculate energy , entropy and variational free energy . At last, gradients of will propagate back to change network parameters according to Eq. (12).

## Appendix B Calculation example of Ising model

Suppose here we consider the classic Ising spin glasses as representative models, in which the energy function is defined as

 E(σ)=−∑(ij)∈EJijσiσj−∑iσiθi, (13)

where is the joint state of state of FVS and state of forest , denotes set of edges, is couplings of the Ising model and is the external fied acting on the node .

Given the configuration of FVS , external field FVS vertices exert on the forest will be

 h0i=∑j∈Ω⋂n(i)β(Jijsj+θjsj), (14)

where is the set of neighbors of vertex . Then the energy of forest will be

 E(t)=−∑(ij)∈ETJijtitj−∑i∈Ttihi. (15)

Here denotes edges of forest and is the overall external field vertex feels.

When performing leaf removal(which removes all leaves of this graph recurrently), the whole forest will be hierarchized to levels with leaves on top and roots on bottom. Our leaf removal ordering is constructed by appending these levels from top to bottom to ensure after removing vertices in upper level, vertices in the next level will all be leaves.

When a leaf is removed from the forest, the partition function of can be written as

 Z(s)=∑t∈Te−βE(s,t)=∑tieβti(hi+Jijtj)∑t∖ie−βE∖i(t∖i) (16) =2coshβ(Jijtj+hi)∑t∖ie−βE∖i(t∖i) =2√coshβ(Jij+hi)coshβ(−Jij+hi) exp[sj2ln(coshβ(Jij+hi)coshβ(−Jij+hi))]∑t∖ie−βE∖i(t∖i) =2√coshβ(Jij+hi)coshβ(−Jij+hi)∑t∖ieβh′jtje−βE∖ij(t∖i) =2√coshβ(Jij+hi)coshβ(−Jij+hi)Z∖i,

where represents energy with vertex and its interaction excluded, and is the external field that combines the original external field of vertex and a external field vertex exert on .

Thus when a leaf vertex be removed from the graph, it will give a factor to the partition function and change the external field of its neighbor will be

 h′j=hj+12βln(coshβ(Jij+hi)coshβ(−Jij+hi)). (17)

While for root vertex the factor will be since it has no neighbors when removed.

Multiplying these factors together and combining Eq. (3) and Eq. (4), the effective energy of the FVS can be analytically expressed as

 ˜E(s) =EΩ(s)+1β⎡⎣∑(i,j)∈ETln√4coshβ(Jij+hi) +∑(i,j)∈ETln√4coshβ(−Jij+hi)+∑i∈Rln(2cosh(βhi))⎤⎦ (18)

Here denotes set of roots of the remaining forest, and is the effective field acting on vertex when is a neighbor of :

## Appendix C Experimental settings

On the first experiment, we compare correlations of three methods based on spin glass model. MCMC results are deemed as the base line. Hyperparameter settings are roughly the same as the first experiment, so we will not specify it here. Connected correlations calculation of MCMC is trivial since samples obey Boltzmann distribution have already gotten. Thus here we only introduce how to calculate connected correlations of other two methods.

For belief propagation, we adopt the scheme of zhou2015spin by propagating cavity fields . As long as cavity fields reaching their fixed point, we can calculate physical quantities based on them. The magnetism of vertex is

 mi=∑j∈∂iarctanh[tanh(βJij)tanh(βhj→i)] (19)

where represents all neighbors of vertex . And connected correlation of and is

 ⟨sisj⟩c=eβJijcoshβh+ij−e−βJijcoshβh−ijeβJijcoshβh+ij+e−βJijcoshβh−ij−mimj (20)

where and .

For our method, calculations can be a little obscure since we only have samples of FVS vertices but correlations of all edges are needed. But we can start from the definition of correlations

 ⟨sisj⟩ =∑sisjsisjp(sisj) =∑sisjsisj∑s∖sisje−βE(s)Z =∑sisjsisj∑sfvse−β~E(saug)∑sfvse−β~E(sfvs) =Z+++Z−−−Z+−−Z−+∑sfvse−β~E(sfvs) (21)

where is the state of FVS vertices plus vertex and , . Then both the numerator and denominator can be calculated using our method, can be calculated similarly thus we can use expressions above to calculate connected correlations of entire graph.

On the second experiment, our method, two architectures from original version of VAN and belief propagation are implemented on a 2D ferromagnetic Ising model with open boundary conditions. For dense and convolution architectures, we borrow hyperparameter settings from  wu2019solving. All hyperparameters used in this experiment are listed in Table 1. Since all these methods have been fine-tuned for their best performance, there are some differences on the hypereparameter settings. We can clearly see that our approach achieves better performance over others with less hyperparameters, which proves the superiority of our construction.

On the third experiment in the main text, two methods are implemented on random regular graphs to compare their running times and convergence speeds. Hyperparamters like batch, net depth, net width and learning rate used in Fig. 3(a) are all the same for both methods in order to compare the differences in performance. The resulting numbers of trainable parameters differ a lot: for the FVS and for the dense VAN, for a random graph of size . We see our method massively decreases the number of hyperparameters.

The hyperparameters of Fig. 3(b) are similar to Fig. 3(a), with the only difference on the number of vertices.