1 Introduction
Developing a deep understanding of the structure and dynamics of collective/emergent manybody phenomena in complex systems has been an elusive goal for decades across many disciplines, including physics, chemistry, biology, and computer science. Spinglasses provide a prototypical and computationally universal language for such systems Stein and Newman (2013)
. Indeed, there is a onetoone correspondence between finding the low energy states of spinglasses and many important computational tasks, such as finding highquality solutions in NPhard combinatorial optimization problems
Mezard and Montanari (2009) and performing sampling and inference in graphical models Moore and Mertens (2011). Metastable states and nonequilibrium dynamics of spinglasses can also represent steadystate attractors in dynamical systems Nishimori and Ortiz (2010), associative memory retrieval in neuroscience Nishimori and Ortiz (2010), and the training and inference over energybased models of neural networks
LeCun et al. (2006).Unfortunately, the standard theoretical and computational techniques are intrinsically inadequate to tackle realistic spinglasses systems of interest that are far from thermodynamic limit, exhibit longranged powerlaw interactions, and have intermediate spatial dimensions. These systems reside in an underexplored intermediate regime between the wellstudied limiting cases of the shortrange EdwardsAnderson Mezard and Montanari (2009) and infiniterange SherringtonKirkpatrick models Sherrington and Kirkpatrick (1975)
. The state space of these complex systems, which typically have a discrete structure, grows exponentially with the degrees of freedom, rendering bruteforce approaches intractable
Moore and Mertens (2011). Moreover, such systems often contain significant nonlinearities or higherorder nonlocal correlations, which induce exotic underlying geometries for the system. Here, the disorders are the essence, or the main characteristic of each problem, not a perturbation over pure/idealized models such as regular lattices. Thus, these systems are not prone to typical analytical approaches such as perturbation theory.There is no universal approach for studying such systems. Meanfield techniques could capture certain properties for some toy models in the thermodynamic limit, such as random energy models or spin models Castellani and Cavagna (2005), but they fail to capture the properties of systems for which large fluctuations cause the meanfield approximation to break down Mezard and Montanari (2009). In principle, the renormalization group (RG) could be used to capture critical or fixed point properties near a phase transition, but these methods are extremely hard to formalize for arbitrary systems Nishimori and Ortiz (2010). Moreover, RG techniques typically have poor performance for problems of practical interest that tend to be highly inhomogeneous and disordered, and they often miss key computational properties as one has to crudely coarsegrain the microscopic degrees of freedom Nishimori and Ortiz (2010).
Recent advances in deep learning could open up the possibility that important nontrivial and nonperturbative emergent properties could be machinelearned in an instancewise fashion. Indeed, deep neural networks have already been used in manybody physics to identify phases and phase transitions for certain classical
Carrasquilla and Melko (2017) and quantum critical systems Biamonte et al. (2017); Carleo and Troyer (2017) and they have also been used to accelerate Monte Carlo sampling Huang and Wang (2017); Liu et al. (2017a); Shen et al. (2018); Li and Wang (2018). However, the intrinsic strongly disordered and frustrated nature of spinglasses defined over discrete variables significantly limits the applicability of machine learning algorithms to such systems.In order to tackle such problem, we have recently developed a continuous probability density theory for spinglass systems with generic connectivity graph and local fields that could reside in the unexplored regime of intermediate spatial dimensions
Hartnett and Mohseni (2019). Here, we are interested in using such continuous representation for learning lowenergy states and critical properties of discrete spinglass distributions described by the following family of Hamiltonians(1) 
where is an Ising spin, is the coupling matrix, is the external magnetic field, and is an index which runs from 1 to . The equilibrium properties of these systems at temperature are governed by the usual Boltzmann distribution
(2) 
where is the partition function. This family of spin systems is extremely broad. It includes systems that are solvable using meanfield techniques as well as systems that continue to elude both theoretical and numerical treatments. Most methods for treating these spin systems aim to exploit some structure contained in the couplings  for example, symmetries due to an underlying lattice structure. In contrast, deep generative models are learned, thus allowing for bespoke methods tailored for a specific system or disorder realization. For this reason, the successful demonstration of deep generative spinglass models is an important accomplishment which should have broad applications for combinatorial optimization, sampling, and inference.
In this work, we perform a first step towards achieving this goal and use normalizing flows Rezende and Mohamed (2015) to model the wellknown spinglass system known as the SherringtonKirkpatrick (SK) model Sherrington and Kirkpatrick (1975). The SK model is a perfect system to test these new methods because it exhibits a spinglass phase characterized by exponentially many metastable states separated by high energy barriers, and so represents a challenging family of distributions for the normalizing flow to learn. However, despite the complexity of the SK model, it still is amenable to theoretical treatments, and the model has been very thoroughly studied since it was introduced almost half a century ago. The SK model is therefore an excellent candidate for the first numerical experiment of this sort as it presents a formidable challenge, and at the same time the ground truth is known and can be used to verify our spinglass learning algorithm. In this work, we will show that the SK model can be successfully machinelearned by normalizing flows. In future work, we hope to extend this success to more challenging spinglasses relevant for computational problems of practical interest.
The layout of this paper is as follows. In Sec. 2, we briefly review the continuous formulation of spinglasses introduced in Hartnett and Mohseni (2019) and use it to recast the SK model in terms of continuous variables. Then, in Sec. 3 we review and discuss normalizing flows from the perspective of statistical physics, and in Sec. 4 we successfully train normalizing flows to approximate the SK model, including in the spinglass phase. We argue that the internal layers of the flow perform a kind of coarsegraining, and in Sec. 5 we show the evidence for a spinglass phase transition within the normalizing flow itself. Lastly, we conclude with a discussion and outlook in Sec. 6.
2 Probability density formulation of spinglasses
The spinglass family represented by Eq. 1 is defined in terms of discrete variables. However, many of the most promising recent algorithms for generative modeling, such as normalizing flows and Generative Adversarial Networks (GANs) Goodfellow et al. (2014), are formulated in terms of continuous variables. In many applications this fact is ignored, and the generative algorithm is nonetheless successfully trained on discrete data. In the context of physicsbased applications, this was done in Liu et al. (2017b), in which GANs were trained to simulate the 2d Ising model. However, in this work, we shall make use the probability density formulation of discrete spinglasses introduced recently in Hartnett and Mohseni (2019) to convert the Boltzmann distribution over discrete spin variables into a probability density defined over continuous variables. A similar approach was used by Li and Wang (2018) to study the phase transition of the 2d Ising model using normalizing flows.
In the probability density formulation of spinglasses, discrete Boltzmann distributions of the form , with given by Eq. 1
, are equivalently described in terms of a continuous random variable
, with probability density . The probability density may also be written as a Boltzmann distribution, i.e. , where is the Hamiltonian density, and the term density refers to how transforms under a change of variables Hartnett and Mohseni (2019). The Hamiltonian density is given by(3) 
We have added a subscript to the Hamiltonian density to emphasize its dependence on the inverse temperature. Here is the partition function of the variable, which is related to the original partition function of the variable via:
(4) 
The Hamiltonian density is written in terms of a shifted coupling matrix , given by
(5) 
where with . The shift ensures that is integrable.
The variable may be interpreted as the integration variable in a somewhat nonstandard application of the HubbardStratonovich transformation. The probability density formulation of Hartnett and Mohseni (2019) (which builds off of the technique introduced in Zhang et al. (2012)) treats and
as random variables with joint distribution
and conditional distributions , . The conditional distribution is simply a multivariate Gaussian:(6) 
with mean and covariance matrix . The other conditional distribution is given by
(7) 
Importantly, both conditional distributions are relatively easy to sample from. Thus, given a sample of the discrete spins , a corresponding continuous configuration may be easily obtained, and vice versa.
2.1 The SherringtonKirkpatrick model
As mentioned in the introduction, the family of Boltzmann distributions with the Hamiltonian of Eq. 1 is very general. Therefore, as a first step towards answering the broader question of whether normalizing flows can successfully learn complex spinglass distributions, we shall restrict our attention to one particular spinglass, the SK model Sherrington and Kirkpatrick (1975). The SK model is defined by specifying that the couplings
be drawn from an iid Gaussian distribution:
(8) 
where the values are fixed by the symmetry of to be the same as the values, and the diagonal entries are zero. In order to machine learn this complex distribution, we shall work in terms of the continuous variable .
Despite its simple form, the SK model is a toy model with a very rich theoretical structure. In particular, there is a spinglass phase transition at a critical temperature of . Above this temperature the spinglass exists in a disordered phase. For lower temperatures ergodicity is broken by the appearance of an exponentially large number of metastable states separated by large energy barriers. Thus, the SK model represents an excellent testing ground for modeling complex spinglass distributions using deep neural networks.
A key result of Hartnett and Mohseni (2019) regarding the continuous formulation of spinglass distributions is the existence of a convex/nonconvex transition which occurs for the SK model at a temperature of
(9) 
Above the Hamiltonian density is convex, and it becomes nonconvex as the temperature is lowered past this value. Importantly, is much higher than the spinglass critical temperature. Thus, it is not obvious a priori whether the continuous formulation will be useful for the task of machine learning the SK model, since becomes “complex” (i.e. has a nonconvex energy landscape) well before the spinglass phase appears.
3 Normalizing flows and statistical physics
In this section, we introduce an application of real normalizing flows Rezende and Mohamed (2015) for modeling statistical manybody systems with real degrees of freedom. Normalizing flows are mappings from a latent random variable to a physical random variable : . The latent variable is drawn from a simple prior distribution, for example a multivariate Gaussian. The flow transforms this into a new distribution , and in most machine learning applications the goal is then to modify the parameters of such that this induced distribution closely matches some target distribution. In our case, is the continuous spin variable introduced above, and the target distribution is the Boltzmann distribution . We will argue below that the physical interpretation of is that it provides a coarsegrained description of the system. Denoting the prior as , the distribution induced by the flow can be obtained using the change of variables formula
(10) 
where the second factor on the RHS is just due to the Jacobian of the transformation.
Real NonVolume Preserving (NVP) flows Dinh et al. (2016) are normalizing flows for which the mapping
is composed of multiple simpler invertible functions called affine layers, which we denote by
, and whose precise definition is given in Appendix A. Letting denote the number of such layers, the flow may be defined recursively as , for , with the understanding that is the identity and corresponds to the complete flow. The output of each layer of the flow represents a random variable , with and .The change of variables formula may be applied to each intermediate variable to obtain an expression for the associated probability density. For applications to statistical physics, it is useful to represent each of these as Boltzmann distributions. In other words, for each layer we introduce an associated Hamiltonian density implicitly defined by
(11) 
Taking the prior to be an isotropic Gaussian with zeromean and standard deviation equal to one, the Hamiltonian density of the prior variable
is just the quadratic function(12) 
The change of variable formula may then be used to obtain for any of the intermediate layers:
(13) 
Similarly, the flow partition function may be expressed as an integral over any of the :
(14) 
Evaluating this expression for , it is easily seen that .
Describing the intermediate layers of the flow as random variables with their own Hamiltonian densities allows for a nice physical interpretation of the flow, see Fig. 1. Initially, the sequence of flow Hamiltonian densities are randomly initialized, and have no relation to the physical Hamiltonian density describing the system of interest . The goal of training the normalizing flow is to modify the parameters of so that the final flow Hamiltonian serves as a better and better approximation to the physical one. We will first investigate two distinct training methods based on different objective functions, and evaluate how well the trained models are able to capture the key properties of the spinglass phase. Later, in Sec. 5, we will present evidence that the trained flows implement a reversible coarsegraining on the physical system which is different than irreversible RG techniques.
4 Selfsupervised learning of deep generative spinglass models
SelfSupervised learning generally refers to a machine learning paradigm where all the labels in the training data can be generated autonomously Doersch and Zisserman (2017). In this paradigm, the architecture and the learning process could be fully supervised, however one does not need to manually label the training data. For example, the data can be labelled automatically by finding or computing the correlations between different input signals. Selfsupervised learning is intrinsically suitable for online learning when there is steady stream of the new data that should be automatically digested on the fly, or whenever robots/agents need to automatically adapting within a changing environment, or whenever new instances of data or partial data can be generated by the future output of the model itself. Here, we use two different selfsupervised learning approaches to train our deep generative spinglass models using forward or backward KL divergence on the data that is automatically generated by either Monte Carlo sampling or by the model itself respectively.
Generative models such as normalizing flows are typically trained by maximizing the log likelihood of some training data. Here, a priori we do not have any dataset, thus we need to generate our data; e.g.; through traditional Monte Carlo techniques such as parallel tempering Earl and Deem (2005)
. Another major difference is that a tractable expression for the true probability distribution is known; i.e. the Boltzmann distribution over
with Hamiltonian density is given by Eq. 3. This allows for two viable approaches towards training the flow to approximate the true distribution  one based on minimizing the reverse KL divergence and one based on minimizing the forward KL divergence.4.1 Reverse KL minimization
In this approach the loss function is given by the reverse KullbackLiebler (KL) divergence, plus a constant offset:
(15) 
The divergence is said to be “reversed” because in most machine learning applications the “forward” version is used.^{1}^{1}1The KL divergence of two probability densities and is defined to be
(16) 
where is the Helmholtz free energy. The subscript is meant to indicate that these free energies are associated with the continuous spinglass distribution , rather than the discrete one. Thus, minimization of the reverse KL loss is equivalent to the minimization of the Gibbs free energy. The global minimum is obtained when , in which case the KL divergence vanishes and the Gibbs and Helmholtz free energies agree, .
In order to employ this loss in a machine learning algorithm, we will make use of the fact that the reverse KL loss may also be written as an expectation over the prior:
(17) 
Thus, rather than estimating a loss over a training set, with this training method the model itself generates a number of samples, and the loss is then estimated by computing the difference of the Hamiltonian densities (or equivalently, the log likelihoods). This therefore represents a potentially very powerful learning method, since the training is limited only by computational constraints, and not by the quantity or quality of data. Since the KL divergence is nonnegative, this loss provides a lower bound for the loss function/Gibbs free energy:
(18) 
The tightness of this bound provides one measure of how well the flow has learned the spinglass distribution.
4.2 Forward KL minimization
A more conventional approach towards training the normalizing flow is to first assemble an unlabeled dataset consisting of spinglass configurations distributed according to the Boltzmann distribution. We will generate such a dataset by performing Monte Carlo sampling of the true spinglass distribution . The loss function will then be taken to be the expectation of the negative loglikelihood over the dataset:
(19) 
which is equivalent to the forward KL divergence plus the Shannon entropy of :
(20) 
may also be written as the the cross entropy between and . Similar to the reverse KL loss function, the KL loss function is lowerbounded by the Shannon entropy,
(21) 
4.3 Training deep generative spinglasses with normalizing flows
We considered the problem of training normalizing flows to approximate the SK model for spins^{2}^{2}2This is a small number of spins by some standards  for example it represents a square lattice in two dimensions. However, recall that the SK model is defined over a complete graph, which in this case contains 32,640 bonds between spins. For context, a square lattice with periodic boundary conditions and spins contains only 512 bonds. and , with the range of temperatures chosen to encompass both the spinglass and paramagnetic phases of the SK model. This range extends far beyond the phase transition temperature of so that we may study the effect of the transition between a convex and nonconvex Hamiltonian density, which occurs for . For each temperature, we considered 10 different realizations of the disorder (i.e. 10 random draws of the matrix ). Also, we set
, so that the smallest eigenvalue of the shifted coupling matrix is
(which effectively adjusts to be ).We performed experiments for both training methods discussed above, minimizing either the reverse KL divergence or the forward KL divergence. For the forward KL approach, we first assembled a dataset of spin configurations using paralleltempering (PT) (see for example Katzgraber (2010)). We simulated 20 parallel systems (called replicas), each at a different temperature. We recorded the spinconfigurations throughout the parallel tempering after each sweep of all spins, until 100,000 samples had been obtained for each temperature. We then repeated this procedure for 10 realizations of the coupling matrix . Lastly, to train the normalizing flow the discrete Ising spin configurations must be converted to continuous configurations. We used each spin configuration to generate a single configuration using the joint distribution (given by Eq. 6).^{3}^{3}3Since the variables and are related probabilistically, each configuration could have been used to generate any number of configurations. We chose a 1:1 conversion ratio for simplicity.
In Fig. 2 (a) we plot the reverse KL loss throughout the course of training. The loss is roughly constant for the first 100 iterations, after which it drops sharply. For it then continues to decrease slowly with additional training. The magnitude of the drop increases for low temperatures. A measure of the quality of the trained flow may be obtained by examining the tightness of the free energy bound in Eq. 18, or equivalently, the smallness of the reverse KL divergence. In order to estimate the KL divergence, we first evaluated the Helmholtz free energy using the relation between and in Eq. 4. In the spinglass phase the discrete partition function was estimated using the results of the parallel tempering described above, while in the paramagnetic phase () we used the replicasymmetric expression , where . Importantly, the replicasymmetric expression is exact in the paramagnetic phase and in the large limit. Fig. 2 (b) shows that the reverse KL divergence is rather high at low temperatures, and decreases rapidly as the temperature is raised. The inset shows the estimated value of the two free energies, Gibbs and Helmholtz.
Similarly, in Fig. 3 (a) we plot the forward KL loss throughout the course of training. In the spinglass phase, learning proceeds slowly at first 100 iterations or so, and then decreases much more rapidly afterwards. In the paramagnetic phase, the loss decreases quickly with the first 100 iterations, and then decreases much more slowly afterwards. For the critical temperature, , the loss curve is essentially flat, and there is not a period of steep decline. Another interesting observation is that the variation in the value of the loss increases as decreases. The data for the lowest temperature considered, , shows that after about 1000 training iterations there are occasionally large fluctuations in the value of the loss.
As with the reverse KL loss function, the quality of the learned flow can be measured by estimating the KL divergence and the tightness of the bound Eq. 21. This is depicted in Fig. 3 (b), and the inset compares the asymptotic value of the forward KL loss with the Shannon entropy of the physical distribution , which we evaluated using the results of the parallel tempering simulations. Compared to the reverse KL divergence, the forward KL divergence first increases rapidly as the temperature is raised, reaches a peak at about , then decreases rapidly across the spinglass phase transition, and finally approaches an asymptotic value as the temperature is increased further. The forward KL divergence achieves much smaller values than the reverse KL divergence in the spinglass phase  the maximum value of the forward KL divergence is an order of magnitude smaller than the maximum value of the reverse KL divergence. One important point is that the estimate of the forward KL divergence actually becomes negative for high temperatures. As the KL divergence is nonnegative, this indicates that the error in the estimate is large enough to change the value from positive to negative. We initially observed a similar phenomenon in estimating the reverse KL loss, but we were able to improve our estimate in the paramagnetic phase by using the exact replica symmetric expression for the free energy. We are not aware of an analogous expression for the Shannon entropy .
4.4 Order parameter of generative spinglasses
Spinglasses are characterized by the presence of a large number of metastable states. These are typically separated from one another by large energy barriers, and as a result ergodicity is broken at low temperatures and the system will become stuck in one of the many states. The closeness of two such states and may be measured by the overlap between them:
(22) 
Here, is the average persite magnetization, or marginal probabilities, within the state , and similarly for . As shown by Parisi, the distribution of overlap values for a given spinglass instance may be used as an order parameter for the spinglass phase Parisi (1980). This represents a key distinction from more conventional phases of matter where the order parameter is simply a real number  here, it is an entire function
. Equivalently, the order parameter for a spinglass could be considered to be the infinite collection of real numbers defined by the moments of the overlap distribution. This unique nature of the spinglass order parameter underscores the complexity of these systems compared to more conventional phases of matter.
In order to evaluate the extent to which the trained normalizing flows have successfully approximated the spinglass phase we examined the overlap distribution. may be approximated by sampling the spinglass and computing the configurational overlaps between two randomly chosen spin configurations and :
(23) 
As sample size grows, the distribution of configurational overlaps will approach the overlap distribution . In Fig. 4 we plot numerical estimates of the nondisorderaveraged overlap distribution for the parallel tempering data and samples obtained from both the forward and reverse KLtrained flows.^{4}^{4}4The flow generates samples of the continuous variable ; samples of the discrete spin configuration were then obtained by sampling from the conditional distribution . Neither the reverse or forwardtrained flow perfectly matches the PT result, although the forward and PT results qualitatively match. The fact that the forwardtrained flow is able to produce a bimodal overlap distribution demonstrates that it has captured an important characteristic of the spinglass phase. In contrast, the overlap distribution for the reverse KLtrained flow has just a single peak near for , which indicates that most of the samples are very close to one another. This suggests that the flow has not successfully modeled the key property of the spinglass phase, namely the existence of many metastable states. Further evidence for this conclusion comes from examining the magnetization, . We find that for all sampling methods, but the reverse KLtrained flow is unique in that it produces nonzero persite magnetizations, i.e. . This is consistent with the hypothesis that the flow distribution has collapsed onto just a single mode of the true distribution. Indeed, this is a known failure mode of models trained on the reverse KL divergence.^{5}^{5}5 A nice discussion on this point may be found in Sec. 21.2.2 of Murphy (2012).
4.5 Ultrametricity in generative spinglasses
One of the most remarkable properties of spinglasses is that their state space can be endowed with highly nontrivial topological structure, which emerges naturally from seemingly unstructured Hamiltonians. For instance, the simple SK model Hamiltonian, with a featureless fully connected underlying graph and iid random couplings, can be shown to generate a simple hierarchical structure in its solution space. Specifically, the low energy states of the SK model obey an interesting relation known as ultrametricity Mézard et al. (1984). Given any triplet of states, , , and , the 3 possible overlaps , and may be converted into a set of Hamming distances which form the 3 sides of the triangle connecting the states. The standard triangle inequality is , and in the SK model, these distances are known to obey the stronger inequality which defines ultrametric spaces. This condition implies some geometric properties, for example that all triangles are either equilateral or acute isosceles. The distribution of triangles may be computed analytically in the SK model using the the replica approach with Parisi’s replica symmetry breaking ansatz. One finds that in the spinglass phase 1/4 of the triangles are the equilateral, and 3/4 are acute isosceles.^{6}^{6}6By isosceles, here we mean those triangles which are isosceles and not also equilateral, since all equilateral triangles are also trivially isosceles.
Therefore, as a final test of the quality of our learned continuous spinglasses trained with normalizing flows, it would be essential to examine if such models can actually generate the emergent hierarchical/ultrametric structure of the SK Hamiltonian. We plot the distribution of triangle side lengths in Fig. 5. Letting denote the ordered distances, the cluster near the origin of Fig. 5 corresponds to equilateral triangles and the other cluster on the axis corresponds to acute isosceles triangles. This same plot was used in Hartnett et al. (2018) in the context of the bipartite SK model. Fig 5 (a) depicts the triangle distribution for samples generated by parallel tempering simulations, and Fig. 5 (b) depicts the forward KLtrained flow distribution. These are in close agreement with oneanother, and both exhibit two wellseparated clusters of equilateral and isosceles triangles. Moreover, in both cases the data is very close to the 1:3 theoretical ratio of equilateral to isosceles triangles. In contrast, the samples generated from the reverse KLtrained flow depicted in Fig. 5 (c) form several isolated clusters very near the origin, which is a manifestation of the mode collapse in this generative model.
5 A phase transition within the internal layers of the normalizing flow
We have demonstrated that normalizing flows may be successfully trained to learn the spinglass phase of the SK model. Recall that the flow induces a sequence of Hamiltonian densities
which interpolate from a Gaussian energy landscape at
to the complex, multimodal landscape that corresponds to the spinglass phase for . This suggests that there is a phase transition within the flow, e.g., in passing from layer to layer .In order to confirm this conjecture, would ideally be a continuous parameter which would allow for a critical layer to be found, and then critical exponents could be calculated with respect to . One way to achieve this scenario would be to take , so that the normalizing flow is composed of an infinite number of layers, each of which implements an infinitesimal diffeormorphism. This would allow for additional theoretical analysis but would diminish the practical relevance since it is unclear how the abstracted normalizing flow with infinite layers relates to the finite
trained models considered above. An interesting alternative would be to use a form of flow based on a 1parameter diffeomorphism such as Neural Ordinary Differential Equations (Neural ODEs)
Chen et al. (2018). In lieu of taking the limit or considering a different class of generative models, we will simply investigate the statistics of the internal layers of our trained flows and establish that, as increases, there appears to be a transition from the paramagnetic phase to the spinglass phase.The Boltzmann distributions corresponding to the intermediate layers of the flow may be sampled by first sampling the Gaussian prior, and then only partially evolving the flow up to the corresponding intermediate layer. Following this approach, in Fig. 6 we plot the sequence of overlap distributions for each internal layer of the flow for 3 distinct temperatures. Fig. 6 (a) shows that deep in the spinglass phase the overlap distribution is transformed from the unimodal distribution of the Gaussian prior to a bimodal distribution in going from to . Each additional layer then increases the separation of the two modes as the flow distribution becomes a closer approximation to the physical distribution. Fig. 6 (b) shows that increasing the temperature (but keeping within the spinglass phase) causes the unimodal to bimodal transition to occur deeper in the flow. Lastly, Fig. 6 (c) demonstrates that in the paramagnetic phase the overlap distribution of all layers is unimodal, and there is no sign of a phase transition within the flow.
To gain more insight into the nature of the transition that occurs as is increased within the flow, in Fig. 7 we plot the triangle distance distribution for each internal layer for . As in Fig. 6, in going from to the distribution appears to qualitatively change, and the triangle distribution becomes bimodal. In this case the two modes correspond to equilateral and acute isosceles triangles. In the paramagnetic phase (and in the strict large limit), there are only equilateral triangles, whereas the spinglass phase is characterized by both equilateral and acute isosceles. Taken together, the results of Fig. 6 and Fig. 7 show that the flow transforms a quadratic energy landscape into a spinglass landscape via an incremental process, with each affine layer bringing the normalizing flow distribution closer to the true spinglass distribution.
6 Discussion and outlook
We have demonstrated that real NVP flows can be successfully trained to model complex spinglass distributions. We evaluated two methods for training the flow, one based on minimizing the Gibbs free energy, or reverse KL divergence, and another based on minimizing the forward KL divergence. Although the reverse KL approach led to a relatively tight bound for the free energy, which suggested that the model was successfully trained, the reverse KL divergence is quite large at low temperatures, and moreover the model was found to suffer from serious mode collapse. In contrast, the forward KLtrained model was able to capture the key properties of the spinglass lowenergy distributions, including a bimodal overlap distribution and topological structure in the state space known as ultrametricity.
Normalizing flows are composed of a series of continuous changes of variables (i.e. diffeomorphisms), and thus they seem illsuited for studying discrete systems such as spinglasses directly. For this reason, as a first step before applying the normalizing flows we used the probability density formulation recently introduced by us in Ref. Hartnett and Mohseni (2019) to convert the discrete Boltzmann distribution into a physically equivalent continuous one. This dual continuous distribution exhibits an important transition at for the SK model. As this temperature is crossed from above, the Hamiltonian density transitions from being convex to being nonconvex. As part of this transition, the critical point at becomes unstable and a pair of new critical points with appears. Because this transition occurs at temperatures well above the spinglass transition, one might suspect that the loss of convexity might hinder the learning task for approximating the spinglass phase. However, as we have showed in Hartnett and Mohseni (2019) the spinglass transition remains intact under continuous transformation and here we find no signature of any nonphysical effects when we use standard loss function based on forward KL divergence. Thus the continuous formulation can be used for spinglass density estimation and generative modeling.^{7}^{7}7Other recent approaches for applying normalizing flows to discrete data may be found in Hoogeboom et al. (2019); Tran et al. (2019). It would be interesting to investigate how these approaches compare to the continuous relaxation approach of Zhang et al. (2012).
Compared to traditional machine learning settings, the reverse KL minimization approach seems appealing because it involves minimizing a natural cost function, i.e., the Gibbs free energy, and the amount of training data is essentially unlimited. Unfortunately, as we observed here, the quality of the selfgenerated “data” was not good enough to yield a reliable performance. In the forward KL approach the amount of data was finite (100,000 samples), although this is an artificial bound because we generated the data ourselves through traditional Monte Carlo approaches. This observation suggests an online extension of the forward KL learning where the Monte Carlo data generation occurs in tandem with the gradientbased optimization. Rather than generating a fixed dataset of Monte Carlo samples, a better approach would be to continuously sample from the parallel tempering replicas, and to periodically perform a gradient descent update for the normalizing flow once enough samples have been collected to form a new minibatch. This would eliminate the potential for overfitting since for each gradient descent update the minibatch would consist of previouslyunseen Monte Carlo samples.
Real NVP flows implement a sequence of changes of variables, also known as diffeomorphisms. An important mathematical result along these lines is that any two distributions over can be related by a diffeomorphism. However, it is a nontrivial problem to find such a transformation, especially for spinglasses, which are notoriously complex. Our successfully trained flow implements a very particular sequence of diffeomorphisms  the inverse flow maps the the energy landscape of a spinglass to the simple quadratic function . Because a rough and multimodal landscape is mapped to a quadratic one, we can think of as a coarsegraining. Importantly, the coarsegraining is with respect to the ruggedness of the energy landscape  no degrees of freedom have been integrated out, and no information has been lost because the flow is invertible. This resembles simulated annealing Kirkpatrick et al. (1983) and adiabatic quantum computation Albash and Lidar (2018) for discrete optimization in which one starts with significant thermal or quantum fluctuations, smoothing the energy landscape to be originally convex, and then gradually (adiabatically) transform it to a nontrivial noncovex distribution via very slow (exponential in the worsecase instances) passage through a thermal or quantum phase transition respectively. In the case of adiabatic quantum annealing the process is in principle deterministic and fully reversible. However, in practice one has to deal with finitetemperature dissipations, manybody quantum localization, and Griffiths singularities for strongly disordered quantum spin glasses in low dimensions Boixo and others (2016); Mohseni et al. (2018).
There is also an apparent strong connection between our work and the renormalization group techniques. One might attempt to formally identify the inverse normalizing flow with a renormalization group flow. However, we are not aware of any formulation of the renormalization group which applies directly to probability densities such as the ones considered here. A possible connection of between RG and normalizing flows can be done through the Neural ODEs representations Chen et al. (2018). The renormalization of spinglasses appears to be most wellstudied for hierarchical models, and thus these would be a natural starting point to develop this connection further.^{8}^{8}8The thesis Castellana (2013) contains a good overview of the renormalization of these systems.
Some years ago an exact mapping between a variational RG transformation and Boltzmann Machines (BMs) with hidden layers was proposed in
Mehta and Schwab (2014). There are clearly some similarities between their setup and ours. In our work, the internal layers of the generative model are drawn from Boltzmann distributions with Hamiltonian densities , and moreover these internal distributions may be easily sampled. In contrast, in Ref. Mehta and Schwab (2014) BMs were studied over discrete variables, whose internal layers are hard to sample from. Perhaps the most significant difference between their work and ours is that in our case the marginal distribution of the zeroth layer is constrained to be a simple Gaussian, whereas for BMs the marginal distribution over the final hidden layer is intractable and depends on the weights in a very nontrivial way. The fact that in normalizing flows the distribution over the latent variable is fixed to be a Gaussian means that the inverse flow may be interpreted as a coarsegraining map, since the endpoint of the inverse flow is always a quadratic energy landscape.Further developing the connection between the renormalization group and normalizing flows would be worthwhile since, unlike traditional coarsegraining or renormalization group approaches which often rely on symmetries or other assumptions not likely to be found in systems of practical interest, the coarsegraining is separately learned for each spinglass system, making our approach broadly applicable. In particular, Li and Wang (2018)
have shown that the coarsegraining performed by the inverse flow may be used to improve Hamiltonian Markov Chain sampling of the system.
Albergo et al. (2019) also recently used trained normalizing flows to accelerate sampling of a simple lattice quantum field theory. The most important practical application of our work would be to obtain similar improvements in the computational tasks associated with spinglasses, such as combinatorial optimization, sampling of multimodal distributions, and inference in graphical models.Acknowledgments
We would like to thank S. Isakov, D. Ish, K. Najafi, and E. Parker, for useful discussions and comments on this manuscript.
References
 Adiabatic quantum computation. Rev. Mod. Phys. 90, pp. 015002. External Links: Document, Link Cited by: §6.
 Flowbased generative models for markov chain monte carlo in lattice field theory. Phys. Rev. D 100, pp. 034515. External Links: Document, Link Cited by: §6.
 Quantum machine learning. Nature 549 (7671), pp. 195–202. Cited by: §1.
 Computational multiqubit tunnelling in programmable quantum annealers. Nat. Commun. 7, pp. 10327 EP –. External Links: Link Cited by: §6.
 Solving the quantum manybody problem with artificial neural networks. Science 355 (6325), pp. 602–606. Cited by: §1.
 Machine learning phases of matter. Nature Physics 13 (5), pp. 431. Cited by: §1.
 The renormalization group for disordered systems. arXiv preprint arXiv:1307.6891. Cited by: footnote 8.
 Spinglass theory for pedestrians. Journal of Statistical Mechanics: Theory and Experiment 2005 (05), pp. P05012. External Links: Document, Link Cited by: §1.
 Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §5, §6.
 Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §3.

Multitask selfsupervised visual learning.
In
The IEEE International Conference on Computer Vision (ICCV)
, Cited by: §4.  Parallel tempering: theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 7, pp. 3910–3916. External Links: Document, Link Cited by: §4.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
 Replica symmetry breaking in bipartite spin glasses and neural networks. Physical Review E 98 (2), pp. 022116. Cited by: §4.5.
 A probability density theory for spinglass systems. "to appear". Cited by: §1, §1, §2.1, §2, §2, §2, §6.
 Integer discrete flows and lossless compression. arXiv preprint arXiv:1905.07376. Cited by: footnote 7.

Accelerated monte carlo simulations with restricted boltzmann machines
. Physical Review B 95 (3), pp. 035105. Cited by: §1.  Introduction to monte carlo methods. Technical report Cited by: §4.3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.1.
 Optimization by simulated annealing. Science 220 (4598), pp. 671–680. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/220/4598/671.full.pdf Cited by: §6.
 A tutorial on energybased learning. In PREDICTING STRUCTURED DATA, Cited by: §1.
 Neural network renormalization group. Physical review letters 121 (26), pp. 260601. Cited by: §A.1, §1, §2, §4.1, §6.
 Selflearning monte carlo method. Physical Review B 95 (4), pp. 041101. Cited by: §1.
 Simulating the ising model with a deep convolutional generative adversarial network. arXiv preprint arXiv:1710.04987. Cited by: §2.
 An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831. Cited by: §6.
 Information, physics, and computation. Oxford University Press, Inc., New York, NY, USA. External Links: ISBN 019857083X, 9780198570837 Cited by: §1, §1, §1.
 Nature of the spinglass phase. Physical review letters 52 (13), pp. 1156. Cited by: §4.5.
 Engineering nonequilibrium quantum phase transitions via causally gapped Hamiltonians. New J. Phys. 20 (10), pp. 105002. External Links: Link Cited by: §6.
 The nature of computation. Oxford University Press, Inc.. External Links: ISBN 0199233217, 9780199233212, Link Cited by: §1, §1.
 Machine learning: a probabilistic perspective. MIT press. Cited by: footnote 5.
 Phase transitions and critical phenomena. Elements of Phase Transitions and Critical Phenomena, pp. 1–15. External Links: Document Cited by: §1, §1.
 Parallel wavenet: fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433. Cited by: §4.1.
 The order parameter for spin glasses: a function on the interval 01. Journal of Physics A: Mathematical and General 13 (3), pp. 1101. Cited by: §4.4.
 Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §1, §3.
 Selflearning monte carlo with deep neural networks. Physical Review B 97 (20), pp. 205140. Cited by: §1.
 Solvable model of a spinglass. Physical review letters 35 (26), pp. 1792. Cited by: §1, §1, §2.1.
 Spin glasses and complexity. Princeton University Press. External Links: ISBN 9780691147338, Link Cited by: §1.
 Discrete flows: invertible generative models of discrete data. In ICLR 2019 Workshop on Deep Generative Models for Highly Structured Data, Cited by: footnote 7.
 Continuous relaxations for discrete hamiltonian monte carlo. In Advances in Neural Information Processing Systems, pp. 3194–3202. Cited by: §2, footnote 7.
Appendix A Real nonvolume preserving flows
In this appendix we discuss the affine layers which compose real NVP flows in more detail and introduce some notation we found useful.
The efficient invertibility of the flow derives from the efficient invertibility of the affine layers, which in turn relies on the fact that each layer acts only on a subset of the variables. Denoting this subset as , and letting be the complement, then the action of a single affine layer on a variable is given by:
(24a)  
(24b) 
for , . Furthermore, each layer depends on two functions, , which we shall take to be neural networks. The inverse of an affine coupling layer may be easily worked out to be
(25a)  
(25b) 
Importantly, note that the functions need not be inverted in order to construct the inverse of the affine layer (in fact, they do not even need to be invertible). The Jacobian matrix can be shown to have a blocktriangular form, and therefore the determinant may be efficiently calculated to be:
(26) 
Real NVP flows may be constructed by composing multiple affine layers, each with their own . Denoting the th afffine layer as , the flow is defined as
(27) 
for , with the understanding that is the identity and corresponds to the complete flow. The output of each layer of the flow represents a random variable , with and . The determinant for the full generator map is then simply the product of the determinants for each individual affine layer:
(28) 
This equation, used in conjunction with the change of variable formula Eq. 10, allows the likelihood of a sample to be efficiently determined.
a.1 Implementation details
The details of the normalizing flow are as follows. For the prior we used an isotropic Gaussian with unit variance. The flow function
was composed of affine coupling layers. For each even layer , the set of variable indices which the flow acts upon nontrivially, i.e. , was chosen randomly. Each index was included inwith probability 1/2. For each odd layer, the complement was used,
, so that each variable is acted upon by the flow every 2 layers. Each layer used a separate set of neural networks, each with their own set of weights. The neural networks were taken to be multilayer perceptrons (MLPs) with 3 hidden layers and one final layer, each with
units.^{9}^{9}9As described here, the neural networks are functions from to , whereas in Appendix A we indicated that . This discrepancy is due to the fact that for ease of implementation we considered the full set of variables and used a binary mask to enforce the fact that the flow should only act upon the set of indices .Lastly, the activation function for each layer in both networks was taken to be LeakyRelu, except for the final layers. The final activation for
was , whereas the final activation for was the identity.For the reverse KLbased learning we employed a trick introduced by Li and Wang (2018), which encouraged the trained flow to approximately respect the reflection symmetry of the physical distribution, corresponding to invariance under . The trick consists of replacing in the objective function Eq. 17 with the symmetrized version .
In both cases we performed the training by minimizing the corresponding objective with the Adam optimizer Kingma and Ba (2014) with learning rate , and used batch sizes of 50. For the reverse KL training, we chose to perform 250,000 minibatch updates.^{10}^{10}10Note that in this case there is no dataset, and so the standard concept of training epochs does not apply.
For the forward KL training we also performed 250,000 updates, which is equivalent to 125 epochs (entire passes through the data) with each epoch consisting of 100,000/50 = 2000 minibatch updates.
Comments
There are no comments yet.