I Introduction
Machine learning has emerged as a powerful tool in the physical sciences, seeing both rapid adoption and experimentation in recent years. Already, there have been demonstrations of learning operators Mills and Tamblyn (2018), detecting unseen patterns within data Ch’ng et al. (2017), “discovering” physical equations Wang et al. (2019), and predicting trends within the scientific literature Tshitoyan et al. (2019)
. Within the field of statistical mechanics, machine learning was recently used to estimate the value of the partition function
Desgranges and Delhommelle (2018), solve canonical models Wu et al. (2019), and generative models conditioned on the Boltzmann distribution have been shown to be efficient at sampling statistical mechanical ensemblesNoé et al. (2019). The connection between physics and machine learning continues to strengthen, with new results appearing daily.Despite these and other successes, machine learning has severe limitations. A major obstacle to the more widespread use of machine learning in the physical sciences is that typically, learned models tend to exhibit poor transferability. Using a model outside of the training or parameterization set can be unreliable Goodfellow et al. (2016). This, along with the lack of well defined and well behaved error estimates can result in erroneous results, the magnitude of which are often uncontrolled. While transferability can be somewhat improved through techniques such as regularization Srivastava et al. (2014), there is currently no general approach which can guarantee transferability to new conditions or ensure reliability of a model when it is presented with previously unseen data.
Here we demonstrate a new approach which overcomes the transferability problem by coupling machine learning concepts with physical principles in a new way. The approach, which we call distributionconsistent learning (DCL) enables fully transferrable learning with a minimal number of observations: using it, we can collect observations at a single set of conditions, yet make accurate predictions for all others (including in different physical phases and across phase boundaries). Using DCL, it is possible to extrapolate observations over a wide range of conditions, including those far from the training set. This is achieved through the straightforward yet rigorous application of the physical concept of equilibrium and the postulate of the uniformity of physical law.
To explain DCL, we will first focus on simple classical spin models (ferromagnetic Ising and Potts models with coupling constant ) for the purposes of illustration and validation. For such simple cases, application of statistical methods to “invert” observations are not new Nguyen et al. (2017); Valleti et al. (2019).
Conceptually, DCL is valid when ensemble probabilities are governed by a known statistics (e.g. Boltzmann or Fermi distributions, which describe equilibrium and some nearequilibrium processes) and the interaction energies do not depend on the control parameters.
Finally, we demonstrate the generalizability of DCL explicitly by applying it, without modification, to an unsolved inversion problem: a semilocal “spinglass” where all couplings between neighbours are selected randomly from a Gaussian distribution (
, ).Ii Probabilities at equilibrium
When a system is at equilibrium, its macroscopic properties can be interpreted as weighted averages over the character of a large number of microstates, . For an example such as a polymer, such microstates may be different molecular conformations, whereas for a spinsystem they are the different combinations of possible up (down) spin arrangements. Such microstates are visited with a probability given by the Boltzmann distribution. For the case of the canonical ensemble in particular (constant particle number, , volume, , and temperature, ), the ensemble probability of a particular microstate is given as
(1) 
where is the classical Hamiltonian (and the energy, , of is given by ), is the inverse temperature, , and is the partition function.
We note that in principle if one were able to observe an equilibrium system long enough, it would be possible to estimate the ensemble probabilities of each microstate simply by counting – count how often the system visits a particular microstate and divide by the number of observed events. This information, combined with the observation temperature (), and Eqn. 1 would be sufficient to determine the energy differences between any two microstates
(2) 
Of course, for all but trivial state spaces, such an approach is completely impractical. The probability of visiting microstates above the ground state at finite is exponentially small in , and the probability of visiting them multiple times within an observation period (which would be necessary in order to collect reliable estimates) is even smaller. Microstates visited during a macroscopic experiment represent only a vanishingly small fraction of the possible configurations which can be realized. Watching and counting is not an option.
Iii Essence of the approach
Rather than attempt to infer probabilities from counting, we instead consider the structure of itself. It is generally known that interactions between spins can give rise to correlations on a range of scales, and that emergence of such correlations are temperature dependent. In some sense, these correlations are similar to patterns that emerge in language due to the rules of grammar. We can “unwrap” into a sequence (Fig. 1
) and analyze them using languagebased sequential machine learning methods. While many such sequence models exist, recurrent neural networks (RNN) are a particularly powerful deep learning technique which have seen significant use in recent years. Primarily, RNN have been used in natural language processing (e.g. predictive text and translation), as well as predicting timeseries data such as stock markets and weather. More recently, autoregressive RNN have been adapted to spatially structured inputs such as images
van den Oord et al. (2016).We first train an RNN on the microstates observed at a fixed set of experimental temperature conditions, (for this study we used ) as DCL requires only data collected at a single temperature. Formally, the precise temperature of observation does not matter, so long as the system is approximately ergodic. Here we tested this explicitly by increasing the temperature by a factor of . This did not change our results qualitatively.
Our training set consists of observations as well as the thermodynamic conditions upon which they were collected (i.e. the observation temperature ). Importantly, the Hamiltonian operator and values of the energy are not included in our training data. This is analogous to what occurs experimentally  one does not generally have knowledge of the interaction potential between particles, but can always conduct an experiment at fixed conditions and make observations of which microstates are explored during such an experiment.
To train the RNN, we provided it with examples of spin microstates as unwrapped configurations (Fig. 1 1). We experimented with different “unwrapping” techniques (which we called “snake” or “spiral” respectively); tests confirm our results are not sensitive to this choice, so long as the procedure is used consistently. Our training procedure and network architecture follow standard techniques and are reported as Supplementary Information.
Once the RNN has been trained, we can use it to determine a particular by computing a series of conditional probabilities of the unwrapped sequence. To do this, we initialize the RNN with the first spin value of and record its prediction for . Applying this now initialized RNN model to produces a conditional probability prediction . The total probability of is found from the multiplicative product of many conditional probabilities
(3) 
where is the linear dimension of the system and is the dimension.
For a two state system (Ising), the RNN has only a single output since . Ising has an input vector of size one (for the spin) and an equivalent output vector. We generalize this for the Potts model, which has an input vector of size (either a 1 or a 0 in each element). The output is also a wide vector.
Using Eqn. 3, we can compare the probabilities of any two microstates, , . Inserting these into Eqn. 2 (combined with the experimental conditions under which the samples were obtained, )) provides us with an estimate of the difference of internal energy between the microstates . We are able to compute such an estimate for any pair, including configurations which were never visited during the training process. Henceforth, we refer to this process of estimating as the RNN energy model.
Iv Using the RNN Energy Model
The top row of Fig. 2 show the performance of the RNN energy models across a range of energies for both the Ising and Potts models. In both cases, the RNN energy model does an excellent job estimating the energy difference between any two microstates. We note that it is exactly this quantity which is needed to perform finite temperature Metropolis Monte Carlo simulations to predict the thermodynamic properties of a spin system.
Using these models, we carry out such numerical simulations under conditions which are far away from the training set. When we compare results for the phase transitions generated using the true Hamiltonian with those generated with the RNN energy models, we see excellent agreement for both Ising and Potts models (Fig.
2 bottom row). This confirms that the RNN model errors in the prediction of are small enough that they do not alter the essential physics of systems under study.Since the RNN already allows us to estimate ensemble probabilities (and thus energy labels from observation), we might think that nothing more can be extracted from our initial observations. It turns out, however, we can incorporate more physics knowledge through the use of a second neural network. We will demonstrate that incorporating this second network will both overcome some limitations of the RNN energy model and improve the accuracy of our predictions without any new observations or labels.
One of the most obvious disadvantages of the RNN energy model is that it has no concept of locality with respect to energy updates. Whenever we wish to know the difference in energy between and , we must reevaluate all of the conditional probabilities which make up and . For interactions which are short ranged (as is the case for both the Ising and Potts models), this is very inefficient. In conventional simulations of such systems, it is customary to reevaluate only interactions which have changed as a result of the MC trial move. With the RNN energy model, there is no general way to achieve such “local” energy updates. This is because for spin flips at some locations in the lattice one may need to recompute only some conditionals, but in general one has to recompute an extensive number of them
Another limitation is that the RNN energy model is only able to make predictions for system sizes which are the same as those within the training set. Ideally, one would like to be able to observe an system, learn something from it, and then make predictions for a larger case.
V The EDNN Energy Model
In previous work Mills et al. (2019), we showed that with the proper construction, neural networks have the ability to directly learn the locality lengthscale, , of operators such as the Hamiltonian. By locality lengthscale, we mean the amount of information in the neighbourhood of a focus region, , which is necessary to compute extensive properties. Magnetization, for example, is a fully local () operator. It is possible to divide the task of computing magnetism for every site in the system, record the value, and sum all at the end (since it is an extensive quantity). For a nearest neighbour spin model, additionally context,
, is needed in order to determine site energies, i.e. the values of the spins in the neighbourhood. Using an extensive deep neural network (EDNN), these locality scales can be learned directly from the data through hyperparameter optimization of
.Initially, our motivation for using an EDNN in this investigation was to be able to learn from small scale systems (and spin model in this case) and apply that learning to a larger system (e.g. ). Training an EDNN with RNN energy model produced labels produces an EDNN energy model. As expected, the EDNN is able to take the small scale examples and transfer that learning to larger systems (Fig. 3 shows the performance of the EDNN energy model). Creating an EDNN energy model had a another unforeseen benefit, however.
By construction, EDNN topologies require that physical laws are the same everywhere. They are designed to learn a function which, when applied across a configuration, maps the sum of outputs to an extensive quantity such as the internal energy. Interestingly, in this case, we find that this physicsbased network design requirement results in improved performance in predicting the underlying interactions even when the labels used in training have noise introduced by the imperfect RNN energy model.
As discussed above, we first train an RNN to predict energy differences . In the second step, we train an EDNN to reproduce predictions from the RNN. The EDNN has never seen labels other than those estimated by the RNN. Despite this, when we compare EDNN predictions relative to the true values  it has better performance than the RNN itself. We suspect that the physics based construction of the EDNN enables it to see through noise introduced by the imperfect RNN and achieve a better estimate of the underlying operator (the root mean squared error, RMSE, of the RNN energy model for Ising is 2.06J compared with 1.47J for the EDNN, Fig. 4). EDNN enforce the postulate of uniformity of physical law into our training procedure; our performance improves as a result.
Vi A general method
As a demonstration of the generality of DCL, we now consider the case of a spinglass where interactions between neighbours are sampled from a random distribution (the EdwardsAnderson model). Even with a much more complex and rich Hamiltonian, the RNN is able to learn only from observations at a fixed temperature, yet can be used to make accurate predictions across a wide range of conditions. Again, only unlabelled observations at a single fixed temperature are required to determine the behaviour of the system under arbitrary conditions, Fig. 5 (see S.I. for another example of random couplings).
By applying DCL to a spinglass, we have demonstrated, for the first time, the ability to effectively learn the Hamiltonian operator directly without ever knowing its form, symmetries, or see direct labelled example. This is the first time such an inversion has been demonstrated for nontrivial systems.
Vii Conclusion
With knowledge only of the experimental constraints (i.e. the statistical ensemble) and a statistically significant number of observations from only a single set of thermodynamic conditions, distributionconsistent learning (DCL) is able to extract enough information about a physical system to fully predict its behaviour over a wide range of unseen conditions, across different lengthscales, and through phase transitions.
DCL can predict the relative energy of microstates at sufficient accuracy that it can be used to reproduce the energetic cost of excitations between states in a size consistent manner. The model consists of two deep neural network topologies which able to learn cooperatively. By using a combination of deep learning and physical constraints, we have shown that full transferability is possible. We expect that there will be many applications of this new method, including optical lattices, the growth of molecules on surfaces, among others.
Viii acknowledgments
The authors acknowledge useful discussions with M. Fraser. IT and JC acknowledge support from NSERC. JC acknowledges support from, SHARCNET, Compute Canada, and the Canada CIFAR AI chair program. This research was supported in part by the National Science Foundation under Grant No. NSF PHY1748958. Work carried out at NRC was done under the auspices of the AI4D Program. Work by SW was performed as part of a user project at the Molecular Foundry, Lawrence Berkeley National Laboratory, supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DEAC02–05CH11231.
Ix Supplementary information
Details of neural network topologies, training data, as well as additional tests and examples are outlined below.
X Neural network and training details
We used 8 configurations per spin for our training data (generated from two independent seeds). Given the high temperature of our observation, our simulations equilibrated rapidly. For test data, we sampled all possible energies (8000 samples from each). For the Ising model, this is
. At the extremes, this is equivalent to reselecting certain configurations over and over again (consistent with what occurs physically). Test and train data were chosen such that they did not overlap. Our RNN are very simple, consisting of a single gated recurrent unit
Cho et al. (2014)(we also tried up to 4 stacked units, which gave only a modest improvement in our results). The size of the hidden state was between 378512 neurons. All RNN models converged quickly  results reported here are based on only 30 epoch (learning rates between
and batch sizes between 40008000). For all models, we used dropout rate of 0.9. All of our networks are implemented in TensorFlow and are available online along with training data (
http://clean.energyscience.ca/codes).For the EDNN, we used random spin configurations for training data and achieved a converged result within only 60 epochs for all models. The EDNN was built using a previously reported architecture Mills et al. (2019) (
and 2 fully connected layers with 32 and 64 neurons with respectively). Throughout, we used rectified linear units (reLU) as activation functions. Our goal was to use a simple and consistent set of parameters and training; it is very likely that there exist better choices of hyperparameters than those presented here.
We also note that the RNN we have used here are very simple in form. Recently, attention mechanismsBahdanau et al. (2014); Kim et al. (2017) have been shown to improve the performance of sequence models significantly (e.g. reducing the number of needed samples need to achieve fixed fidelity). We expect that more advanced sequence models, including attention only “Transformer”Vaswani et al. (2017) networks and related models could also be of benefit here, particularly with experimental data.
References
 Neural machine translation by jointly learning to align and translate. External Links: 1409.0473 Cited by: §X.
 Machine Learning Phases of Strongly Correlated Fermions. Physical Review X 7 (3), pp. 031038. External Links: Document, 1609.02552, ISSN 21603308, Link Cited by: §I.
 On the properties of neural machine translation: encoderdecoder approaches. External Links: 1409.1259 Cited by: §X.
 A new approach for the prediction of partition functions using machine learning techniques. The Journal of Chemical Physics 149 (4), pp. 044118. Cited by: §I.
 Deep learning. The MIT Press. External Links: ISBN 0262035618, 9780262035613 Cited by: §I.
 Structured attention networks. External Links: 1702.00887 Cited by: §X.
 Extensive deep neural networks for transferring small scale learning to large scale systems. Chem. Sci. 10, pp. 4129–4140. External Links: Document, Link Cited by: §X, §V.
 Deep neural networks for direct, featureless learning through observation: The case of twodimensional spin models. Physical Review E 97 (3), pp. 032119. External Links: Document, 1706.09779v1, ISSN 24700045, Link Cited by: §I.

Inverse statistical problems: from the inverse ising problem to data science
. Advances in Physics 66 (3), pp. 197–261. External Links: Document, Link, https://doi.org/10.1080/00018732.2017.1341604 Cited by: §I.  Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning. Science 365 (6457), pp. eaaw1147 (en). External Links: ISSN 00368075, 10959203, Link, Document Cited by: §I.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Link Cited by: §I.
 Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 (7763), pp. 95–98. External Links: Document, ISSN 14764687, Link Cited by: §I.

Inversion of lattice models from the observations of microscopic degrees of freedom: parameter estimation with uncertainty quantification
. External Links: 1909.09244, Link Cited by: §I.  Pixel recurrent neural networks. External Links: 1601.06759 Cited by: §III.
 Attention is all you need. External Links: 1706.03762 Cited by: §X.
 Emergent schr√∂dinger equation in an introspective machine learning architecture. Science Bulletin 64 (17), pp. 1228 – 1233. External Links: ISSN 20959273, Document, Link Cited by: §I.
 Solving statistical mechanics using variational autoregressive networks. Phys. Rev. Lett. 122, pp. 080602. External Links: Document, Link Cited by: §I.
Comments
There are no comments yet.