Short sighted deep learning

02/07/2020 ∙ by Ellen de Melllo Koch, et al. ∙ University of the Witwatersrand 0

A theory explaining how deep learning works is yet to be developed. Previous work suggests that deep learning performs a coarse graining, similar in spirit to the renormalization group (RG). This idea has been explored in the setting of a local (nearest neighbor interactions) Ising spin lattice. We extend the discussion to the setting of a long range spin lattice. Markov Chain Monte Carlo (MCMC) simulations determine both the critical temperature and scaling dimensions of the system. The model is used to train both a single RBM (restricted Boltzmann machine) network, as well as a stacked RBM network. Following earlier Ising model studies, the trained weights of a single layer RBM network define a flow of lattice models. In contrast to results for nearest neighbor Ising, the RBM flow for the long ranged model does not converge to the correct values for the spin and energy scaling dimension. Further, correlation functions between visible and hidden nodes exhibit key differences between the stacked RBM and RG flows. The stacked RBM flow appears to move towards low temperatures whereas the RG flow moves towards high temperature. This again differs from results obtained for nearest neighbor Ising.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years machine learning has shown remarkable success in a wide range of problems. These algorithms can outperform humans at a number of tasks including image classification and complex strategy games

Krizhevsky et al. (2012); Vinyals et al. (2019); Ranzato et al. (2010); Salakhutdinov et al. (2007); Le Roux and Bengio (2010, 2008). In many fields efforts to harness the power machine learning offers are being pursued. Specifically in physics, machine learning is being explored as a tool to uncover properties of systems that are difficult to study analytically Carrasquilla and Melko (2017); Wang (2016); Deng et al. (2017); Broecker et al. (2017); Aoki and Kobayashi (2016); Nomura et al. (2017); Morningstar and Melko (2017); Howard (2018). However, there is at present no convincing explanation of how these networks learn and why they can achieve such remarkable feats Paul and Venkatasubramanian (2014); Zhang et al. (2016); Jordan and Mitchell (2015); Lin et al. (2017).

Recent studies have aimed to gain insight into how and why deep learning works, by employing the statistical mechanics of spin systems as simple toy modelsMehta and Schwab (2014); Iso et al. (2018); Koch et al. (2019); Li and Wang (2018); Kim and Kim (2018); Funai and Giataganas (2018); Koch-Janusz and Ringel (2018); Bény (2013)

. The idea is that deep learning, which is a form of coarse graining, is related to the renormalization group (RG) of statistical mechanics and quantum field theory. This is a natural idea since both problems are concerned with reducing many degrees of freedom to an effective description employing far fewer degrees of freedom. Spin systems are an ideal setting in which to explore this proposal, thanks to their simplicity. In addition to the simplicity of these models, their statistical mechanics is well understood making them a reasonable laboratory for deep learning studies. Further, spin systems are also not far removed from more standard applications of deep learning including image analysis: the state of a spin system is composed of local interacting patches much in the same way that an image is made up of local collections of pixels which are highly correlated

Saremi and Sejnowski (2013).

Although a number of recent studies have explored the link between Kadanoff’s variational renormalization group (RG) and deep learning Mehta and Schwab (2014); Iso et al. (2018); Koch et al. (2019); Li and Wang (2018); Kim and Kim (2018); Funai and Giataganas (2018); Koch-Janusz and Ringel (2018); Bény (2013), it is still not settled whether such a link exists. Studies have so far focused on nearest neighbor Ising models. We add to this discussion by considering long range interacting spin systems which, as we will see, exhibit important differences from their short ranged cousins.

The paper Iso et al. (2018) trained three (Restricted Boltzmann machines) RBM networks on Ising model data. The first was trained on data generated at temperatures ranging from above to below the critical temperature, the second network on data above the critical temperature and the third at temperatures below the critical temperature. The trained weights of each network were used to define a map from the visible spins back to the visible spins. Iterating this map defines an RBM flow. Results produced by the first network show interesting behavior suggesting that RBMs learn the critical temperature in the sense that the RBM flow terminates on the critical point of the Ising model. This is in contrast to RG flows: the critical point of the Ising model is an unstable fixed point and fine tuning is required to construct an RG flow trajectory that terminates on this critical point.

In a follow up paper, Koch et al. (2019), scaling dimensions are used to provide precise quantitative tests that probe whether or not the network reproduces the critical point dynamics. The approach recognizes that the critical point of the Ising model is described by a conformal field theory (CFT), so that correlation functions of the spins in patterns generated by the RBM can be compared to known CFT predictions. The results of Koch et al. (2019) show that the network correctly reproduces the smallest scaling dimension, which belongs to the Ising spin itself. The network does not correctly reproduce the next smallest scaling dimension, which is related to the energy density. This implies that although the RBM has correctly learned the largest scale correlations in the critical spin states, it fails on shorter length scales.

Our goal is to repeat the above investigations, for a two-dimensional long range spin lattice. Training a network on a system with a different range of interactions will allow us to probe how robust the claim is that RBMs learn the critical temperature of a given spin system and we can again probe what length scales are captured by training. We use Markov Chain Monte Carlo methods (MCMC) to determine the critical temperature of the long range spin lattice and corresponding scaling dimensions, as well as to generate the RBM training data. Our results suggest that the RBM flow does not converge to the critical temperature. Further, stacking RBMs does not appear to generate a flow that matches the coarse features of RG flows. Consequently, one lesson we learn is that it is dangerous to draw general conclusions from numerical experiments conducted on a single system. Furthermore an interesting question which suggests itself (but which we will not manage to answer) is the question of what classes of models a given conclusion applies to.

The long range spin lattice we study is described in Section I.1. A quick overview of RG and RBMs is given in sections I.2 and I.3. MCMC is discussed in Section I.4 along with the numerical approximations made for various properties of the long range spin lattice. The results of our study are presented in Section II.

i.1 Long range spin lattice model

The long range spin lattice is a model of a magnet made up of binary spins. Considering the two-dimensional model we have a square lattice of sites with a spin at each site. Each spin, , can take values of and each site

is labelled by a two-dimensional vector of integer coordinates

. Spins on sites and interact with a strength of . Each spin also interacts with an external magnetic field, with strength .

The Hamiltonian of the long range spin lattice model is given by

(1)

The interaction strength is inversely proportional to a power of the distance between sites and

(2)

where . We set and .

For the two-dimensional nearest neighbour Ising model, analytic results give precise values for the critical temperature and scaling dimensions. When moving to models such as the long range spin lattice, no analytic results exist and we thus rely on MCMC methods to determine these properties before training the RBM Geyer (1992).

The critical point of the long ranged spin system will again be described by a CFT. The two point functions of primary operators of this CFT will have the usual power law fall off, something that can be explored numerically. The two point function of primary operators at the fixed point of the spin lattice are of the form Poland et al. (2019)

(3)

where is the scaling dimension and is a constant. To probe the largest scales of the patterns generated by the RBM we will focus on the two operators of the lowest dimension. The first operator is the spin itself. The two point correlator between spins , with scaling dimension takes the form

(4)

with to be determined numerically. The second operator of interest is the energy density where we study the correlator between , with scaling dimension

(5)

with

(6)
(7)

where is the number of sample configurations summed and denotes the th sample. We use MCMC sampling methods to determine the critical temperature of this model which is found to be (details are discussed later in Section I.4).

i.2 Rg

RG defines a coarse graining transformation which is applied repeatedly to a microscopic description of a system to eventually arrive at a macroscopic description of the system Wilson and Kogut (1974); Efrati et al. (2014). Each application of RG results in a rescaling where unimportant degrees of freedom are removed and important degrees of freedom remain. The degrees of freedom or operators are characterized as either relevant, irrelevant or marginal. At each step of applying RG, couplings between variables change and we obtain a new coarse grained Hamiltonian Wilson and Kogut (1974). The only change that occurs when obtaining a new Hamiltonian at each step is the couplings update. The couplings that are relevant will become larger and the couplings that are irrelevant will get smaller as more steps of coarse graining are applied. The marginal terms in the Hamiltonian will remain unchanged.

In this paper we study a discrete spin lattice and so are interested in a discrete version of RG, defined by Kadanoff, which performs a block spin average Kadanoff et al. (1976). Block spin averaging takes four spins as seen in Figure 1 in red, and averages these spins to produce the blue spin found in the centre of each group of four red spins. The average value of the red spins becomes the value of the new centre blue spin. By performing this averaging we rescale lengths in the system by a factor of .

Figure 1: Block-spin averaging as introduced by Kadanoff. Groups of four red sites are averaged to obtain blue sites within their centre.

The RG flow halts at a critical point where the system exhibits scale invariant properties. This is because all couplings remain unchanged regardless of the length scale studied Poland et al. (2019). A basic observable of the fixed point is the spectrum of scaling dimensions.

i.3 Rbm

Restricted Boltzmann machines are made up of two layers of nodes, an input layer called the visible layer and an output layer called the hidden layer. The number of nodes in the visible layer is dependant on the number of data points each training sample has. Selecting the number of nodes in the hidden layer is an iterative process, which tries to select architectures that give the best results. Nodes in both layers take values of . Every visible node is connected to every hidden node and connections between nodes in the same layer are forbidden. The connection between a visible node and a hidden node has an associated weight . Each visible node and hidden node has an associated bias and respectively.

The energy function of an RBM is given by

(8)

where is the number of visible nodes and

is the number of hidden nodes. This energy determines the probability of a visible and hidden vector occurring together, which is given by

(9)

where

(10)

is the partition function. The goal of training is to match the distribution of the input data to the distribution generated by the RBM model Hinton (2012), where

(11)

To achieve this the weights and biases are updated using derivatives of the Kullback-Liebler divergence

(12)

Computation of the partition function

is practically impossible due to the enormous number of states summed over. To overcome this an approximate method, contrastive divergence, is used to train the network

Hinton (2002, 2012); Sutskever and Tieleman (2010). Rather than summing over the entire space of vectors, we sum only over the vectors given in the training set Carreira-Perpinan and Hinton (2005). During training the weights and biases are updated according to the following update rules

(13)
(14)
(15)

We use values from the training data set to determine expectations given by , and values from the model to determine expectations given by . To obtain the expectation of we simply use the training data set, however we do not have a set of hidden vectors in the training set to determine or . In order to find a set of data hidden vectors which we can use in the data expectations we sample hidden vectors using the training data set with a probability given by equation (16)

(16)

We also do not have a set of visible vectors and hidden vectors for the model expectation values. We use the data hidden vectors (generated using equation (16) and the training data set) to generate model visible vectors with equation (17)

(17)

Then using these model visible vectors, we can generate a set of model hidden vectors using equation (16). This occurs at each step of training in order to calculate the updates given by equations (13), (14) and (15).

Once training is complete, we can generate configurations, which we call an RBM flow, using the trained RBM Iso et al. (2018); Funai and Giataganas (2018). Starting from a set of visible vectors (could be the initial training data or MCMC data at a specific temperature), we generate a flow of visible and hidden vectors by sampling with the probabilities given in equations (16) and (17), each time using the most recently generated set of vectors

(18)

In order to get we would sample hidden nodes given the visible vectors . To get we would sample visible nodes given the hidden vectors . We can iterate the process to generate an RBM flow. In Section II.1 we discuss results derived from RBM flows.

i.4 Markov chain Monte Carlo

Studying the long range spin lattice analytically is difficult. We rely on MCMC numerical simulations to determine key statistical properties of the fixed point of the long range spin lattice. The fixed point theory is a conformal field theory, with a special class of operators, primary operators, that have two point functions that have a power law fall off. Probing these correlation functions provides detailed quantitative tests for the RBM output. We are most interested in two primary operators: the basic spin and energy density operators given in equations (4) and (5).

We begin with a randomly initialized lattice with each spin taking values of and a fixed temperature. A single step of MCMC consists of the following:

  1. Randomly select a site .

  2. Determine the change in energy, , that results from flipping .

  3. If , accept the new configuration with probability 1.

  4. If , accept the new configuration with probability .

We can generate a chain of lattice states by repeating the above process. Starting from a randomly initialized lattice means that a number of MCMC steps are required before we have a state that resembles states at the temperature we are concerned with. To ensure we sample valid states, allow a burn in period of sufficiently many MCMC steps before keeping samples generated. The number of steps required for the burn in period is not known precisely but a good rule of thumb is to measure observables such as the magnetization or specific heat of the samples generated and ensure these values have stabilized before keeping samples generated.

We generate 2000 samples at each temperature using a burn in period of 50000 MCMC steps. From the magnetization plot in Figure 1(c) we see the critical temperature lies between and .

We can give an independent determination of the critical temperature by studying correlation functions of the primary operators given in equations (4) and (5). The scaling dimension related to the energy, , must have a value of 1 at the critical point. From Figure 1(a) we see that when . This value for the critical temperature lies in the correct range given by the magnetization plot. Using 7.7 for we determine from Figure 1(b) which is shown by the vertical and horizontal lines as 0.53.

(a) versus temperature for the long range spin lattice with a lattice size of 10 by 10.
(b) versus temperature for the long range spin lattice with a lattice size of 10 by 10.
(c) Magnetization versus temperature curve for various lattice sizes.
Figure 2: Scaling dimensions of the spin lattice, with Hamiltonian in equation (1) with and , determined using MCMC. is 1 and at . These values are at the intercept of the vertical and horizontal red lines in plots (a) and (b).

Ii Numerical results

We perform two separate numerical experiments - one investigating a single RBM network and one investigating a simple "deep" network. The first experiment studies the RBM flow, while the second explores the possibility that the layers in a stacked RBM network mimic steps in an RG flow.

The RBM flow is generated using a trained RBM. The flow produces sets of visible vectors, one for each given initial vector, as described in Section I.3. Our RBM flows for the two-dimensional long range spin lattice can be compared to existing results for the two-dimensional Ising spin lattice with nearest neighbour interactions. As we will see, this comparison sheds light on the possible connection between RBM and RG flows. Existing work shows that for the two-dimensional nearest neighbour Ising model, the RBM flows converge to the critical temperature of the system Iso et al. (2018); Funai and Giataganas (2018). This is in contrast to RG flows because the Ising critical point is unstable and significant fine tuning is needed to ensure that a flow will terminate at the critical point.

Do RBM flows of the long range spin lattice flow to the critical point? We have interrogated this question by comparing scaling dimensions for the spin and energy density operators, as well as temperatures, calculated using the configurations generated by the RBM flow. By calculating these quantities for successive configurations in an RBM flow we determine if there is a fixed point of the flow and whether or not this fixed point is related to critical point of the long range spin lattice. The results obtained from the RBM flows are reported in Section II.1.

The second experiment chooses the number of nodes in the stacked RBM networks in such a way that if there is a correspondence to RG, the block spinning of each step combines spins in groups of 4. This corresponds to scaling the input lattices by a factor of . To accomplish this, choose the number of hidden nodes to be of the number of visible nodes. We study the flow generated by each RBM layer. Each layer of the RBM network produces a set of hidden vectors that are input to the next RBM network in the stack. Comparing each RBMs hidden layer configurations to steps in the coarse grained configurations produced by RG gives a quantitative comparison between the two methods.

The stacked network of RBMs is trained in a greedy layer-wise manner Bengio et al. (2006). In greedy layer-wise training, each layer is trained independently and the hidden vectors produced by a trained layer are used as training data for the next layer. Once all layers have been trained, we use the initial set of visible configurations (generated from MCMC at ) and the hidden configurations from each layer to calculate correlations between visible and hidden nodes. This correlator provides a diagnostic for comparing the coarse graining performed by RG versus that performed by stacked RBM networks.

For the variational RG correlators for configurations we use the initial set of configurations generated by MCMC, while the configurations are generated by block spinning. Results comparing RG to the stacked RBM are reported in Section II.2.

ii.1 RBM flows

In this experiment we train a single RBM network and study the associated RBM flow. An RBM flow is a sequence of configurations generated by the trained network, starting from a set of initial visible configurations (see Section I.3 for more details). The RBM network we consider has

visible neurons and

hidden neurons.

(a) versus flow length.
(b) versus flow length.
(c) Average probability of temperature versus temperature of flows of various lengths. Flow tends to a temperature of 6.
Figure 3: RBM flows starting from low temperature () configurations. The RBM has visible nodes and hidden nodes. The training set is comprised of 2000 configurations at each temperatures giving 30000 configurations in total.
(a) versus flow length.
(b) versus flow length.
(c) Average probability of temperature versus temperature of flows of various lengths.
Figure 4: RBM flows starting from configurations near the critical temperature (). The RBM has visible nodes and hidden nodes. Training uses 2000 configurations at temperatures giving 30000 configurations in total.

The data used to train the network is generated using MCMC using the long range spin lattice Hamiltonian. 2000 configurations are generated at each temperature . The network is trained using 30000 training steps, a learning rate of and batches of 1000 samples. Once the network has been trained, we generate flows starting from (i) low temperature , (ii) the critical temperature and (iii) high temperature . The configurations generated for different flow lengths determine the scaling dimensions of the spin, , and energy density,

, operators. In addition, by employing a supervised neural network, we also measure the average temperature of flows at different lengths. This allows us to determine both how the dimensions and the temperature evolve with the RBM flow. The supervised network is trained on the same data as is used to train the RBM. Note however that for the supervised case this data contains temperature labels.

Results for flows starting from (i) low temperature configurations are given in Figure 3, (ii) the critical temperature in Figure 4 and (iii) high temperature in Figure 5. Figure 2(a) converges after a flow length of to a value for of approximately which underestimates the value determined from MCMC of shown by the red horizontal line. This implies that the patterns generated by the RBM are more correlated on large length scales than they should be. In contrast to this, results obtained in Koch et al. (2019) show that flows generated by an RBM trained on two-dimensional Ising model configurations, starting from low temperature configurations, converge to the correct value for Koch et al. (2019)

. There is not improvement when these RBM configurations are used to estimate the scaling dimension of the energy density: Figure

2(b) shows that the flows generate a value of , which is again an under estimate of the correct value . The RBM has consistently underestimated the scaling dimensions, implying that distant spins are more correlated than they should be. Intuitively, more correlation suggests that we are below the critical temperature and we indeed find that the flows converge to a temperature of approximately 6 which is below . A configuration below the critical temperature, has more correlation between distant spins than those at the critical temperature.

(a) versus flow length.
(b) versus flow length.
(c) Average probability of temperature versus temperature of flows of various lengths.
Figure 5: RBM flows starting from configurations at a high temperature (). The RBM network has visible nodes and hidden nodes. The training set is made up of configurations at temperatures of with 2000 configurations at each temperature (in total 30000 configurations).

Clearly the RBM has not correctly determined the critical point of the long range spin lattice, which is in tension with conclusions reached when studying the nearest neighbor Ising spin lattice: in that case flows converge to the critical temperature Iso et al. (2018); Funai and Giataganas (2018).

Starting the flow from configurations at the critical temperature gives the results shown in Figure 3(a). The flows converge to a value close to 0 for , well below the critical point value. For the flow converges to a value near 1.2, as shown in Figure 3(b). This is an over estimate of the critical point value . Figure 3(c) shows that the flow converges to a temperature of , again lower than the critical temperature. These results suggest that the configurations generated by the RBM flow have not managed to capture the physics of the long ranged spin lattice.

For the RBM flow starting from high temperature, results in Figure 4(a) suggest that , which is completely unphysical: this corresponds to a situation when the correlation between two spins grows with the distance between the spins. The results in Figure 4(b) suggest that the value does not converge. In the flow starting from high temperature configurations, the temperature appears to remain high as the RBM flow length increases.

ii.2 Stacked RBM and RG comparison

In this section we study the flow generated by successive stacked layers of a 3-layer deep RBM network. The first layer has visible and hidden nodes, the second visible and hidden nodes and the third visible and hidden nodes. The networks are trained in a greedy layer wise manner, i.e. each layer is trained separately and in order.

The first network is trained on configurations at a temperature of with lattice size Bengio et al. (2006). Once this network is trained, hidden configurations are generated by feeding training data to the first network. The hidden configurations generated by the first RBM are the training data for the second RBM. Once trained the second RBM generates training data for the third and final layer.

To test if the stacked RBM flow resembles an RG flow, we compare to three steps of an RG flow. The deep network we have studied eliminates a quarter of the spins in each layer, which we mirror by block spinning groups of four spins. Consequently, the first step of RG will coarse grain lattices with spins to lattices with spins, the second step then coarse grains to lattices with spins and the third to lattices with spins.

To develop a quantitative comparison of the flows of the stacked RBM and those of block spin RG, we calculate correlators between the initial spins populating the lattice and the spins populating each subsequent coarse grained lattice. The result is a matrix of values, , indexed by a label for a site of the original lattice and a label for a site of a coarse grained lattice. Selecting a specific column, for example , gives the correlation of a specific coarse grained spin with all of the original spins. Rearranging the resulting vector into a matrix, produces a visual pattern of the correlation between the coarse grained spin and the original spins. In this way our results are used to produce a picture of how the coarse graining is achieved.

Figure 6 shows plots obtained from the RG flow. Plots (a) to (d) show typical correlation between the input spins and a block spin after one application of RG. The first four plots show a small highly correlated local patch with other areas having lower correlation values. The highly correlated patch corresponds to the set of spins averaged to produce the block spin, so that the coarse graining is indeed reflected in this correlator plot. Plots (e) to (h) show plots between the original spins and a block spin produced by two steps of RG. The plots clearly demonstrate that the patch of high correlation has increased in size, which is a consequence of the fact that after two steps of RG each block spin averages 16 of the original spins. Plots (i) to (l) show plots between the original spins and block spins produced by three steps of RG. The high correlation region has again increased, reflecting the fact that now each block spin summarizes 64 of the original spins. A significant feature of these plots is that the regions of correlation are local, which is a hallmark of the RG coarse graining in which short distance features are removed first by the coarse graining.

Figure 7 shows plots for the stacked RBM. Again here plots (a) to (d) show correlation between the training configurations ( visible nodes) and hidden configurations ( hidden nodes) from the first RBM, plots (e) to (h) show correlations between the training configurations ( visible nodes) and the hidden configurations ( hidden nodes) from the second RBM and plots (i) to (l) show correlations between configurations ( visible nodes) of the training data and the hidden configurations ( hidden nodes) from the third RBM. There is no clear difference between the plots for the first layer, second layer and third layer. There are some local patches but these are not as distinct as those of the RG flow of Figure 6. Comparison of figures 6 and 7 reveals that there is not much similarity, suggesting that the two coarse graining schemes are quite different.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 6: RG correlation plot.
Plots (a), (b), (c) and (d) show typical plots where is an input spin and is a block spin produced by one step of RG.
Plots (e), (f), (g) and (h) show typical plots where is an input spin and is a block spin produced by two steps of RG.
Plots (a), (b), (c) and (d) show typical plots where is an input spin and is a block spin produced by three steps of RG.
In all 12 plots a single block spin’s correlator with the complete collection of input spins is plotted.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 7: RBM correlation plot.
Plots (a), (b), (c) and (d) show a sample of the plots where is an input spin and is a hidden spin produced by the first RBM in the stacked network.
Plots (e), (f), (g) and (h) show a sample of the plots where is an input spin and is a hidden spin produced by the second RBM in the stacked network.
Plots (a), (b), (c) and (d) show a sample of the plots where is an input spin and is a hidden spin produced by the third RBM in the stacked network.
In all 12 plots a single hidden node’s correlator with the complete collection of input spins is plotted.
(a) Stacked RBM flow
(b) RG flow
Figure 8: Average probability of temperature versus temperature of a flow for a stacked RBM network trained on lattices at and an RG flow of 3 steps. The RG flow and RBM flows are generated by applying both methods to the training set of lattices at consisting of spins.

Another interesting aspect to explore is the temperature of each successive flow. Here we make use of the same supervised network used to measure temperature in Section II.1. Figure 8 shows the average measured temperature of configurations at different steps in the RG and stacked RBM flows. After each step of RG, the temperature of the coarse-grained configurations appears to increase. After one application of RG we measure an average temperature of 6.25 which increases to 7 after two applications of RG and further increases to 7.5 after the third step of RG as seen in Figure 7(b).

The stacked RBM appears to cause a decrease in temperature as we move through successive layers. The hidden vectors from the first RBM are at a high temperature with a maximum probability at a temperature of 12.5, the second and third RBMs have hidden vectors at a temperature just above 7.5.

Considering the comparison between plots and the temperature measurement of flows for the RG and stacked RBM network, we do not see convincing evidence that suggest the two are equivalent. In the case of the measured temperatures, we see very different behavior with the temperature flowing in opposite directions. With RG we see temperature increasing with increased steps in the flow and with the stacked RBM we see a decrease in temperature with successive hidden vectors produced by the RBM layers.

For the two-dimensional Ising model as explored in Koch et al. (2019), the flows went to a higher temperature in both the case of RG and that of the stacked RBM network. This contrasts with the flow of the long range spin model where we see an increase in temperature when applying successive steps of RG and a decrease in temperature when applying successive RBMs.

Iii Conclusions and Discussion

We have explored the link between RG and deep learning using RBM’s trained on data taken from states of a long range spin lattice. This generalizes existing work Iso et al. (2018); Funai and Giataganas (2018) exploring the two-dimensional Ising model with nearest neighbor interactions. Our results show important differences from those of Iso et al. (2018); Funai and Giataganas (2018), which are discussed and interpreted in this section.

We have explored the RBM flow generated by the weights of single RBM network, trained using spin configurations of a two-dimensional long range spin lattice. Data sets generated at low temperature, at high temperature and near the critical temperature, have been considered. Our results demonstrate that regardless of the temperature from which the flow begins, the RBM flow fixed point does not resemble the critical point of the long range spin lattice. The scaling dimensions for the spin and energy operators do not converge to the values we expect at the critical point. In addition, when measuring the temperature of the flows, they do not converge to the critical temperature. This contrasts with results obtained for the Ising spin lattice where the correlator is correctly reproduced and the flows appear to converge to the critical temperature Koch et al. (2019).

Our study has also compared RG flow to the flow produced by a stacked deep RBM network. Following proposals put forward in Koch et al. (2019) this comparison was performed by comparing the correlator resulting from the two flows. plots from RG show local correlated patches surrounded by a large uncorrelated region. As the number of steps in the RG flow increases so does contrast between the patch of high correlation and the surrounding uncorrelated environment. This is very clearly a signature of the locality built into the block spinning RG procedure. Plots for the stacked RBM show random patches of correlation surrounded by dark uncorrelated regions. The patches lack the regularity of the patterns the emerge in the case of RG, and consequently they are not enhanced by subsequent layers of the deep RBM. As we move through successive layers of the RBM flow’s plots there is no noteworthy oragnized change in the size or shape of the correlated patches which were observed for the RG flow.

In addition to the plots, the average temperature of successive steps in both the RG and the stacked RBM flows were determined. As expected the RG flow shows a clear increase in temperature with the length of the flow. This simply reflects the presence of relevant operators in the theory so that the fixed point is unstable. The RBM however shows a different behavior: the temperature decreases with the depth of the stacked network. For nearest neighbor Ising lattices temperature increases both along the RG flow and with the depth of the stacked RBM network, showing agreement on this very basic question. Even for this most basic question, the RBM flow and RG flow disagree for the long ranged spin lattice.

Our results clearly demonstrate that the coarse graining performed by the RBM does not correspond to block spin RG of the long range spin lattice. For the nearest neighbor Ising model there is at least some evidence that deep RBM’s are reproducing some features of the RG flow. Why does the long ranged spin lattice differ so significantly from the nearest neighbor Ising lattice? In the long range spin lattice, interactions decrease smoothly in strength as the distance between the interacting spins increases. The net result is that more distant spins are correlated by interactions in the long range spin lattice, as compared to the Ising lattice whose interactions switch off abruptly as one moves beyond nearest neighbors. Consequently, it is harder for the RBM to recognize locality in the emergent patterns. Without locality, the resulting coarse graining is nothing like RG.

Setting aside this disagreement, there maybe a more fundamental obstacle to comparing the flows. The RBM energy function is simple and linear in and . Setting the coefficient of the term linear in corresponds to fixing the single spin expectation value, i.e. the magnetization of the lattice. There are no further parameters in the RBM energy function that could be used to fix higher order correlators, such as two point or three point spin correlators. This implies that not all relevant parameters in the theory have been fixed, so that it would be surprising if the RBM and RG flows terminate at the same point in the space of all possible lattice models. Small differences between the two will be magnified by the flow making the two look very different. Including higher order terms in the RBM energy function might allow us to use additional properties present in the input data, during training. This would allow the RBM to correctly learn essential physical characteristics of the long range spin lattice. Clearly our most credible conclusion is that more work is required to explore the connection that may exist between RG and deep learning!

Acknowledgements

We would like to thank Robert de Mello Koch for useful discussions.

References

  • K. Aoki and T. Kobayashi (2016) Restricted boltzmann machines for the long range ising models. Modern Physics Letters B 30 (34), pp. 1650401. Cited by: §I.
  • Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2006) Greedy layer-wise training of deep networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, Cambridge, MA, USA, pp. 153–160. Note: event-place: Canada External Links: Link Cited by: §II.2, §II.
  • C. Bény (2013) Deep learning and the renormalization group. arXiv preprint arXiv:1301.3124 [quant-ph]. Cited by: §I, §I.
  • P. Broecker, J. Carrasquilla, R. G. Melko, and S. Trebst (2017) Machine learning quantum phases of matter beyond the fermion sign problem. Scientific reports 7 (1), pp. 8823. Cited by: §I.
  • J. Carrasquilla and R. G. Melko (2017) Machine learning phases of matter. Nature Physics 13 (5), pp. 431. Cited by: §I.
  • M. A. Carreira-Perpinan and G. E. Hinton (2005) On contrastive divergence learning.. In Aistats, Vol. 10, pp. 33–40. Cited by: §I.3.
  • D. Deng, X. Li, and S. D. Sarma (2017) Machine learning topological states. Physical Review B 96 (19), pp. 195145. Cited by: §I.
  • E. Efrati, Z. Wang, A. Kolan, and L. P. Kadanoff (2014) Real-space renormalization in statistical mechanics. Reviews of Modern Physics 86 (2), pp. 647. Cited by: §I.2.
  • S. S. Funai and D. Giataganas (2018)

    Thermodynamics and feature extraction by machine learning

    .
    arXiv preprint arXiv:1810.08179. Cited by: §I.3, §I, §I, §II.1, §II, §III.
  • C. J. Geyer (1992) Practical markov chain monte carlo. Statistical science, pp. 473–483. Cited by: §I.1.
  • G. E. Hinton (2002) Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §I.3.
  • G. E. Hinton (2012) A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pp. 599–619. Cited by: §I.3.
  • E. Howard (2018) Holographic renormalization with machine learning. arXiv preprint arXiv:1803.11056. Cited by: §I.
  • S. Iso, S. Shiba, and S. Yokoo (2018) Scale-invariant feature extraction of neural network and renormalization group flow. Physical Review E 97 (5), pp. 053304. Cited by: §I.3, §I, §I, §I, §II.1, §II, §III.
  • M. I. Jordan and T. M. Mitchell (2015) Machine learning: trends, perspectives, and prospects. Science 349 (6245), pp. 255–260. Cited by: §I.
  • L. P. Kadanoff, A. Houghton, and M. C. Yalabik (1976) Variational approximations for renormalization group transformations. Journal of Statistical Physics 14 (2), pp. 171–203. External Links: ISSN 1572-9613, Link, Document Cited by: §I.2.
  • D. Kim and D. Kim (2018) Smallest neural network to learn the ising criticality. Physical Review E 98 (2), pp. 022138. Cited by: §I, §I.
  • E. d. M. Koch, R. d. M. Koch, and L. Cheng (2019) Is deep learning an rg flow?. arXiv preprint arXiv:1906.05212. Cited by: §I, §I, §I, §II.1, §II.2, §III, §III.
  • M. Koch-Janusz and Z. Ringel (2018) Mutual information, neural networks and the renormalization group. Nature Physics 14 (6), pp. 578. Cited by: §I, §I.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • N. Le Roux and Y. Bengio (2008)

    Representational power of restricted boltzmann machines and deep belief networks

    .
    Neural computation 20 (6), pp. 1631–1649. Cited by: §I.
  • N. Le Roux and Y. Bengio (2010) Deep belief networks are compact universal approximators. Neural computation 22 (8), pp. 2192–2207. Cited by: §I.
  • S. Li and L. Wang (2018) Neural network renormalization group. Physical review letters 121 (26), pp. 260601. Cited by: §I, §I.
  • H. W. Lin, M. Tegmark, and D. Rolnick (2017) Why does deep and cheap learning work so well?. Journal of Statistical Physics 168 (6), pp. 1223–1247. Note: arXiv: 1608.08225 External Links: ISSN 0022-4715, 1572-9613, Link, Document Cited by: §I.
  • P. Mehta and D. J. Schwab (2014) An exact mapping between the variational renormalization group and deep learning. arXiv:1410.3831 [cond-mat, stat]. Note: arXiv: 1410.3831 External Links: Link Cited by: §I, §I.
  • A. Morningstar and R. G. Melko (2017) Deep learning the ising model near criticality. The Journal of Machine Learning Research 18 (1), pp. 5975–5991. Cited by: §I.
  • Y. Nomura, A. S. Darmawan, Y. Yamaji, and M. Imada (2017) Restricted boltzmann machine learning for solving strongly correlated quantum systems. Physical Review B 96 (20), pp. 205152. Cited by: §I.
  • A. Paul and S. Venkatasubramanian (2014) Why does deep learning work? - a perspective from group theory. arXiv:1412.6621 [cs, stat]. Note: arXiv: 1412.6621 External Links: Link Cited by: §I.
  • D. Poland, S. Rychkov, and A. Vichi (2019) The conformal bootstrap: theory, numerical techniques, and applications. Reviews of Modern Physics 91 (1), pp. 015002. Cited by: §I.1, §I.2.
  • M. Ranzato, A. Krizhevsky, and G. Hinton (2010) Factored 3-way restricted boltzmann machines for modeling natural images. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 621–628. External Links: Link Cited by: §I.
  • R. Salakhutdinov, A. Mnih, and G. Hinton (2007) Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 791–798. Note: event-place: Corvalis, Oregon, USA External Links: ISBN 9781595937933, Link, Document Cited by: §I.
  • S. Saremi and T. J. Sejnowski (2013) Hierarchical model of natural images and the origin of scale invariance. Proceedings of the National Academy of Sciences of the United States of America 110 (8), pp. 3071–3076. External Links: ISSN 1091-6490, Document Cited by: §I.
  • I. Sutskever and T. Tieleman (2010) On the convergence properties of contrastive divergence. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 789–795. Cited by: §I.3.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019)

    Grandmaster level in starcraft ii using multi-agent reinforcement learning

    .
    Nature, pp. 1–5. Cited by: §I.
  • L. Wang (2016)

    Discovering phase transitions with unsupervised learning

    .
    Physical Review B 94 (19), pp. 195105. Cited by: §I.
  • K. G. Wilson and J. Kogut (1974) The renormalization group and the expansion. Physics reports 12 (2), pp. 75–199. Cited by: §I.2.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv:1611.03530 [cs]. Note: arXiv: 1611.03530 External Links: Link Cited by: §I.