Info: Neural Network, Deep Learning , Recursive Net etc, RNN, BM,RBM,HMM
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.READ FULL TEXT VIEW PDF
Deep learning models have lately shown great performance in various fiel...
Deep learning has achieved a great success in many areas, from computer
Deep learning has recently achieved very promising results in a wide ran...
Although there has been a rapid development of practical applications,
Deep learning relies on a very specific kind of neural networks: those
Much work has been done refining and characterizing the receptive fields...
Transfer learning refers to the use of knowledge gained while solving a
Info: Neural Network, Deep Learning , Recursive Net etc, RNN, BM,RBM,HMM
In statistical physics, one often considers an ensemble of binary spins that can take the values . The index labels the position of spin in some lattice. In thermal equilibrium, the probability of a spin configuration is given by the Boltzmann distribution
where we have defined the Hamiltonian , and the partition function
Note throughout the paper we set the temperature equal to one, without loss of generality. Typically, the Hamiltonian depends on a set of couplings or parameters, , that parameterizes the set of all possible Hamiltonians. For example, with binary spins, the could be the couplings describing the spin interactions of various orders:
Finally, we can define the free energy of the spin system in the standard way:
The idea behind RG is to find a new coarse-grained description of the spin system where one has “integrated out” short distance fluctuations. To this end, let us introduce new binary spins, . Each of these spins will serve as a coarse-grained degree of freedom where fluctuations on small scales have been averaged out. Typically, such a coarse-graining procedure increases some characteristic length scale describing the system such as the lattice spacing. For example, in the block spin renormalization picture introduced by Kadanoff, each represents the state of a local block of physical spins, . Figure 1 shows such a block-spin procedure for a two-dimensional spin system on a square lattice, where each represents a block of visible spins. The result of such a coarse-graining procedure is that the lattice spacing is doubled at each step of the renormalization procedure.
In general, the interactions (statistical correlations) between the induce interactions (statistical correlations) between the coarse-grained spins, . In particular, the coarse-grained system can be described by a new coarse-grained Hamiltonian of the form
where describe interactions between the hidden spins, . In the physics literature, such a renormalization transformation is often represented as mapping between couplings, . Of course, the exact mapping depends on the details of the RG scheme used.
In the variational RG scheme proposed by Kadanoff, the coarse graining procedure is implemented by constructing a function, , that depends on a set of variational parameters and encodes (typically pairwise) interactions between the physical and coarse-grained degrees of freedom. After coupling the auxiliary spins to the physical spins , one can then integrate out (marginalize over) the visible spins to arrive at a coarse-grained description of the physical system entirely in terms of the . The function then naturally defines a Hamiltonian for the through the expression
We can also define a free energy for the coarse grained system in the usual way
Thus far we have ignored the problem of choosing the variational parameters that define our RG transformation . Intuitively, it is clear we should choose to ensure that the long-distance physical observables of the system are invariant to this coarse graining procedure. This is done by choosing the parameters to minimize the free energy difference, , between the physical and coarse grained systems. Notice that
Thus, for any exact RG transformation, we know that
In general, it is not possible to choose the parameters to satisfy the condition above and various variational schemes (e.g. bond moving) have been proposed to choose to minimize this .
We will show below that this variational RG procedure has a natural interpretation as a deep learning scheme based on a powerful class of energy-based models called Restricted Boltzmann Machines (RBMs)Hinton and Salakhutdinov (2006); Salakhutdinov et al. (2007); Larochelle and Bengio (2008); Smolensky (1986); Teh and Hinton (2001). We will restrict our discussion to RBMs acting on binary data Hinton and Salakhutdinov (2006) drawn from some probability distribution, , with binary spins labeled by an index . For example, for black and white images each spin encodes whether a given pixel is on or off and the distribution encodes the statistical properties of the ensemble of images (e.g the set of all handwritten digits in the MNIST dataset).
To model the data distribution, RBMs introduce new hidden spin variables, () that couple to the visible units. The interactions between visible and hidden units are modeled using an energy function of the form
where are variational parameters of the model. In terms of this energy function, the joint probability of observing a configuration of hidden and visible spins can be written as
This joint distribution also defines a variational distribution for the visible spins
as well as a marginal distribution for hidden spins themselves:
Finally, for future reference it will be helpful to define a “variational” RBM Hamiltonian for the visible units:
and an RBM Hamiltonian for the hidden units:
Since the objective of the RBM for our purposes is unsupervised learning, the parameters in the RBM are chosen to minimize the Kullback-Leibler divergence between the true distribution of the dataand the variational distribution :
Furthermore, notice that when the RBM exactly reproduces the visible data distribution
In general it not possible to explicitly minimize the
and this minimization is usually performed using approximate numerical methods such as contrastive divergenceHinton (2002). Note that if the number of hidden units is restricted (i.e. less than ), the RBM cannot be made to match an arbitrary distribution exactly Le Roux and Bengio (2008).
In a DNN, RBMs are stacked on top of each other so that, once trained, the hidden layer of one RBM serves as the visible layer of the next RBM. In particular, one can map a configuration of visible spins to a configuration in the hidden layer via the conditional probability distribution,. Thus, after training an RBM, we can treat the activities of the hidden layer in response to each visible data sample as data for learning a second layer of hidden spins, and so on.
In variational RG, the couplings between the hidden and visible spins are encoded by the operators . In RBMs, an analogous role is played by the joint energy function . In fact, as we will show below, these objects are related through the equation,
where is the Hamiltonian defined in Eq. 3 that encodes the data probability distribution . This equation defines a one-to-one mapping between the variational RG scheme and RBM based DNNs.
Using this definition, it is easy to show that the Hamiltonian , originally defined in Eq. 6 as the Hamiltonian of the coarse-grained degrees of freedom after performing RG, also describes the hidden spins in the RBM. This is equivalent to the statement that the marginal distribution describing the hidden spins of the RBM is of the Boltzmann form with a Hamiltonian . To prove this, we divide both sides of Eq. 6 by to get
Substituting Eq. 18 into this equation yields
Substituting Eq. 15 into the right-hand side yields the desired result
These results also provide a natural interpretation for variational RG entirely in the language of probability theory. The operatorcan be viewed as a variational approximation for the conditional probability of the hidden spins given the visible spins. To see this, notice that
where in going from the first the line to the second line we have used Eqs. 11 and 14. This implies that when an RG can be performed exactly (i.e. the RG transformation satisfies the equality ), the variational Hamiltonian is identical to the true Hamiltonian describing the data, and is exactly the conditional probability. In the language of probability theory, this means that the variational distribution exactly reproduces the true data distribution and .
In general, it is not possible to perform the variational RG transformation exactly. Instead, one constructs a family of variational approximations for the exact RG transform Kadanoff et al. (1976); Kadanoff (2000); Efrati et al. (2014). The discussion above makes it clear that these variational distributions work at the level of the Hamiltonians and Free Energies. In contrast, in the Machine Learning literature, these variational approximations are usually made by minimizing the KL divergence . Thus, the two approaches employ distinct variational approximation schemes for coarse graining. Finally, notice that the correspondence does not rely on the explicit form of the energy and hence holds for any Boltzmann Machine.
To gain intuition about the mapping between RG and deep learning, it is helpful to consider some simple examples in detail. We begin by examining the one-dimensional nearest-neighbor Ising model where the RG transformation can be carried out exactly. We then numerically explore the two-dimensional nearest-neighbor Ising model using an RBM-based deep learning architecture.
The one-dimensional Ising model describes a collection of binary spins organized along a one-dimensional lattice with lattice spacing . Such a system is described by a Hamiltonian of the form
where is a ferromagnetic coupling that energetically favors configurations where neighboring spins align. To perform a RG transformation, we decimate (marginalize over) every other spin. This doubles the lattice spacing and results in a new effective interaction between spins (see Figure 2). If we denote the coupling after performing successive RG transformations by , then a standard calculation shows that these coefficients satisfy the RG equations
where we have defined Kadanoff (2000). This recursion relationship can be visualized as a one-dimensional flow in the coupling space from to . Thus, after performing RG the interactions become weaker and weaker and as .
This RG transformation also naturally gives rise to the deep learning architecture shown in Figure 2. The spins at a given layer of the DNN have a natural interpretation as the decimated spins when performing the RG transformation in the layer below. Notice that the coupled spins in the bottom two layers of the DNNs in Fig. 2B form an “effective” one-dimensional chain isomorphic to the original spin chain. Thus, marginalizing over spins in the bottom layer in the DNN is identical to decimating every other spin in the original spin systems. This implies that the “hidden” spins in the second layer of the DNN are also described by the RG transformed Hamiltonian with a coupling between neighboring spins. Repeating this argument for spins coupled between the second and third layers and so on, one obtains the deep learning architecture shown in Fig. 2B which implements decimation.
The advantage of the simple deep architecture presented here is that it is easy to interpret and requires no calculations to construct. However, an important shortcoming is that it contains no information about half of the visible spins, namely the spins that do not couple to the hidden layer.
We next applied deep learning techniques to numerically coarse-grain the two-dimensional nearest-neighbor Ising model on a square lattice. This model is described by a Hamiltonian of the form
where indicates that and are nearest neighbors and
is a ferromagnetic coupling that favors configurations where neighboring spins align. Unlike the one-dimensional Ising model, the two dimensional Ising model has a phase transition that occurs when(recall we have set ). At the phase transition, the characteristic length scale of the system, the correlation length, diverges. For this reason, near a critical point the system can be productively coarse-grained using a procedure similar to Kadanoff’s block spin renormalization (see Fig. 1) Kadanoff (2000).
Inspired by our mapping between variational RG and DNNs, we applied standard deep learning techniques to samples generated from the D Ising model for , just above the critical temperature. samples were generated from a periodic D Ising model using standard equilibrium Monte Carlo techniques and served as input to an RBM-based deep neural network of four layers with , , , and spins respectively (see Fig. 3A). We furthermore imposed an L penalty on the weights between layers in the RBM and trained the network using contrastive divergence Hinton (2002) (see Materials and Methods). The L penalty serves as a sparsity promoting regularizer that encourages weights in the RBM to be zero and prevents overfitting due to the finite number of samples. In practice, it ensures that visible and hidden spins interact with only a small subset of all the spins in an RBM. (Note that we did not use a convolutional network that explicitly builds in spatial locality or translational invariance.)
The architecture of the resulting DNN suggests that it is implementing a coarse-graining scheme similar to block spin renormalization (see Fig. 3). Each spin in a hidden layer couples to a local block of spins in the layer below. This iterative blocking is consistent with Kadanoff’s intuitive picture of how coarse-graining should be implemented near the critical point. Moreover, the size of the blocks coupling to each hidden unit in a layer are of approximately the same size (Fig. 3B,C), and the characteristic size is increasing with layer (Fig. 3D). Surprisingly, this local block spin structure emerges from the training process, suggesting the DNN is self-organizing to implement block spin renormalization. Furthermore, as shown in Fig. 3E, reconstructions from the coarse grained DNN can qualitatively reproduce the macroscopic features of individual samples despite having only spins in the top layer, a compression ratio of .
Deep learning is one of the most successful paradigms for unsupervised learning to emerge over the last ten years. The enormous success of deep learning techniques at a variety of practical machine learning tasks ranging from voice recognition to image classification raises natural questions about its theoretical underpinnings. Here, we have demonstrated that there is a one-to-one mapping between RBM-based Deep Neural Networks and the variational renormalization group. We illustrated this mapping by analytically constructing a DNN for the 1D Ising model and numerically examining the 2D Ising model. Surprisingly, we found that these DNNs self organize to implement a coarse-graining procedure reminiscent of Kadanoff block renormalization. This suggests that deep learning may be implementing a generalized RG-like scheme to learn important features from data.
RG plays a central role in our modern understanding of statistical physics and quantum field theory. A central finding of RG is that the long distance physics of many disparate physical systems are dominated by the same long distance fixed points. This gives rise to the idea of universality – many microscopically dissimilar systems exhibit macroscopically similar properties at long distances Physicists have developed elaborate technical machinery for exploiting fixed points and universality to identify the salient long distance features of physics systems. It will be interesting to see, what, if any of this more complex machinery can be imported to deep learning. A potential obstacle for importing ideas from physics into the deep learning framework is that RG is commonly applied to physical systems with many symmetries. This is in contrast to deep learning which is often applied to data with limited structure.
Recently, it was suggested that modern RG techniques developed in the context of quantum systems such as matrix product states and tensor networks have a natural interpretation in terms of variational RGEfrati et al. (2014). These new techniques exploit ideas such as entanglement entropy and disentanglers which create a features with a minimum amount of redundancy. It is an open question to see whether these ideas can be imported into deep learning algorithms. Our mapping also suggests a route for applying real space renormalization techniques to complicated physical systems. Real space renormalization techniques such as variational RG have often been limited by their inability to make good approximations. Techniques from deep learning may represent a possible route for overcoming these problems.
Details are given in the SI Materials and Methods. Stacked RBMs were trained with a variant of the code from Hinton and Salakhutdinov (2006). This code is available at https://code.google.com/p/matrbm/. In particular, only the unsupervised learning phase was performed. Individual RBMs were trained with contrastive divergence for epochs, with momentum using mini-batches of size on total samples from the Ising model with . Additionally, regularization was implemented, with strength , instead of weight decay. This L1 regularization strength was chosen to ensure that one could not have all-to-all couplings between layers in the DNN. Reconstructions were performed as in Hinton and Salakhutdinov (2006). See Supplementary files for a Matlab variable containing the learned model.
The effective receptive field is a way to visualize which spins in the visible layer that coupled to a given spin in one of the hidden layers. We denote the effective receptive field matrix of layer by and the number of spins in layer by , with the visible layer corresponding to . Each column in
is a vector that encodes the receptive field of a single spin in hidden layer. It can be computed by convoluting the weight matrices encoding the weights between the spins in layers and . To compute first we set and used the recursion relationship for . Thus, the effective receptive field of a spin is a measure of how much that hidden spin influences the spins in the visible layer.