During the last decade, machine learning has experienced a rapid development, both in everyday life with the incredible success of image recognition used in various applications, and in researchgoodfellow2016deep,mehta2019high where many different communities are now involved. This common effort involves fundamental aspects such as why it works or how to build new architectures and at the same time a search for new applications of machine learning to other fields, like for instance improving biomedical images segmentationronneberger2015u or detecting automatically phase transitions in physical systemcarrasquilla2017machine. Machine learning classical tasks are divided into at least two big categories: supervised and unsupervised learning (putting aside reinforcement learning and the more recently introduced approach of self-supervised learning). Supervised learning consists in learning a specific task — for instance recognizing an object on an image or a word in a speech— by giving the machine a set of samples together with the correct answer and correcting the prediction of the machine by minimizing a well-design and easy computable loss function. Unsupervised learning consists in learning a representation of the data given an explicit or implicit probability distribution, hence adjusting a likelihood function on the data. In this latter case, no label is assigned to the data and the result depends thus solely on the structure of the considered model and of the dataset.
In this review, we are interested in a particular model: the Restricted Boltzmann Machine (RBM). Originally called Harmonium Smolensky or product of experts hinton2002training, RBMs were designedackley1985learning to perform unsupervised tasks even though they can also be used to accomplish supervised learning in some sense. RBMs are part of what is called generative models which aim to learn a latent representation of the data in order to later be used to generate statistically similar new data —but different from those of the training set. There are Markov Random Fields (or Ising model for physicists), that were designed as a way to automatically interpret an image using a parallel architecture including a direct encoding of the probability of each “hypothesis” (latent description of a small portion of an image). Later on, RBMs started to take an important role in the Machine Learning community, when a simple learning algorithm introduced by Hinton et al.hinton2002training, the contrastive divergence (CD), managed to learn a non trivial dataset such as MNISTlecun1998gradient. It was in the same period that RBMs became very popular in the ML community for its capability to pre-train deep neural networks (for instance deep auto-encoder), in a layer wise style. And, it was then showed that RBMs are universal approximatorle2008representational of discrete distributions, that is, an arbitrary large RBM can approximate arbitrarily well any discrete distribution (which led to many rigorous results about the modelization mechanism of RBMsmontufar2016restricted). In addition, RBMs offer the possibility to be stacked to form a multi-layer generative model known as a deep Boltzmann machine (DBM)salakhutdinov2009deep. In the more recent years, RBMs continued to attract scientific interest. Firstly because it can be used on continuous or discrete variable very easilykrizhevsky2009learning,MuTa,cho2011improved,yamashita2014bernoulli. Secondly, because the possible interpretations of the hidden nodes can be very usefulhjelm2014restricted,hu2018latent. Interestingly, in some cases, more elaborate methods such as GANgoodfellow2014generative are not working betteryelmen2019creating. Finally it can be used for other tasks as well such as classification or representation learningZHANG20181186. Besides all these positive aspects, the learning process itself of the RBM remains poorly understood. The reasons are twofold: firstly, the gradient can be computed only in an approximated way as we will see; secondly, simple changes may have terrible impact on the learning or, messed up completely with the other meta-parameters. For instance making a naive change of variable in the MNIST datasetcho2011enhanced,tang2011data can affect importantly the training performance111
In MNIST, it is usual to consider binary variableto describe the dataset. Taking instead naively will affect dramatically the learning of the RBM.. Another example, when varying the number of hidden nodes, keeping the other mete-parameters fixed, will affect not only the representational power of the rbm but also the learning dynamics itself.
The statistical physics community, on its side, has a long tradition of studying inference and learning process with its own tools. Using idealized inference problems, it has managed in the past to shed light on the learning process of many ML models. For instance, in the Hopfield modelhopfield1982neural,AmGuSo1,AmGuSo2,AmGuSo3, a retrieval phase was characterized where the maximum number of patterns that can be retrieved can be expressed as a function of the temperature. Another example is the computation of the storage capacity of the Perceptronrosenblatt1958perceptron on synthetic datasetsgardner1988space,Derrida-Gardner. In these approaches, the formalism of statistical physics explains the macroscopic behavior of the model in term of its position on a phase diagram in the large size limit.
From a purely technical point of view, the RBM can be seen for a physicist as a disordered Ising model on a bipartite graph. Yet, the difference with respect to the usual models that are studied in statistical physics is that the phase diagram of a trained RBM involves a highly non-trivial coupling matrix where the components are correlated as a result of the learning process. These dependencies make it non-trivial to adapt classical tools from statistical mechanics, such as the replica theorymezard1987spin. We will illustrate in this article how methods from statistical physics still have helped to characterize both the equilibrium phase of an idealized RBM where the coupling matrix has a structured spectrum, and how the learning dynamics can be analyzed in some specific regimes, both results being obtained with traditional mean-field approaches.
The paper is organized as follows. We will first give the definition of the RBM and review the typical learning algorithm used to train the model in Section (2). Then, in Section (3), we will review different types of RBMs by changing the prior on its variables and show explicit links with other models. In Section (4), we will review two approaches that characterize the phase diagram of the RBM and in particular its compositional phase, based on two different hypothesis over the structure of the parameters of the model. Finally, in Section (5) we will show some theoretical development helping to understand the formation of patterns inside the machine and how we can use the mean-field or TAP equations to learn the model.
2 Definition of the model and learning equations
2.1 Definition of the RBM
The RBM is an Ising model (or equivalently, a Markov random field), defined on a bipartite graph structure over two layers of variables: the visible nodes , for and the hidden nodes , denoting and the number of visible and hidden nodes respectively. In the following, we will use to enumerate the visible variables and for the hidden ones. No connection between any pair of visible or hidden nodes occurs . Hence, we will call the coupling or weight matrix and denote its elements as since no other interactions are present (such as or ). In addition to the pairwise coupling matrix , each visible and hidden node can have a local magnetic field, or local bias (we will refer to it as bias in the rest of the article), respectively named and . We can introduce the following Hamiltonian
from which we define a Boltzmann distribution
where is given by
The structure of the RBM is represented on Figure 1
where the visible nodes are represented by black dots, the hidden nodes by red dots and the weight matrix by blue dotted lines.
The benefit of having a bipartite structure is that, when keeping fixed an entire layer, in our case all the visible or all the hidden nodes, the variables of the other layer become statistically independent. In other words, the measure and
factorizes over the visible/hidden nodes respectively. This is an important property to keep in mind since it will be used in the learning procedure of the model. We will see that this property is widely used during the learning in order to draw new samples using a Monte-Carlo Markov Chain (MCMC) by Gibbs sampling.
Historically, the RBM was first defined with binary
variables for both the visible and the hidden nodes in line with the sigmoid activation function of the perceptron, hence being directly intepretable as spin-glass model of statistical mechanics. A more general definition is considered here by introducing a prior distribution function for both the visible and hidden variables, allowing us to consider discrete or continuous variables. This generalization will allow us to see the links between RBMs and other well-known models of machine learning. From now on we will write all the equations for the generic case using the notationand to indicate an arbitrary choice of “prior” distribution. Averaging over the RBM measure corresponding to Hamiltonian (1) will then be denoted by
where here can represent both discrete sums or integrals and with the RBM distribution defined from now on as
It is worth mentioning that, the choice of the prior distribution can be rephrased in terms of an activation function on the conditioned distribution over the visible or hidden variables. Therefore, when specifying a prior distribution, we will systematically indicate the corresponding activation function for the hidden layer, that is
, which is obtained using the Bayes theorem
Before entering more into the technical details about the RBM, it is important to recall that it has been designed as a “learnable” generative model in practice. In that sense, the usual procedure is to feed the RBM with a dataset, tune its parameter , and such that the equilibrium properties of the learned RBM reproduce faithfully the correlations (or the patterns) present in the dataset. In other words, it is expected that the learned model is able to produce new data statistically similar but distinct from the training set. To do so, the classical procedure is to proceed with a stochastic gradient ascent (to be explained in Section 2.2
) of the likelihood function that can be easily expressed. Usually the learning of ML models involves the minimization of a loss function which happens here to be minus the log likelihood, thus in the following we will refer to Stochastic Gradient Descent (SGD) instead. First, consider a set of datapoints, where is the index of the data. The log-likelihood is given by
The gradient w.r.t. the different parameters will then take a simple form. Let us detail the computation of the gradient w.r.t. the weight matrix. By deriving the log-likelihood w.r.t. the weight matrix we get
where we used the following notation
The gradients for the biases (or magnetic fields) are
It is interesting to note that, in expression (4), the gradient is very similar to the one obtained in the traditional inverse Ising problem with the difference that in the inverse Ising problem the first term (sometimes coined “positive term”) depends only on the data, while for the RBM, we have a dependence on the model (yet simple to compute). Once the gradient is computed, the parameters of the model are updated in the following way
where called the learning rate tunes the speed at which the parameters are updated in a given direction, the superscript being the index of iteration. A continuous limit of the learning process can be formally defined by considering real and replacing by , by and letting .
The difficulty to train an RBM resides in the difficulty to compute the second term of the gradient, the so-called “negative term”, which represents, in the gradient over the weight matrix, the correlation between a visible node and a hidden node under the RBM distribution. Similarly, the gradient over the biases is difficult to compute, where here the negative term is given by the mean value over the visible/hidden nodes. Depending on the value of the parameters of the model (the couplings and the biases), we can either be in a phase where it is easy to sample configurations from , (usually called paramagnetic phase); either be (if unlucky) in a spin glass phase, where it is exponentially hard to escape from the spurious free energy minima; either be (if lucky) in a ”recall” phase where the dominant states correspond to data-like configurations. But even in the latter case, it might be difficult to transit from one state to another one with random jumps if these states are separated by large energy or free energy barriers, as in the Hopfield model for instance.
2.2 Stochastic Gradient Descent
Considering the difficulty to use the eq. (4
) to learn the model (the computation of the negative term scales exponentially with the system size, and Monte Caro Markov chains (MCMC) can be very slow to converge), an efficient approximative scheme name contrastive divergencehinton2002training (CD) has been developed in order to approximate this term. First of all, the dataset is partitioned into small subsets called minibatches, and the gradient ascent is performed sequentially over all these minibatches in a random order. As a result each gradient step is performed only over a small subset of the whole dataset at a time. In order to estimate the negative term, the principle of CD is to start many Monte-Carlo chains in parallel, as many as the number of samples in a minibatch, and to use each sample of the minibatch as an initial condition for the chain. The idea being that starting from desired equilibrium configurations and makingsteps — the number of MC steps is coined in the method : CD-k—, we expect to explore nearby configurations representative of the dataset when the machine is learned; if otherwise the chains flow away they will “teach” the RBM how to adjust the parameters. The interpretation of CD is that it tends to create a basin of ”attraction” centered on the datapoints where nearby configurations will be attractive to these datapoint under the Gibbs dynamics. In practice, starting from a datapoint a random configuration of the hidden layer is sampled; in turn given this a configuration of the visible layer is sampled and so on for steps. For this we take advantage of the bipartite structure of the model to draw a whole visible or hidden layer at once thanks to the factorization of the conditional distribution and :
finally and are used to estimate the negative term. It is clear that the CD-k is not directly minimizing the likelihood, or equivalently the Kullback Leibler (KL) divergence between the data distribution and the Boltzmann one . In reality it minimizes the KL divergence between the data distribution and the distribution obtained after MC steps that is defined as
Incarreira2005contrastive it is argued that this procedure is roughly equivalent to minimizing the following KL difference
up to an extra term considered to be small without much theoretical guaranty. The major drawback of this method is that the phase space of the learned RBM is never explored since we limit ourselves to MC steps around the data configurations, therefore it can lead to estimate very poorly the probability distribution for configurations that lie “far away” from the dataset. A simple modification has been proposed to deal with this issue in tieleman2008training. The new algorithm is called persistent-CD (pCD) and consists of having again a set of parallel MC chains, but instead of using the dataset as initial condition, they are first initialized from random initial conditions and then the state of the chains is saved from one update of the parameters to the next one. In other words, the chains are initialized one time at the beginning of the learning and are then constantly updated a few MC steps further at each update of the parameters. In that case, it is not longer needed to have as many chains as the number of samples in the mini-batch even though in order to keep the statistical error comparable between the positive and the negative term it should be of the same order. More details can be found intieleman2008training about PCD and infischer2014training for a more general introduction to the learning behavior using MC. In Section 5 we will intend to understand some theoretical and numerical aspect of the RBMs learning process.
3 Overview of various RBM settings
Before investigating the learning behavior of RBMs, let us have a glimpse at various RBM settings and their relation to other models, by looking at common possible priors used for the visible and hidden nodes.
3.1 Gaussian-Gaussian RBM
The most elementary setting is the linear RBM, where both visible and hidden nodes have Gaussian priors:
the intrinsic variance of the visible and hidden variables respectively. After summing over hidden variables we get a multi-variate Gaussian distribution over the visible ones. If not very sophisticated, the model is yet interesting because it presents a non-trivial learning dynamics that can be written exactlykarakida2014analyzing,karakida2016dynamical,decelle2018thermodynamics,decelle2017spectral. When using Gaussian prior, the corresponding activation functionare Gaussian centered on :
Let us write the marginal distribution over the visible nodes (we omit the hidden bias since it can be canceled by a redefinition of the visible one), starting from eq. (3) and integrating over the hidden variables we get
where we define the precision matrix . Now we can also identify the conditions for the existence of the measure . We need the matrix
to be strictly positive definite, hence that the highest eigenvalue ofremains strictly below . More interestingly, the Gaussian prior let us write in closed form the stochastic gradients (in fact we solve the deterministic equation, not the stochastic one), hence given us some hints on the nature of the learning dynamics of non-linear RBMs, since in any case we expect a linear regime to take place at the beginning of the learning process. In the present case, we can rewrite eq. (4) as
where is the correlation between the nodes and in the dataset, and
the inverse of the precision matrix. At this point, followingdecelle2018thermodynamics, it is convenient to use the singular value decomposition (SVD) of. We note the eigen-decomposition of the rectangular weight matrix , where the matrix and
correspond to the left (resp. right) eigenvectors ofassociated to the visible (resp. hidden) variables and the eigenvalue associated to the mode . As can be seen in eq. (12), this transformation will diagonalize the interaction term of the Hamiltonian of the system. We can now make the following change of variables
under this change of variable, the Gaussian measure factorizes where and therefore
Writing the distribution in this new basis we obtain
Hence, we can obtain an exact equation for the gradient in the basis of the SVD of the weight matrix . First, we project the equation eq. (13) on the modes of the SVD of
Now to simplify we discard the fluctuations associated to the stochastic gradient, by considering instead the full gradient and an infinitesimal learning rate such that we can consider the iteration time to be continuous and identify . As a result we obtain the time derivative of the matrix decomposed over its eigenmodes
This equation shows that, the gradient update of can be decomposed when projected on the SVD basis of into a gradient over the mode and a rotation of the matrices and . Noticing first that
, we therefore end up with the following dynamics for the singular values:
where in eq (14) denotes the variance of the components of the data on the mode :
This first result tells us that when keeping the matrices and fixed, the SGD on the mode will adjust the value of such that the r.h.s matches the variance in the direction given by , giving the following limit values:
We remark that, if the empirical variance given by the data is smaller than the prior variance of the visible variables the corresponding mode is filtered out. The evolution of the matrices and can also be obtained decelle2017spectral from the following expression in the present case222Actually these equations are given with a wrong sign in decelle2017spectral which is corrected here.
of the infinitesimal rotations of the vectorsand . In the particular case of the Gaussian-Gaussian RBM, we can note the absence of term averaged over the model . This is due to the fact that the SVD corresponds to the eigendecomposition of the RBM measure (that is, the Gaussian measure factorizes over the singular modes) and that the eqs (16-17) involve correlation between modes which are zero here. From eq. (16-17), we see that a steady state is found when a direction is found that diagonalizes the empirical covariance matrix of the dataset.
In short, the Gaussian-Gaussian RBM learns the principal components of the dataset and for each principal axes the weight matrix is adjusted until the strength of the corresponding modes reaches the value given by eq. (15). Of course, modes above threshold acquire a variance which matches the variance of the dataset in this direction . We can somehow say that the Gaussian-Gaussian RBM is performing a sort of SVD of the dataset, keeping only the modes above a given threshold. It is worth noting that an analysis has been done inkarakida2016dynamical where it is shown that updating the parameters of the model using the CD approximation converges toward the same solution as the one obtained by maximizing the likelihood of the model.
We can illustrate the learning mechanism in simple cases where it is possible to solve explicitly the dynamics. First assume that the RBM has found the principal axes, i.e. consider the matrices and to be fixed. In this case the quantity remains constant. Letting
and rescaling time as , equation (14) then rewrites as
and we obtain a solution of the form
For we get a sigmoid type behaviour
To illustrate the rotation of the modes, consider now the situation where there are modes , which are a linear combination of two dominant modes of the data with identical orientation taken in this order, all other modes considered to be already properly aligned with the data. Let then represent the angle between and (and also between and see Figure (2). Equation (16) for this pair of modes rewrites then as
so that finally we get a dynamical system of the form
Note that at fixed and the dynamics of corresponds to the motion of a pendulum w.r.t the variable shown on Figure 2.
The Gaussian-Gaussian case is interesting as a solvable model of RBM but of limited scope, since reduces in the end to a multivariate Gaussian. Next, a simple non-linear RBM which remains exactly solvable is based on the so-called spherical modelberlin1952spherical,stanley1968spherical. For this model, it is possible to compute the phase diagram and the equilibrium states once the coupling matrix is given —more precisely, when the spectral density of the coupling matrix is given. Here we chose the following priors to impose a spherical constraint on the hidden nodes:
where is a parameter of the modeldecelle2020gaussian. The interest of such an RBM is first that the spherical constraint can be dealt with analyticallydecelle2020gaussian,genovese2020legendre. Secondly the model can exhibit a phase transition unlike the Gaussian-Gaussian case. Absorbing the parameter in the definition of the weight matrix, to follow the computation ofdecelle2020gaussian, a simple analysis in the thermodynamic limit tells us that the phase transition takes place when the parameter exceeds the value , where depends on the value of the highest mode and of the form of the spectrum of (typically, , where the pre-factor depends on the form of the spectrum). The condensation along this mode of the visible (resp. hidden) magnetization is then given by
where we defined . This type of RBM is again of limited scope to represent data. In the thermodynamic limit a finite number of modes can condense. They necessarily accumulate at the top of the spectrum of the weight matrix and represent a distribution concentrated on an -dimensional sphere in absence of external fields while other non-condensed modes are responsible for transverse Gaussian fluctuations. The dynamical aspect of this model will be discussed in Section (5).
To end up this section let us also mention that the finite size regime is amenable to an exact analysis when restricting the weight matrix spectrum to have the property of being doubly degenerated (seedecelle2020gaussian for details).
The case of the Gaussian mixture if rarely viewed like that, fits actually perfectly the RBM architecture. Consider here the case of Gaussian visible nodes and a set of discrete hidden variables with a constraint corresponding to the softmax activation function nijman1997symmetry:
With this formulation, we indeed see that the conditional probability of activating a hidden node is a softmax function
It is easy from this expression to recognize the equations of the Gaussian mixture model (GMM)mackay2003information,bishop2006pattern, where the latent variableindicates if a sample belong or not to the center . The position of the associated center is given by the vector . It is even clearer when writing the marginal over the visible nodes after summing over the hidden nodes in eq. (3)
the weight of the mode in the Gaussian mixture centered in . Now we can see that the extra parameter can be absorbed in the definition of the weight matrix . It turns out that the positive term of the gradient in equation (4) (ignoring
) corresponds to the gradient that is obtained in the GMM. This can be reformulated into the Expectation Maximization (EM) update by considering thatdo not depend on , hence doing the “expectation” step:
If we impose that the gradient is zero, doing now the “maximization” step, we obtain
where the l.h.s. is to be understood as the new values for the parameters while the conditional distribution on the r.h.s. depends on . For an RBM, one would also compute the negative term of the gradient, involving the derivative of w.r.t. . We obtain the negative term
Again, we can recover with equation (24) the EM update for the density of the Gaussian mode in the GMM, by first considering that the conditioned distribution does not depend on (expectation step) and by putting the l.h.s. to zero (maximization step). The fact that when using the RBM formalism we do not obtain directly the same EM equations as in the GMM is due to the different parametrization of the parameters. In the GMM, the density of each Gaussian is defined right from the beginning as an independent parameter while when using the RBM, the density of the Gaussian depends on other parameters such as the weight matrix .
Phase transition in the learning process —
an interesting phenomena occurs in this model when learning position of the centers of the Gaussian while submitting the variances of the Gaussian to an annealing processrose1990statistical. First of all, starting from a very high variance (equivalently, very high temperature), we can convince ourselves that the learning will end up finding the center of mass of the dataset. Let us therefore consider that we centered the dataset beforehand: ,
. Then, reducing slowly the variance of each component of the mixture, we can look for the moment at which point the degenerate solution corresponding to all the centers placed at the center of masses of the dataset becomes unstable. Linearizing the EM equations (23) around this point with and , where the are small perturbations, we can derive the threshold where the linear perturbations get amplified. The linear stability analysis leads to the following equations for the perturbation :
where is the covariance matrix of the dataset. From this expression, one sees that when the variance is higher that the largest eigenvalue of , i.e. , the solution is stable. Then, when , the solution is unstable and the system starts to learn something more about the dataset besides its center of mass. It is interesting to note that this threshold is very similar to the one obtained in equation (15) for the Gaussian-Gaussian RBM. In this model, it is then possible to study the cascade of phase transition, occurring in a hierarchical way on structured datasetsKloppenburg1997,Kappen2000. We stress here that, even if it is possible to project the learning equations on the SVD of the weight matrix as in the two previous analysis, it does not provide much more insight since this case cannot be solved exactly by this transformation.
It is also interesting to investigate the behavior of the exact gradient (not using EM) in the presence of a learning rate . When using the gradient, the update equations are given by . In that case we obtain the following equation for the linear stability
Interestingly, the threshold does not depend on the value of in that case, meaning that the instability is a generic properties of the learning dynamics. The only change is the speed with which the instabilities will develop.
3.4 Bernoulli-Gaussian RBM
The next case is the Bernoulli-Gaussian RBM where we consider the following prior