Thermodynamics of Restricted Boltzmann Machines and related learning dynamics

03/05/2018 ∙ by Aurelien Decelle, et al. ∙ 0

We analyze the learning process of the restricted Boltzmann machine (RBM), a certain type of generative models used in the context of unsupervised learning. In a first step, we investigate the thermodynamics properties by considering a realistic statistical ensemble of RBM, assuming the information content of the RBM to be mainly reflected by spectral properties of its weight matrix W. A phase diagram is obtained which seems at first sight similar to the one of the Sherrington-Kirkpatrick (SK) model with ferromagnetic couplings. The main difference resides in the structure of the ferromagnetic phase which may or may not be of compositional type, depending mainly on the distribution's kurtosis of the singular vectors components of W. In a second step the learning dynamics of an RBM from arbitrary data is studied in thermodynamic limit. A "typical" learning trajectory is shown to solve an effective dynamical equation, based on the aforementioned ensemble average and involving explicitly order parameters obtained from the thermodynamic analysis. This accounts in particular for the dominant singular values evolution and how this is driven by the input data: in the linear regime at the beginning of the learning, they correspond to unstable deformation modes of W reflecting dominant covariance modes of the data. In the non-linear regime it is seen how the selected modes interact in later stages of the learning procedure, by eventually imposing a matching between order parameters with their empirical counterparts estimated from the data. Experiments on both artificial and real data illustrate these considerations, showing in particular how the RBM operates in the ferromagnetic compositional phase.



There are no comments yet.


page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Restricted Boltzmann Machine (RBM) [1]

is an important machine learning tool used in many applications, by virtue of its ability to model complex probability distributions. It is a neural network which serves as a generative model, in the sense that it is able to approximate the probability distribution corresponding to the empirical distribution of any set of high-dimensional data points living in a discrete or real space of dimension

. From the theoretical point of view, the RBM is of high interest as it is one of the simplest neural network generative models and the probability distribution that it defines presents a simple analytic form. Moreover, there are clear connections between RBMs and well known disordered systems in statistical physics. As an example, when data are composed by vectors with binary components the discrete RBM takes the form of an heterogeneous Ising model composed of one layer of visible units (the observable variables) connected to one layer of hidden units (the latent or hidden variables building up the dependencies between the visible ones), in which couplings and fields are obtained from the training data through a learning procedure. In order to build more powerful models, RBMs can be stacked to form “deep” architectures. In such a case, they can form a multi-layer generative model known as a Deep Boltzmann Machine (DBM)  [2] or they can be stacked and trained layerwise as a pre-training procedure for neural networks  [3]

. The standard learning algorithms in use are the contrastive divergence 

[4] (CD) and the refined Persistence CD [5] (PCD), which are based on a quick Monte Carlo estimation of the response function of the RBM and are efficient and well documented [6]. Nevertheless, despite some interesting interpretations of CD in terms of non-equilibrium statistical physics [7]

, the learning of RBMs remains a set of obscure recipes from the statistical physics point of view: hyperparameters (like the size of the hidden layer) are supposed to be set empirically without any theoretical guidelines.

Historically, statistical physics played a central role in studying the theoretical foundations of neural networks. In particular, during the 1980s many works on the Hopfield model [8, 9, 10, 11] managed to define its learning capacity and to compute the number of independent patterns that it could store. It is worth noticing that, as RBMs are ultimately defined as a Boltzmann distribution with pairwise interactions on a bipartite graph, they can be studied in a way similar to that used for the Hopfield model. The analogy is even stronger since connections between the Hopfield model and RBMs have been made explicit when using Gaussian hidden variables [12], here the number of patterns of the Hopfield model corresponding to the number of hidden units. Motivated by a renewed excitement for neural networks, recent works actually propose to exploit the statistical physics formulation of the RBM to understand what is its learning capacity and how mean-field methods can be exploited to improve the model. In [13, 14, 15], mean-field based learning methods using TAP equations are developed. TAP solutions are usually expected to define a decomposition of the measure in terms of pure thermodynamical states and are useful both as an algorithm to compute the marginals of the variables of the model and to identify the pure states when they are yet unknown. For instance, in a sparse explicit Boltzmann machine (i.e. without latent variables) this implicit clustering can be done by means of belief propagation fixed points 111a somewhat different form of the TAP equations with simple empirical learning rules [16]. In [17, 18], an analysis of the static properties of RBMs is done assuming a given weight matrix , in order to understand collective phenomena in the latent representation, i.e. the way latent variables organize themselves in a compositional phase [19, 20] to represent actual data. These analysis make use of the replica trick (or equivalent) making the common assumption that the components of the weight matrix are i.i.d.; despite the fact that this approach may give some insights into the retrieval phase, this approximation is problematic since, as far as a realistic RBM is concerned (an RBM learned on data), the learning mechanism introduces correlations within the weights of and then it seems rather crude to continue to assume the independence and hope to understand the realistic statistical properties of the model.

Concerning the learning procedure of neural networks, many recent statistical physics based analyses have been proposed, most of them within teacher-student setting [21]. This imposes a rather strong assumption on the data in the sense that it is assumed that these are generated from a model belonging to the parametric family of interest, hiding as a consequence the role played by the data themselves in the procedure. From the analysis of related linear models [22, 23]

, it is already a well established fact that a selection of the most important modes of the singular values decomposition (SVD) of the data is performed in the linear case. In fact in the simpler context of linear feed-forward models the learning dynamics can be fully characterized by means of the SVD of the data matrix 

[24], showing in particular the emergence of each mode by order of importance with respect to the corresponding singular values.

First steps to follow this guideline have been done in [25]

, in the context of a general RBM and to address the shortcomings of previous analyses, in particular concerning the assumptions over the weights distribution. To this end it has been proposed to characterize both the learned RBM and the learning process itself by means of the SVD spectrum of the weight matrix in order to single out the information content of the RBM. It is assumed that the SVD spectrum is split in a continuous bulk of singular vectors corresponding to noise and a set of outliers that represent the information content. By doing this it is possible to go beyond the usual unrealistic assumption of i.i.d. weights made for analyzing RBMs. Proceeding along this direction, in the present work we first present a thermodynamic analysis of RBMs under the more realistic assumptions over the weight matrix that we propose. Then, on the same basis, the learning dynamics of RBMs is studied by direct analysis of the dynamics of the SVD modes, both in the linear and non-linear regimes.

Hidden layerVisible layer

Figure 1: bipartite structure of the RBM.

The paper is organized as follows: in Section 2

we introduce the RBM model and its associated learning procedures. Section 

3 presents the static thermodynamical properties of the RBM with realistic hypothesis on its weights: a statistical ensemble of weight matrices is discussed in Section 3.1; mean-field equations in the replica-symmetric (RS) framework are given in Section 3.2 and the corresponding phase diagram is studied in Section 3.3 with a proper delimitation of the RS domain where the learning procedure is supposed to take place. The ferromagnetic phase is studied in great details in 3.4 by looking in particular at the conditions leading to a compositional phase. Section 4 is devoted to the learning dynamics. In Section 4.1, a deterministic learning equation is derived in the thermodynamic limit and a set of dynamical parameters is shown to emerge naturally from the SVD of the weight matrix. This equation is analyzed for linear RBMs in Section 4.2 in order to identify the unstable deformation modes of that result in the first emerging patterns at the beginning of the learning process; the non-linear regime is described in Section 4.3, on the basis of the thermodynamic analysis, by numerically solving the effective learning equations in simple cases. Our analysis is finally illustrated and validated in Section 5 by actual tests on the MNIST dataset.

2 The RBM and its associated learning procedure

An RBM is a Markov random field with pairwise interactions defined on a bipartite graph formed by two layers of non-interacting variables: the visible nodes and the hidden nodes representing respectively data configurations and latent representations (see Figure 1). The former noted correspond to explicit representations of the data while the latter noted

are there to build arbitrary dependencies among the visible units. They play the role of an interacting field among visible nodes. Usually the nodes are binary-valued (of Boolean type or Bernoulli distributed) but Gaussian distributions or more broadly arbitrary distributions on real-valued bounded support are also used 


, ultimately making RBMs adapted to more heterogeneous data sets. Here to simplify we assume that visible and hidden nodes will be taken as binary variables

(using values gives the advantage of working with symmetric equations hence avoiding to deal with the “hidden” biases on the variables that appear when considering binary variables). Like in the Hopfield model [8], which can actually be cast into an RBM [12], an energy function is defined for a configuration of nodes


and this is exploited to define a joint distribution between visible and hidden units, namely the Boltzmann distribution


where is the weight matrix and and are biases, or external fields on the variables. is the partition function of the system. The joint distribution between visible variables is then obtained by summing over hidden ones. In this context, learning the parameters of the RBM means that, given a dataset of samples composed of variables, we ought to infer values to , and such that new generated data obtained by sampling this distribution should be similar to the input data. The general method to infer the parameters is to maximize the log likelihood of the model, where the pdf (2) has first been summed over the hidden variables


Different learning methods have been set up and proven to work efficiently, in particular the contrastive divergence (CD) algorithm from Hinton [4] and more recently TAP based learning [13]. They all correspond to expressing the gradient ascent on the likelihood as


where is the learning rate. The main problem are the terms on the right hand side of (4-6

). These are not tractable and the various methods basically differ in their way of estimating those terms (Monte-Carlo Markov chains, naive mean-field, TAP…). For an efficient learning the

terms must also be approximated by making use of random mini-batches of data at each step.

3 Static thermodynamical properties of an RBM

3.1 Statistical ensemble of RBMs

When analyzing the thermodynamical properties of RBMs, it is common to assume that the weights

are i.i.d. random variables, like for example in 

[20, 17, 18]. This generally leads to a Marchenko-Pastur (MP) distribution [27] of the singular values of , which is unrealistic.

In order to clarify our notation, let us recall the definition of the singular value decomposition (SVD). As a generalization of eigenmodes decomposition to rectangular matrices, the SVD for a RBM is given by


where is an orthogonal matrix whose columns are the left singular vectors , is an orthogonal matrix whose columns are the right singular vectors and is a diagonal matrix whose elements are the singular values . The separation into left and right singular vectors is due to the rectangular nature of the decomposed matrix, and the similarity with eigenmodes decomposition is revealed by the following SVD equations

In [25] it is argued that the MP distribution of SVD modes actually corresponds to the noise of the weight matrix, while the information content of the RBM is better expressed by the presence of SVD modes outside of this bulk. This leads us to write the weight matrix as


where the are isolated singular values (describing a rank matrix), the and

are the dominant eigenvectors of the SVD decomposition and the

are i.i.d. terms corresponding to noise, with . The and are two sets of respectively and

-dimensional orthonormal vectors, which means that their components are respectively

and , and . We assume to be the rank of and and for all . Note that in the limit and with fixed and , has a spectrum density

composed of a Marchenko-Pastur bulk of eigenvalues and of set of discrete modes:


The interpretation for the noise term is given by the presence of an extensive number of modes at the bottom of the spectrum, along which the variables won’t be able to condense but that still contribute to the fluctuations. In the present form our model of RBM is similar to the Hopfield model and recent generalizations [28], the patterns being represented by the SVD modes outside of the bulk. The main difference, in addition to the bipartite structure of the graph, is the non-degeneracy of the singular values . The choice made here is to consider finite, giving which means that the thresholds (having the meaning of feature detectors) should be because feature is detected when an extensive number of spins is aligned with . In addition, this allows us to assume simple distributions for the components of and (for instance, considering them i.i.d.). Altogether, this defines the statistical ensemble of RBM to which we restrict our analysis of the learning procedure.

Another approach would be to consider extensive, thereby assuming that all modes can potentially condense even though they are associated to dominated singular values. In that case, the separation between the condensed modes and the rest should be made when order parameters are introduced and the noise would then correspond to uncondensed modes. If the number of condensed modes is assumed to be extensive, then we should instead consider an average over the orthogonal group which would lead to a slightly different mean-field theory [29, 30].

3.2 Replica symmetric Mean-field equation

Our analysis in the thermodynamic limit follows classical treatments using replicas, like [31, 9] for the Hopfield model or [17] for bipartite models. The starting point is to express the average over and of the log partition function in (2) with the help of the replica trick:

First the average over yields

After this averaging, 4 sets of order parameters and are introduced with the help of two distinct Hubbard-Stratonovich transformations. The first one corresponds to

The second one is aimed at extracting magnetization’s contributions correlated with the modes:



These variables represent the following quantities:

namely the correlations of the hidden [resp. visible] states with the left [resp. right] singular vectors and the Edward-Anderson (EA) order parameters measuring the correlation between replicas of hidden or visible states. and denote an average w.r.t. the rescaled components and of the SVD modes. The transformations involve pairs of complex integration variables because of the asymmetry introduced by the two-layers structure in contrast to fully connected models.

We obtain the following representation:

with and



Since is an incomplete basis we also need to take care of the potential residual transverse parts and , such that the following decompositions hold:


To keep things tractable, both and will be considered negligible in the sequel. Taking into account these components would lead to the addition of a random field to the effective RS field of the variables and eventually to a richer set of saddle point solutions. Note that the order of magnitude of and is at this stage an assumption. If and (or and ) were uncorrelated they would scale as . Moreover, regarding the ensemble average, we will consider and fixed in the sequel.

The thermodynamic properties are obtained by first making a saddle point approximation possible by letting first and taking the limit afterwards. We restrict here the discussion to RS saddle points [32]. The breakdown of RS can actually be determined by computing the so-called AT line [33] (see Appendix A). At this point we assume a non-broken replica symmetry. The set reduces then to a pair of spin glass parameters, i.e. and for all , while quenched magnetizations on the SVD directions are now represented by .

Taking the limit yields the following limit for the free energy:


Assuming a replica-symmetric phase, the saddle-point equations are given by



and , with and denoting an average over the Gaussian variable and the rescaled components and of the SVD modes. We note that the equations are symmetric under the exchange , simultaneously with , and , given that and have the same distribution. In addition, for independently distributed and and vanishing fields (), solutions corresponding to non-degenerate magnetizations have symmetric counterparts: each pair of non-vanishing magnetizations can be negated independently as , generating new solutions. So to one solution presenting condensed modes, there correspond distinct solutions.

3.3 Phase Diagram

The fixed point equations (1617) can be solved numerically to tell us how the variables condensate on the SVD modes within each equilibrium state of the distribution and whether a spin-glass or a ferromagnetic phase is present. The important point here is that with finite and a non-degenerate spectrum the mode with highest singular value dominates the ferromagnetic phase.

In absence of bias () and once is interpreted as temperature and as ferromagnetic couplings, we get a phase diagram similar to that of the Sherrington-Kirkpatrick (SK) model with three distinct phases (see Figure 2)

  • a paramagnetic phase () (P),

  • a ferromagnetic phase () (F),

  • a spin glass phase (; ) (SG).

In general, the lines separating the different phases correspond to second order phase transitions and can be obtained by a stability analysis of the Hessian of the free energy. They are related to unstable modes of the linearized mean-field equations and correspond to an eigenvalue of the Hessian becoming negative.

The (SG-P) line is obtained by looking at the Hessian in the sector:

from what results that the spin glass phase develops when 222Note that in [17] a dependence is found. This dependence is hidden in our definition of giving

times the variance of

instead of as in their case.

. This transition line is understood tacking directly into account the spectral properties of the weight matrix. Classically, this is done with the help of the linearized TAP equations and exploiting the Marchenko-Pastur distribution 

[32]. In our context, the linearized TAP equations read

given the variance of the weights in absence of dominant modes. Then we can show that the paramagnetic phase becomes unstable when the highest eigenvalue of the matrix on the rhs is equal to : if is a singular value of , the corresponding eigenvalues verify the relation

from which it is clear that the largest eigenvalue corresponds to the largest singular value . Owing to the Marchenko-Pastur distribution so verifies

is readily obtained for .

For the (F-SG) frontier we can look at the sector corresponding to the emergence of a single mode (written in the spin-glass phase):

From this it is clear that the first mode to become unstable is the mode with highest singular value and this occurs when and , solutions of (16,17), verify

As for the SK model, this line appears to be well below the de Almeida-Thouless (AT) line, which is the line above which the RS solution is stable (see Figure 2, and Appendix A for the computation of the AT line). This means that in principle a replica symmetry breaking treatment would be necessary to properly separate the two phases. However, we will leave aside this point as we are mainly interested in the practical aspects, namely the ability of the RBM to learn arbitrary data, and so we are mostly concerned with the ferromagnetic phase above the AT line.

For the (P-F) line we consider the same sector of the Hessian but now written in the paramagnetic phase, i.e. setting in the above equation, and this simply yields the emergence of the single mode for .

Note that all of this is independent on how the statistical average over and is performed. Instead, as we shall see later on, the way of averaging influences the nature of the ferromagnetic phase.

Figure 2: Phase diagram in absence of bias and with a finite number of modes, with Gaussian and Laplace distributions for and . The dotted line separates the spin glass phase from the ferromagnetic phase under the RS hypothesis. The RS phase is unstable below the AT line. The influence of on the AT and SG-F lines is shown. In all cases, the hypothetical SG-F line lies well inside the broken RS phase. Inset: high temperature () stability gap corresponding to a fixed point associated to a mode , expressed as a function of and considering various distributions.

Regarding the stability of the RS solution, the computation of the AT line reported in Appendix A is similar to the classical one made for the SK model, though slightly more involved. In fact we were not able to fully characterize, in replica space, all the possible instabilities of the Hessian which would potentially lead to a breakdown of the replica symmetry. At least the one responsible for the ordinary SK model RS breakdown has a counterpart in the bipartite case that gives a necessary condition for the stability of the RS solution:

For the terms below the radical become identical and the condition reduces to the one of the SK model, except for the averages which are not present in the SK model. In Figure 2, is shown the influence on the phase diagram of the value of and of the type of average made on and .

3.4 Nature of the Ferromagnetic phase

Some subtleties arise when considering various ways of averaging over the components of the singular vectors. In [19, 20] is emphasized the importance for networks to be able to reproduce compositional states structured by combination of hidden variables. In our representation, we don’t have direct access to this property but, in some sense, to the dual one, which is given by states corresponding to combinations of modes. Their presence and their structure are rather sensitive to the way the average over and is performed. In this respect the case in which and have i.i.d. Gaussian components is very special: all fixed points associated to dominated modes can be shown to be unstable and fixed points associated to combinations of modes are not allowed. To see this, first notice that in such a case the magnetization’s part of the saddle point equations (16,17) read


Since the role of the bias is mainly to introduce some asymmetry between otherwise degenerated fixed points obtained by sign reversal of at least one pair , let us analyze the situation without fields, i.e. by setting . We immediately see that as long as the singular values are non degenerate, only one single mode may condense at a time. Indeed if mode condenses we necessarily have

and this can be verified only by one mode at a time. Looking at the stability of the fixed points, we see that only the fixed point associated to the largest singular value is actually stable (details reported after the introduction of lemma 3.1).

For other distributions like uniform Bernoulli or Laplace, instead, stable fixed points associated to many different single modes or combinations of modes can exist and contribute to the thermodynamics. In order to analyze this question in more general terms we first rewrite the mean-field equations in a convenient way which require some preliminary remarks. We restrict the discussion to i.i.d. variables so that we can consider single variable distributions. Joint distributions will be distinguished from single variable distributions by the use of bold: , being the (finite) number of modes susceptible of condensing.

Given the distribution and assuming it to be even, we define a related distribution attached to mode :


This distribution has some useful properties.

Lemma 3.1.

Given that is centered with unit variance and kurtosis , is a centered probability distribution with variance


Consider the moments of

. For odd they vanish while for even they read:

i.e. the even moments of relate to moments of order of . The lemma then follows from the fact that has unit variance.  

In this respect, the Gaussian averaging is special because we have and . Then the mean-field equations (16,17) corresponding to the magnetizations can be rewritten in a form similar to (18,19) by introducing the variables and :





This rewriting will prove very useful also in the next section when analyzing the learning dynamics.

Let us now assume, in absence of bias, a non-degenerate fixed point associated to some given mode with finite and . The fixed point equation imposes the relation


The stability of such a fixed point with respect to any other mode is related to the positive definiteness of the following block of the Hessian

with, in the present case

This reduces to

Therefore for the Gaussian averaging case, since , and given (25), we necessarily have

i.e. the Hessian has negative eigenvalues. This means that if the mode is dominated by another mode , the magnetization will develop until , while will vanish.

For the general case of i.i.d. variables, assuming and obey the same distribution , let and be the cumulative distributions associated respectively to and

Given the values of obtained from the fixed point associated to mode , we have the following property:

Proposition 3.2.


which in turn implies



This is obtained by straightforward by parts integration respectively over and in equations (16,17), relative to magnetizations.  

In other words if dominates on then there is a positive stability gap defined as


such that there is a non-empty range for higher values of for which the fixed point associated to mode corresponds to a local minimum of the free energy. Note that property (i) [resp. (ii)] is analogous (in the sense that it implies it) to having a larger [resp. smaller] variance than , i.e. [resp. ]. Therefore distributions with negative relative kurtosis () will tend to favor the presence of metastable states, while the situation will tend to be more complex for probabilities with positive relative kurtosis. Indeed, in the latter case the fixed point associated to the highest mode might not correspond to a stable state if lower modes in the range are present, and fixed points associated to combinations of modes have to be considered. Note that in contrary with the Gaussian case, this can happen because is different for each mode and therefore more flexibility is offered by equations (21,22) than from equations (18,19).

Let us give some examples. Denote by the relative kurtosis. As already said the Gaussian distribution is a special case with . In addition, for instance for corresponding to Bernoulli, Uniform or Laplace, we have the following properties illustrated in the inset of Figure 2:

  • Bernoulli ():

    then for , yielding a positive stability gap.

  • Uniform ():

    It can be verified that for , yielding again a positive stability gap.

  • Laplace ():

    Here we have for , yielding a negative stability gap.

These three examples fall either in condition (i) or (ii), with a stability gap that is either always positive or always negative, independently of . We can also provide examples for which the stability condition may vary with . Consider for instance a sparse Bernoulli distribution, with a sparsity parameter:

The relative kurtosis is in this case

Looking at and it is seen that both conditions and are not fulfilled, except for which corresponds to the plain Bernoulli case. As we see in the inset of Figure 2, for the stability gap is always negative, meaning that a unimodal ferromagnetic phase is not stable, and it is replaced by a compositional ferromagnetic phase at all temperatures. Instead, for and at sufficiently high temperature (low ) the single mode fixed point dominate the ferromagnetic phase.

Laplace distribution:

let us look at the properties of the phase diagram in the case of singular vectors’ components being Laplace i.i.d., case in which a negative stability gap is expected and it may lead to a compositional phase. For this we need the expression for a sum of Laplace variables to compute the averages involved in (16,17). For this purpose, we define the following distributions: