# Gaussian-Spherical Restricted Boltzmann Machines

We consider a special type of Restricted Boltzmann machine (RBM), namely a Gaussian-spherical RBM where the visible units have Gaussian priors while the vector of hidden variables is constrained to stay on an L_2 sphere. The spherical constraint having the advantage to admit exact asymptotic treatments, various scaling regimes are explicitly identified based solely on the spectral properties of the coupling matrix (also called weight matrix of the RBM). Incidentally these happen to be formally related to similar scaling behaviours obtained in a different context dealing with spatial condensation of zero range processes. More specifically, when the spectrum of the coupling matrix is doubly degenerated an exact treatment can be proposed to deal with finite size effects. Interestingly the known parallel between the ferromagnetic transition of the spherical model and the Bose-Einstein condensation can be made explicit in that case. More importantly this gives us the ability to extract all needed response functions with arbitrary precision for the training algorithm of the RBM. This allows us then to numerically integrate the dynamics of the spectrum of the weight matrix during learning in a precise way. This dynamics reveals in particular a sequential emergence of modes from the Marchenko-Pastur bulk of singular vectors of the coupling matrix.

## Authors

• 9 publications
• 10 publications
• ### Mean-Field Inference in Gaussian Restricted Boltzmann Machine

A Gaussian restricted Boltzmann machine (GRBM) is a Boltzmann machine de...
12/03/2015 ∙ by Chako Takahashi, et al. ∙ 0

• ### Matrix Variate RBM Model with Gaussian Distributions

Restricted Boltzmann Machine (RBM) is a particular type of random neural...
09/21/2016 ∙ by Simeng Liu, et al. ∙ 0

• ### Thermodynamics of Restricted Boltzmann Machines and related learning dynamics

We analyze the learning process of the restricted Boltzmann machine (RBM...
03/05/2018 ∙ by Aurelien Decelle, et al. ∙ 0

• ### On the mapping between Hopfield networks and Restricted Boltzmann Machines

Hopfield networks (HNs) and Restricted Boltzmann Machines (RBMs) are two...
01/27/2021 ∙ by Matthew Smart, et al. ∙ 0

• ### Phase Diagram of Restricted Boltzmann Machines and Generalised Hopfield Networks with Arbitrary Priors

Restricted Boltzmann Machines are described by the Gibbs measure of a bi...
02/20/2017 ∙ by Adriano Barra, et al. ∙ 0

• ### Sampling the Riemann-Theta Boltzmann Machine

We show that the visible sector probability density function of the Riem...
04/20/2018 ∙ by Stefano Carrazza, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the last decade, the field of machine learning became the center of attention of both the public domain and of the scientific research. With the development of deep neural networks taking advantage of the GPU technology, the performance on classification tasks started to outperform human level at image recognition, and more recently, generative model such as generative adversarial network

[1] (GAN) have been able to generate images that cannot be distinguish from a true one [2]. Despite recent significant advances [3, 4]

the theoretical understanding of deep learning lag behind these progresses, in various respects like for instance on the interplay between adequate network architecture and complexity of the data.

Statistical physics has been helpful in the past to clarify the learning process on idealized inference problems. In the 80’, before the A.I. winter, many works on neural networks were proposing some elements of understanding in terms of the theoretical phase diagram of some models. For instance, a retrieval phase for the Hopfield model was determined along with the number of patterns that can be retrieven in that case [5, 6, 7, 8]

. Another example deals with the perceptron where again, the capacity for storing synthetic dataset can be computed

[9, 10, 11]. The storage of information in layered neural networks was also analyzed in [12] with mean-field techniques. These approaches could then be adapted in many different contexts, such as community detection [13], compressed sensing [14] or traffic inference [15] to mention only a few of them. Typically in this kind of approach, the formalism of statistical physics relates the behaviour of the model to its position on a phase diagram in the large limit, mean-field equations being used to characterize the free energy landscape and to sample efficiently the system.

In this work a somewhat similar path is followed to study generative models, by focusing on a “tractable” version of the restricted Boltzmann machine (RBM). While RBM is considered to be a basic tool of machine learning, introduced more than 30 years ago [16], it is still attracting a lot of interest, both from the machine learning and the statistical physics communities. First, it is a model that can be handled without the need of GPU and can be run on a ordinary computer in reasonable time while solving non-trivial tasks. Second, it has only one hidden layer in its classical formulation which allows the possibility to get some understanding of the learned hidden features since they are directly linked to the visible variables. Finally, it can be expressed as an Ising model and therefore, many standard tools developed by the statistical physics community can be used to determine its properties.

Originally, the RBM played an important role in deep learning as a way to pre-train deep auto-encoders layerwise [17]. It is also in principle possible to stack many RBM to form a multi-layer generative model known as a Deep Boltzmann Machine (DBM) [18]. Within the recent years, RBM has continuously attracted the interest of the research community, firstly because it can be easily used for both continuous and discrete variables [19, 20, 21, 22]

and the activation can be tuned to be either binary of relu

[23]; secondly because for datasets of modest size it is able to deliver good results [24, 25] comparable to the ones obtain from more elaborated network such as GAN (see for instance [26]

). However, even for such a simple model, the learning procedure (what is learned, and how it is learned) is still very difficult to analyze with non-linear activation functions, in order to identify the key features and mechanisms allowing it to work properly. Even for practical purpose, it is intrinsically difficult to efficiently estimate numerically the gradient w.r.t. the parameters of the model, as soon as the network has learned non trivial modes. Empirical procedures have been proposed, first the contrastive divergence

[27] (CD) and the refined Persistence CD [28] (PCD) and later on a mean-field estimate [29], none of these being fully satisfactory (see e.g. [30] for a more detailed discussion), especially if one is willing to learn an empirical distribution with good accuracy. For that purpose, recent works [31, 32] using the analogy between the RBM and the Hopfield model characterize the retrieval capacity of RBMs. RBM with sparse weight matrix have been considered in [33] to analyze compositional mechanisms of features to create complex patterns. Other works have focused on a mean-field theory for the RBMs, first to approximate the gradient and second to probe the mean-field landscape in the general case [29, 34] or in the spherical case [35] or even to compute the entropy in a very simple case [36]. Recently we also used a mean-field approach to understand the phase diagram of the model as a function of the spectrum of the weight matrix [37, 38]. We characterized in some way how the singular modes of the weight matrix evolve and interact during learning, bringing forward a clustering model interpretation of the RBM in terms of mean-field fixed points.

In this paper we study an RBM with continuous symmetry, consisting of one layer of Gaussian variables (the visible one) and one layer of real variables with a spherical constraint. In the spirit of the original spherical Ising model introduced by Berlin and Kac [39], this offers the possibility to say something relevant to the original model, by solving a simpler one. It turns out that in a special setting the thermodynamical properties of the Gaussian-spherical RBM can be obtained exactly. This allows one to devise an exact gradient ascent of the likelihood to learn the model, despite the fact that this model as we shall see is able to encode only rather specific data. The observation made previously on the spectral dynamics of the learning procedure [37, 38], in particular that the modes of the weight matrix are learned by order of importance, will be illustrated by an exact integration of the dynamical equations introduced in these works. In addition this solution could constitute a possibility to assess approximate mean-field methods and empirical learning strategies.

The paper is organized as follows. In Section 2 we define the RBM with a spherical hidden layer and derive the likelihood and the response functions exactly in the case of Gaussian visible units, together with the dominant behavior in the thermodynamic limit. In Section 3 we specify some properties specific to the spherical constraint, as the way is occuring the onset of ferromagnetic order, the critical behavior of the magnetization associated to mode condensation, remarking and exploiting in passing some connection with spatial condensation in particle processes explored in [40]. Next, the Section 4 focuses on a particular case where the spectrum of the weight matrix is doubly-degenerated and allows one to compute exactly for finite size systems the partition function of the system. Finally, in the section 5 we exploit these results to numerically integrate the spectral dynamics of the weight matrix during learning.

## 2 Model definition

### 2.1 Boltzmann measure and associated likelihood

The basic structure of the RBM is shown on Figure 1. It is a bipartite model connecting one layer of visible variables to one layer of hidden variables, these ones acting as a field to generate interactions among visible variables. We define the visible variables and the hidden variables both real valued, where and denotes the number of visible (resp. hidden) variables. will represent the size of the system and its shape. We define the energy function by

 E(s,σ)=−∑i,jwijsiσj+∑is2i2−∑iηisi−∑jθjσj, (1)

is the weight matrix between the visible and hidden variables, and are local fields exerted on variables. In this form the visible variables have a Gaussian prior in absence of hidden variables. The spherical constraint imposes an additional prior distribution on the hidden variables. Overall the distribution over and is defined as

 p(s,σ)=1Ze−E(s,σ) δ(∑jσ2j−¯σ2L) (2)

where is the normalization factor and

a parameter of the model. In this setting, it is possible to diagonalize the distribution by using the singular value decomposition (SVD) of the matrix

:

 wij=∑αwαuαivαj

where are left singular vectors, attached to the visible space, while are right singular vectors attached to the hidden space and are the singular values. Depending on whether or the set or the set is not a complete orthonormal set of respectively the visible or the hidden space. If we assume for instance that the matrix corresponding to the left singular vectors has to be complemented by

arbitrary orthonormal vectors to form a complete basis of the visible space. For the moment we don’t need to specify whether

is larger than , denote and assume that and represent complete basis respectively of the visible and hidden space.

2) is conveniently expressed by means of the components of the visible and hidden vectors in these bases:

 ^sα=1√LNv∑i=1Uαisi^ηα=1√LNv∑i=1Uαiηi

for and

 ^σα=1√LNh∑j=1Vαjσj^θα=1√LNh∑j=1Vαjθj

for . These obey the following normalization rules:

 Nv∑i=1s2i=LNv∑α=1^s2αNh∑j=1σ2j=LNh∑α=1^σ2αNv∑i=1ηisi=LNv∑α=1^ηα^sαNh∑j=1θiσi=LNh∑α=1^θα^σα

we obtain

 p(^s,^σ)=1Zexp(LN∑α=1[wα^sα^σα+^ηα^sα+^θα^σα]−LNv∑α=1^s2α2)δ(LNh∑α=1^σ2α−¯σ2L).

From these transformations, we expect and also to scale like . In this representation the SVD modes are coupled by the spherical constraint. To get the distribution of the visible variables alone, we have to integrate over the hidden variables which can be done first by using the Fourier representation of the function

 p(^s,^σ)=12iπZ∫a+i∞a−i∞dzexp(L(∑α[wα^sα^σα+^ηα^sα+^θα^σα]−∑α^s2α2−z(∑α^σ2α−¯σ2))), (3)

with . With the change of variable we get

 p(^s)=12iLZ(2πL¯σΣ(^s))Nh/2−1exp(LNv∑α=1(^ηα^sα−^s2α2))∫a+i∞a−i∞dzzNh/2exp(L¯σΣ(^s)2(z+1z))

with

 Σ2(^s)\tiny def=N∑α=1(wα^sα+^θα)2.

The integration over can actually be rewritten as (for )

 p(^s)=(2π)Nh/22LZ~INh/2−1(L¯σΣ(^s))exp(LNv∑α=1(^ηα^sα−^s2α2)), (4)

where

 ~Iν(x)=x−νIν(x),

with the modified Bessel function

The partition function is also given by means of a single integral after integrating over visible and hidden variables the form (3) after the change :

 Z=12iπ∫a+i∞a−i∞dzeLϕ(z), (5)

where

 ϕ(z)=¯σ2z2−δ2log(z)+h202z+12Nv∑α=1η2α+12N∑α=1[h2αz−w2α−1Llog(z−w2α)], (6)

up to a constant and

 δ \tiny def=(κ−κ−1)\rm\small 1\kern-3.5pt\normalsize 1{Nh>Nv}, h20 \tiny def=% \rm\small 1\kern-3.5pt\normalsize 1{Nh>Nv}Nh∑α=Nv+1^θ2α, hα \tiny def=^ηαwα+^θα.

### 2.2 Learning algorithm

The objective of the standard learning procedure of the RBM is to find the set of parameters such that the likelihood of a given dataset be maximal. This is done by conventional gradient ascent of the log likelihood (LL). The conventional gradient of the LL w.r.t. the parameters is given by

 ∂L∂wij =⟨siσjp(σ|s)⟩Data−⟨siσj⟩RBM ∂L∂ηi =⟨si⟩Data−⟨si⟩RBM ∂L∂θj =⟨σjp(σ|s)⟩Data−⟨σj⟩RBM

This requires to compute various response functions of the RBM and the conditional probability

. As shown in [37, 38] it is convenient to rewrite the gradient in the frame defined by the SVD modes of the weight matrix. As already seen it is here especially adapted since the RBM measure is naturally expressed in this frame. In addition, the specificity of the Gaussian-spherical model is that the joint distribution of the visible variables (4) is invariant w.r.t. a rotation of the singular vector . This means that we can use a more economical gradient. In addition to modifications of we are led to consider infinitesimal rotation between modes and of the visible SVD basis only. Here

is a skew-symmetric operator corresponding to the change

 duα =Ωαβuβ, duβ =−Ωαβuα.

 1L∂L∂wα =⟨^sα^σαp(^σ|^s)⟩Data−⟨^sα^σα⟩RBM 1L∂L∂^ηα =⟨^sα⟩Data−⟨^sα⟩RBM 1L∂L∂^θα =⟨^σαp(^σ|^s)⟩Data−⟨^σα⟩RBM 1L∂L∂Ωαβ =⟨(wα^sα^σβ−wβ^sβ^σα)p(^σ|^s)⟩Data−⟨(wα^sα^σβ−wβ^sβ^σα)⟩RBM

with

 p(^σ|^s)=¯σ2L(2π)Nh/2exp(L∑αwα^sα^σα+^θα^σα)~INh/2−1(L¯σΣ(^s))δ(L∑α^σ2α−¯σ2L).

This results in the following (continuous) update equations for the parameters and the dataset :

 dwαdt =⟨^sα^σαp(^σ|^s)⟩Data−⟨^sα^σα⟩RBM, (7) d^ηαdt =⟨^sα⟩Data−⟨^sα⟩RBM−∑βΩαβ^ηβ, (8) d^θαdt =⟨^σαp(^σ|^s)⟩Data−⟨^σα⟩RBM, (9) d^skαdt =−∑βΩαβ^skβ,∀k∈S, (10)

with

 Ωαβ=⟨(wα^sα^σβ−wβ^sβ^σα)p(^σ|^s)⟩Data−⟨(wα^sα^σβ−wβ^sβ^σα)⟩RBM (11)

and where represents the learning rate. Here the last update equation corresponds to simply adapting the data projection on the rotated basis. Note that the same is done also for the second update equation concerning the field projection , which is optional here but coherent with the conventional update rules and useful in practice. Note also that the singularity of the conventional gradient observed in [37] for pairs of modes with identical singular values has disappeared. Hence, computing the gradient requires to evaluate one and two-points correlation functions of SVD variables.

As seen in the previous section, the LL takes the form

 L=⟨log(~INh/2−1[L¯σΣ(^s)])+L∑α(^ηα^sα−^s2α2)⟩% \rm Data−log(Z).

Compared to the simple Gaussian RBM likelihood, we see one important difference: eigenvalues of

do interact, in particular via the empirical term which is now a nonlinear, monotonically increasing function of . With help of the identity

 d~Iν(x)dx=x~Iν+1(x),

we get from this the gradient of the log likelihood in the form:

 ∂L∂wa=¯σ⟨^sα(wα^sα+^θα)INh/2(L¯σΣ(^s))Σ(^s)INh/2−1(L¯σΣ(^s))⟩% Data−∂log(Z)∂wα

Using the following asymptotic expression for large (see e.g. [41])

 Iν(νz)∼1√2πνeνη(1+z2)1/4withη=√1+z2+logz1+√1+z2,

resulting from a saddle point approximation of the modified Bessel function, we obtain the asymptotic expression

 ⟨^sα^σβp(^σ|^s)⟩Data=¯σ⟨^sαwβ^sβ+^θβ1+√1+¯σ2Σ(^s)2⟩Data,

valid for large .

The remaining point to address now in order to be able to train such a machine is the estimation of the partition function and its derivatives. For the rest of the paper, the local fields on the visible variables will be set to zero to lighten the presentation.

## 3 Thermodynamical properties

The expression of the partition function given by eq. (5-6) indicates that the physical properties of the Gaussian-spherical RBM depend only on the spectrum of its weight matrix in absence of the fields as for the ordinary spherical model [42]. Standard treatments of the spherical model (see [39, 43, 44]) rely on a saddle point approximation of the contour integral representation of given by eq. (5). Here we recall and straightforwardly adapt these arguments to our needs by making simple assumptions on the limit spectrum of when

. This leads us in a second step to establish incidentally a connections with condensation phase transition analyzed in the context of factorized steady states

[40].

### 3.1 Ferromagnetic transition

First notice that given in (6) is convex on the domain of interest:

 ϕ′′(z)=δ2z3+h20z3+∑α[h2α(z−w2α)3+12L1(z−w2α)2]>0,for z>w2max,

with the highest singular value, so there is only one solution to the saddle point equation allowing for the following approximation:

 Z∼L→∞exp(Lϕ(z0))√2πL|ϕ′′(z0)|.

At the saddle point the free energy per degree of freedom is given by

 f=−ϕ(z0,η,θ).

From these quantities we can in principle get all the needed response functions (see Appendix A). We will focus here on the computation of the spontaneous magnetizations as a function of the spectrum of . Their expressions can be obtain using

 ⟨^sα⟩ =−∂f∂^ηα=wαhαz0−w2α, (12) ⟨^σα⟩ =−∂f∂^θα=hαz0−w2α. (13)

This gives us relations between magnetizations and . In order to analyze further the thermodynamic properties of the system some assumptions have to be made on the spectral properties of . Let us define the spectral density (SD) associated to :

 ρL(E)\tiny def=1LNh∑α=1δ(E−w2α),

(which includes zero modes for whenever ). In the thermodynamic limit it is assumed that the SD tends to a well defined limit distribution

 limL→∞ρL(E)=ρ(E).

 ϕ(z)=12¯σ2z+12∫Emax0dEρ(E)(h(E)2z−E−log(z−E)),

with some upper bound of the SD, and where is any smooth function taking the value for at finite , being expected to be in general. represents the SD of the external field. As in [43] for instance, we have to distinguish between a situation where the SD has isolated dominant modes and the situation with just a continuous bulk of modes bounded by . When the SD has no isolated dominant mode we look for a solution to the saddle point equation

 ϕ′(z) =12(g1(z)−g2(z)) =0

where

 g1(z) \tiny def=¯σ2−∫Emax0dEρ(E)(z−E)2h(E)2, g2(z) \tiny def=∫Emax0dEρ(E)z−E,

Let us call

 ¯σ2c\tiny def=g2(Emax) (14)

The properties of the system depends on the behavior of near . A thorough discussion of its influence on the physical properties of the spherical model can be found in [42]. Here we restrict the discussion to behavior of the type with the exponent . This cover various study cases like for instance -dimensional regular lattices or

for i.i.d. random matrices. In order to get closed form expressions we shall consider the following beta distributions for the SD:

 ρ(E) =κB(1−γ,γ+1)E−γ(E% \rm max−E)γEmaxwithγ∈]−1,1[, (15) ρ(E)h(E)2 =h2B(β+1,1−β)Eβ(E\rm max% −E)−βEmaxwithβ∈]−1,1[, (16)

where the beta function takes here the special form

 B(1−γ,1+γ)=γπsin(γπ),

and with the squared norm of the external field. represents the SD of the external fields. This setting can be useful to study the response function at the top of the spectrum, by simply letting . is infinite for and finite otherwise with

 ¯σ2c=1Emax(δ+κγ),γ>0

in the latter case. The different scenarios for obtaining are sketched on Figure 2.

• When , diverges close to and therefore the intersection with always converges to the point . In that case, there is no way that goes to zero for when and therefore all the magnetizations (12,13) vanish.

• When we have to distinguish between two cases. First, we consider condensation on modes that have by applying small vanishing fields on these modes. In that case, since when we have , again the magnetization will simply vanish since the denominator in (12,13) is finite and non-zero. For modes at if , then when we put the field to zero therefore giving the same results as in the first scenario. Now, if instead , has a finite limits given below. We obtain a spontaneous magnetization in that case.

The spontaneous magnetization is obtained as follows for the cases where a saddle point solution exists. If a field is concentrated on the largest mode, this is equivalent to choose such that . We get from the saddle point equation:

 ¯σ2−h2max(z0(hmax)−Emax)2=¯σ2c.

Now, when and we let we obtain

 limhmax→0hmaxz0(hmax)−Emax=√¯σ2−¯σ2c.

Eliminating in (12,13) yields the spontaneous magnetization

 ⟨^sα⟩ =w\rm max¯σ√¯σ2−¯σ2c, ⟨^σα⟩ =√¯σ2−¯σ2c,

and

 ⟨^sα^σβ⟩=δαβwα(¯σ2−¯σ2c).

When the highest mode acquires a macroscopic magnetization in the ferromagnetic phase, the resulting distribution becomes bimodal along this direction. It is also worth noticing that when the highest mode is degenerated times, in absence of any external fields the system has an symmetry corresponding to rotations in the subspace defined by these vectors. This results in that case into a distribution concentrated on a -dimensional sphere in the ferromagnetic phase. The specific shape of the condensed distribution will be studied in the next subsection.

### 3.2 Condensation mechanism in thermodynamic limits

The scaling form (15,16) of the SD of singular values and field densities allows us to make an explicit connection with scaling function derived in a different context, namely condensation of factorized steady states [40]. After dropping irrelevant terms and making the change while absorbing in the definition of we get the following form of the partition function:

 ZL,N[¯σ,h]=12iπ∫i∞−i∞dzeL2ϕ(z,¯σ,h),

with (see Appendix B)

 ϕ(z,¯σ,h)=¯σ2(z+1)−κγ∫z0du[1−(u1+u)γ] + h2β[(1+zz)β−1]. (17)

For large the rescaling leads to the scaling behavior

 ZL,N(¯σ,h)=eL2(¯σ2−h2β)⎛⎝L−1γ+1Vγ,β(Lγ1+γ(¯σ2−¯σ2c),L2+β+γ1+γh2)+O(1L21+γ)⎞⎠,

valid for , where the following scaling function has been introduced

 Vγ,β(x,y) =∫i∞−i∞dzexp(12xz+bzγ+1+cyz1+β), =1π∫∞0du e−b2uγ+1−c2yuβ+1cos(xu2−b1uγ+1+c1yuβ+1),

with

 b=κ2γ(γ+1)b1 c=12β  c1

and where the change of variable is used for . In [40] the same scaling function (at ) is encountered albeit in a different context. We can therefore closely follow their analysis to describe the transition to ferromagnetic order of the present spherical model.

As seen previously, when there is a possibility for dominant modes to generate ferromagnetic order when . In presence of an external field () there is always a solution to the saddle point equation and no transition occurs at . Instead, in absence of external fields and , there is no solution to the saddle point equation when while there is always one in the opposite case, and the transition corresponds to the onset of ferromagnetic order materialized by condensation along the dominant modes. In that case a finite fraction of the overall variance of the distribution is captured by one or possibly a small number of modes.

In order to study the condensate we need to express the marginal probabilities and . For any given mode we have (in absence of external fields)

 p^sα(x) =∫dyp^σα(y)eL(√ϵαxy−x22) p^σα(x) =ZL,N−1(¯σ2−x2)ZL,Nexp(L2ϵαx2),

with and

where it is assumed that corresponds to the system with one single mode at added to the SD (15), while corresponds to the SD (15) alone. Let us call

 Vex=¯σ2−¯σ2c,

the “excess of variance” in the system. We get for the condensate the following behavior

 p^σα(x)∝Wγ(Lγ1+γ(x2−Vex),L11+γ(1−ϵα))

with now

 Wγ(x,y)=e−xy2∫∞0du e−b2uγ+1cos(xu2−b1uγ+1)

So whose plot is shown on Figure 3 and which asymptotic behavior for large is given in Appendix B. This help us to determine how many modes condense and the shape of the distribution along these modes, in the vincinity of the upper boundary of the spectrum corresponding to . Strictly speaking the bump observed on Figure 3 represents the condensation of a mode only for , because as soon as is strictly positive , which means that the contribution of the bump to the distribution is suppressed exponentially by a factor by comparison to contributions near . Still we see that the bump is present for some values . To know to which modes this corresponds to, first note that