# 'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data.

## Authors

• 1 publication
• 3 publications
• 4 publications
• 7 publications
• ### Minimal model of permutation symmetry in unsupervised learning

Permutation of any two hidden units yields invariant properties in typic...
04/30/2019 ∙ by Tianqi Hou, et al. ∙ 0

• ### A Quantum Field Theory of Representation Learning

Continuous symmetries and their breaking play a prominent role in contem...
07/04/2019 ∙ by Robert Bamler, et al. ∙ 0

• ### Symmetry Breaking in Symmetric Tensor Decomposition

In this note, we consider the optimization problem associated with compu...
03/10/2021 ∙ by Yossi Arjevani, et al. ∙ 12

• ### Statistical physics of unsupervised learning with prior knowledge in neural networks

Integrating sensory inputs with prior beliefs from past experiences in u...
11/06/2019 ∙ by Tianqi Hou, et al. ∙ 0

• ### Replica Symmetry Breaking in Bipartite Spin Glasses and Neural Networks

Some interesting recent advances in the theoretical understanding of neu...
03/17/2018 ∙ by Gavin Hartnett, et al. ∙ 0

• ### Invariant-equivariant representation learning for multi-class data

Representations learnt through deep neural networks tend to be highly in...
02/08/2019 ∙ by Ilya Feige, et al. ∙ 0

• ### Role of zero synapses in unsupervised feature learning

Synapses in real neural circuits can take discrete values, including zer...
03/23/2017 ∙ by Haiping Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Many high-dimensional inputs or data enjoy various kinds of low-dimensional invariances, which are at the basis of the so-called manifold hypothesis

manifold

. For instance, the pictures of somebody’s face are related to each other through a set of continuous symmetries corresponding to the degrees of freedom characterizing the relative position of the camera (rotations, translations, changes of scales) as well as the internal deformations of the face (controlled by muscles). While well-understood symmetries can be explicitely taken care of through adequate procedures, e.g. convolutional networks, not all invariances may be known a priori. An interesting question is therefore if and how these residual symmetries affect the representations of the data achieved by learning models.

This question does not arise solely in the context of machine learning, but is also of interest in computational neuroscience, where it is of crucial importance to understand how the statistical structure of input stimuli, be they visual, olfactive, auditory, tactile, … shapes their encoding by sensory brain areas and their processing by higher cortical regions. Information theory provides a mathematical framework to answer this question Laughlin

, and was applied, in the case of linear models of neurons, to a variety of situations, including the prediction of the receptive fields of retinal ganglion cells

atick1992could , the determination of cone fractions in the human retina Vijay or the efficient representation of odor-variable environments Tibi . In the case of natural images, which enjoy approximate translational and rotational invariances, non-linear learning rules resulting from adequate modification of Oja’s dynamics gerstner or sparse-representation learning procedures sparse

produce local edge detectors, such as do independent component analysis

hyvarinen2000independent . These detectors bear strong similarities with the neural receptive fields measured in the visual cortex (V1 area) in mammals.

It is therefore natural to wonder whether the existence of localized receptive fields is a general feature to be expected from representations of invariant distributions of inputs. Gardner’s theory of optimal learning for single-layer neural network (perceptron) predicts that spatially correlated patterns,

e.g. drawn from a translationally-invariant distribution, lead to a localized profile of weights monasson93 . Further supporting evidence was recently brought by several works, focusing on the production of such receptive fields in the context of unsupervised learning. Learning of symmetric data with similarity-preserving representations sengupta or with auto-encoders benna

both led to localized receptive fields tiling the underlying manifold, in striking analogy with place cells and spatial maps in the hippocampus. In turn, such high-dimensional place-cell-like representations have putative functional advantages: they can be efficiently and accurately learned by recurrent neural networks, and thus allow for the storage and retrieval of multiple cognitive low-dimensional maps

battista19 .

The present work is an additional effort to investigate this issue in a highly simplified and idealized framework of unsupervised learning, where both the data distribution and the machine are under full control. Similarly to previous studies mehtaschwab ; ringel , we consider synthetic data with controlled invariances generated by standard models instatistical physics, such as the Ising and XY models. These data are then used to train Restricted Boltzmann Machines (RBM), a simple albeit powerful framework for representation learning, where a layer of hidden (latent) units account for the correlation structure in the data configurations. We show how the receptive fields of the hidden units undergo a symmetry-breaking transition in the space of couplings: units individually cover localized regions of the input space, but concur to tile the space as best as possible, in much the same way as hippocampal place cells do. This symmetry breaking is dynamically restored if we let the training algorithm run for very long times (well beyond the training time needed to saturate the log-likelihood of the test set): while keeping their localized shape, the center of the receptive/place fields diffuses along the input space, effectively ensuring the invariance of the learned distribution. We also show that this symmetry-breaking phenomenon requires a minimum number of data, an illustration of the general phenomenon of retarded learning watkin

, also encountered in random matrix theory in the context of the so-called spiked covariance model

reimann ; bbp .

Our paper is organized as follows. RBM and their learning algorithms are introduced in Section II. We consider the case of a data distrbution with a single invariance in Section III, and with two symmetries in Section IV. A detailed theoretical analysis of the learning dynamics and of the receptive field emerging through the symmetry-breaking transition can be found in Section V. Conclusions and speculative connections with experiments in neuroscience are proposed in Section VI.

## Ii Restricted Boltzmann Machines

### ii.1 Definition and log-likelihood

A Restricted Boltzmann Machine (RBM) is a bipartite, undirected stochastic neural network with two layers, see Fig.  1:

• the visible layer includes units , , which carry the configurations of data. For simplicity, we assume here that visible units take binary values, .

• the hidden layer includes units , , on which are expressed the representations of the data configurations. Hidden, or latent variables can take real or binary values.

The model is formally defined by a Gibbs probability distribution over the sets of visible (

) and hidden () variable configuration:

 p(v,h)=1Ze−E(v,h) ,whereZ=∑v∫dhe−E(v,h) (1)

is the partition function, such that is normalized to unity, and the energy function is given by

 E(v,h)=−N∑i=1M∑μ=1wiμvihμ−N∑i=1bivi+M∑μ=1Uμ(hμ) . (2)

In the formula above, is the real-valued weight (coupling) connecting the hidden unit and the visible unit , are real-valued bias terms, also called fields and are the hidden unit potentials. We consider two possible choices for :

• For binary () valued hidden units, a regular field term similar to the visible units. In that case, Eqn. 2 is a special case of Ising distribution, with only couplings between units belonging to different layers.

• For real valued hidden units, the symmetric double well potential . For , the potential is quadratic and the corresponding variable is Gaussian and for the potential has two minimas at

; this choice of potential effectively interpolates between Gaussian (

) and binary () hidden units barra2017phase .

Due to the absence of connections between the units within a layer, the conditional probability of hidden units given the visible units factorizes as follows:

 p(h|v)=M∏μ=1p(hμ|Iμ(v)) , (3)

where is the total input received from the visible layer by hidden unit in the absence of fields on visible units, and . Therefore, sampling from the conditional distribution is simply done by first computing the hidden layer inputs , then sampling independently each hidden unit given its input according to its hidden unit potential. Similarly, the average activity of a hidden unit given the visible units, , is a non-linear function of the input ; for binary hidden units, we have

. Therefore, RBM can be viewed as linear-nonlinear model similar to other feature extraction methods such as Independent Component Analysis. Symmetric formulas can be written for the conditional probability of visible units given the hidden units.

In addition, the marginal distribution over the visible units can be written in closed form:

 p(v)=∫dhp(v,h)=1Ze∑Ni=1biviN∏μ=1∫dhμe−Uμ(hμ)+hμIμ(v)=1Zexp(N∑i=1bivi+M∑μ=1Γμ(Iμ(v))−Eeff(v)) ,

where is the cumulant generative function, or log Laplace transform, associated to the potential ; for binary hidden units, . Note that by construction, is the average value of the hidden unit given its input

; therefore the hidden unit potential determines the transfer function of the hidden unit. Importantly, although the joint distribution is pairwise, the marginal distribution is not in general as

functions are not quadratic. Therefore, RBM generate effective high-order interactions between the units , and are capable of expressing complex measures over the visible configurations le2008representational ; tubiana2017emergence .

### ii.2 Training algorithm

Training the RBM is the process of fitting the parameters to maximize the average log-likelihood of the data items assumed to be independently drawn from

. While this may be done with the gradient ascent method, calculating the likelihood is computationally intensive as it requires evaluating the partition function, and sampling methods like Markov Chain Monte Carlo (MCMC) in the form of Gibbs sampling are used.

For the model with parameters , the log-likelihood of a single training example is

 logL(vdata|Θ)=logp(vdata)=−Eeff(vdata)−logZ=−Eeff(vdata)−log[∑ve−Eeff(v)] . (4)

Taking the partial derivative with respect to gives

 ∂logL(vdata|Θ)∂Θ=−∂Eeff(vdata)∂Θ+1Z∂Z∂Θ=−∂Eeff(vdata)∂Θ+⟨∂Eeff(v)∂Θ⟩RBM , (5)

where denotes the average according to the marginal distribution over the visible units with parameter values .

In particular, for the weights , we have according to (II.1), . The gradient of the total log-likelihood is then

 ∂⟨logL(vdata|Θ)⟩data∂wiμ=⟨vdatai⟨hμ|vdata⟩⟩data−⟨vi⟨hμ|v⟩⟩RBM . (6)

Equation (6

) is an example of moment-matching condition, as it imposes that the correlation between the variables

and computed from the data coincides with its counterpart defined by the RBM model distribution . The gradients of over and lead to similar moment-matching conditions for, respectively, the average values of and of .

#### ii.2.2 Approximating the log-likelihood gradient

In the gradient of the log-likelihood of Eqn. (6), the model-distribution moment is not computationally tractable, as it requires to sum over all values of the visible and the hidden variables. In practice, an approximate value for this term is obtained by Markov Chain Monte Carlo (MCMC) methods. The Markov Chain is defined by repeated iterations of Gibbs sampling, which consists in sampling h from v and v from h using Eqn. 3. In principle, one should run a full MCMC simulation at each gradient step, but this is computationally prohibitive. For our RBM training we use the

Persistent Contrastive Divergence (PCD)

algorithm tieleman2008training : Markov Chains are initialized at the beginning of the training and updated with only a few Gibbs Monte Carlo steps between each evaluation of the gradient, see fischer2015 for a more detailed review. This approximation works very well for the data distribution studied here because they are in a paramagnetic phase (= monomodal), hence the Markov Chains mix very rapidly.

#### ii.2.3 Stochastic Optimization

The RBM is trained using Stochastic Gradient Ascent (SGA), the golden standard for neural network optimization. SGA is a variant of ordinary gradient ascent where at each step, only a small subset of the data set (the minibatch), of size examples, is used to evaluate the average log-likelihood, see Eqn 7 where is the learning rate which dictates how much to change the parameter in the direction of the steepest gradient. The dataset is divided into mini-batches

, and for each epoch of training

, we perform one SGA update for each mini-batch. An epoch consists of using all the subsets for the update such that each data sample is used once. After every epoch the subsets are again drawn randomly. Several dozens of epochs are usually required to reach convergence.

 Θt+1=Θt+ν⎡⎣1B∑b ∈ % Batch(t)∇ΘlogL(vdata,b|Θ)⎤⎦ (7)

Compared to ordinary gradient ascent, SGA serves several purposes. First and foremost, its computational cost is significantly reduced as only a small batch is used per update; yet the update is usually reliable thanks to data redundancy. Second, the stochastic evaluation of the gradient introduces noise in the learning process. This prevents the dynamic from getting trapped in local maxima, which is crucial for non-convex optimization landscapes, and it also directs the dynamics toward minima with wider basins of attraction jastrzkebski2017three . It has been argued that the later effect contributes in improving generalization performance hochreiter1997flat ; keskar2016large ; chaudhari2016entropy . Though the convergence rate of SGA has a slower asymptotic rate than ordinary gradient descent, it often does not matter in practice for finite data sets, as the performance on the test set usually does not improve anymore once the asymptotic regime is reached bottou2008tradeoffs .

The noise level of the SGD is directly related to the batch size and learning rates parameters, see for instance smith2017bayesian . Briefly speaking, assuming i.i.d. and infinite number of samples, the SGA parameter increment has mean value

, and variance proportional to

; in the large

limit it is also Gaussian distributed according to the central limit theorem. In comparison, the increments of a continuous time Langevin equation with energy landscape

and noise covariance matrix , integrated over a time step has the same mean value and a covariance proportional to . Identifying both noises gives the following scaling law for the SGA noise, . Reducing the learning rate and increasing the batch size therefore decrease the noise level, and vice-versa. In all our experiments, both learning rates and batch sizes are kept fixed throughout a training session.

## Iii Learning Data with a Single Invariance

### iii.1 Data distribution: Ising model

Our first toy distribution for data is the celebrated one-dimensional ising model from statistical physics. Here each is a spin which can either be up or down, that is can take only binary values. The corresponding joint probability distribution of the visible units reads

 pdata(v1,v2,...,vN)=1Zisingeβ∑Ni=1vivi+1 (8)

where the partition function normalizes this probability over the visible configurations, and is referred to as the inverse temperature. We enforce periodic boundary conditions through .

As is well known, under distribution (8), all visible units have average values equal to zero, and the correlation function decays exponentially with the distance separating the corresponding units on the ring,

 ⟨vivj⟩=∑vpdata(v1,v2,...,vN)vivj=e−|i−j|/ξ ,whereξ=1lncothβ (9)

is the correlation length. The above expression for the correlation holds when .

Formula (8) defines a simple example of invariant distribution under the set of translations (or, better, rotations) on the -site ring. More precisely, for any integer , we have

 pdata(v1,v2,...,vN)=pdata(vk+1,vk+2,...,vk+N) , (10)

where is to be intended modulo . Figure 2 shows a number of configurations, drawn independently and at random from this probability distribution using the Gibbs sampling algorithm.

### iii.2 Initial learning and emergence of place cells

#### iii.2.1 Case of a single hidden unit

First we train the RBM with only hidden unit, and visible units. Such a limited machine is, of course, not expected to reproduce accurately the Ising model distribution underlying the data. However, this is an interesting limit case to study how the RBM can make the most of its single set of weight attached to the unit. We use a large number of data configurations for training, which makes our distribution approximately invariant under rotations on the ring.

We initialize the weights with small amplitude Gaussian random values; since the data are symmetric, we further impose . The results of the training phase after 100 epochs, i.e. the weights are shown in Fig. 3(a). We observe that the weights are not uniform as could have been naively expected from rotational invariance, but focus on a limited portion (place) of the -site ring. The position of the peak depends on the initial conditions for the weights; it may also be influenced by the small irregularities in the data set coming from the finite number of training configurations.

To understand what determines the width of the weight peak, we train different RBMs with data at different inverse temperatures , and calculate their average peak widths over multiple runs. We plot the peak width as a function of the correlation length in Fig. 3(b). We observe that the peak width scales proportionally to . Interestingly, despite its very limited expression power, our single-unit RBM has correctly learned to coarse grain the visible unit configurations on the relevant scale length in the data, . Having wider receptive, or place fields would not be as much as informative. For instance, with a set of uniform weights

, the hidden unit would simply estimate the average magnetization of (mean value of all visible units in) the data configurations, which are all equal to zero up to fluctuations of the order of

, and would completely miss the correlated structure of the data. Conversely, more narrow place fields would have lower signal-to-noise ratios: the strong correlations of visible units over the length

allows one to reliably estimate the local magnetization and the correlation structure on this scale.

#### iii.2.2 Case of multiple hidden units

We next show results obtained when training RBM with hidden units on the same data. Figure 4 shows that each one of the three sets of weights have roughly the same peaked structure (same width) as in the case, but the peaks are centered at different places along the ring. The roughly equal distance between successive peaks shows the existence of an effective repulsion between the weights of any two hidden units. This phenomenon is easy to understand on intuitive grounds: having very overlapping place fields produces highly redundant hidden units,and would not help capturing the spatial correlation in the data spreading over the entire ring.

Training of RBMs with a large number of hidden units shows the same pattern of production of place fields attached to different hidden units, covering in a approximately uniform way the visible space (ring), see Fig. 4(c) in the case of hidden units. The only notable difference is that the width of the place fields shrinks as gets very large. This happens when , i.e. when the single-hidden-unit peaks would start to largely overlap.

### iii.3 Long-time learning and restoration of invariance through place-field diffusion

We now let the training dynamics evolve for a much larger number of epochs. In the case of a RBM with one hidden unit only, the weight vector shows the overall peak structure of Fig.

3(a) at all times (after a short initial transient during which the localized peak emerges). However, the location of the peak may change on very long training time scales. Figure 5(a) shows ten trajectories of the center of the peak corresponding to ten random initialization of the weights (equal to small values drawn independently and randomly). We observe that the centers of the peaks undergo a seemingly random motion. When the number of data items used for training is very large (to erase any tiny non-homogeneity in the empirical distribution), this random motion looks like pure diffusion.

Figure 5(b) shows that the Mean Square Displacement (MSD) of the peak center grows roughly linearly with the training time (number of epochs), which defines the effective diffusion coefficient of the weight peak. For intermediate numbers of data items, diffusion is activated: due to the inhomogeneities in the empirical data distribution, some places along the ring are preferred, and have a tendency to trap the weight peak for some time.

Repeating the same analysis for a RBM with hidden units allows us to observe the diffusion of the three peak centers, see Fig. 6. We see that the motions of these centers are coupled to maintain a constant distance between each other. This is a clear signature of the effective repulsion between the hidden-unit weight vectors already discussed in Section III.2.2.

### iii.4 Case of few data: Retarded learning transition

The emergence of a pronounced peak in the weight vector attached to a hidden unit reported above takes place only if the number of data items are sufficiently large. For very few data, the RBM weights do not show any clear spatial structure and seem to overfit the data. Similarly, for a fixed number of data samples, a transition is observed between the overfitting and spatially-structured regimes as the correlation length (or the inverse temperature ), that is, the spatial signal in the data is increased. To distinguish these two regimes, we introduce the empirical order parameter

 W=∣∣ ∣∣N∑i=1wi1∣∣ ∣∣ , (11)

which is expected to be large when place fields emerge and the weights are spatially structured, and much smaller (and vanishingly small in the large– limit) in the overfitting regime.

Figure 7(a) shows the value of the order parameter as a function of the intensity of spatial correlations for a fixed number of data samples. For small values of (and ) vanishes: the very weak spatial structure in the available data is not learned by the RBM. At large , a place field emerges, focusing on a finite portion of the ring, and is non zero. The same transition is observed when is fixed and the number of training samples, , is varied, see Fig. 7(b). For few samples or, equivalently, large noise levels , the RBM overfits the data and vanishes. For small values of , becomes non zero, signalling the emergence of a place field focusing on a finite portion of the ring.

This transition is an example of the very general mechanism of the so-called retarded learning phenomenon watkin , also encountered in the context of random correlation matrices and the spiked covariance model. The connection with random matrices will be made explicit in Section V.1.

## Iv Learning data with multiple invariances

### iv.1 Data distribution: discretized XY model

The classical XY model is a popular model in statistical physics, used in particular to study topological phase transitions in two dimensions. We consider here the one-dimensional version of this model, which shows no such phase transition but is nonetheless very useful for our study due to the additional symmetry with respect to the Ising model. In the XY model each lattice site

carries an angle with respect to some arbitrary, fixed direction. The energy function reads, up to a scale factor that can absorbed in the temperature definition,

 E(θ1,θ2,...,θN)=−N∑i=1cos(θi−θi+1) (12)

with periodic boundary condition . We then discretize the set of angle values in multiples of , where is an integer. The resulting model is a Potts model over the integer-valued variables , with probability distribution (with periodic boundary conditions)

 pdata(v1,v2,....,vN)=1Zeβ∑ni=1M(vi,vi+1) (13)

where the interaction kernel mimics the XY energy function,

 M(v,v′)=cos(2πP(v−v′)) , (14)

and the partition function normalizes the distribution . This distribution enjoys two symmetries, compare to the single symmetry of the Ising model in (10): for any integers and we have,

 pdata(v1,v2,...,vN)=pdata(vk+1+L,vk+2+L,...,vk+N+L) , (15)

where and are to be intended, respectively, modulo and . Figure 8 shows a set of 100 configurations over sites, generated independently and at random from this model for .

### iv.2 Symmetry-breaking in both spaces

#### iv.2.1 Case of a single hidden unit

We consider a RBM with visible Potts-type units , which can take one out of values, and with hidden unit. The weights is now a vector , with and . The component of this vector is the connection between the hidden unit and the visible unit when it carries the Potts state .

We first train a RBM with a single hidden unit , which takes real values and is submitted to a double-well potential. Figure 9

(a) shows the weights obtained after training from a very large number of configurations, starting from small white noise initial conditions for the

. We observe a strong modulation of the weights in the space and angle directions, achieving peak values around some site and angle . Similar results were found for a binary-value hidden unit, , with a slightly weaker localization of the weights and at a different location, see Fig. 9(b). In the following, we show results obtained for the RBM with the real-value hidden unit only.

Since the interaction matrix in the Potts model takes the cosine function form, our RBM should learn the same functional dependence from the data samples. We show in Fig. 9(c) the quantity

 Wangulari=∣∣ ∣∣P−1∑v=0wi,1,v∣∣ ∣∣ , (16)

which measures the angular modulation of the weights on each site . We see a strong space localization around , because the weights only take non-zero values near that location. This location is arbitrary and similar to the place-field formation accompanying the breaking of translation symmetry over space observed for the Ising model. In addition, at the location of the maxima, the weight vector is very well approximated by a cosine function, see Fig. 9(d). The RBM has learned the correct frequency equal to , and the phase takes an arbitrary value. Indeed, the phase in a free parameter due to the invariance against choices of in (15).

To obtain a more precise picture of the receptive field, we then consider, for each site , the -dimensional vector of the weights . We then fit this vector with a cosine function of adjustable frequency and phase, referred to as, respectively, and . We show, as functions of the site index , the periods and the phases in Fig. 9(e). We observe that the period takes the expected value over the receptive field (sites ranging approximately between and 50). Similarly, the phase is constant (and takes an arbitrary value) over the same region of space. Informally speaking, when the hidden unit is on, all the XY spins supported by the sites in the receptive field point to the same direction.

#### iv.2.2 Case of multiple hidden units

We also train a RBM with Real valued hidden unit with double well potential, with results shown in Fig. 10. We see that the receptive fields of the hidden units are mutually separated in space, and show the same phenomenon of repulsion between the units observed for the Ising data. In addition, the angular dependence of the five weight vectors exhibit the same frequency (equal to ), but the phases show also a nice equi-separation due to repulsion along the angular direction.

Though we expected to see a diffusion of the receptive fields both along the spatial and angular dimensions for very large learning times, we did not observe this phenomenon even with RBM trained with 1,000,000 samples. This is likely due to the fact that the landscape is still rough for this amount of data, and diffusion remains activated. We have not tried to increase the number of samples because of the high computational cost.

### iv.3 Differentiated retarded learning transitions

In this section, we show that RBM trained with data generated by the discretized XY model shows retarded learning phase transitions. However, as there are two potential symmetry breaking directions in this model, one corresponding to the angular space and the other to the positional space, the breaking of symmetry along these direction may take place at two different values of the noise ratio , i.e. for different number of samples in the data set used for training. The reason is that the number of Potts states in the angular direction, , may largely differ from the number of sites on the lattice, . Consequently, the effective system sizes along the two directions are different.

This phenomenon of differentiated retarded learning phase transitions is reported in Fig. 11. We show in panel (a) of the figure the spatial modulation defined through,

 Wspatialv=N∑i=1wi,1,v , (17)

as a function of the Potts angular state variable . We observe that for large , the spatial modulation vanishes all over the angular space: low amount of data are not sufficient for the RBM to capture the angular correlations in the configurations. For large enough data set () the spatial modulation shows a clear dependence on . We then show in panel (b) of Fig. 11 the angular modulation as a function of the lattice site index for varied levels of sampling noise, . Again, for large , no modulation is seen. However, for very small noise levels , we do observe that is peaked around some well defined site . Interestingly, in the range , the angular modulation does not significantly vary over space, while the spatial modulation varies over angles, compare panels (a) and (b). We conclude that, for intermediate ranges of values of , the RBM has created a place-field along the angular direction, but not along the spatial direction.

To test the generality of the phenomenon of differentiated transitions, we also generated data samples from variants of the discretized XY model. We modified the XY model in terms of changing the interaction matrix in (13) from the cosine function to short range couplings, and also we changed the Hamiltonian to include not only nearest neighbor couplings but also long range couplings in the positional space. The resulting models display a variety of phase transitions in the RBM weights after training, with positional symmetry breaking arising before (for smaller amount of training data) angular ordering in some cases (not shown).

## V Theoretical analysis

Hereafter, we study analytically the dynamics of learning of the weights of the RBM with binary hidden units when trained with data. Two limit cases will be considered:

• The case of few data, which allows us to establish the connection with random matrix theory and the so-called retarded learning transition;

• The case of a large amount of data, with weak correlations, which we analyze in detail to understand the formation and shape of the place field, as well as the interactions between different place fields arising through learning.

While we will focus on the learning dynamics of the weights, we assume that the RBM has correctly learned the local fields, so we will set from the beginning in the case of unbiased binary data . In addition, we assume that hidden units are of Bernoulli type, . The log-likelihood therefore reads

 logL=⟨M∑μ=1logcosh(N∑i=1wiμvi)⟩data−logZ({wiμ}) , (18)

where the partition function is

 Z({wiμ})=∑{v1,v2,...,vN}M∏μ=1cosh(N∑i=1wiμvi) . (19)

Taking the partial derivative with respect to we get the following expression for the gradient of the log-likelihood:

 ∂logL∂wiμ=⟨vitanh(N∑j=1wjμvj)⟩data−1Z({wiμ})∑{v1,v2,...,vN}visinh(N∑j=1wjμvj)∏λ(≠μ)cosh(N∑j=1wiλvj) . (20)

The continuous-time dynamical equations for the evolution of the weights during training, assuming that the batch size is maximal, i.e. that all the data are used for training, are

 dwiμdt=ν∂logL∂wiμ , (21)

where is the learning rate.

### v.1 Few data: Small weight expansion and the retarded learning transition

#### v.1.1 Linearized equations of the dynamics

In this Section, we assume that the weights have initially very small (random) values. For small enough learning times, we may linearize the dynamical equations (21). We obtain

 (22)

where

 Cij=⟨vivj⟩data (23)

is the empirical covariance matrix estimated from the data. Let

be the largest eigenvalue of

, and

the associated eigenvector, with components

. As the diagonal elements are equal to unity (), we have that , unless

is the identity matrix and the data shows no correlation at all. Hence, according to (

22), all weight vectors align along ; this result holds within the linear approximation, and is therefore expected to be valid at short times only.

Let us consider the noise ratio , equal to the number of visible units (system size) over the number of training samples. For bad sampling (large ), the empirical covariance matrix can be approximated by the covariance matrix of a null model, in which all visible units are independent and unbiased: is equal to with equal probabilities (), independently of the other ’s. The asymptotic distribution of the eigenvalues of such a random matrix has a special form, called the Marcenko Pastur (MP) spectrum, whose right edge (top eigenvalue) is given by

 Λnoise=ΛMP=(1+√r)2 (24)

and the corresponding top eigenvector has random, Gaussian distributed components.

Conversely, for good sampling (small ), we expect the empirical covariance to be similar to the covariance matrix computed from the model distribution from which data were generated. Due to the translational invariance of , its top eigenvector has the same symmetry: , up to a normalization factor. Hence, we expect to be similar to and be roughly uniform. In the double, large and large limit, the two regimes may be separated by a sharp transition, taking place at a critical value of . To locate this value, we compute below the top eigenvalue of the model covariance matrix, and compare it to its MP counterpart (24). The crossover between the bad and good sampling regimes takes place when both eigenvalues are equal.

#### v.1.2 Case of Ising data

Let us consider the case of the one-dimensional Ising model. When a large number of configurations is available, we have , see (9). Due to the rotational invariance, the top-eigenvector, has all its components equal. Therefore the top eigenvalue of the covariance matrix is

 ΛIsing(β)=N∑j=1Cij=1+2[tanhβ+tanh2β+tanh3β+...]≈1+tanhβ1−tanhβ=e2β . (25)

When the inverse temperature is small, this ‘signal’ eigenvalue is smaller than the ‘noise’ eigenvalue (24), locating the right edge of the MP spectrum. In this case, we expect the top eigenvector of the empirical covariance matrix to be noisy, and not to capture the correlation between the Ising variables . In this regime, the RBM overfits the data and no receptive field with a localized weight structure can emerge. As increases above

 β(r)=log(1+√r) , (26)

the signal eigenvalue becomes larger than the MP edge, and we expect the top eigenvector of to have comparable component and be similar to .

The above statement is corroborated by the results shown in Fig. 12(Top), which shows the top eigenvalue of the correlation matrix (23) as a function of the noise ratio, , where is the number of samples. For large (few samples), is very well approximated by , while, for small (many samples), gets very close to as expected. The crossover between these two regimes takes place at values of such that (26).

Figure 12(c) shows how the top eigenvector of the data correlation matrix changes as more and more samples are considered. One clearly sees a phase transition from a random vector to the uniform eigenvector .

#### v.1.3 Case of XY data

For the discrete XY model, the correlation matrix in the limit can be computed as well using the transfer matrix formalism. We find

 Cij(v,v′)=1P−1P−1∑p=1(λp(β)λ0(β))|j−i|cos(2πp(v−v′)P) , (27)

where

 λp(β)=P−1∑v=0exp[βcos(2πvP)]cos(2πvpP) . (28)

enjoys translational invariance along both axis, hence its eigenvectors are discrete 2D Fourier modes; after computation, we find that the top eigenvalue is

 ΛXY(β)=PP−1λ0(β)+λ1(β)λ0(β)−λ1(β) . (29)

with a corresponding eigenspace of dimension 2, spanned by

, . The top eigenvector is uniform over space, as for the Ising model, but not over the angular variables, see Fig. 13(c).

The ’noise’ eigenvalue is similarly given by the MP spectrum, although slightly modified: the dimension to sample size ratio is now , and in the limit, the correlation matrix has top eigenvalue owing to the anticorrelations between Potts variables on the same site, . We obtain:

 Λnoise=PP−1+(1+√rP)2 (30)

Similarly to the case of Ising data, when is small, the signal ’eigenvalue’ is small compared to the ’noise’ eigenvalue, and the empirical top eigenvector has a small projection in the space spanned by ,, see Fig. 13(b,c,d). The crossover between the two regimes takes place at values of such that . The first retarded learning transition of the RBM occurs in the same range of , see Fig.11.

### v.2 Many data: Small β expansion

After some training time, linear equation (22) for the weights breaks down and non linearities must be taken into account decelle . We derive below an approximation to the RBM dynamic learning equation (with or 2 hidden units) for the one-dimensional Ising models, which is exact for small (but non vanishing) inverse temperature . We show that this equation is free of any external parameters after appropriate rescaling of the weights. We compare the numerical solutions to this equation with the result of the training with RBM to find a parameter independent agreement with the shape and the structure of the weights. We also cast the equation into a continuous form, and formulate the system in terms of a standard Reaction-Diffusion instability problem with the weights as an inducer and the sum of weights squared as the repressor.

#### v.2.1 One Hidden Unit System: formation of receptive field

For one hidden unit, equations (20,21) become, after some elementary manipulation,

 ∂logL∂wj=⟨vitanh(N∑j=1wjvj)⟩data−tanhwi , (31)

where we have dropped the index for the sake of clarity. Expanding the hyperbolic tangents to the third powers of their arguments, we obtain

 ∂logL∂wj=∑j⟨vivj⟩datawj−13∑j,k,l⟨vivjvk vl⟩datawjwkwl−wi+13w3i+O(w4) . (32)

Let us now assume that a large number of samples is available. At the lowest order in , we have

 ⟨vivj⟩=⎧⎪⎨⎪⎩1ifi=j ,βifi=j±1 ,0otherwise . (33)

and

 ⟨vivjvkvl⟩=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩1%ifi=j,k=l or any permutation ,βifi=j±1,k=l or any permutation ,βifi=j,k=l±1 or any permutation ,0otherwise . (34)

for, respectively, the 2- and 4-point correlations. we therefore obtain

 ∂logL∂wj=β(wi+1+wi−1)−wi∑kw2k+w3i+O(w4,βw3) . (35)

Upon appropriate rescaling of the weights, , and of the learning rate, , we obtain, in the small regime, the non trivial, parameter-free dynamical equation

 1νdwidt=wi+1+wi−1−wi∑kw2k+w3i . (36)

The stationary solution of this equation is shown in Fig. 14(a).

This equation can be cast in a continuous form over space, where we use the Laplacian to describe spatial diffusion. The corresponding continuous partial differential equation reads

 1ν∂w∂t(x,t)=∂2w∂x2(x,t)+(2−b(t))w(x,t)+w(x,t)3 , (37)

where

 b(t)=∫L0w(x,t)2dx . (38)

These coupled dynamical equations lead to a non-trivial spatial formation through the so-called Turing Reaction Diffusion instability mechanism book1 . The field diffuses over space and activates itself (self-promoting, through the cubic term), but is inhibited by another species, . This repressor is diffusing with an infinite diffusion coefficient, i.e. is spatially uniform, and depends on through (38). As grows due to self-activation, so does the repressor , until reaches a stationary profile. We show in Appendix A that the above dynamical equation satisfy the general criteria for stable pattern formation.

#### v.2.2 One hidden unit: Profile of the receptive field

Consider the stationary continuous equation satisfied by the weights after learning, see (37),

 0=d2wdx2(x)+(2−b)w(x)+w(x)3 . (39)

Multiplying by on both sides and integrating over we obtain that

 E(x)≡12(dwdx)2+w(x)2+14w(x)4−b2w(x)2 (40)

has a uniform value , independent of . When , both and tend to 0, which sets . We deduce that

 dwdx(x)=±w(x)√b−2−w(x)22 . (41)

We now explicitly break the symmetry by fixing the centre of the peak of the weights in , with , for , and for . Imposing that the derivative of the weight with respect to vanishes at its maximum, i.e. that is twice differentiable in gives

 w(0)=√2(b−2) . (42)

Integrating (41) with condition (42), we find

 w(x)=√2(b−2)cosh(x√b−2) . (43)

Using definition (38) for we then find

 b=∫∞−∞w(x)2dx=4√b−2 , (44)

whose solutions are . The corresponding profiles of the weights are shown in Fig. 14(c). We now study the stability of the solution under the time-dependent perturbation , where is given by (43). According to equations (37) and (39), we have

 1ν∂ϵ∂t(x,t)=∂2ϵ∂x2(x,t)+(2−b)ϵ(x,t)−2(∫dyw(y)ϵ(y,t))ϵ(x,t)+3w(x)2ϵ(x,t) . (45)

Multiplying by and integrating over , we get the following equation

 1νddt∫dxw(x)ϵ(x,t)=−√b−2∫dxw(x)ϵ(x,t)⎛⎝8−bcosh(x√b−2)2⎞⎠ . (46)

We deduce that the weight profile is stable if and only if . Therefore, the solution is unstable against small variations of the peak amplitude near , and the solution is the correct, stable one. Notice that the width of the peak of the weight, in the limit is finite according to expression (43). This phenomenon was also observed by the RBM training results in Fig 3(b), where the peak width obtained by linear fit (coefficient ) was also positive and finite.

#### v.2.3 Two hidden units: interaction between receptive fields

For two hidden units, equations (20,21) become, after some simple manipulation,

 (47)

together with a similar equation for the weight vector obtained by swapping the hidden-unit indices and . Notice that this equation simplifies to (31) when the weight vector is set to zero, i.e. when for all visible units , and the number of hidden units is effectively .

Let us now expand (47) in powers of the weights. The first term on the right hand side of the equation (involving the average over the data distribution) has the same expansion as in the case above, see (32). For the second term, using

 1+N∏k=1cosh(wk1−wk2)cosh(wk1+wk2)=2−2N∑k=1wk1wk2+O(w4) , (48)

and rescaling the weights, , and the learning rate, , as before, we obtain

 1νdwi1dt=wi+1,1+wi−1,1−wi1∑kw2k1+w3i1+wi1w22i−wi2∑kwk1wk2 . (49)

Similarly, we find

 1νdwi2dt=wi+1,2+wi−1,2−wi2∑kw2k2+w3i2+w