Statistical mechanics of complex neural systems and high dimensional data

01/30/2013
by   Madhu Advani, et al.
Stanford University
0

Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? And second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction, and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.

READ FULL TEXT VIEW PDF

Authors

page 31

05/05/2021

A unifying tutorial on Approximate Message Passing

Over the last decade or so, Approximate Message Passing (AMP) algorithms...
06/22/2021

Large N limit of the knapsack problem

In the simplest formulation of the knapsack problem, one seeks to maximi...
04/25/2022

Spontaneous Emergence of Computation in Network Cascades

Neuronal network computation and computation by avalanche supporting net...
10/08/2010

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

In recent years, ideas from statistics and scientific computing have beg...
05/14/2019

Stochastic thermodynamics of computation

One of the major resource requirements of computers - ranging from biolo...
06/04/2020

The why, how, and when of representations for complex systems

Complex systems thinking is applied to a wide variety of domains, from n...
10/28/2020

High-dimensional inference: a statistical mechanics perspective

Statistical inference is the science of drawing conclusions about some s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neuronal networks are highly complex dynamical systems consisting of large numbers of neurons interacting through synapses

[1, 2, 3]. Such networks subserve dynamics over multiple time-scales. For example, on fast time scales, on the order of milliseconds, synaptic connectivity is approximately constant, and this connectivity directs the flow of electrical activity through neurons. On slower timescales, on the order of seconds to minutes and beyond, the synaptic connectivity itself can change through synaptic plasticity induced by the statistical structure of experience, which itself can stay constant over even longer timescales. These synaptic changes are thought to underly our ability to learn from experience. To the extent that such separations of timescale hold, one can exploit powerful tools from the statistical physics of disordered systems to obtain a remarkably precise understanding of neuronal dynamics and synaptic learning in basic models. For example, the replica method and the cavity method, which we introduce and review below, become relevant because they allow us to understand the statistical properties of many interacting degrees of freedom that are coupled to each other through some fixed, or quenched, interactions that may be highly heterogenous, or disordered.

However, such networks of neurons and synapses, as well as the dynamical processes that occur on them, are not simply tangled webs of complexity that exist for their own sake. Instead they have been sculpted over time, through the processes of evolution, learning and adaptation, to solve important computational problems necessary for survival. Thus biological neuronal networks serve a function that is useful for an organism in terms of improving its evolutionary fitness. The very concept of function does not of-course arise in statistical physics, as large disordered statistical mechanical systems, like glasses or non-biological polymers do not arise through evolutionary processes. In general, the function that a biological network performs (which may not always be a-priori obvious) can provide a powerful way to understand both its structure and the details of its complex dynamics [4]

. As the functions performed by neuronal networks are often computational in nature, it can be useful to turn to ideas from distributed computing algorithms in computer science for sources of insight into how networks of neurons may learn and compute in a distributed manner. In this review we also focus on distributed message passing algorithms whose goal is to compute the marginal probability distribution of a single degree of freedom in a large interacting system. Many problems in computer science, including error correcting codes and constraint satisfaction, can be formulated as message passing problems

[5]. As we shall review below, message passing is intimately related to the replica and cavity methods of statistical physics, and can serve as a framework for thinking about how specific dynamical processes of neuronal plasticity and network dynamics may solve computational problems like learning and inference.

This combination of ideas from statistical physics and computer science are not only useful in thinking about how network dynamics and plasticity may mediate computation, but also for thinking about ways to analyze large scale datasets arising from high throughput experiments in neuroscience. Consider a data set consisting of points in an

dimensional feature space. Much of the edifice of classical statistics and machine learning has been tailored to the situation in which

is large and is small. This is the low dimensional data scenario in which we have large amounts of data. In such situations, many classical unsupervised machine learning algorithms can easily find structures or patterns in data, when they exist. However, the advent of high throughput techniques in neuroscience has pushed us into a high dimensional data scenario in which both and are large, but their ratio is . For example, we can simultaneously measure the activity of neurons but often only under a limited number of trials (i.e. also ) for any given experimental condition. Also, we can measure the single cell gene expression levels of genes but only in a limited number of cells. In such a high dimensional scenario, it can be difficult to find statistically significant patterns in the data, as often classical unsupervised machine learning algorithms yield illusory structures. The statistical physics of disordered systems again provides a powerful tool to understand high dimensional data, because many machine learning algorithms can be formulated as the minimization of a data dependent energy function on a set of parameters. We review below how statistical physics plays a useful role in understanding possible illusions of structure in high dimensional data, as well as approaches like random projections and compressed sensing, which are tailored to the high dimensional data limit.

We give an outline and summary of this review as follows. In section 2 we introduce the fundamental techniques of the replica method and cavity method within the context of a paradigmatic example, the Sherrington-Kirkpatrick (SK) model [6] of a spin glass [7, 8, 9]. In a neuronal network interpretation, such a system qualitatively models a large network in which the heterogenous synaptic connectivity is fixed and plays the role of quenched disorder. On the otherhand, neuronal activity can fluctuate and we are interested in understanding the statistical properties of the neuronal activity. We will find that certain statistical properties, termed self-averaging properties, do not depend on the detailed realization of the disordered connectivity matrix. This is a recurring theme in these notes; in large random systems with microscopic heterogeneity, striking levels of almost deterministic macroscopic order can arise in ways that do not depend on the details of the heterogeneity. Such order can govern dynamics and learning in neuronal networks, as well as the performance of machine learning algorithms in analyzing data, and moreover, this order can be understood theoretically through the replica and cavity methods.

We end section 2 by introducing message passing which provides an algorithmic perspective on the replica and cavity methods. Many models in equilibrium statistical physics are essentially equivalent to joint probability distributions over many variables, which are equivalently known and described as graphical models in computer science [10]. Moreover, many computations in statistical physics involve computing marginal probabilities of a single variable in such graphical models. Message passing, also known in special cases as belief propagation [11], involves a class of algorithms that yield dynamical systems whose fixed points are designed to approximate marginal probabilities in graphical models. Another recurring theme in these notes is that certain aspects of neuronal dynamics may profitably be viewed through the lens of message passing; in essence, these neuronal (and also synaptic) dynamics can be viewed as approximate versions of message passing in a suitably defined graphical model. This correspondence between neuronal dynamics and message passing allows for the possibility of both understanding the computational significance of existing neuronal dynamics, as well as deriving hypotheses for new forms of neuronal dynamics from a computational perspective.

In section 3 we apply the ideas of replicas, cavities and messages introduced in section 2 to the problem of learning in neuronal networks as well as machine learning (see [12]

for a beautiful book length review of this topic). In this context, training examples, or data play the role of quenched disorder, and the synaptic weights of a network, or the learning parameters of a machine learning algorithm, play the role of fluctuating statistical mechanical degrees of freedom. In the zero temperature limit, these degrees of freedom are optimized, or learned, by minimizing an energy function. The learning error, as well as aspects of the learned structure, can be described by macroscopic order parameters that do not depend on the detailed realization of the training examples, or data. We show how to compute these order parameters for the classical perceptron

[13, 14]

, thereby computing its storage capacity. Also we compute these order parameters for classical learning algorithms, including Hebbian learning, principal components analysis (PCA), and K-means clustering, revealing that all of these algorithms are prone to discovering illusory structures that reliably arise in random realizations of high dimensional noise. Finally, we end section

3 by discussing an application of message passing to learning with binary valued synapses, known to be an NP-complete problem [15, 16]. The authors of [17, 18] derived a biologically plausible learning algorithm capable of solving random instantiations of this problem by approximating message passing in a joint probability distribution over synaptic weights determined by the training examples.

In section 4

, we discuss the eigenvalue spectrum of random matrices. Matrices from many random matrix ensembles have eigenvalue spectra whose probability distributions display fascinating macroscopic structures that do not depend on the detailed realization of the matrix elements. These spectral distributions play a central role in a wide variety of fields

[19, 20]; within the context of neural networks for example, they play a role in understanding the stability of linear neural networks, the transition to chaos in nonlinear networks [21], and the analysis of high dimensional data. We begin section 4 by showing how replica theory can also provide a general framework for computing the typical eigenvalue distribution of a variety of random matrix ensembles. Then we focus on understanding an ensemble of random empirical covariance matrices (the Wishart ensemble [22]) whose eigenvalue distribution, known as the Marcenko-Pastur distribution [23]

, provides a null model for the outcome of PCA applied to high dimensional data. Moreover, we review how the eigenvalues of many random matrix ensembles can be thought of as Coulomb charges living in the complex plane, and the distribution of these eigenvalues can be thought of as the thermally equilibriated charge density of this Coulomb gas, which is stabilized via the competing effects of a repulsive two dimensional Coulomb interaction and an attractive confining external potential. Moreover we review how the statistics of the largest eigenvalue, which obeys the Tracy-Widom distribution

[24, 25], can be understood simply in terms of thermal fluctuations of this Coulomb gas [26, 27]. The statistics of this largest eigenvalue will make an appearance later in section 5 when we discuss how random projections distort the geometry of manifolds. Overall, section 4 illustrates the power of the replica formalism, and plays a role in connecting the statistical physics of two dimensional Coulomb gases to PCA in section 3.5 and geometric distortions induced by dimensionality reduction in section 5.3.

In section 5 we discuss the notion of random dimensionality reduction. High dimensional data can be difficult to both model and process. One approach to circumvent such difficulties is to reduce the dimensionality of the data; indeed many machine learning algorithms search for optimal directions on which to project the data. As discussed in section 3.5, such algorithms yield projected data distributions that reveal low dimensional, illusory structures that do not exist in the data. An alternate approach is to simply project the data onto a random subspace. As the dimensionality of this subspace is lower than the ambient dimensionality of the feature space in which the data resides, features of the data will necessarily be lost. However, it is often the case that interesting data sets lie along low dimensional submanifolds in their ambient feature space. In such situations, a random projection above a critical dimension, that is more closely related to the dimensionality of the submanifold than to the dimensionality of the ambient feature space, often preserves a surprising amount of structure of the submanifold. In section 5 we review the theory of random projections and their ability to preserve the geometry of data submanifolds. We end section 5

by introducing a statistical mechanics approach to random dimensionality reduction of simple random submanifolds, like point clouds and hyperplanes. This analysis connects random dimensionality reduction to extremal fluctuations of 2D Coulomb gases discussed in sections

4.3 and 4.4.

The manifold of sparse signals forms a ubiquitous and interesting low dimensional structure that accurately captures many types of data. The field of compressed sensing [28, 29], discussed in section 6, rests upon the central observation that a sparse high dimensional signal can be recovered from a random projection down to a surprisingly low dimension by solving a computationally tractable convex optimization problem, known as minimization. In section 6 we focus mainly on the analysis of minimization based on statistical mechanics and message passing. For readers who are more interested in applications of random projections, compressed sensing and minimization to neuronal information processing and data analysis, we refer them to [30]. There, diverse applications of how the techniques in sections 5 and 6 can be used to acquire and analyze high dimensional neuronal data are discussed, including, magnetic resonance imaging [31, 32, 33], compressed gene expression arrays [34], compressed connectomics [35, 36], receptive field measurements, and fluorescence microscopy [37, 38] of multiple molecular species at high spatiotemporal resolution [39] using single pixel camera [40, 41] technology. Also diverse applications of these same techniques to neuronal information processing are discussed in [30], including semantic information processing [42, 43, 44], short-term memory [45, 46], neural circuits for minimization [47], learning sparse representations [48, 49], regularized learning of high dimensional synaptic weights from limited examples [50], and axonally efficient long range brain communication through random projections [51, 52, 53, 54].

After introducing CS in 6.1, we show how replica theory can be used to analyze its performance in section 6.2. Remarkably, the performance of CS, unlike other algorithms discussed in section 3.5

, displays a phase transition. For any given level of signal sparsity, there is a critical lower bound on the dimensionality of a random projection which is required to accurately recover the signal; this critical dimension decreases with increasing sparsity. Also, in section

6.3 we review how the minimization problem can be formulated as a message passing problem [55]. This formulation yields a message passing dynamical system that qualitatively mimics neural network dynamics with a crucial history dependence terms. minimization via gradient descent has been proposed as a framework for neuronal dynamics underlying sparse coding in both vision [56] and olfaction [57]. On the otherhand, the efficiency of message passing in solving minimization, demonstrated in [55] may motivate revisiting the issue of sparse coding in neuroscience, and the role of history dependence in sparse coding network dynamics.

Finally, the appendix in section 8

provides an overview of the replica method, in a general form that is immediately applicable to spin glasses, perceptron learning, unsupervised learning, random matrices and compressed sensing. Overall, the replica method is a powerful, if non-rigorous, method for analyzing the statistical mechanics of systems with quenched disorder. We hope that this exposition of the replica method, combined with the cavity and message passing methods discussed in this review within a wide variety of disparate contexts, will help enable students and researchers in both theoretical neuroscience and physics to learn about exciting interdisciplinary advances made in the last few decades at the intersection of statistical physics, computer science, and neurobiology.

2 Spin Glass Models of Neural Networks

The SK Model [6] is a prototypical example of a disordered statistical mechanical system. It has been employed as a simple model of spin glasses [7, 8], as well as neural networks [58], and has made a recent resurgence in neuroscience within the context of maximum entropy modeling of spike trains [59, 60]. It is defined by the energy function

(1)

where the are spin degrees of freedom taking the values . In a neural network interpretation, represents the activity state of a neuron and is the synaptic connectivity matrix of the network. This Hamiltonian yields an equilibrium Gibbs distribution of neural activity given by

(2)

where

(3)

is the partition function, and is an inverse temperature reflecting sources of noise. The connectivity matrix is chosen to be random, where each

is an independent, identically distributed (i.i.d) zero mean Gaussian with variance

.

The main property of interest is the statistical structure of high probability (low energy) activity patterns. Much progress in spin glass theory [7] has revealed a physical picture in which the Gibbs distribution in (2) decomposes at low temperature (large ) into many “lumps” of probability mass (more rigorously, pure states [61]) concentrated on subsets of activity patterns. Equivalently, these lumps can be thought of as concentrated on the minima of a free energy landscape with many valleys. Each lump, indexed by , is characterized by a mean activity pattern , where is an average over configurations belonging to the free energy valley , and a probability mass (the probability that a random activity pattern belongs to valley ). In the large limit, free energy barriers between valleys diverge, so that in dynamical versions of this model, if an activity pattern starts in one valley, it will stay in that valley for infinite time. Thus ergodicity is broken, as time average activity patterns are not equal to the full Gibbs average activity pattern. The network can thus maintain multiple steady states, and we are interested in understanding the structure of these steady states.

Now the detailed activity pattern in any free energy minimum (i.e the mean pattern ) depends on the detailed realization of the connectivity , and is hard to compute. However, many interesting quantities, that involve averages over all neurons, are self-averaging, which by definition means that their fluctuations across different realizations of vanish in the large limit. As we see below, typical values of such quantities, for any given realization of , can be computed theoretically by computing their average over all . One interesting quantity that probes the geometry of free energy minima is the distribution of overlaps between all pairs of activity patterns. If the activity patterns belong to two valleys, and , then the overlap is

(4)

Now since is the probability a randomly chosen activity pattern belongs to valley , the distribution of overlaps between any two pairs of activity patterns independently chosen from (2) is given by

(5)

This distribution turns out not to be self-averaging (it fluctuates across realizations of ), unless there is only one valley, or state (modulo the reflection symmetry , in which case the distribution becomes concentrated at a single number , which is the self-overlap of the state, . If there is indeed one state, then does not depend on the detailed realization of and provides a measure of the variability of mean activity across neurons due to the quenched disorder in the connectivity. In the case of multiple valleys, one can also compute the disorder averaged overlap distribution ; despite the fact that the overlap distribution may not be self-averaging, its average over can still yield a wealth of information about the geometric organization of free energy minima in neural activity space. This can be done using the replica method, which we now introduce.

2.1 Replica Solution

To understand the statistical properties of the Gibbs distribution in (2), it is useful to compute its free energy . Correlations between neurons can then be computed via suitable derivatives of the free energy. Fortunately, the free energy is self-averaging, which means that to understand the free energy for any realization of , it suffices to compute its average over all :

(6)

where denotes an average over the disorder . This average is difficult to do because the logarithm appears inside the average. The replica trick circumvents this difficulty by exploiting the identity

(7)

This identity is useful because it allows us to first average over an integer power of , which can be done more easily, and then take the limit. Appendix 8 provides a general outline of the replica approach that can be used for many problems. Basically to compute the average over it is useful to introduce replicated neuronal activity patterns , for , yielding

(8)

Now the average over can be performed because it is reduced to a set of Gaussian integrals. To do so, we use the fundamental identity

(9)

where

is a zero mean Gaussian random variable with variance

. Applying this to (8) with , , and yields

(10)

where

(11)

is the overlap matrix between replicated activity patterns.

Thus although for any fixed realization of the quenched disorder , the replicated activity patterns were independent, marginalizing over, or integrating out the disorder introduces attractive interactions between the replicas. Consistent with the general framework, presented in section 8, the interaction between replicas depends only on the overlap matrix , and we have in (120), . Thus minimization of this energy

Figure 1: Probability lumps in free energy valleys. Schematic figures of the space of all possible neuronal or spin configurations (large circle) and the space of spin configurations with non-negligible probability under the Gibbs distribution in (2) (shaded areas). (A) At high temperature all spin configurations are explored by the Gibbs distribution. Thus the inner product between two random spins drawn from the Gibbs distribution will typically have inner product, and so the replica order parameter is . (B) The replica symmetric ansatz for a low temperature phase: the spins freeze into a small set of configurations (free energy valley), which can differ from realization to realization of the connectivity . However, the inner product between two random spins, and therefore also the replica order parameter, takes a nonzero value that does not depend on the realization of . (C) One possible ansatz for replica symmetry breaking (RSB) in which the replica overlap matrix is characterized by two order parameters, . This ansatz, known as 1-step RSB, corresponds to a scenario in which the Gibbs distribution breaks into multiple lumps, with describing the typical inner product between two configurations chosen from the same lump, and describing the typical inner product between configurations from different lumps. (D) There exists a series of k-step RSB schemes describing scenarios in which the Gibbs distribution decomposes into a nested hierarchy of lumps of depth . This figure describes a possible scenario for . The true low temperature phase of the SK model is thought to be described by a particular RSB ansatz [7].

function promotes alignment of the replicas. The intuition is that for any fixed realization of , the replicas will prefer certain patterns. Which patterns are preferred will vary across realizations of . However, for any fixed realization of , the preferred set of patterns will be similar across replicas since the fluctuations of each replicated neuronal activity pattern are controlled by the same quenched connectivity . Thus even after averaging over , we expect this similarity to survive, and hence we expect average overlaps between replicas to be nonzero.

However, minimization of the energy alone does not determine the overlap matrix . One must still sum over in (10) which yields an entropic term corresponding to the number of replicated activity patterns with a given set of overlaps. While energy minimization drives overlaps to be large, entropy maximization drives overlaps to be small, since there are many more replicated configurations with small, rather than large overlaps. This competition between energy and entropy leads to a potentially nontrivial overlap matrix. After computing this entropic term, the most likely value of the overlap matrix can be computed via the saddle point method, yielding a set of self-consistent equations for Q (a special case of (126),(127)):

(12)

where denotes an average with respect to the Gibbs distribution , with .

Now the physical meaning of the saddle point replica overlap matrix is explained in 8.2; it is simply related to the disorder averaged overlap distribution:

(13)

where is given by (5). So the distribution of overlaps between pairs of free energy minima (weighted by their probability), is simply the distribution of off-diagonal matrix elements of the replica overlap matrix. Thus, in searching for solutions to (12), any ansatz about the structure of is implicitly an ansatz about the geometry and multiplicity of free energy valleys in (2), averaged over .

Now the effective Hamiltonian yielding the average in (12) is symmetric with respect to permutations of the replica indices (i.e. permuting the rows and columns of ). Therefore it is natural to search for a replica symmetric saddle point in which for all . This is equivalent to an assumption that there is only one free energy valley, and measures its heterogeneity. Taking the limit with this replica symmetric ansatz yields a saddle point equation for (see (142) for the derivation):

(14)

At high temperature (), is the only solution, representing a “paramagnetic” state (Fig. 1A) in which activity patterns fluctuate over all possible configurations, and average neural activity is for all (Fig. 1A). At lower temperature (), a nonzero solution rises continuously from , suggesting a phase transition to a “frozen” state corresponding to a single valley (Fig. 1B) in which each neuron has a different mean activity .

While this scenario seems plausible, a further analysis of this solution [6, 62] yields inconsistent physical predictions (like negative entropy for the system). Within the replica framework, this inconsistency can be detected by showing that the replica symmetric saddle point for is unstable [63], and so one must search for solutions in which breaks the replica symmetry. This corresponds to a physical picture in which there are many free energy minima. A great deal of work has lead to a remarkably rich ansatz which predicts a nested hierarchical, tree like organization on the space of free energy minima (see Fig. 1CD), known as an ultrametric structure [64]. It is striking that this highly symmetric and ordered low temperature hierarchical structure emerges generically from purely random, disordered couplings . Unfortunately, we will not explore this phenomenon further here, since for most of the applications of replica theory to neuronal processing and data analysis discussed below, a replica symmetric analysis turns out to be correct.

2.2 Chaos in the SK Model and the Hopfield Solution

So far, in order to introduce the replica method, we have analyzed a toy neuronal network with a random symmetric connectivity matrix , and found that such a network exhibits broken replica symmetry corresponding to a hierarchy of low energy states that are stable with respect to thermal or noise induced fluctuations. It is tempting to explore the possibility that this multiplicity of states may be useful for performing neural information processing tasks. However, several works have noted that while these states are stable with respect to thermal fluctuations, they are not structurally stable with respect to perturbations either to the inverse temperature , or the connectivity matrix [65, 66, 67]. Indeed very small changes to or induce macroscopic changes in the location of energy minima in the space of neuronal activity patterns. This sensitive dependence of low energy activity patterns to either or was called temperature or disorder chaos respectively in [66]. For neural information processing, it would be useful to instead have network connectivities whose noisy dynamics not only thermally stabilize a prescribed set of neuronal activity patterns, but do so in a manner that is structurally stable with respect to changes in either the connectivity or level of noise.

An early proposal to do just this was the Hopfield model [68]. Suppose one wishes to find a network connectivity that stabilizes a prescribed set of -dimensional patterns , for , where . Hopfield’s proposal was to choose

(15)

This choice reflects the outcome of a Hebbian learning rule [69] in which each synapse from neuron to neuron changes its synaptic weight by an amount proportional to the correlation between the activity on its presynaptic and postsynaptic neuron. When the activity pattern is imposed upon the network, this correlation is , and when all patterns are imposed upon the network in succession, the learned synaptic weights are given by (15).

This synaptic connectivity induces an equilibrium probability distribution over neuronal activity patterns through (2). Ideally, this distribution should have free energy valleys, corresponding to lumps of probability mass located near the patterns and their reflections . If so, then when network activity is initialized to either a corrupted or partial version of one of the learned patterns , the network will relax (under a dynamics whose stationary distribution is given by (2)) to the free energy valley corresponding to . This relaxation process is often called pattern completion. Thus Hopfield’s prescription provided a unifying framework for thinking about learning and memory: the structure of past experience (i.e. the patterns ) are learned, or stored, in the network’s synaptic weights (i.e. through (15)), and subsequent network dynamics can be viewed as motion down a free energy landscape determined by the weights. If learning is successful, the minima of this free energy landscape correspond to past experiences, and the process of recalling past experience corresponds to completing partial or corrupted initial network activity patterns induced by current stimuli.

A key issue then is storage capacity: how many patterns can a network of neurons store? This issue was addressed in [70, 71] via the replica method in the situation where the stored patterns are random and uncorrelated (each is chosen independently to be or with equal probability). These works extensively analyzed the properties of free energy valleys in the Gibbs distribution (2) with connectivity (15), as a function of the inverse temperature and the level of storage saturation . This problem fits the classic mold of disordered statistical physics, where the patterns play the role of quenched disorder, and neuronal activity patterns play the role of thermal degrees of freedom. In particular, the structure of free energy minima can be described by a collection of self-averaging order parameters , denoting the overlap of neuronal activity with pattern . Successful pattern completion is possible if there are free energy valleys such that the average of in each valley is large for one pattern and small for all the rest. These free energy valleys can be thought of as recall states. The replica method in [70, 71] yields a set of self-consistent equations for these averages. Solutions to the replica equations, in which precisely one order parameter is large, are found at low temperature only when . For , the system is in a spin glass state with many free energy minima, none of which have a macroscopic overlap with any of the patterns (in the solutions to the replica equations, no average is O(1) as with ). At such high levels of storage, so many patterns “confuse” the network, so that its low energy states do not look like any one pattern . Indeed the free energy landscape of the Hopfield model as becomes large behaves like the low temperature spin glass phase of the SK model discussed in the previous section.

Even for , at low enough temperatures, spurious, metastable free energy valleys corresponding to mixtures of patterns can also arise. These mixture states are characterized by solutions to the replica equations in which the average is for more than one . However, as the temperature is increased, such mixture states melt away. This phenomenon illustrates a beneficial role for noise in associative memory operation. However, there is a tradeoff to melting away mixture states by increasing temperature, as decreases with increasing temperature. Nevertheless, in summary there is a robust region in the phase plane with and corresponding to low temperatures, in which the recall states dominate the free energy landscape over neural activity patterns, and the network can successfully operate as a pattern completion, or associative memory device. Many important details about the phase diagram of free energy valleys as a function of and and be found in [70, 71].

2.3 Cavity Method

We now return to an analysis of the SK model through an alternate method that sheds light on the physical meaning of the saddle point equation for the replica symmetric order parameter in (14), which may seem a bit obscure. In particular, we give an alternate derivation of (14) through the cavity method [7, 72] which provides considerable physical intuition for (14) by describing it as a self-consistency condition. In general, the cavity method, while indirect, can often provide intuition for the final results derived via more direct replica methods.

The starting point involves noting that the SK Hamiltonian (1) governing the fluctuations of neurons can be written as

(16)

where

(17)

is the local field acting on neuron , and

(18)

is the Hamiltonian of the rest of the neurons . Since is a sum of many terms, it is tempting to approximate its thermal fluctuations in the full system of neurons in (16

) by a Gaussian distribution. However, such a Gaussian approximation is generally invalid because the individual terms are correlated with each other. One source of correlation arises from a common coupling of all the neurons

to . For example, because interacts with through the symmetric coupling , whenever (or ) this exerts a positive (or negative) effect on the combination . Thus all individual terms in (17) exhibit correlated fluctuations due to common coupling to the fluctuating neuron .

The key idea behind the cavity method is to consider not the distribution of the local field acting on neuron in the full system of neurons in (16), but instead the distribution of in a “cavity system” of neurons obtained by removing from the system, thereby leaving “cavity” (see Fig. 2AB). Then is known as the cavity field, or the field exerted on neuron by all the others in the absence of neuron , and its distribution is given by that of (17) in a Gibbs distribution with respect to (18):

(19)

The joint distribution of

and its local field in the full system of spins can be written in terms of the cavity field distribution as follows:

(20)

where .

The advantage of writing the joint distribution of and in terms of the cavity field distribution is that one can now plausibly make a Gaussian approximation to , i.e. the distribution of (17) in the cavity system (18) of neurons in the absence of . Because the cavity system does not couple to neuron , it does not know about the set of couplings , and therefore the thermal fluctuations of cavity

Figure 2: The cavity Method. (A) A network of neurons, or spins. (B) A cavity surrounding a single neuron, , that has been removed from the system. (C) In a replica symmetric approximation, the full distribution of the field exerted on the cavity (in the absence of ) by all other neurons can be approximated by a Gaussian distribution, while the joint distribution of and takes the form in equation (20).

activity patterns , while of course correlated with each other, must be uncorrelated with the couplings , unlike the case of these same fluctuations in the presence of . Motivated by this lack of correlation, we can make a Gaussian approximation to the thermal fluctuations of in the cavity system . Note that this does not imply that the local field in the full system is Gaussian. Indeed, if in (20) is Gaussian, then obtained by marginalizing out in cannot be Gaussian; as discussed above, this non-gaussianity arises due to positive correlations between and induced by their coupling . The simplification in replacing the network with a fluctuating field is shown in the transition from Fig.2B to 2C.

Under a Gaussian approximation, is characterized by its mean

(21)

and variance

(22)
(24)

where is the order parameter

(25)

and . Here we have neglected various terms that vanish in the large limit, but most importantly, in going from (22) to (2.3), we have made a strong assumption that the connected correlation vanishes in the large limit fast enough that we can neglect all off-diagonal terms in (22). This can be true if the cavity system (and consequently the full system) is accurately described by a single free energy valley. On the otherhand, if the system is described by multiple free energy valleys, the connected correlation will receive contributions from fluctuations across valleys, and we cannot neglect the off-diagonal terms [7]. Thus the validity of this cavity approximation is tantamount to an assumption of replica symmetry, or a single valley in the free energy landscape. As discussed above, under the assumption of a single valley, we expect to be self-averaging: it does not depend on the detailed realization of in the large limit. Finally, we note that the cavity method can be extended to scenarios in which replica symmetry is broken and there are multiple valleys [7].

In the replica symmetric scenario, under the Gaussian approximation to , (20) becomes

(26)

allowing us to compute the mean activity of neuron in the full system of neurons, in terms of the mean and variance of its cavity field,

(27)

But now we must compute , which we can do by demanding self consistency of the cavity approximation. First of all, we note that there was nothing special about neuron ; the above procedure of forming a cavity system by removing neuron could have been done with any neuron. Thus (26, 27) holds individually for all neurons , and we can average these equations over all to obtain an expression for :

(28)

However, we do not yet know for each . For each , is a random variable due to the randomness of the couplings , which are uncorrelated with by virtue of the fact that this thermal average occurs in a cavity system in the absence of . Thus we expect the distribution of over random realizations of to be gaussian, with a mean and variance that are easily computed to be and respectively. Furthermore, we expect this distribution to be self-averaging: i.e. the distribution of for a fixed across different realizations of should be the same as the distribution of across different neurons for a fixed realization of , in the large limit. Under this assumption, although we may not know each individual , we can replace the average over neurons in (28) with an average over a Gaussian distribution, yielding

(29)

Here denotes a “quenched” average with respect to a zero mean unit variance Gaussian variable , reflecting the heterogeneity of the mean cavity field across neurons, and the thermal average is computed via (26, 27), and reflects the thermal fluctuations of a single neuron in the presence of a cavity field with mean and variance and , respectively.

Equation (29) is a self-consistent equation for the order parameter which is itself a measure of the heterogeneity of mean activity across neurons. So physically, (29) reflects a demand that the statistical properties of the cavity fields are consistent with the heterogeneity of mean neural activity. Now finally, we can specialize to the model in which in (26), which yields in (27), and when this is substituted into (29), we recover the self-consistent equation for in (14), derived via the replica method.

2.4 Message Passing

So far, we have seen two methods which allow us to calculate self-averaging quantities (for example ) that do not depend on the detailed realization . However, we may wish to understand the detailed pattern of mean neural activity, i.e. for all , for some fixed realization of . Mathematically, this corresponds to computing the marginal distribution of a single neuron in a full joint distribution given by (2). Here we introduce efficient distributed message passing algorithms from computer science [5, 11, 10] that have been developed to compute such marginals in probability distributions which obey certain factorization properties.

Consider for example a joint distribution over variables that factorizes into a set of factors, or interactions, indexed by :

(30)

Here is any arbitrary variable that could be either continuous or discrete, and denotes the collection of variables that factor depends on. Thus we systematically abuse notation and think of each factor index also as a subset of the variables, with variable if and only if factor depends on . The factorization properties of can be visualized in a factor graph, which is a bipartite graph whose nodes correspond either to variables or factors , and

Figure 3: Message Passing. (A) A factor graph in which variable nodes are represented by circles, and factor nodes are represented by squares. (B) The flow of messages involved in the update of the message . (C) The message passing approximation to the joint distribution of , and . Here the interaction is treated exactly, while the effect of all other interactions besides are approximated by a product of messages. (D) Exact message passing in a chain; the marginal on is computed exactly as a product of two messages.

there is an edge between a factor and variable if and only if , or equivalently factor depends on (see Fig. 3A). For example, the SK model, or more generally any neural system with an equilibrium distribution, corresponds to a factor graph in which the neurons are the variables , and the factors correspond to nonzero synaptic weights connecting pairs of neurons. Thus each corresponds to a neuron pair , and in the SK model of equation (2), .

The utility of the factor graph representation is that an iterative algorithm to compute the marginals,

(31)

for all can be visualized as the flow of messages along the factor graph (3B). We first define this iterative algorithm and then later give justification for it. Every message is a probability distribution over a single variable, and at any given time there are two types of messages, one from variables to factors, and the other from factors to variables. We denote by the message from variable to factor , and by the message from factor to variable . Intuitively, we can think of as an approximation to distribution on induced by all other interactions besides interaction . In contrast, we can think of as an approximation to the distribution on induced by the direct influence of interaction alone. These messages will be used below to approximate the marginal of in the full joint distribution of all interactions (see e.g. (34)).

The (unnormalized) update equation for a factor to variable message is given by

(32)

where denotes the set of all variables connected to factor node except (see Fig. 3b). Intuitively, the direct influence of alone on (the left hand side of (32)) is obtained by marginalizing out all variables other than in the factor , supplemented by accounting for the effect of all of the other interactions besides on variables by the product of messages (see Fig. 3b). The (unnormalized) update equation for the variable to factor messages is then given by

(33)

Intuitively, the distribution on induced by all other interactions besides interaction (the left hand side of (33)) is simply the product of the direct influences of all interactions that involve variable , except for interaction (see Fig. 3b). Message passing involves randomly initializing all the messages and then iteratively running the update equations (32,33) until convergence. One exception to the random initialization is the situation where any variable is connected to only variable node . In this case,

is initialized to be a uniform distribution over

, since in the absence of , variable feels no influence from the rest of the graph. Under the message passing dynamics, will remain a uniform distribution. Now for general factor graphs, convergence is not guaranteed, but if the algorithm does converge, then the marginal distribution of a variable can be approximated via

(34)

and indeed the the joint distribution of all variables can be approximated via

(35)

The update equations (32,32), while intuitive, lead to two natural questions: for which factor graphs will they converge, and if they converge, how well will the fixed point messages and approximate the true marginals though equations (34,35)? A key intuition arises from the structure of the approximation to the joint marginal of the variables in (35) (see also Fig. 3C). This approximation treats the coupling of the variables through interaction by explicitly including the factor . However, it approximates the effects of all other interactions on these variables by a simple product of messages . An exactly analogous approximation is made in the update equation (32). Such approximations might be expected to work well whenever removing the interaction leads to a factor graph in which all the variables that were previously connected to are now weakly coupled (ideally independent) under all the remaining interactions .

This weak coupling assumption under the removal of a single interaction holds exactly whenever the factor graph is a tree, with no loops. Indeed in such a case, removing any one interaction removes all paths through the factor graph between variables . In the absence of any such paths, all pairs of variables are independent, and their joint distribution factorizes, consistent with the approximations made in (32) and (35). In general, whenever the factor graph is a tree, the message passing equations converge in finite time, and the fixed point messages yield the true marginals [11]. We will not give a general proof of this fact, but we will illustrate it in the case of a one dimensional Ising chain (see Fig. 3D). Consider the marginal distribution of a spin at position in the chain. This spin feels an interaction to its left and right, and so (35) tells us the marginal is a product of two converged messages at time :

(36)

Each of these two messages can be computed by iterating messages from either end of the chain to position . For example, the rightward iteration for computing is

(37)

where the first equality is a special case of (33) and the second is a special case of (32). The first message in this iteration, is initialized to be a uniform distribution, since spin is only connected to a single interaction . A similar leftward iteration leads to the calculation of . Each iteration converges in an amount of time given by the path length from each corresponding end to , and after convergence, we have

(38)
(39)

Inserting (38) and (39) into (36) yields the correct marginal for , and the normalization factor can be fixed at the end by demanding . Note that whereas a naive sum over all spin configurations to compute the marginal over would require operations, this iterative procedure for computing the marginal requires only operations. Moreover, two sweeps through the chain allow us to compute all the messages, and therefore all marginals simultaneously, as (36) holds for all . Overall, this method is essentially identical to the transfer matrix method for the Ising chain, and is a generalization of the Bethe approximation [73].

Although message passing is only exact on trees, it can nevertheless be applied to graphical models with loops, and as discussed above, it should yield good approximate marginals whenever the variables adjacent to a factor node are weakly correlated upon removal of that factor node. We will see successful examples of this in the contexts of learning in section 3.6 and compressed sensing in 6.3. An early theoretical advance in partially justifying the application of message passing to graphical models with loops was a variational connection: each solution to the fixed point equations of message passing are in one to one correspondence with extrema of a certain Bethe free energy [74], an approximation to the Gibbs free energy in variational approaches to inference in graphical models that is exact on trees (see [75] for a review). However there are no known general and precise conditions under which message passing in graphical models with many loops is theoretically guaranteed to converge to messages that yield a good approximation to the marginals. Nevertheless, in practice, message passing seems to achieve empirical success in approximating marginals when correlations between variables adjacent to a factor node are indeed weak after removal of that factor.

We conclude this section by connecting message passing back to the replica method. In general, suitable averages of the messaging passing equations reduce to both the cavity equations and the replica equations [5]. To illustrate this in the special case of the SK model, we outline the derivation of the replica saddle point equation (14) from the perspective of message passing. We first note that since every factor node in the SK model has degree , the update of a message from a factor to variable , i.e. depends only on the message through,

(40)

which is a special case of (32). Thus we can take one set of messages, for example the node to factor messages, , as the essential degrees of freedom upon which the message passing dynamics operates. We simplify notation a little by letting . Then the remaining message passing update (33) yields the dynamics

(41)

where if and only if is coupled to through a nonzero .

Now each message is a distribution over a binary variable, and all such distributions can be usefully parameterized by a single scalar parameter:

(42)

Here the scalar parameter can be thought of as a type of cavity field; as , if message passing is successful, converges to the field exerted on spin by all the spins in a cavity system in which the interaction is removed. In terms of this parameterization, the message passing updates (41) yield a dynamical system on the cavity fields [76],

(43)

Here is defined implicitly through the relation

(44)

Physically is the effective field on a binary spin coupled with strength to another spin that experiences an external field of strength , after marginalizing out . Explicitly,

(45)

In the weak coupling limit of small , , which reflects the simple approximation that the average magnetization of , due to the external field (which would be if the back-reaction from can be neglected), exerts a field on . The more complex form of in (45) reflects the back-reaction of on that becomes non-negligible at larger values of the bi-directional coupling . In (43), the updated cavity field turns out to be a simple sum over all the spins (besides ) of this same effective field obtained by marginalizing out in the presence of its own cavity field .

Using (43), we are now ready to derive (14). The key point is to consider self-consistency conditions for the distribution of cavity fields as . We can think of this distribution in two ways. First, for a fixed and , is a random variable due to the random choice of couplings . Second, for a fixed realization of , at a message passing fixed point, there is an empirical distribution of cavity fields across all choices of pairs and . The assumption of self-averaging means that as , the latter empirical distribution converges to the distribution of the former random variable. In any case, we would like to write down a self-consistent equation for this distribution, by observing that this distribution must be self-reproducing under the update equation (43). More precisely, in (43), if the couplings are drawn i.i.d from a distribution , and the cavity fields are drawn i.i.d. from a distribution , then the induced distribution on should be identical to . This yields a recursive distributional equation characterizing the distribution of cavity fields at a message passing fixed point:

(46)

Here we have suppressed the arbitrary indices and . More generally, one can track the time-dependent evolution of the distribution of cavity fields, an algorithmic analysis technique known as density evolution [5].

In general, it can be difficult to solve the distributional equation for . However, in the SK model, one could make an approximation that the distribution of cavity fields is a zero mean Gaussian with variance . Then the distributional equation for reduces to a self-consistency condition for by taking the expectation of on each side of (46). The left hand side is by definition . To simplify the right hand side, since the couplings have variance of , we can use the small coupling approximation . Then averaging on both sides of (46) yields

(47)
(48)
(49)
(50)

Now, since we have assumed is zero mean Gaussian with variance , this is equivalent to the replica symmetric saddle point equation (14).

In summary, we have employed a toy model of a neural network, the SK spin glass model, to introduce the various replica, cavity and message passing approaches to analyzing disordered statistical mechanical systems. In each case we have discussed in detail the simplest possible ansatz concerning the structure of free energy landscape, namely the replica symmetric ansatz, corresponding to a single valley with weak connected correlations between degrees of freedom. While this assumption is not true for the SK model, it nevertheless provides a good example system in which to gain familiarity with the various methods. And fortunately, for many of the applications discussed below, the assumption of a single free energy valley governing fluctuations will turn out to be correct. Finally, we note that just as the replica and cavity methods can be extended [7] to scenarios in which replica symmetry is broken, corresponding to many free energy valleys and long range correlations, so too can message passing approaches. Indeed viewing optimization and inference problems through the lens of statistical physics has lead to a new algorithm, known as survey propagation [77, 78], which can find good marginals, or minimize costs, in free energy landscapes characterized by many metastable minima that can confound more traditional, local algorithms.

3 Statistical Mechanics of Learning

In the above sections, we have reviewed powerful machinery designed to understand the statistical mechanics of fluctuating neural activity patterns in the presence of disordered synaptic connectivity matrices. A key conceptual advance made by Gardner [79, 80] was that this same machinery could be applied to the analysis of learning, by performing statistical mechanics directly on the space of synaptic connectivities, with the training examples presented to the system playing the role of quenched disorder. In this section we will explore this viewpoint and its applications to diverse phenomena in neural and unsupervised learning (see [12] for an extensive review of this topic).

3.1 Perceptron Learning

The perceptron is a simple neuronal model defined by a vector of

synaptic weights , which linearly sums a pattern of incoming activity , and fires depending on whether or not the summed input is above a threshold. Mathematically, in the case of zero threshold, it computes the function , where represents the firing state and represents the quiescent state. Geometrically, it separates its input space into two classes, each on opposite sides of the dimensional hyperplane orthogonal to the weight vector . Since the absolute scale of the weight vector is not relevant to the problem, we will normalize the weights to satisfy , so that the set of perceptrons live on an dimensional sphere.

Suppose we wish to train a perceptron to memorize a desired set of input-output associations, . Doing so requires a learning rule (an algorithm for modifying the synaptic weights based on the inputs and outputs) that finds a set of synaptic weights that satisfies the inequalities

(51)

We will see below, that as long as there exists a simultaneous solution to the inequalities, then remarkably, a learning rule, known as the perceptron learning rule [13], can find the solution. The main remaining question is then, under what conditions on the training data does a solution to the inequalities exist?

A statistical mechanics based approach to answering this question involves defining an energy function on the dimensional sphere of perceptrons as follows,

(52)

where is the alignment of example with the weight vector . Successfully memorizing all the patterns requires all alignments to be positive, so should be a potential that penalizes negative alignments and favors positive ones. Indeed a wide variety of learning algorithms for the perceptron architecture can be formulated as gradient descent on for various choices of potential functions in (52) [12]. However, if we are interested in probing the space of solutions to the inequalities (51), it is useful to take , where is the Heaviside function (, and otherwise). With this choice, the energy function in (52) simply counts the number of misclassified examples, and so the Gibbs distribution

(53)

in the zero temperature () limit becomes a uniform distribution on the space of perceptrons satisfying (51) (see Fig. 4).

Figure 4: Perceptron Learning. (A) The total sphere of all perceptron weights (grey circle) and a single example (black arrow). The blue region is the set of perceptron weights that yield an output on the example. (B) The same as (A), but for a different example. (C) The set of weights that yield on both examples in (A) and (B). (D) As more examples are added, the space of correct weights shrinks, and its typical volume is given by , where is the replica order parameter introduced in section 3.3

Thus the volume of the space of solutions to (51), and in particular, whether or not it is nonzero, can be computed by analyzing the statistical mechanics of (53) in the zero temperature limit.

3.2 Unsupervised Learning

This same statistical mechanics formulation can be extended to more general unsupervised learning scenarios. In unsupervised learning, one often starts with a set of data vectors , for , where each vector is of dimension . For example, each vector could be a pattern of expression of genes across experimental conditions, or a pattern of activity of neurons in response to stimuli. The overall goal of unsupervised learning is to find simple hidden structures or patterns in the data. The simplest approach is to find an interesting single dimension spanned by the vector , such that the projections of the data onto this single dimension yield a useful one dimensional coordinate system for the data. This interesting dimension can often be defined by minimizing the energy function (52), with the choice of determining the particular unsupervised learning algorithm. One choice, corresponds to Hebbian learning. Upon minimization of , this choice leads to ; i.e. points in the direction of the center of mass of the data. In situations in which the data has its center of mass at the origin, a useful choice is . Under this choice,

points in the direction of the eigenvector of maximal eigenvalue of the data covariance matrix

. This is the direction of maximal variance in the data, also known as the first principal component of the data; i.e. it is the direction which maximizes the variance of the distribution of across data points .

Beyond finding an interesting dimension in the data, another unsupervised learning task to find clusters in the data. A popular algorithm for doing so is -means clustering. This is an iterative algorithm in which one maintains a guess about potential cluster centroids in the data, . At each iteration in the algorithm, each cluster is defined to be the set of data points closer to centroid than to any other centroid. Then all the cluster centroids are optimized by minimizing the sum of the distances from to those data points assigned to cluster . In the case where the distance measure is euclidean distance, this step just sets each centroid to be the center of mass of the data points assigned to cluster . The cluster assignments of the data are then recomputed with the new centroids, and the whole process repeats. The idea is that if there are well separated clusters in the data, this iterative procedure should converge so that each is the center of mass of cluster .

For general , this iterative procedure can be viewed as an alternating minimization of a joint energy function over cluster centroids and cluster membership assignments. For the special case of , and when both the data and cluster centroids are normalized to have norm , this energy function can be written as

(54)

where

(55)

and,

(56)
(57)

Gradient descent on this energy function forces each centroid to perform Hebbian learning only on the data points that are currently closest to it.

3.3 Replica Analysis of Learning

Both perceptron learning and unsupervised learning, when formulated as statistical mechanics problems as above, can be analyzed through the replica method. A natural question for perceptron learning is how many associations can a perceptron with synapses memorize? One benchmark is the case of random associations where is a random vector drawn from a uniform distribution on a sphere of radius and each with probability half. Similarly, a natural question for unsupervised learning, is how do we assess the statistical significance of any structure or pattern we find in a high dimensional dataset consisting of points in dimensions? To address this question it is often useful to analyze what structure we may find in a null data distribution that itself has no structure, for example when the data points are drawn from a uniform distribution on the sphere (or equivalently, in the large limit, from a Gaussian distribution with identity covariance matrix).

In both cases, the analysis simplifies in the ”thermodynamic” limit with the ratio held constant. Fortunately, this is the limit of relevance to neural models with many synaptic weights, and to high dimensional data. The starting point of the analysis involves understanding the low energy configurations of the Gibbs distribution in (53). In the thermodynamic limit, important observables, like the volume of low energy configurations or the distribution of data along the optimal direction(s) become self-averaging; they do not depend on the detailed realization of or . Therefore we can compute these observables by averaging over these realizations. This can be done by first averaging the replicated partition function

(58)

where