Criticality & Deep Learning I: Generally Weighted Nets

02/26/2017 ∙ by Dan Oprisa, et al. ∙ 0

Motivated by the idea that criticality and universality of phase transitions might play a crucial role in achieving and sustaining learning and intelligent behaviour in biological and artificial networks, we analyse a theoretical and a pragmatic experimental set up for critical phenomena in deep learning. On the theoretical side, we use results from statistical physics to carry out critical point calculations in feed-forward/fully connected networks, while on the experimental side we set out to find traces of criticality in deep neural networks. This is our first step in a series of upcoming investigations to map out the relationship between criticality and learning in deep networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Various systems in nature display patterns, forms, attractors and recurrent behavior, which are not caused by a law per se; the ubiquity of such systems and similar statistical properties of their exhibit order has lead to the term ”universality”, since such phenomena show up in cosmology, the fur of animals [1], chemical and physical systems [2], landscapes, biological prey-predator systems and endless many others [3]. Furthermore, because of universality, it turns out that the most simplistic mathematical models exhibit the same statistical properties when their parameters are tuned correctly. As such it suffices to study N-particle systems with simple, ”atomistic” components and interactions since they already exhibit many non-trivial emergent properties in the large N limit. Certain ”order” parameters change behavior in a non-classical fashion, for specific noise levels. Using the rich and deep knowledge gained in statistical physics about those systems, we map the mathematical properties and learn about novel behaviors in deep learning set ups. Specifically we look at a collection of N units on a lattice with various pair interactions; when the units are binary spins with values (), the model is known as a Curie-Weiss model. From a physical point of view, this is one of the basic, analytically solvable models, which still possesses the rich emergent properties of critical phenomena. However, given its general mathematical structure, the model has already been used to explain population dynamics in biology [4], opinion formation in society [5]

, machine learning

[6, 7, 8] and many others [9, 10]. All those systems, with a rich and diverse origination, posses almost identical behavior at criticality. In the latter case of machine learning, the Curie-Weiss model encodes information about fully connected and feed-forward architectures to first order. Similar work was done in [11, 12], where insights from Ising models and fully connected layers are drawn and applied to net architectures; in [13]

a natural link between the energy function and an autoencoder is established. We will address the generalisation of fully connected system and understand its properties, before moving to the deep learning network and applying there the same techniques and intuition.

The article is organised as follows: section 2 gives a short introduction of critical systems and appropriate examples from physics; in section 3 we map a concrete, non-linear, feed forward net to its physical counterpart and discuss other architectures as well; then we turn to investigating the practical question whether we can spot traces of criticality in current deep learning nets in 4. Finally we summarise our findings in 5 and hint at future directions for the rich map between statistical systems and deep learning.

2 Brief summary of critical phenomena

Critical phenomena were first thoroughly explained and analysed in the field of statistical mechanics, although they were observed in various other systems, but lacking a theoretical understanding. The study of criticality belongs to statistical physics and is an incredibly rich and wide field, hence we can only briefly summarise some few results of interest for the present article; definitely a much more comprehensive coverage can be found, see e.g.[14, 15, 16, 17]. In a nutshell, the subject is concerned with the behavior of systems in the neighbourhood of their critical points, [18]. One thus looks at systems composed of (families of) many, identical particles, trying to derive properties for macroscopic parameters, such as density or polarisation from the microscopic properties and interactions of the particles; statistical mechanics can hence be understood as a bridge between macroscopic phenomenology (e.g. thermodynamics) and microscopic dynamics (e.g. molecular or quantum-mechanical interacting collections of particles). In a nutshell, criticality is achieved when macroscopic parameters show anomalous, divergent behavior at a phase transition. Depending on the system at hand, the parameters might be magnetisation, polarisation, correlation, density, etc. Specifically it is the correlation function of the ”components” which then displays divergent behavior, and signals strong coordinated group behavior over a wide range of magnitudes. Usually it is the noise (temperature) which at certain values will induce the phase transition accompanied by the critical anomalous behavior. Given its relevance in physics and also its mathematical analogy to our deep learning networks, we will briefly review here the Curie-Weiss model with non-constant coupling and examine its behavior at criticality.

2.1 Curie-Weiss model

A simplistic, fully solvable model for a magnet is the Curie-Weiss model (CW), [19]. It possesses many interesting features, exhibits critical behavior and correctly predicts some of the experimental findings. As its mathematics is later on used in our deep learning setup, we will briefly present main properties and solutions for the sake of self-consistency.

The Hamiltonian of the CW model is given by

(1)

Here the are a collection of interacting ”particles”, in our physical case, spins, that interact with each other via the coupling ; they take values and interact pairwise with each other, at long distances; the inclusion of a factor of multiplying the quadratic spin term makes this long-range interaction tractable in the large limit. Furthermore, there is a directed external magnetic field which couples to every spin via . Since the coupling between spins is a constant and since every spin interacts with every other spin (except self-interactions, which is accounted by a factor of ) the Hamiltonian can be rewritten to

(2)

With being the inverse temperature the partition function can be formulated

(3)
(4)

which can be fully solved, [19], summing over each of the states; given an explicit partition , the free energy can be computed via

(5)

Once we have various macroscopic values of interest can be inferred such as the magnetisation of the system, aka first derivative of wrt. . This is a so called ”order parameter”, which carries various other denominations, such as polarisation, density, opinion imbalance, etc. depending on the system at hand. It basically measures how arranged or homogeneous the system is under the influence of the outside field which couples to the spins via . A full treatment and derivation of the model including all its critical behavior can be found in [20], from where we get the equation of state for the magnetisation

(6)

with . The analysis of this equation for various temperatures and couplings , reveals a phase transition at the critical temperature . Introducing the dimensionless parameter and expanding (6) in small couplings the famous power law dependence on temperature for the magnetisation emerges:

(7)

Here we recognise one of the very typical power laws which are ubiquitous to critical systems. The quantity we are most interested in though is the second derivative of the free energy wrt. , which is basically the 2-point correlation function of the spins . Again, expanding the second derivative of the free energy in small couplings and looking in the neighbourhood of the critical temperature yields

(8)

again displaying power law behavior with a power coefficient . The innocent looking equation 8 has actually tremendous consequences, as it implies that correlation does not simply restrict to nearest neighbours but goes over very long distances only slowly decaying; further, because of the power law behavior, there will be self-similar, fractal patterns in the system: islands of equal magnetisation will form within other islands and so on, all the way through all scales. Also, the correlation diverges at the criticality point . We will carry out the explicit calculations for our case of interest - non-constant matrix couplings - later one, in section 3.1.

2.2 Criticality in real-world networks

Two of the main motivations why we look for criticality and exploit on it in artificial networks, are the universal arising of this phenomenon as well as various hints of its occurrence in biological [4] and neural systems [25, 22]; once systems get ”sizable” enough, gaining complexity, critical behavior emerges, which also applies to man-made nets [23]. Various measures can be formulated to detect criticality, and they all show power law distribution behavior. In the world wide web, e.g. the number of links to a source, and the number of links away from a source, both exhibit power law distribution

(9)

for some power coefficient . Similar behavior can be uncovered in various other networks, if sizable enough, such as citation behavior of scientific articles, social networks, etc. A simple, generic metric to detect criticality in networks is the degree distribution, defined as the number of (weighted) links connecting to one node.

Further, also the correlation between nodes is non-trivial, such that nodes with similar degree have higher probability of being connected than nodes of different degree

[23], chapter VII. We will follow a similar path as proposed above and grow an experimental network with new nodes having the simplest preferential and directed attachment towards existing nodes, as a function of their degree:

(10)

Here, denotes the probability that some node will grow a link to another node of degree . Hence, every new node, will prefer

nodes with higher degrees, leading to the overall power distribution observed in the real world systems. Additional metrics we look at are single neuron activity as well as layer activity and pattern behavior; more details on that in section

4.

3 Criticality in deep learning nets

3.1 From feed-forward to fully connected architecture

We will focus now on a feed-forward network, with two layers, and connected via a weight matrix ; In order to probe our system for criticality, we write down its Hamiltonian

(11)

which has been first formulated in the seminal paper [9]. Here, the values of the and are . Further, by absorbing the biases in the usual way we can assume our weight matrix has the form:

(12)

while the read .

This Hamiltonian describes a two layer net containing rectified linear units (ReLU) in the

-layer with a common bias term . The weight matrix sums the binary inputs coming from the and those are fed into ; depending whether the ReLU threshold has been reached, is activated, hence the binary values allowed for both, inputs and -layer.

Further, we show in appendix A, that the partition function is up to a constant the same for the units taking values in or . By redefining We can then formulate the partition function as

(13)

where is the inverse temperature . This is the partition function of a bipartite graph with non-constant connection matrix .

However, it turns out, that the partition function of the fully connected layer is the highest contribution (1st order) of our feed forward network (see appendix B), hence further simplifying the expression to

(14)

We will now proceed and compute the free energy , defined as , using the procedure presented in [10]. From the free energy we then find all quantities of interest, especially the 2-point correlation function of the neurons.

3.2 Fully connected architecture with non-constant weights

In order to solve the CW model analytically, one has to perform the sum over spins, which is hindered by the quadratic term . The standard way to overcome this problem is the gaussian linearisation trick which replaces quadratic term by its square root - linear in and one additional continuous variable - the ”mean” field, which is being integrated over entire :

(15)

which in physics, is known as the Hubbard–Stratonovich transform.

Unfortunately our coupling is not scalar, and hence we will linearise the sum term by term to keep track of all the weight matrix entries. First we will insert identities via the Dirac delta function into our Hamiltonian as used in (14):

(16)

With the definition of the delta function the partition function (14) reads now

(17)

As already stated, we could perform the sum over the binary units , since they show up linearly in the exponential after the change of variables via delta identity111In general we’re not interested in numerical multiplicative constants, as later on, when logging the partition and computing the free energy, those terms will be simple additive constants without any contribution after differentiating the expression; we effectively converted the sum over binary values into integrals over , leading to

(18)

with a generalised Hamiltonian

(19)

Ultimately we are interested in the free energy per unit, which contains the partition function, via

(20)

From we can now obtain all quantities of interest via derivatives, in our case with respect to . The partition function still contains a product of double integrals, which can be solved via the saddle point approximation; we recall here the one-dimensional case

(21)

where is the stationary value of and is in our case the Hessian evaluated at the stationary point:

(22)

while is given in (3.2).

The expression 18 can now be computed by applying simultaneously the saddle point conditions for both integrals. The stationarity conditions222We keep in mind that we enlarged to contain as well, hence the explicit equations are dependent for and give

(24)

which combined deliver the self consistency mean field equation of the fully connected layer (3.2). Further, denoting the the Hamiltonian satisfying the stationarity conditions, it reads

(25)

Equation (25) already displays manifestly the consistency equation for the mean field, as taking the first derivative wrt. leaves exactly the consistency equation over per its construction;

Now we can rewrite the free energy (20) as

(26)

We need to address now the large limit; obviously the second term coming from the determinant clearly vanishes in the large- limit, as the logarithm is slowly increasing, while we divide through ; the first term - a double sum over is of order and hence a well defined average in the limit; the last term - , when expanded, is again linear in the sum333The interior sum over is an average, hence well defined in the limit; after expansion, we’re left with the outer sum (over ), which is again a well defined average when divided by , and hence a well defined average after dividing through , hence we’re left with the free energy

(27)

We’re at the point now, where all quantities of interest can be derived from the free energy ; the order parameter (aka magnetisation when dealing with spins) per unit is defined as

(28)

The second term on the right vanishes identically, as we recognize it being evaluated at the stationarity condition for the Hamiltonian. The contribution of the first term is:

(29)

which is (the weighted sum version of) the iconic self-consistency mean field equation of the CW magnet (6).

The critical point, is located where the correlation function diverges for ; the 2-point correlation function (aka susceptibility when dealing with spins) is the second derivative of F, i.e. the derivative of (3.2) wrt. :

(30)

where we used the original equation (3.2) for taking the derivatives. It is worth contemplating first equations (3.2) and (30). They both capture the essence of the criticality of our system, including it’s power law behavior. When the weight matrix reduces to a scalar coupling, both equations reduce to the classical CW system and display the behavior shown in (7) and (8). Furthermore, eq. (30) encodes all the information needed for finding the critical point of matrix system at hand; we recall that all s (and their derivatives) are already implicitly ”solved” in terms of and via the stationarity equation (3.2) and hence the are just place holders for functions of and ; we’re thus left with a non-linear system of first order differential equations in variables, which will produce poles for specific values of the couplings and temperature at criticality.

4 Experimental results

Figure 1: Feed-forward net: Layer 3 weight distribution

After investigating criticality through the partition function in our theoretical setup, now we turn to a practical question: do current deep learning networks exhibit critical behaviour, or put it differently, can we spot traces of critical phenomena in them? Instead of directly attacking the partition function of real world deep neural nets, we start with the practical observation, that systems at around criticality show off power law distributions in certain internal attributes.

Figure 2: Feed-forward net: Log-log plot of layer activation pattern frequencies by rank
Figure 3: Autoencoder: Log-log plot of layer activation pattern frequencies by rank
Figure 2: Feed-forward net: Log-log plot of layer activation pattern frequencies by rank

Concretely for networks [23, 24] we look for traces of power laws in weight distributions, layer activation pattern frequencies, single node activation frequencies and average layer activations. In the following we will present experimental results for multilayer feed-forward networks, convolutional neural nets and autoencoders.

For all networks we ran experiments on the CIFAR-10 dataset, training each models for 200 epochs using ReLU activations and Adam Optimizer without gradient clipping and run inferences for 100 epochs. The feed forward network had 3 layers with 500, 400 and 200 nodes, the CNN had 3 convolutional layers followed by 3 fully connected layers and the autoencoder had one layer with 500 nodes.

For weight distributions we looked at sums of absolute values of the outgoing weights at each node, as a weighted order of the node. In fig. 1 we have a log-log plot of counts versus the node order as defined above, and detect no linear behavior.

Figure 4: CNN: Log-log plot of layer activation pattern frequencies by rank
Figure 5: CNN: Log-log plot of layer activation pattern frequencies by rank
Figure 4: CNN: Log-log plot of layer activation pattern frequencies by rank

For layer activation patterns we counted the frequency of each layer activations through the inference epochs. Figures 3 and 3 are log-log plots of layer activation frequencies versus their respective counts for the feed-forward layer the autoencoder. As we see, the hidden layer activation pattern frequencies of the Autoencoder resembles a truncated straight line, indicating that learning hidden features in unsupervised manner can give rise for scale free, power law phenomena in accordance with the findings of [24], but no other architectures show traces of any power law.

For single node activation frequencies we counted the frequency of each node activations through the inference epochs.

Figure 6: Feed-forward net: Log-log plot of single node activation frequencies by rank
Figure 7: CNN: Log-log plot of single node activation frequencies by rank
Figure 6: Feed-forward net: Log-log plot of single node activation frequencies by rank

Figures 7 and 7 depict the behavior of feed-forward and CN network. The flat, nearly horizontal line in the latter architecture is again a sign of missing exponent whatsoever.

As a last measure we employed the sum of activations defined as the average activations on each layer throughout the inference epochs.

Figure 8: Feed-forward net: Average layer activation distribution
Figure 9: CNN: Average layer activation distribution
Figure 8: Feed-forward net: Average layer activation distribution

Spontaneous and detectable criticality did not arise in classical architectures so the next step will be to create and experiment with systems that have induced criticality and learning rules that take into account criticality. Our first approach was to grow a fully connected net using the preferential attachment algorithm to induce at least some power law in node weights, and use the fully connected net as a hidden to hidden module. We further experimented with different solutions, regarding input and read out of activations from this hidden to hidden module, without changing the power law distribution. (This would simulate a system located at a critical state, with power law weight distribution). Our findings so far show that learning in these systems is very unstable without any advancement in learning and inference. The fundamental missing part is how to naturally induce a critical state in a network, which is equipped with learning rules that inherently take into account the critical state. For that we need new architectures and new learning rules, derived from the critical point equations (30).

5 Summary and outlook

Summary: In this article we make our first steps in investigating the relationship between criticality and deep learning networks. After a short introduction of criticality in statistical physics and real world networks we started with the theoretical setup of a fully connected layer. We used continuous mean field approximation techniques to tackle the partition function of the system ending up with a system of differential equations that determine the critical behaviour of the system. These equations can be the starting point for a possible network architecture with induced criticality and learning rules exploiting criticality. After that we presented results of experiments aiming to find traces of power law distributions in current deep learning networks such as multilayer feed-forward nets, convolutional networks and autoencoders. The results - except for the autoencoder - were affirmative in the negative sense, setting up as next the necessity to create networks with induced criticality and learning rules that exploit the critical state.

Outlook: Obviously the fully connected layer, which can be solved analytically on the theoretical side is of limited importance, as it translates into a rather simplistic architecture; more realistic, widely used set-ups, e.g. convolutional or recurrent nets, do very well contain the feed-forward mechanism, but are strongly deviating and hence only partially mapped to our theoretical treatment; it would definitely be essential to address theoretically the convolution mechanism of deep nets and establish a link between the theoretical and experimental side; also inducing criticality into the net via eq. (30

) could prove beneficial and might very well affect learning behavior and flow on the surface on the loss function.

Appendix

Appendix A Different unit values

We here show that the partition function with Hamiltonian

(31)

who’s units are taking values in has the same qualities as encoded in the partition function with Hamiltonian , who’s units take values in .

We rewrite the Hamiltonian in (31) with units taking values in (using Einstein’s summation convention over double indices) :

(32)

where the and take values in . Carrying now the multiplications in (32) yields

(33)

with . Hence when computing the partition Z with (32) we obtain

(34)

where the right hand side is the original Hamiltonian with a shifted coupling . The additional constant factors out completely and hence when taking the logarithm and the second derivative it won’t change the outcome. Also we note that the second derivative wrt. is .

Appendix B First order contribution

We consider here the Hamiltonian of the bi-partite graph connected via weight matrix (with Einstein summation convention):

(35)

with the free energy

(36)

Without any loss of generality we set the temperature , and we won’t keep track of it. Carrying the partial sum over yields

(37)

The sum over the is understood as a collection of

terms, each corresponding to a unique combination of 0’s and 1’s in the vector of length

representing that specific state of the spins; however, the sum can be conveniently written as a product of binary summands, where each contains exactly the two possible states of the th spin - this is where the product over comes from in upper formula. Expanding now to lowest order in we obtain

(38)

where is the Hamiltonian of the fully connected graph, defined as (Einstein summation convention)

(39)

A few notes are in place regarding eq. (39): the matrix is now symmetric by construction and hence mediates between equally sized (actually identical) layers; further, all higher terms of the function are even, hence all contributions are higher order, symmetric interactions of the layer with itself.

References