Machine learning methods built on deep neural networks have had unparalleled success across a dizzying array of tasks ranging from image recognition (Krizhevsky et al., 2012) to translation (Wu et al., 2016) to speech recognition and synthesis (Hinton et al., 2012). Amidst the rapid progress of machine learning at large, a theoretical understanding of neural networks has proceeded more modestly. In part, this difficulty stems from the complexity of neural networks, which have come to be composed of millions or even billions Shazeer et al. (2017) of parameters with complicated topology.
Recently, a number of promising theoretical results have made progress by considering neural networks that are, in some sense, random. For example, Choromanska et al. (2015) showed that random rectified linear neural networks could, with approximation, be mapped onto spin glasses; Saxe et al. (2014) explored the learning dynamics of randomly initialized networks; Daniely et al. (2016) and Daniely et al. (2017) studied an induced duality between neural networks with random pre-activations and compositions of kernels; Raghu et al. (2017) and Poole et al. (2016) studied the expressivity of deep random neural networks; and Schoenholz et al. (2017) studied information propagation through random networks. Work on random networks in the context of Bayesian neural networks has a longer history (Neal, 1996, 2012; Cho and Saul, 2009). Overall it seems increasingly likely that statements about randomly initialized neural networks might be able to inform practical design questions.
In a seemingly unrelated context, the past century has witnessed significant advances in theoretical physics, many of which may be attributed to the development of statistical, classical, and quantum field theories. Field theory has been used to understand a remarkably diverse set of physical phenomena, ranging from the standard model of particle physics (Weinberg, 1967)
, which represents the sum of our collective knowledge about subatomic particles, to the codification of phase transitions using Landau theory and the renormalization group(Chaikin and Lubensky, 2000). Consequently, an extremely wide array of tools have been developed to understand and approximate field theories.
In this paper we elucidate an explicit connection between random neural networks and statistical field theory. We demonstrate how well-established techniques in statistical physics can be used to study random neural networks in a quantitative and robust way. We begin by constructing an ensemble of random neural networks that we believe has a number of appealing properties. In particular, one limit of this ensemble is equivalent to studying neural networks after random initialization while another limit corresponds to probing the statistics of minima in the loss landscape. We then introduce a change of variables which shows that the weights and biases may be integrated out analytically to give a distribution over the pre-activations of the neural network alone. This distribution is identical to that of a statistical lattice model and this mapping is exact. We examine an expansion of our results as the network width grows large, and obtain concise and interpretable results for deep linear networks and deep rectified linear networks. We show that there exist well-defined mean field theories whose fluctuations may be fully characterized, and that there exist corresponding continuum field theories that govern the long wavelength behavior of these random networks. We compare our theory to simulations of random neural networks and find exceptional agreement. Thus we show that the behavior of wide random networks can be very precisely characterized.
This work leaves open a wide array of avenues that may be pursued in the future. In particular, the ensemble that we develop allows for a loss to be incorporated in the randomness. Looking forward, it seems plausible that statements about the distribution of local optima could be obtained using these methods. Moreover early training dynamics could be investigated by treating the small loss limit as a perturbation. Finally, generalization to arbitrary neural network architectures and correlated weight matrices is possible.
We now briefly discuss lattice models in statistical physics and their corresponding effective field theories, using the ubiquitous Ising model as an example. Many materials are composed of a lattice of atoms. Magnets are such materials where the electrons orbiting the atoms all spin in the same direction. To model this behavior, physicists introduced a very simple model that involves “spins” placed on vertices of a lattice. In our simple example we consider spins sitting on a one-dimensional chain of length at sites indexed by . The spins can be modeled in many ways, but in the simplest formulation of the problem we take . This represents spins that are either aligned or anti-aligned. For ease of analysis we consider a periodic chain defined so that the first site is connected to the last site and .
The statistics of the spins in such a system are determined by the Boltzmann distribution which gives the probability of a configuration of spins to be given by, where is the reciprocal of the thermodynamic temperature. Here,
is the “energy” of the system, where is a coupling constant. This energy is minimized when all of the sites point in the same direction. The normalization constant is the partition function, and is given by
Despite the relative simplicity of the Ising model it has had enormous success in predicting the qualitative behavior of an extremely wide array of materials. In particular, it can successfully explain transitions of physical systems from disordered to ordered states.
In general, lattice models are often unwieldy, and their successful predictions are sometimes surprising given how approximately they treat interactions present in real materials. The resolution to both of these concerns lies in the realization that we often are most concerned with the behavior of systems at very large distances relative to the atomic separations. For example, in the case of magnets we care much more about the behavior of the whole material rather than how any two spins in the material behave.
This led to the development of Effective Field Theory (EFT) where we compute a field by averaging over many using, for example, a Gaussian window centered at . The field is defined at every point in space and so here corresponds to a continuous relaxation of . It turns out that a number of features of the original model, such as symmetries and locality, survive the averaging process. In EFT we study a minimal energy that captures these essential (or long range) aspects of our original theory.
The energy describing the EFT is typically a complicated function of and its derivatives, . However, it has been shown that the long wavelength behavior of the system can successfully be described by considering a low-order expansion of . For the Ising model, for example, the effective energy is,
It can be shown, using the theory of irrelevant operators, that as long as higher order powers of do not change qualitative aspects of the resulting theory. Only even powers are allowed because the Ising model has global symmetry and the gradient term encodes the propensity of spins to align with one another. EFTs such as this one have been very successful at describing the large scale behavior of lattice models such as the Ising model. A great triumph of modern condensed matter physics was the realization that phases of matter could be characterized by their symmetries in this way.
A very large effort has been devoted to developing techniques to analyze lattice models and EFTs. Consequently, any theory that can be written as a lattice model or EFT has access to a wide array of approximate analytic and numerical techniques that can be leveraged to study it. We will employ several of these techniques, such as the saddle point approximation, here. This paper is therefore a way of opening the door to use this extensive toolset to study neural networks.
3 An Exact Correspondence
Consider a fully-connected feed-forward neural network,, with layers of width parametrized by weights , biases , and nonlinearities . The network is defined by the equation,
We equip this network with a loss function,, where is the input to the network and is a target that we would like to model. Given a dataset (inputs together with targets) given by we can define a “data” piece to our loss,
Throughout this text we will use Roman subscripts to specify an input to the network and Greek subscripts to denote individual neurons. We then combine this with anregularization term on both the weights and the biases to give a total loss,
Here we have introduced a parameter that controls the relative influence of the data on the loss. It can also be understood as the reciprocal of the regularization parameter. In this equation and in what follows, we adopt the Einstein summation convention in which there is an implied summation over repeated Greek indices.
To construct a stochastic ensemble of networks we must place a measure on the space of networks. As the objective we hope to minimize is the total loss, , with reference to Jaynes (1957) we select the maximum entropy distribution over subject to a measurement of the expected total loss,
This gives a probability of finding a network to be given by, where is the partition function,
Here we have introduced the notation and for simplicity.
While a choice of ensemble in this context will always be somewhat arbitrary, we argue that this particular ensemble has several interesting features that make it worthy of study. First, if we set then this amounts to studying the distribution of untrained, randomly initialized, neural networks with weights and biases distributed according to and respectively. This situation also amounts to considering a Bayesian neural network with a Gaussian distributed prior on the weights and biases as in Neal (2012). When is small, but nonzero, we may treat the loss as a perturbation about the random case. We speculate that the regime of small should be tractable given the work presented here. Studying the case of large will probably require methodology beyond what is introduced in this paper; however, if progress could be made in this regime, it would give insight into the distribution of minima in the loss landscape.
Our main result is to rewrite eq. (8) in a form that is more amenable to analysis. In particular, the weights and biases may be integrated out analytically resulting in a distribution that depends only on the pre-activations in each layer. This formulation elucidates the statistical structure of the network and allows for systematic approximation. By a change of variables we arrive at the following theorem (for proof appendix 8.1).
Through the change of variables, , the distribution over weights and biases defined by eq. (8) can be converted to a distribution over the pre-activations of the neural network. When the distribution over the pre-activations is described by a statistical lattice model defined by the partition function,
Here is a vector whose components are the pre-activations corresponding different inputs to the network and
is a vector whose components are the pre-activations corresponding different inputs to the network andis the correlation matrix between activations of the network from different inputs. The lattice is a one-dimensional chain indexed by layer and the “spins” are . We term this class of lattice model the Stochastic Neural Network.
) is the full joint-distribution for the pre-activations in a random network with arbitrary activation functions and layer widths with no reference to the weights or biases. We see that this lattice model features coupling between adjacent layers of the network as well as between different inputs to the network. Finally, we see that the loss now only features the pre-activation of the last layer of the network. The input to the network and the loss therefore act as boundary conditions on the lattice.
There is a qualitative as well as methodological similarity between this formalism and the use of replica theory to study spin glasses Mézard et al. (1987). Qualitatively, we notice that the replicated partition function in spin glasses involves the overlap function which measures the correlation between spins in different replicas while eq. (9) is naturally written in terms of which measures the correlation between activations due to different inputs to the network. Methodologically, when using the replica trick to analyze spin-glasses, one assumes that the interactions between spins are Gaussian distributed and shared between different replicas of the system; by integrating out the couplings analytically, the different replicas naturally become coupled. In this case the weights play a similar role to the interactions and their integration leads to a coupling between different the signals due to different inputs to the network.
Samples from a stochastic network with can be seen in fig. (1).
In this framework we can see that the mean field approximation of Poole et al. (2016) amounts to the replacement of by where denotes an expectation. This procedure decouples adjacent layers and replaces the complex joint distribution over pre-activations by a factorial Gaussian distribution. As a result, this approximation is unable to capture any cross-layer fluctuations that might be present in random neural networks. We can see this in fig. (1) where the black dashed lines denote the prediction of this particular mean field approximation. Note that while changes to the variance are correctly predicted, fluctuations are absent. Both Poole et al. (2016) and Schoenholz et al. (2017) study this particular factorial approximation to the full joint distribution, eq. (9). Additionally, the composition kernels of Daniely et al. (2016, 2017) can be viewed as studying correlation functions in this mean-field formalism over a broader class of network topologies.
The mean field theory of Poole et al. (2016) is analytically tractable for arbitrary activation function and so it is interesting to study. However, the explicit independence assumption makes it an uncontrolled approximation, especially when generalizing to neural network topologies that are not fully connected feed-forward networks. Additionally, there are many interesting questions that one might wish to ask about correlations between pre-activations in different layers of random neural networks. Finally, it is unclear how to move beyond a mean field analysis in this framework. To overcome these issues, we pursue a more principled solution to eq. (9) by considering a controlled expansion for large .
While results involving the distribution of pre-activations resulting from a single input are an interesting first step we know from Poole et al. (2016); Schoenholz et al. (2017); Daniely et al. (2016) that correlations between the pre-activations due to different inputs is important when analyzing notions of expressivity and trainability. We therefore believe that extending these results to nontrivial datasets will be fruitful. To this end, it might be useful to take inspiration from the spin-glass community and seek to rephrase eq. (9) in terms of an overlap and to look for replica-symmetry breaking.
4 The Stochastic Neural Network On A Ring
With the stochastic neural network defined in eq. (9), we consider a specific network topology that is unusual in machine learning but is commonplace in physics. In particular, as in the Ising model described above, we consider a stochastic network whose final layer feeds back into its first layer. Since this topology is incompatible with a loss defined in terms of network inputs and outputs, we set in this case.
A schematic of this network can be seen in fig. (2). The substantial advantage of considering this periodic topology is that we can neglect the effect of boundary conditions and focus on the “bulk” behavior of the network. The boundary effects can be taken into account once a theory for the bulk has been established. This method of dealing with lattice models is extremely common. We additionally set and independent of layer.
The stochastic network on a ring is described by the energy,
subject to the identification . We will call this lattice model the stochastic neural network on a ring. For the remainder of this paper we will consider systematic approximations to eq. (11).
5 Linear Stochastic Neural Networks
To gain intuition for the stochastic network on a ring we will begin by considering a linear network with . In this case it is clear that the energy in eq. (11) is isotropic. It is therefore possible to change variables into hyper-spherical coordinates and integrate out the angular part explicitly (which will give a constant factor that may be neglected). Consequently, the energy for the stochastic linear network is given by (see appendix 8.2),
A controlled approximation to eq. (12) as can be constructed using the Laplace approximation (sometimes called the saddle point approximation). The essence of the Laplace approximation is that integrals of the form can be approximated by as where minimizes . Consequently, we first seek a minimum of eq. (12) to expand around.
where measures the distance to criticality. This solution can be tested by generating many instantiations of stochastic linear networks and then computing the average norm of the pre-activations after the transient from the input has decayed.
In fig. (3) we plot the empirical norm measured in this way against the theoretical prediction. We see excellent agreement between the numerical result and the theory111Note that while we are measuring the average norm of the linear stochastic network, we are predicting which is the mode of the distribution. However, these quantities are equal in the large limit of the Laplace approximation..
Nonuniform fluctuations around the minimum can now be computed. Let and expand the energy to quadratic order in . Writing we find that (see appendix 8.2),
As in the work of Poole et al. (2016), here we also approximate the behavior of the full joint distribution by a Gaussian. However, the Laplace approximation retains the coupling between layers and therefore is able to capture inter-layer fluctuations.
Together eq. (13) and eq. (14) fully characterize the behavior of the linear stochastic network as By expanding to beyond quadratic order, corrections of order can be computed. One application of this would be to reprise the analysis of signal propagation in deep networks in Schoenholz et al. (2017), but for networks of finite rather than infinite width.
As our network is topologically equivalent to a ring, we can perform a coordinate transformation of eq. (14) to the Fourier basis by writing . To respect the periodic boundary conditions of the ring, will be summed from 0 to in units of . It follows that (see appendix 8.2),
The Fourier transformation therefore diagonalizes eq. (14) and so we predict that the different Fourier modes ought to be distributed as independent Gaussians. Since the variance of each mode is positive for , the optimum that we identified in eq. (13) is indeed a minimum.
This calculation gives very precise predictions about the behavior of pre-activations in wide, deep stochastic networks.
To test these predictions we generate samples from linear stochastic networks of width and depth . For each sample we take the norm of the pre-activations in the last layers of the network and compute the fluctuation of the pre-activation around (eq. (13)). For each sample we then compute the FFT of the norm of the pre-activations. Finally, we compute the variance of each Fourier mode (for more details and plots see appendix 8.3). We plot the results of this calculation in fig. (4) for different values of . In each case we see strong agreement between our numerical experiments and the prediction of our theory. Note that the factorial Gaussian approximation discussed briefly above is unable to capture these fluctuations.
The long wavelength behavior of fluctuations in the deep linear network is well described by an effective field theory. This effective field theory can be constructed by expanding eq. (15) to quadratic order in , approximating sums by integrals and differences by derivatives. We find that the effective field theory is defined by the energy (see appendix 8.2),
We note that this field theory features explicitly as well as symmetry. Perhaps expectedly this implies that information can equally travel forward and backwards through the network.
Both the effective field theory and the lattice model have long wavelength fluctuations that are given by the limit of eq. (15),
Given this equation we can read off the length-scale governing fluctuations to be We therefore see that stochastic linear networks feature a phase transition at with an accompanying diverging depth-scale in the fluctuations.
6 Rectified Linear Stochastic Neural Networks
Having discussed the linear stochastic neural network we now move on to the more complicated case of the stochastic neural network on a ring with rectified linear activations, . Again we seek to construct the Laplace approximation to eq. (11).
In this case we notice that the norm squared of any decomposes into two terms, . Here, and are the vectors of positive and negative components of respectively. With this decomposition, the energy for the rectified linear stochastic neural network can be written as,
The integral over each can be decomposed as a sum of integrals over each of the different orthants. In each orthant, the set of positive and negative components of is fixed; Consequently, we may apply independent hyperspherical coordinate transformations to and to within each orthant.
With this in mind, let be the number of positive components of in a given orthant with the remaining components being negative. It is clear that the number of orthants with positive components will be . The partition function for the rectified linear network can therefore be written as (see appendix 8.4),
Here and is the norm of the positive and negative components of the pre-activations respectively. In the limit, the sum over orthants can be converted into an integral and the functions can be approximated using Stirling’s formula. We therefore see that, unlike in the case of linear networks, the lattice model for rectified linear networks contains three interaction fields, , , and .
As in the linear case, we can now construct the Laplace approximation for this network. We first make an ansatz of a constant solution, and , independent of the layer . Solving for the minimum of the energy we arrive at the following saddle point conditions (see appendix 8.4),
Perhaps this result should not be surprising given the symmetry of the random weights. We expect that in the limit the network will settle into a state where half the pre-activations are negative and half the pre-activations are positive.
We can test the results of this prediction in fig. (5) by sampling instances of 1024 layer deep rectified linear stochastic neural networks with . As in the case of the deep linear stochastic network we see excellent agreement between theory and numerical simulation.
Nonuniform fluctuations around the saddle point can once again be computed. To do this we write and . We now expand the energy and make the substitutions and to find an energy cost for fluctuations (see appendix 8.4),
We can understand some of these fluctuations in an intuitive way, for example fluctuations in the norm of the fraction of positive components and the norm of the negative components are anti-correlated. But in general rectified linear networks have subtle and interesting fluctuations, and to our knowledge this work presents the first quantitative theoretical description of the statistics of random rectified linear networks. We note in passing that that the fully factorial mean field theory would not be able to capture any of the anisotropy in the fluctuations identified here.
As in the linear case, the layer-layer coupling can be diagonalized by moving into Fourier space. In the rectified linear case, however, this transformation retains covariance between the different fluctuations. In particular, we can write the energy in Fourier space as where is a vector of fluctuations and
is the Fourier space inverse covariance matrix between different fields (see appendix 8.4).
We can compare our theoretical predictions for the covariance matrix against numerical results generated in an analogous manner to the linear case.
The results of this comparison can be seen in fig. (6) for different elements of the covariance matrix and different values of . As in the linear case we see excellent agreement between the theoretical predictions and the numerical simulations. Finally, we can complete our analysis by computing an effective field theory that governs long wavelength fluctuations (see appendix 8.4).
Once again we can identify an effective field theory that governs long wavelength fluctuations. We find that it is given by (see appendix 8.4),
Note that unlike in the case of the stochastic linear network both the and
symmetries are broken when acting on any given field. This symmetry breaking makes sense since the network treats the different fields quite asymmetrically and the forward and backward propagation dynamics are quite different. In physics, the symmetries and symmetry breaking have been shown to dictate the behavior of systems over large regions of their parameters. Thus, as in Landau theory, many systems are classified based on the symmetries they possess. The presence of this symmetry breaking between linear networks and rectified-linear networks suggests that such an approach might be fruitfully applied to neural networks. As with the deep linear network, the long-wavelength limit of the effective field theory and the lattice model agree.
Here we have shown that for fully-connected feed forward neural networks there is a correspondence between random neural networks and lattice models in statistical physics. While we have not discussed it here, this correspondence actually holds for a very large set of neural network topologies. Lattice models can also be constructed for ensembles of random neural networks that have weights and biases whose distributions are more complicated than factorial Gaussian. In general, the effect of nontrivial network topology and correlations between weights will be to couple spins in the lattice model. Thus, the topology of the neural network will generically induce a topology of the corresponding lattice model. For example, convolutional networks will have corresponding lattice models that feature interactions between the set of all the pre-activations in a given layer that share a filter.
As in physics, it seems likely that lattice models for complex neural networks will be fairly intractable compared to the relatively simple examples presented here. On the other hand, the success of effective field theories at describing the long wavelength fluctuations of random neural networks suggests that even complex networks may be tractable in this limit. Moreover, as neural networks get larger and more complex the behavior of long wavelength fluctuations will become increasingly relevant when thinking about the behavior of the neural network as a whole.
We believe it is likely that there exist universality classes of neural networks whose effective field theories contain the same set of relevant operators. Classifying neural networks in this way would allow us to make statements about the behavior of entire classes of networks. This would transition the paradigm of neural network design away from specific architectural decisions towards a more general discussion about which class of models was most suitable for a specific problem.
Finally, we note that there is has been significant effort made to understand biological neural activity leveraging similar analogies to lattice models and statistical field theory. Notably, Schneidman et al. (2006) noticed that Ising-like models can quantitatively capture the statistics of neural activity in vertebrate retina; Buice and Chow (2013) developed field theoretic extensions to older mean-field theories of populations of neurons; far earlier, Ermentrout and Cowan (1979) used similar techniques to investigate how hallucinations between two similar patterns might come about. By placing artificial neural networks into the context of field-theory it may be possible to find subtle relationships with their biological counterparts.
8.1 Proof of the Main Result
In this section we prove the main result of the paper. We do so in two steps. First we examine the partition function,
and introduce the pre-activations at the cost of adding -function constraints. We use the Fourier representation of these constraints to bring them into the exponent. This requires introducing auxiliary variables that enforce the constraints. Once in this form it becomes apparent that the weights and biases are Gaussian distributed and may therefore be integrated out explicitly. Finally we integrate out the constraints that we introduced in the preceding step to convert the distribution into a distribution over the pre-activations alone.
The partition function for the maximum entropy distribution of a fully-connected feed-forward neural network can be written as,
where we have let for notational convenience.
We can repeat this process iteratively until all of the pre-activations have been introduced. We find,
where we will use interchangeably for notational simplicity. This procedure has essentially used a change of variables to introduce the pre-activations explicitly into the partition function.
Here, -functions constrain the pre-activations their correct values given the weights. To complete the proof we leverage the Fourier representation of the -function as . In particular we use Fourier space denoted by for each pre-activation constraint. We therefore find,
as required. ∎
Provided , the weights, biases, and fictitious fields can be integrated out of eq. (25) to give a stochastic process involving only the pre-activations as,
where is a vector of pre-activations corresponding to each input to the network, if , and is the correlation matrix between activations of the network from different inputs.
We proceed directly completing the square and integrating out Gaussian variables. For notational simplicity we temporarily let and be linear. We then integrate out the weights and biases by completing the square,