Unified Field Theory for Deep and Recurrent Neural Networks

by   Kai Segadlo, et al.

Understanding capabilities and limitations of different network architectures is of fundamental importance to machine learning. Bayesian inference on Gaussian processes has proven to be a viable approach for studying recurrent and deep networks in the limit of infinite layer width, n→∞. Here we present a unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems. The theory elucidates that while the mean-field equations are different with regard to their temporal structure, they yet yield identical Gaussian kernels when readouts are taken at a single time point or layer, respectively. Bayesian inference applied to classification then predicts identical performance and capabilities for the two architectures. Numerically, we find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks and the convergence speed depends non-trivially on the parameters of the weight prior as well as the depth or number of time steps, respectively. Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in 1/n. The formalism thus paves the way to investigate the fundamental differences between recurrent and deep architectures at finite widths n.


page 1

page 2

page 3

page 4


A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks

We develop a mathematically rigorous framework for multilayer neural net...

Non-Gaussian processes and neural networks at finite widths

Gaussian processes are ubiquitous in nature and engineering. A case in p...

The edge of chaos: quantum field theory and deep neural networks

We explicitly construct the quantum field theory corresponding to a gene...

Towards a General Theory of Infinite-Width Limits of Neural Classifiers

Obtaining theoretical guarantees for neural networks training appears to...

Nonlinear Hawkes Process with Gaussian Process Self Effects

Traditionally, Hawkes processes are used to model time–continuous point ...

Dynamic mean field programming

A dynamic mean field theory is developed for model based Bayesian reinfo...

Mean-field inference methods for neural networks

Machine learning algorithms relying on deep neural networks recently all...

1 Introduction

Deep learning has brought a dramatic improvement of the state-of-the-art in many fields of data science, ranging from speech recognition and translation to visual object classification [1, 2, 3, 4]. Any progress in the empirically-driven improvement of algorithms must be accompanied by a profound understanding of why and how deep learning works. Such an understanding is needed to provide guarantees, for example about the accuracy and the robustness of the networks, and will help preventing the frequently reported failures of deep learning, such as its vulnerability to adversarial examples [5].

A common method to obtain analytical insight into deep networks is to study the overparametrized limit in which the width of all layers tends to infinity. In this limit, it has been shown with mean-field theory that under a Gaussian prior on the weights in each layer, the pre-activations follow a Gaussian process with an iteratively determined covariance [6, 7, 8, 9]

; in particular, the pre-activations across different layers and across different neurons become independently Gaussian distributed. This approach allows one to investigate learning and prediction in the framework of Bayesian inference


Often, analogies are drawn between deep neural networks (DNNs) and discrete-time recurrent neural networks (RNNs): Unrolling time in RNNs formally converts them to DNNs, however with shared weights

across layers of identical size

. This led to parallel developments in terms of training strategies for both architectures, such as backpropagation

[10] and backpropagation through time [11].

There are, however, a number of open issues when applying mean-field theory to deep and recurrent neural networks. First of all, the approximation as a Gaussian process relies on the central limit theorem and is thus strictly valid only in the limit of infinite layer widths

. Moreover, due to weight sharing, pre-activations for different points in time are not statistically independent in RNNs; the central limit theorem is thus not applicable and the mean-field approximation becomes uncontrolled. Several studies still find that the mean-field theories of DNNs and RNNs appear to be closely related, culminating in ref. [12]

which formulates a variety of network architectures as tensor programs and finds that most common network architectures, under certain conditions on the non-linearities and piors, converge in distribution to a Gaussian process. But the relationship between the Gaussian processes for RNNs and DNNs has so far not been addressed.

Currently, there is no systematic approach that would allow one to simultaneously study DNNs and RNNs and that would be extendable to finite layer width ; the agreement of the mean-field predictions with the performance of finite-size networks is based on numerical evidence so far. Furthermore, in the limit of infinite width the number of trainable parameters of a DNN, , and of an RNN, , both tend to infinity and do not enter explicitly in the result of the Gaussian approximation. The Gaussian process thus has limited capability of quantifying the expressivity of neural networks in relation to the required resources, such as the number of trained weights. Studies on finite-size corrections beyond the limit are so far restricted to DNNs [13, 14, 15, 16, 17, 18]. Understanding the limits of the putative equivalence of DNNs and RNNs on the mean-field level requires a common theoretical basis for the two architectures that would extend to finite and finite .

To overcome these limitations, we here combine the established view of Bayesian inference by Gaussian processes [19] with the emerging methods of statistical field theory applied to neural networks [20, 21, 22, 23, 24, 25]. The latter methods have been developed in the field of disordered systems, which are systems with random parameters, such as spin glasses [26, 27, 28]. These methods are able to extract the typical behavior of a system with a large number of interacting components. For example, this approach has recently been used to characterize the typical richness, represented by the entropy, of Boolean functions computed in the output layer of DNNs, RNNs, and sparse Boolean circuits [29].

Concretely, in this paper, we present a systematic derivation of the mean-field theories for DNNs and RNNs that is based on the well-established approach of field theory for recurrent networks [20, 30, 24], which allows a unified treatment of the two architectures [29]. This paves the way for extensions to finite , enabled by a rich set of systematic methods available in the mathematical physics literature to compute corrections beyond the leading order [31, 32]. Already to leading order, we find that the mean-field theories of DNNs and RNNs are in fact qualitatively different with regard to correlations across layers or time, respectively. The predictive distribution in Bayesian training is therefore in general different between the two architectures. Nonetheless, the structure of the mean-field equations can give rise to the same Gaussian processes kernel in the limit of infinite width for both DNNs and RNNs if the readout in the RNN is taken from a single time step. This finding holds for single inputs, as pointed out in ref. [29]

, as well as input sequences. Furthermore, for a point-symmetric activation function

[29], there is no observable difference between DNNs and RNNs on the mean-field level if the biases are uncorrelated in time and the input is only supplied in the first time step.

2 Theoretical background

2.1 Bayesian supervised learning

First, we briefly review the Bayesian approach to supervised learning

[33]. Let be a model (here DNN or RNN) that maps inputs to outputs and that depends on a set of parameters . Conventional training of such a model corresponds to finding a particular parameter set that maximizes the likelihood for some given training data , with and . A prediction for the output caused by an unseen test input is then given by . In the Bayesian view, one instead assumes a prior distribution of parameters to obtain, via Bayes’ rule, an entire posterior distribution of the parameters


The conditioning on the training data in can be interpreted as selecting, among all possible parameter sets given by the prior , those parameter sets that accomplish the mapping . A Bayesian prediction for some unseen test input correspondingly results from marginalizing the likelihood over the posterior distribution of the parameters


Inserting eq:param_posterior yields the predictive distribution


that depends on the model-dependent network priors


The network priors encompass all input-output relationships which are compatible with the prior and the model. The difference between the two network priors, eq:network_prior and eq:network_prior_with_test, is the information on the additional test input and output .

Note that the Bayesian approach to supervised learning can also be used for input sequences with . To this end, it is sufficient to replace and in the above formulas.

In the following, we use a field theoretic approach to calculate the network priors for both deep and recurrent neural networks. Conditioning on the training data, eq:output, then yields the Bayesian prediction of the output.

2.2 Network architectures

Deep feedforward neural networks (DNNs) and discrete-time recurrent neural networks (RNNs) can both be described by a set of pre-activations

that are determined by an affine linear transformation


of activations . The pre-activations are transformed by an activation function

which is applied element-wise to vectors. For DNNs,

denotes the weight matrix from layer to layer , and represents biases in layer . Inputs are typically only applied to the first layer such that the input matrices vanish for . For RNNs, the index denotes different time steps. The weight matrix, input matrix, and biases are static over time, , , and , and couple activities across successive time steps. For both architectures, we include an additional input and output layer


with , which allow us to set independent input and output dimensions. Here, denotes the final layer for the DNN and the final time point for the RNN. The set of trainable parameters is the collection of and .

2.3 Parameter priors

We use Gaussian priors for all model parameters, that is for the RNN


and for the DNN


as well as


for both architectures (where for the RNN). These priors on the parameters are used to calculate the network prior .

3 Unified field theory for RNNs and DNNs

The network prior

, eq:network_prior, is a joint distribution of all outputs

, each corresponding to a single training input . Its calculation is tantamount to a known problem in physics, the replica calculation [34, 31]. Here, each replicon is a copy of the network with the same parameters but a different input . For simplicity, in the following we illustrate the derivation of for a single input that is presented to the first layer of the DNN or at the first time point for the RNN, respectively. We present the more cumbersome but conceptually similar general case of multiple inputs, or multiple input sequences, in app:replicated-Field-theory-DNN-RNN.

The network prior is defined as the probability of the output given the input,


marginalized over the parameter prior, where


follows by enforcing the set of equations, eq:eq_of_motion to eq:output_layer, using Dirac constraints. Throughout the manuscript, we use the abbreviation .

3.1 Marginalization of the parameter prior

From eq:Ansatz, it follows that the computation of the marginalization of the parameters in eq:prior_single_input can be reduced to


To proceed, it is advantageous to represent the Dirac -distributions as Fourier integrals,


with the inner product and , because it leads to averages of the form which are analytically solvable. Using for , the network prior for a single replicon, , takes the form (details in app:replicated-Field-theory-DNN-RNN)


where and . The exponent , commonly called action, is given by




is the action of the input and the recurrent layer of the RNN or the inner part of the DNN, respectively, and


is the action for the output layer. Note that is diagonal in neuron indices with respect to the explicitly appearing fields and and couplings across neurons are only mediated by terms of the form .

For RNNs, the shared connectivity and biases at different time points imply correlations across time steps; for DNNs, in contrast, the connectivity and biases are realized independently across layers, so that the action decomposes into a sum of individual layers. In eq:S_rec, this leads to


which is the only difference between DNN and RNN in this formalism.

3.2 Auxiliary variables

An action that is quadratic in and corresponds to a Gaussian and therefore to an analytically solvable integral. However, the post-activations in and introduce a non-quadratic part and the terms cause a coupling across neurons. To deal with this difficulty, we introduce new auxiliary variables


where , a common practice originating from dynamic spin-glass theory [35] and used for random networks [24, 23, 25, 36]. The second term in eq:def_aux_field contains the sum of post-activations over all neuron indices. Assuming sufficiently weak correlations among the , we expect the sum to be close to its mean value with decreasing variations as grows; for large the sum is thus close to a Gaussian. This intuition is made precise below by a formal saddle point approximation in .

Enforcing the auxiliary variables through Dirac- constraints, analogous to eq:Dirac_Fourier (see app:replicated-Field-theory-DNN-RNN for details), leads to the prior distribution


where the distribution is described by the action


Here, with is the size of a layer in the DNN relative to the size of the RNN. The recurrent part and the input, which decouple in the neurons, are together described for the DNN by



a centered Gaussian with layer-dependent variance

and for the RNN by


with and a centered Gaussian across time with covariance matrix .

3.3 Saddle-point approximation

The factor in eq:RNN_decoupled_action, which stems from the decoupling across neurons, for large leads to a strongly peaked distribution of and . Therefore we can use a saddle point approximation to calculate the average over and in eq:single_output_prob. In the limit this approximation becomes exact.

We thus search for stationary points of the action, and , which yields a coupled set of self-consistency equations for the mean values and , commonly called mean-field equations:

, which follows from the normalization of the probability distribution

[37], and


with . eq:MFT comprises both DNN and RNN; the difference between eq:W_rec_DNN and eq:W_rec_RNN leads to the appearance of on the r.h.s. The average on the r.h.s. has to be taken with respect to a theory that only includes two layers or time points. This is due to the marginalization property of the Gaussian distribution of the pre-activations , which results from inserting the saddle-point solutions eq:MFT for and Accordingly, we are left with a closed system of equations for the saddle-point values that are the layer- or time-dependent correlations. These equations need to be solved recursively from the input to the output .

3.4 Network prior

Computing the Gaussian integral over in the saddle-point approximation of eq:single_output_prob, one obtains the distribution of the outputs as independent Gaussians across neurons


An analogous calculation for multiple input sequences (see app:replicated-Field-theory-DNN-RNN) yields the equivalent mean-field equations


with and . These lead to the joint network prior


where the covariance matrix is the Gram matrix of the kernel [19],


Here denotes the -th row of the output matrix that comprises the output of neuron to all input sequences .

In principle, it is also possible to use independent biases or input weights across time steps in the RNN. This would lead to the respective replacements and in eq:MFT_replicated.

3.5 Predictive distribution

We split into training data (indexed by subscript ) and test data (indexed by subscript ). The conditioning on the training data via eq:output can here be done analytically because the network priors are Gaussian [19]. For scalar inputs, this yields the predictive distribution




which are fully determined by the kernel matrix . For input sequences, it is again sufficient to replace and .

4 Comparison of RNNs and DNNs

Figure 1: Mean-field theory for DNN and RNN with a single input. a) Average variance in mean-field theory

(eq:MFT; solid gray curve) and estimate

from simulation, averaged over realizations of networks, for biases that are uncorrelated across time/layers (blue crosses RNN; orange dots DNN). b) Cross-covariance as a function of the hidden layer index for fixed

and uncorrelated biases. RNN: Mean-field theory (solid dark blue and dark magenta). Mean (blue / purple dots) and standard error of the mean (light blue / light purple tube) of

estimated from simulation of network realizations. DNN: Mean (orange dots) and standard error of the mean of estimated from simulation of network realizations. Other parameters , , finite layer width ,

hidden layers, ReLU activation

and Gaussian inputs with . c) Same as a) but for biases that are static across time/layers. d) Same as b) but for the static bias case.

Above, we derived the mean-field equations (33) for the kernel matrix using a field-theoretic approach. Here, we investigate differences in the mean-field distributions of the different network architectures, starting with the kernel and considering the predictive distribution afterwards.

4.1 Kernel

The diagonal elements, for the single-input case in eq:MFT and equivalently for the multiple-input-sequences case in eq:MFT_replicated, are identical for RNNs and DNNs, because for both architectures. This implies that the equal-time or within-layer statistics, correspondingly, is the same in both architectures. The reason is that the iterations eq:MFT and eq:MFT_replicated for equal-time points form closed sets of equations; they can be solved independently of the statistics for different time points . Formally, this follows from the marginalization property of the Gaussian, which implies that any subset of a multivariate Gaussian is Gaussian, too, with a covariance matrix that is the corresponding sector of the covariance matrix of all variables [19]. The precise agreement of this mean-field prediction with the average correlation estimated from direct simulation is shown in fig:Mean-field-theory-for-DNN-RNNa and c for the single-input case for both uncorrelated (a) and static biases (c) across time or layers, respectively.

A notable difference between RNN and DNN is that activity in the RNN is correlated across time steps due to the shared weights, even if biases are uncorrelated in time, as shown in fig:Mean-field-theory-for-DNN-RNNb. Static biases simply strengthen the correlations across time steps (see fig:Mean-field-theory-for-DNN-RNNd). For DNNs, in contrast, cross-layer correlations only arise due to biases that are correlated across layers, because weights are drawn independently for each layer. This is shown in fig:Mean-field-theory-for-DNN-RNNb and d: Correlations vanish for DNNs in the uncorrelated bias case (b) and take on the value , the variance of the bias, in the static bias case (d). Again, the mean-field theory accurately predicts the non-zero correlations across time in the RNN as well as the correlations across layers generated by the correlated biases in the DNN. In the RNN, temporal correlations show a non-trivial interplay due to the shared weights across time. We observe an instability that can build up by this mechanism in finite-size RNNs, even in parameter regimes that are deemed stable in mean-field theory (see app:RNN_instability, fig:finite_size_effect).

In a particular case, the correlations across time steps also vanish for the RNN: we show by induction that off-diagonal elements vanish for point-symmetric activation functions if inputs are only provided in the initial time step, , and the bias is absent, (or uncorrelated across time steps). Assuming that , we have


with and . Hence, if the pre-activations at points and are uncorrelated, also will be uncorrelated. The base case of the induction proof follows from the independence of the input weights and the recurrent weights : correlations between time point zero and other time points are zero. Therefore, by induction in time, time points will be uncorrelated at any point, meaning that for odd activations and the considered input layer, the solutions of the mean-field equations are the same for DNNs and RNNs.

4.2 Predictive distribution

Figure 2: Convergence of RNN and DNN towards the mean-field theory. Maximum mean-discrepancy

for a radial basis function kernel with length scale

[38] between the empirical distribution of scalar outputs and the Gaussian distribution with covariance matrix predicted by MFT eq:MFT_replicated. Empirical estimation across realizations . Average over realizations of , and (error bars showing standard error of the mean). ReLU activation . a) as a function of the width of the network layer for and (inset), with and and uncorrelated biases across time/layers. b) as a function of the depth or duration , for width , , and and uncorrelated biases. c) Same as a) but for biases that are static across time/layers. d) Same as b) but for the static bias case.

Coming back to the general case, we next ask if the different off-diagonal elements of the mean-field equations for RNN and DNN have observable consequences. The answer is no if a linear readout is taken at a single time point or layer , correspondingly (cf. eq:output_layer for the readout): This is a direct consequence of the identical diagonal elements of the covariance , so that the predictive distribution eq:predictive_dist_final for the RNN and the DNN is identical in mean-field theory; the two architectures have the same Gram matrix and thus the same predictive distribution eq:predictive_dist_final. This means that the two architectures have identical computational capabilities in the limit of infinite layer width.

To check how quickly the mean-field theory is approached by finite-size networks, we measure the maximum mean-discrepancy [38, 9] between the Gaussian distribution with covariance matrix and the empirical joint distribution of a set of scalar outputs , eq:output_layer, across realizations of and . The inputs are random patterns presented to the first layer or time step, respectively. We find that convergence is rather fast for both architectures (fig:Convergence-of-RNN-DNN a and c). For sufficiently deep architectures as well as both uncorrelated and static biases, RNNs systematically show a slower convergence than DNNs, which could be anticipated due to the smaller number of independently drawn Gaussian weights, versus . This observation is in line with the MMD being larger for the RNN than for the DNN for (fig:Convergence-of-RNN-DNN b and d). This is also consistent with the coherent interplay of shared connectivity and correlated activity across time steps in the RNN (see app:RNN_instability, fig:finite_size_effect). Overall, we find a faster convergence for uncorrelated biases than for biases that are static over time or layers, respectively.

The temporal correlations present in RNNs become relevant in the case of sequence processing. In such a setting, the network in each time step receives a time dependent input with a non-trivial temporal correlation structure that drives the temporal correlations of the RNN activations for , see eq:MFT_replicated. If the latter are read out in each time step, temporal correlations enter the kernel and thus influence task performance.

We finally note that we here use a separate readout layer. The realization of readout weights as independent Gaussian variables causes vanishing temporal correlations between the readouts and the activity in previous layers or time steps, respectively. For the Gaussian kernel, however, the presence or absence of a readout layer does not make any difference. Alternatively, the readout of signals could be taken from an arbitrary choice of neurons in the last layer or time step, respectively, leading to the same kernel.

5 Discussion

We present a unified derivation of the mean-field theory for deep (DNN) and recurrent neural networks (RNN) using field-theoretical methods. The derivation in particular yields the Gaussian process kernel that predicts the performance of networks trained in a Bayesian way.

The mean-field theories for the statistics within a layer of the DNN and for the equal-time statistics of the RNN are identical, even if temporally correlated input sequences are supplied to the latter network. The reason is that the mean-field equations (33) form a closed system of equations for this subset of the statistics; they can be solved independently of the correlations across time or layers, respectively. This justifies the ‘annealed approximation’ [28] for RNNs where the couplings are redrawn at each time step—which corresponds to the DNN-prior. It is also compatible with earlier work [39] which compares simulations of networks with tied weights (RNN) to the mean-field theory for untied weights (DNN). Intriguingly, the equivalence of the equal-time statistics implies that the predictive distributions of DNNs and RNNs are identical, given the readout is taken only from the final layer or the last time step, respectively.

There are qualitative differences between the mean-field theories for the correlations across time in the RNN and across layers of the DNN: correlations across layers vanish in the DNN, while the weight sharing in the RNN generally causes non-trivial correlations across time. For point-symmetric activation functions, these correlations also vanish in the RNN if the bias is absent (or uncorrelated across time steps) and the input is provided only in the first step. In general, a linear readout from activations that are taken across different time points or layers, respectively, yields different Gaussian process kernels for the RNN compared to the DNN.

Numerically, the convergence of finite-size networks of both architectures to the mean-field theory is generally fast. The RNN converges typically slower than the DNN, at least for long times and correspondingly deep networks. We hypothesize that the temporally correlated activity in the RNN is the cause: The realization of the coupling matrix is the same for all time steps. Also, fluctuations of the activity are coherent over time. Activity and connectivity therefore interact coherently over multiple time steps, so that variations of the connectivity across realizations may cause a corresponding variability of activity. In a DNN, in contrast, both activity and connectivity are uncorrelated across layers, so that variations due to different realizations of the couplings average out over layers.

Identical mean-field theories in the single-input case and for point-symmetric activation functions were already presented in ref. [29] in the context of a characterization of the space of Boolean functions implemented by randomly coupled DNNs and RNNs. Since our work differs on a conceptual level, the implications of the results differ: In the Bayesian inference picture, the equivalent mean-field theories imply identical performance of the trained networks for both architectures at large width; for the characterization of computed Boolean functions, the equivalent mean-field theories imply an equivalent set of functions implemented by any two random instances of the two architectures at large width. The conceptual difference leads to further differences on the technical level: The inputs and outputs considered here include analog values and they are presented not only to the first layer or time step, respectively, but also in a sequential manner at subsequent times or layers. Finally, the disorder average plays a subtle but fundamentally different role in the two works: In ref. [29], the disorder average extracts the typical behavior of any single, sufficiently large, instance of a randomly coupled network. In contrast, in the Bayesian framework considered in this manuscript, the disorder average naturally arises from the marginalization of the parameter prior, i.e., one here considers ensembles of random networks.

The analysis of RNNs and DNNs in this manuscript is based on methods from statistical field theory and our results are formulated in that language [31]. It is worth noting that this field-theoretical approach can be connected to a mathematically more rigorous approach based on large-deviation theory [40].

The main limitation of the presented results is their validity for networks with large widths. There has been previous theoretical work on networks of finite width that is, however, restricted to DNNs: Refs. [13, 14, 17, 41] have presented approaches based on perturbation theory, while refs. [15, 18] employed an Edgeworth expansion. The dynamics of the neural-tangent kernel for deep networks with finite width has been studied in ref. [16]. For specific deep networks of finite width with linear or ReLU activation functions the single-input prior was computed exactly in terms of the Meijer G function in ref. [42]. The formalism proposed here paves the way for a systematic study of generic deep and recurrent networks beyond the leading order in . Computing finite-size corrections in the presented formalism amounts to calculating fluctuation corrections of the auxiliary fields which is a standard task in field theory [31, 32]. It requires the spectrum of the Hessian , evaluated at the mean-field point. Such an approach yields small non-Gaussian corrections to the prior that, moreover, depend on . The corrections therefore provide quantitative insight into the limits of the equivalence between RNNs and DNNs at finite widths, and offer a handle to study the capacity of a network in relation to its resources, the number of weights.

We would like to thank Bo Li, Alexander Mozeika, and David Saad for bringing their related work to our attention. This work was partially supported by the European Union’s Horizon 2020 research and innovation program under Grant agreement No. 945539 (Human Brain Project SGA3), the Helmholtz Association Initiative and Networking Fund under project number SO-092 (Advanced Computing Architectures, ACA), the German Federal Ministry for Education and Research (BMBF Grant 01IS19077A), and the Excellence Initiative of the German federal and state governments (ERS PF-JARA-SDS005).

6 Appendix

6.1 Unified field theory for multiple input sequences

Here, we show the derivation of the mean-field equations with more than one input sequence , the generalization of the derivation presented in the main text. We introduce Greek indices for the different input vectors that we also call ‘replicas’ in the following. Equations for the single-replicon case in the main text can be obtained by setting ; the non-sequential input case follows by setting for and all .

6.1.1 Action and auxiliary variables

We start from the parameterized likelihood for multiple replicas

Expressing the Dirac distributions as integrals , we obtain for the network prior the expression