1 Introduction
Currently, kernel methods and deep neural networks are two of the most remarkable machine learning methodologies. Recent years have witnessed lots of works on the connection between these two methodologies.
lee2017deep pointed out that randomly initializing parameters of an infinitely wide network gives rise to a Gaussian process, which is referred as neural network Gaussian processes (NNGP). Due to the widespread attraction of this idea, the studies of NNGP have been scaled into more types of networks, such as attentionbased models hron2020infinite and recurrent networks yang2019:rnn.A Gaussian process is a classical nonparametric model. The equivalence between an infinitely wide fullyconnected network and a Gaussian process has been established by
neal1996:priors; lee2017deep. Given a fullyconnected multilayer network whose parameters are i.i.d. randomly initialized, the output of each neuron is an aggregation of neurons in the preceding layer whose outputs are i.i.d. When the network width goes infinitely large, according to the Central Limit Theorem
fischer2010history, the output of each neuron conforms to the Gaussian distribution. As a result, the output function expressed by the network is essentially a Gaussian process. The correspondence between neural networks and Gaussian processes allows the exact Bayesian inference using the neural network
lee2017deep.Despite the achievements of the current NNGP theory, there is an important limit in this theory not addressed satisfactorily. So far, the neural network Gaussian process is essentially induced by increasing width, regardless of how many layers are stacked in a network. But in the era of deep learning, what concerns us more regarding deep learning is its depth and how the depth affects the behaviors of a neural network, since the depth is the major element accountable for the power of deep learning. Although that the current NNGP theory is beautiful and elegant in its form, unfortunately, it can not accommodate our concern adequately. Therefore, it is highly necessary to expand the scope of the existing theory to include the depth issue. Specifically, our natural curiosity is what is going to happen if we have an infinitely deep but finitely wide network. Can we derive an NNGP by increasing depth rather than increasing width, which contributes to understanding the true picture of deep learning? If this question is positively answered, we are able to reconcile the successes of deep networks and the elegance of the NNGP theory. What’s more, as a valuable addition, the depthinduced NNGP greatly enlarges the scope of the existing NNGP theory, which is posited to open lots of doors for research and translation opportunities in this area.
The above idea is well motivated based on a widthdepth symmetry consideration. Previously, lu2017expressive and hornik1989multilayer have respectively proved that the widthbounded and depthbounded neural networks are universal approximators. fan2020quasi
suggested that a wide network and a deep network can be converted to each other with a negligible error by using De Morgan’s law. Since somehow there exists symmetry between width and depth, it is very probable that deepening a neural network in certain conditions can lead to an NNGP as well. Along this direction, we investigate the feasibility of inducing an NNGP by depth (NNGP
), with a network that has a topology of shortcuts, as shown in Figure 1. The characteristic of this topology is that outputs of intermediate layers with a gap of are aggregated in the final layer, yielding the network output. Such a shortcut topology has been successfully applied into medical imaging you2019ctand computer vision
fan2018sparse as a backbone structure.An NNGP by width (NNGP) is accomplished by summing the i.i.d. output terms of infinitely many neurons and applying Central Limit Theorem. In contrast, for the topology in Figure 1, as the depth increases, the outputs of increasingly many neurons are aggregated together. We constrain the random weights and biases such that those summed neurons turn into weakly dependent by the virtue of their separation. Consequently, when going infinitely deep, the network, as illustrated in Figure 1, is also a function drawn from a Gaussian process according to the generalized Central Limit Theorem under weak dependence (b1995:clt). Beyond the proposed NNGP, we theoretically prove that NNGP is uniformly tight and provide a tight bound of the smallest eigenvalue of the concerned NNGP kernel. From the former, one can determine the properties of NNGP such as functional limit and continuity properties of NNGP, while the nontrivial lower and upper bounds mirror the characteristics of the derived kernel, which constitutes a fundamental cornerstone for optimization and generalization of deep learning theory.
Main Contributions.
In this work, we establish the NNGP by increasing depth, in contrast to the present mainstream NNGPs that are induced by increasing width. Our work substantially enlarges the scope of the existing elegant NNGP theory, making a stride towards understanding the true picture of deep learning. Furthermore, we investigate the essential properties of the proposed NNGP and its associated kernel, which lays a solid foundation for future research and applications. Lastly, we implement an NNGP
kernel and apply it for regression experiments on two wellknown data sets.2 Preliminaries
Let be the set for an integer . Given a function , we denote by if there exist positive constants and such that for every ; if there exist positive constants and such that for every ; if there exist positive constants and such that for every . Let denote the matrix norm for the matrix . Throughout this paper, we employ the maximum spectral norm
as the matrix norm (meyer2000:norm), where denotes the
th singular value of the matrix
. Let denote the number of elements, e.g., . Finally, we provide several definitions for the characterization of inputs and parameters.Definition 1
A data distribution is said to be wellscaled, if the following conditions hold for :

;

;

.
Definition 2
A function is said to be wellposed, if is firstorder differentiable, and its derivative is bounded by a certain constant
. Specially, the commonly used activation functions like ReLU, tanh, and sigmoid are wellposed (see Table
1).Definition 3
A matrix is said to be stablepertinent for a wellposed activation function , in short , if the inequality holds.
Activations  WellPosedness 

ReLU  
sigmoid 
3 Main Results
In this section, we formally present the neural network Gaussian process NNGP, led by an infinitely deep but finitely wide neural network with i.i.d. weight parameters. We also derive valuable characterizations for NNGP and its associated NNGP
kernel: uniform tightness with the increased depth and bound estimation of the kernel’s smallest eigenvalue.
3.1 Neural Network Gaussian Process with Increasing Depth
Consider an layer neural network whose topology is illustrated as Figure 1, the feedforward propagation follows
(1) 
where and
are the weight matrix and bias vector of the
layer, respectively, and is the activation function. Invoking shortcut connections, the final output of this network is a mean of () previous layers with an equal separation and(2) 
where the ones matrix indicates the unit shortcut connection between and final layer, and denotes the summed number of concerned hidden neurons
Let
be the concatenation of all vectorized weight matrices and
. Regarding the neural network , we present the first main theorem as follows.Theorem 1
Theorem 1 states that our proposed neural network converges to a Gaussian process as . Given a data set , the limit output variables of this network belongs to a multivariate Gaussian distribution whose mean equals to 0 and covariance matrix is an matrix, the entry of which is defined as
(3) 
The key idea of proving Theorem 1
is to show that our proposed neural network converges to a Gaussian process as depth increases according to the generalized Central Limit Theorem with weakly dependent variables instead of random ones. To implement this idea, we intentionally constrain the weights and biases to enable that random variables of two hidden layers with a sufficient separation degenerate to weak dependence, i.e., mixing processes. By aggregating the weakly dependent variables to the final layer via shortcut connections, the output of the proposed network converges to a Gaussian process as the depth goes to infinity. The key steps are formally stated by Lemmas
1 and 2 as follows.Lemma 1
Provided a wellposed and stablepertinent parameter matrices, the concerned neural network comprises a stochastic sequence of weakly dependent variables as the depth goes to infinity .
Proof. Let denote the distribution of the random variable sequence , where , and indicates the vector of random variables before the timestamp . We define a coefficient as
where
stands for a conditional probability distribution and
denotes a probability measure, or equally the algebra of events (joe1997:Sklar), which satisfiesfor two probability distributions
and . According to Eq. (1), we havewhere
Given the wellposed and stablepertinent parameter matrices, i.e., for any , it holds true that
where and . Thus, we have
Therefore, the sequence led by Eq. (1) is mixing, or equally weakly dependent, which completes the proof.
Lemma 2
Suppose that (i) a random variable sequence is weakly independent, satisfying mixing with an exponential convergence rate, (ii) for , we have
Let , then we have
Further, the limit variable converges in distribution to as , provided .
Lemma 2 is a variant of the generalized Central Limit Theorem under weak dependence. The proof idea can be summarized as follows. From (doukhan2012:mixing), it’s observed that an mixing sequence with an exponential convergence rate can be covered by the mixing one with . Thus, the conditions of Lemma 2 satisfy the preconditions of the generalized Central Limit Theorem under weak dependence (b1995:clt, Theorem 27.5). This lemma also has alternative proofs according to the encyclopedic treatment of limit theorems under mixing conditions. Interested readers can refer to bradley2007:mixing for more details.
Proof of Theorem 1. Let denote the output variables of the th layer, which satisfies that and . Because the weights and biases are taken to be i.i.d., the sequence leads to a stochastic process, and the postactivations in the same layer, such as and are independent for . Given an integer , we select a subsequence of as follows:
for and , which satisfy . From Lemma 1, the sequence leads to a weakly dependent stochastic process. Aggregating this subsequence with shortcut connections to the output layer, the output of the concerned neural network converges to a Gaussian process as as well as , from Lemma 2.
Remark. To the best of our knowledge, our proposed NNGP is the first NNGP induced by increasing depth. Currently, there is no rigorous definition for width and depth. The way we claim depth just aligns with the conventional usage of the width and depth for a neural network, in which the depth is understood as the maximum number of neurons among all possible routes from the input to the output and the width is the maximum number of neurons in a layer. As illustrated in Figure 2, if examined in an unravelled view, our network is a simultaneously wide and deep network due to the layer reuse in different routes. However, we argue that this will not affect our claim because not every layer has an infinite width in the unravelled view, which is different from the key character of NNGP. What’s more, the conventional usage is more acceptable relative to the unravelled view; otherwise, it is against the common sense to also regard the ResNet as a wide network.
3.2 Uniform Tightness of NNGP
In this subsection, we delineate the asymptotic behavior of NNGP as the depth goes to infinity. Here, we assume that the weights and biases are i.i.d. sampled from . Per the conditions of Theorem 1, we have the following theorem.
Theorem 2
For any , the stochastic process, described in Lemma 1, is uniformly tight in .
Theorem 2 reveals that the stochastic process contained by our network (illustrated in Figure 1) is uniformly tight, which is an intrinsic characteristic of NNGP. Based on Theorem 2, one can obtain not only the functional limit and continuity properties of NNGP, in analogy to the results of NNGP (bracale2020:asymptotic).
Similarly, we start the proof of Theorem 2 with some useful lemmas.
Lemma 3
Let denote a sequence of random variables in . This stochastic process is uniformly tight in , if (1) is a uniformly tight point of () in ; (2) for any and , there exist , such that .
Lemma 3 is the core guidance for proving Theorem 2. This lemma can be straightforwardly derived from Kolmogorov Continuity Theorem (stroock1997:Kolmogorov), provided the Polish space .
Lemma 4
Based on the notations of Lemma 3, is a uniformly tight point of () in .
Proof. It suffices to prove that 1) is a tight point of () in and 2) the statistic converges in distribution as . Note that 1) is selfevident since every probability measure in is tight; 2) has been proved by Theorem 1. Therefore, this lemma is completely proved.
Remark. Notice that the convergence in distribution () from Lemmas 2 and 4 paves the way for the convergence of expectations. Specifically, provided a linear and bounded functional as and a function which satisfies that , then we have and according to General Transformation Theorem (van2000:asymptotic, Theorem 2.3) and Uniform Integrability (billingsley2013:convergence), respectively. These results may serve as solid bases for applications of NNGP in the future.
Lemma 5
Based on the notations of Lemma 3, for any and , there exist , such that
Proof. This proof follows mathematical induction. Before that, we show the following preliminary result. Let be one element of the augmented matrix at the
th layer, then we can formulate its characteristic function as
where denotes the imaginary unit with
. Thus, the variance of hidden random variables at the
layer becomes(4) 
Since the activation is a wellposed function and , we affirm that is Lipschitz continuous (with Lipschitz constant ).
Now we start the mathematical induction. When , for any and , we have
where . Per mathematical induction, for , we have
Thus, one has
(5) 
where
Thus, Eq. (5) becomes
where
Iterating this argument, we obtain
where
The above induction holds for any positive even . Let , then this lemma is proved as desired.
3.3 Tight Bound for the Smallest Eigenvalue
In this subsection, we provide a tight bound for the smallest eigenvalue of the NNGP kernel. For the NNGP with ReLU activation, we have the following theorem.
Theorem 3
Suppose that are i.i.d. sampled from and is a wellscaled distribution, then for an integer , with probability , we have , where
Theorem 3 provides a tight bound for the smallest eigenvalue of the NNGP kernel. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization.
The key idea of proving Theorem 3 is based on the following inequalities about the smallest eigenvalue of realvalued symmetric square matrices. Given two symmetric matrices , it’s observed that
(6) 
From Eqs. (2) and (3), we can unfold as a sum of covariance of the sequence of random variables . Thus, we can bound by via a chain of feedforward compositions in Eq. (1). We begin this proof with the following lemmas.
Lemma 6
Let be a Lipschitz continuous function with constant and denote the Gaussian distribution , then for , there exists , s.t.
(7) 
Lemma 6 shows that the Gaussian distribution corresponding to our samples satisfies the logSobolev inequality (i.e., Eq. (7)) with some constants unrelated to dimension
. This result also holds for the uniform distributions on the sphere or unit hypercube
(nguyen2021:eigenvalues). Detailed proofs can be accessed from Appendix.Lemma 7
Suppose that are i.i.d. sampled from , then with probability , we have
for , where
Proof. From Definition 1, we have
Since are i.i.d. sampled from , for , we have with probability at least . Provided , the singlesided inner product is Lipschitz continuous with the constant . Thus, from Lemma 6, for , we have
Thus, for , we have
We complete the proof by setting .
Proof of Theorem 3. We start this proof with some notations. Recall the empirical NNGP kernel . For convenience, we force . We also abbreviate the covariance as and pick throughout this proof.
Unfolding Eq. (3), we have
(8) 
where
in which the subscript indicates the th element of vector . From Theorem 1, the sequence of random variables is weakly dependent with as . Thus, is an infinitesimal with respect to when and is sufficiently large.
Invoking Eqs. (6) into Eq. (8), we have
(9)  
(10) 
Iterating Eq. (10) and then invoking it into Eq. (9), we have
(11) 
From the Hermite expansion of ReLU function, we have
(12) 
where indicates the expansion order. Thus, we have
(13)  
where the superscript denotes the th Khatri Rao power of the matrix , the first inequality follows from Eq. (12), the second one holds from Gershgorin Circle Theorem (salas1999gershgorin), and the third one follows from Lemma 7. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (13) into Eq. (11)
4 Experiments
Generally, the depth can endow a network with a more powerful representation ability than the width. However, it is unclear whether or not the superiority of depth can sustain in the setting of NNGP, as all parameters are random rather than trained. In other words, it is unclear whether the established NNGP is more expressive than NNGP. To answer this question, in this section, we apply the NNGP
kernel into the generic regression task and then compare its performance on the FashionMNIST (FMNIST) and CIFAR10 data sets with that of NNGP
.NNGP regression. Provided the data set , where is the input and is the corresponding label, our goal is to predict for the test sample . From Theorem 1, and belong to a multivariate Gaussian process , whose mean equals to 0 and covariance matrix has the following form
(15) 
where is an matrix computed by Eq. (3), and the th element of is for . It’s observed that Eq. (15) provides a division paradigm corresponding to the training set and test sample, respectively. Thus, we have with
(16) 
where denotes the label vector. When the observations are corrupted by the Gaussian additive noise of , Eq. (16) becomes
(17) 
where is the identity matrix.
For numerical implementation, we calculate the NNGP and NNGP kernels as follows:
(18) 
where indicates the deep network or wide network.
Experimental setups. We conduct regression experiments on FMNIST and CIFAR10 data sets. We respectively sample 1k, 2k, 3k data from the training sets to construct two kernels, and then test the performance of kernels on the test sets. Here, we employ an onehiddenlayer wide network for computing the NNGP kernel, whereas the width of the deep network is set to the number of classes which is the smallest possible width for prediction tasks. For a fair comparison, the depth of NNGP and the width of NNGP are equally set to (). For regression tasks, the class labels are encoded into an apposite regression formation where incorrect classes are and correct class is (lee2017deep). For two networks, we employ as the activation function. Following the setting of NNGPlee2017deep, all weights are initialized with a Gaussian distribution of mean and variance of for a normalization in each layer, where is the number of neurons in the th layer. The initialization is repeated times to compute the empirical statistics of the NNGP and NNGP based on Eq. (18). We also run each experiment times for counting the mean and variance of accuracy. All experiments are conducted on Intel Corei76500U.
Model  FMNIST  Test accuracy  CIFAR10  Test accuracy 

NNGP  1k  0.3450.016  1k  0.1660.018 
NNGP  0.3420.021  0.1870.018  
NNGP  2k  0.3520.019  2k  0.1780.007 
NNGP  0.3730.030  0.1880.012  
NNGP  3k  0.372 0.024  3k  0.1820.005 
NNGP  0.3650.007  0.1850.019 
Results. Table 2 lists the performance of the regression experiments using NNGP and NNGP kernels. It’s observed that the test accuracy of NNGP and NNGP kernels are comparable to each other, which implies that NNGP and NNGP kernels are similar to each other in representation ability. We think the reason is as follows. Both NNGP and NNGP kernels are not the stacked kernels. Their difference is mainly the aggregation of independent or weakly dependent variables. Thus, their discriminative ability should be similar (lee2017deep).
Next, we use the angular plot to investigate how the separation affects the representation ability of the NNGP kernel. The angle is computed according to
and the angular plot manifests the relationship between kernel values and angles. If an angular plot comes near the zero, the kernel cannot well recognize the difference between samples. Otherwise, the kernel is regarded to have a better discriminative ability. We render the network depth so that the NNGP kernel is empirically computed by aggregating shortcut connections with a separation of between neighboring shortcut connections. Figure 3 illustrates the angularities of NNGP kernels with for FMNIST1k training data. It is observed that the angular plot of the kernel with is compressed to be closer to zero relative to that of the kernel with . This result suggests that the separation should be set to a smaller number for a powerful NNGP kernel.
5 Related Work
Deep Learning and Kernel Methods. There have been great efforts on correspondence between deep neural networks and Gaussian processes. neal1996:priors presented the seminal work by showing that a onehiddenlayer network of infinite width turns into a Gaussian process. cho:MKMs linked the multilayer networks using rectified polynomial activation with compositional Gaussian kernels. lee2017deep
showed that the infinitely wide fullyconnected neural networks with commonly used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks including Bayesian networks
(novak2018bayesian), deep networks with convolution (garriga2019:bayesian), and recurrent networks (yang2019:rnn). Furthermore, wang2020:bridging wrote an inclusive review for studies on connecting neural networks and kernel learning. Despite great progresses being made, all existing works about NNGP rely on increasing width to induce the Gaussian processes, yet we investigate the depth paradigm and offer an NNGP by increasing depth, which not only complements the existing theory to a good degree but also enhances our understanding to the true picture of “deep” learning.Developments of NNGPs. Recent years have witnessed a growing interest in neural network Gaussian processes. NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of lee2017deep, an explicit estimate in the form of variance prediction is given to each test sample. Besides, pang2019neural
showed that the NNGP is good at handling data with noises and is superior to discretizing differential operators in solving some linear or nonlinear partially differential equations.
park2020towards employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the process of neural architecture search. dutordoir2020bayesian presented the translation insensitive convolutional kernel by relaxing the translation invariance of deep convolutional Gaussian processes. lu2020interpretableproposed an interpretable NNGP by approximating an NNGP with its loworder moments.
6 Conclusions and Prospects
In this paper, we have presented the first NNGP by depth based on the widthdepth symmetry consideration. We formulate the associated NNGP kernel by a network with increasing depth. Next, we have characterized the basic properties of the proposed NNGP kernel by proving its uniform tightness and estimating its smallest eigenvalue, respectively. Such results serve a solid base for future research, such as the generalization and optimization of NNGP and Bayesian inference with the NNGP. Lastly, we have conducted regression experiments on image classification and showed that our proposed NNGP kernel can achieve the performance comparable to the NNGP kernel. Future efforts can be put into scaling the proposed NNGP kernel into more applications.
Comments
There are no comments yet.