Currently, kernel methods and deep neural networks are two of the most remarkable machine learning methodologies. Recent years have witnessed lots of works on the connection between these two methodologies.lee2017deep pointed out that randomly initializing parameters of an infinitely wide network gives rise to a Gaussian process, which is referred as neural network Gaussian processes (NNGP). Due to the widespread attraction of this idea, the studies of NNGP have been scaled into more types of networks, such as attention-based models hron2020infinite and recurrent networks yang2019:rnn.
A Gaussian process is a classical non-parametric model. The equivalence between an infinitely wide fully-connected network and a Gaussian process has been established byneal1996:priors; lee2017deep
. Given a fully-connected multi-layer network whose parameters are i.i.d. randomly initialized, the output of each neuron is an aggregation of neurons in the preceding layer whose outputs are i.i.d. When the network width goes infinitely large, according to the Central Limit Theoremfischer2010history
, the output of each neuron conforms to the Gaussian distribution. As a result, the output function expressed by the network is essentially a Gaussian process. The correspondence between neural networks and Gaussian processes allows the exact Bayesian inference using the neural networklee2017deep.
Despite the achievements of the current NNGP theory, there is an important limit in this theory not addressed satisfactorily. So far, the neural network Gaussian process is essentially induced by increasing width, regardless of how many layers are stacked in a network. But in the era of deep learning, what concerns us more regarding deep learning is its depth and how the depth affects the behaviors of a neural network, since the depth is the major element accountable for the power of deep learning. Although that the current NNGP theory is beautiful and elegant in its form, unfortunately, it can not accommodate our concern adequately. Therefore, it is highly necessary to expand the scope of the existing theory to include the depth issue. Specifically, our natural curiosity is what is going to happen if we have an infinitely deep but finitely wide network. Can we derive an NNGP by increasing depth rather than increasing width, which contributes to understanding the true picture of deep learning? If this question is positively answered, we are able to reconcile the successes of deep networks and the elegance of the NNGP theory. What’s more, as a valuable addition, the depth-induced NNGP greatly enlarges the scope of the existing NNGP theory, which is posited to open lots of doors for research and translation opportunities in this area.
The above idea is well motivated based on a width-depth symmetry consideration. Previously, lu2017expressive and hornik1989multilayer have respectively proved that the width-bounded and depth-bounded neural networks are universal approximators. fan2020quasi
suggested that a wide network and a deep network can be converted to each other with a negligible error by using De Morgan’s law. Since somehow there exists symmetry between width and depth, it is very probable that deepening a neural network in certain conditions can lead to an NNGP as well. Along this direction, we investigate the feasibility of inducing an NNGP by depth (NNGP), with a network that has a topology of shortcuts, as shown in Figure 1. The characteristic of this topology is that outputs of intermediate layers with a gap of are aggregated in the final layer, yielding the network output. Such a shortcut topology has been successfully applied into medical imaging you2019ct
and computer visionfan2018sparse as a backbone structure.
An NNGP by width (NNGP) is accomplished by summing the i.i.d. output terms of infinitely many neurons and applying Central Limit Theorem. In contrast, for the topology in Figure 1, as the depth increases, the outputs of increasingly many neurons are aggregated together. We constrain the random weights and biases such that those summed neurons turn into weakly dependent by the virtue of their separation. Consequently, when going infinitely deep, the network, as illustrated in Figure 1, is also a function drawn from a Gaussian process according to the generalized Central Limit Theorem under weak dependence (b1995:clt). Beyond the proposed NNGP, we theoretically prove that NNGP is uniformly tight and provide a tight bound of the smallest eigenvalue of the concerned NNGP kernel. From the former, one can determine the properties of NNGP such as functional limit and continuity properties of NNGP, while the non-trivial lower and upper bounds mirror the characteristics of the derived kernel, which constitutes a fundamental cornerstone for optimization and generalization of deep learning theory.
In this work, we establish the NNGP by increasing depth, in contrast to the present mainstream NNGPs that are induced by increasing width. Our work substantially enlarges the scope of the existing elegant NNGP theory, making a stride towards understanding the true picture of deep learning. Furthermore, we investigate the essential properties of the proposed NNGP and its associated kernel, which lays a solid foundation for future research and applications. Lastly, we implement an NNGPkernel and apply it for regression experiments on two well-known data sets.
Let be the set for an integer . Given a function , we denote by if there exist positive constants and such that for every ; if there exist positive constants and such that for every ; if there exist positive constants and such that for every . Let denote the matrix norm for the matrix . Throughout this paper, we employ the maximum spectral norm
as the matrix norm (meyer2000:norm), where denotes the
-th singular value of the matrix. Let denote the number of elements, e.g., . Finally, we provide several definitions for the characterization of inputs and parameters.
A data distribution is said to be well-scaled, if the following conditions hold for :
A matrix is said to be stable-pertinent for a well-posed activation function , in short , if the inequality holds.
3 Main Results
In this section, we formally present the neural network Gaussian process NNGP, led by an infinitely deep but finitely wide neural network with i.i.d. weight parameters. We also derive valuable characterizations for NNGP and its associated NNGP
kernel: uniform tightness with the increased depth and bound estimation of the kernel’s smallest eigenvalue.
3.1 Neural Network Gaussian Process with Increasing Depth
Consider an -layer neural network whose topology is illustrated as Figure 1, the feed-forward propagation follows
are the weight matrix and bias vector of thelayer, respectively, and is the activation function. Invoking shortcut connections, the final output of this network is a mean of () previous layers with an equal separation and
where the ones matrix indicates the unit shortcut connection between and final layer, and denotes the summed number of concerned hidden neurons
be the concatenation of all vectorized weight matrices and. Regarding the neural network , we present the first main theorem as follows.
Theorem 1 states that our proposed neural network converges to a Gaussian process as . Given a data set , the limit output variables of this network belongs to a multivariate Gaussian distribution whose mean equals to 0 and covariance matrix is an matrix, the -entry of which is defined as
The key idea of proving Theorem 1
is to show that our proposed neural network converges to a Gaussian process as depth increases according to the generalized Central Limit Theorem with weakly dependent variables instead of random ones. To implement this idea, we intentionally constrain the weights and biases to enable that random variables of two hidden layers with a sufficient separation degenerate to weak dependence, i.e., mixing processes. By aggregating the weakly dependent variables to the final layer via shortcut connections, the output of the proposed network converges to a Gaussian process as the depth goes to infinity. The key steps are formally stated by Lemmas1 and 2 as follows.
Provided a well-posed and stable-pertinent parameter matrices, the concerned neural network comprises a stochastic sequence of weakly dependent variables as the depth goes to infinity .
Proof. Let denote the distribution of the random variable sequence , where , and indicates the vector of random variables before the timestamp . We define a coefficient as
stands for a conditional probability distribution anddenotes a probability measure, or equally the -algebra of events (joe1997:Sklar), which satisfies
for two probability distributionsand . According to Eq. (1), we have
Given the well-posed and stable-pertinent parameter matrices, i.e., for any , it holds true that
where and . Thus, we have
Therefore, the sequence led by Eq. (1) is -mixing, or equally weakly dependent, which completes the proof.
Suppose that (i) a random variable sequence is weakly independent, satisfying -mixing with an exponential convergence rate, (ii) for , we have
Let , then we have
Further, the limit variable converges in distribution to as , provided .
Lemma 2 is a variant of the generalized Central Limit Theorem under weak dependence. The proof idea can be summarized as follows. From (doukhan2012:mixing), it’s observed that an -mixing sequence with an exponential convergence rate can be covered by the -mixing one with . Thus, the conditions of Lemma 2 satisfy the preconditions of the generalized Central Limit Theorem under weak dependence (b1995:clt, Theorem 27.5). This lemma also has alternative proofs according to the encyclopedic treatment of limit theorems under mixing conditions. Interested readers can refer to bradley2007:mixing for more details.
Proof of Theorem 1. Let denote the output variables of the -th layer, which satisfies that and . Because the weights and biases are taken to be i.i.d., the sequence leads to a stochastic process, and the post-activations in the same layer, such as and are independent for . Given an integer , we select a sub-sequence of as follows:
for and , which satisfy . From Lemma 1, the sequence leads to a weakly dependent stochastic process. Aggregating this sub-sequence with shortcut connections to the output layer, the output of the concerned neural network converges to a Gaussian process as as well as , from Lemma 2.
Remark. To the best of our knowledge, our proposed NNGP is the first NNGP induced by increasing depth. Currently, there is no rigorous definition for width and depth. The way we claim depth just aligns with the conventional usage of the width and depth for a neural network, in which the depth is understood as the maximum number of neurons among all possible routes from the input to the output and the width is the maximum number of neurons in a layer. As illustrated in Figure 2, if examined in an unravelled view, our network is a simultaneously wide and deep network due to the layer reuse in different routes. However, we argue that this will not affect our claim because not every layer has an infinite width in the unravelled view, which is different from the key character of NNGP. What’s more, the conventional usage is more acceptable relative to the unravelled view; otherwise, it is against the common sense to also regard the ResNet as a wide network.
3.2 Uniform Tightness of NNGP
In this subsection, we delineate the asymptotic behavior of NNGP as the depth goes to infinity. Here, we assume that the weights and biases are i.i.d. sampled from . Per the conditions of Theorem 1, we have the following theorem.
For any , the stochastic process, described in Lemma 1, is uniformly tight in .
Theorem 2 reveals that the stochastic process contained by our network (illustrated in Figure 1) is uniformly tight, which is an intrinsic characteristic of NNGP. Based on Theorem 2, one can obtain not only the functional limit and continuity properties of NNGP, in analogy to the results of NNGP (bracale2020:asymptotic).
Similarly, we start the proof of Theorem 2 with some useful lemmas.
Let denote a sequence of random variables in . This stochastic process is uniformly tight in , if (1) is a uniformly tight point of () in ; (2) for any and , there exist , such that .
Based on the notations of Lemma 3, is a uniformly tight point of () in .
Proof. It suffices to prove that 1) is a tight point of () in and 2) the statistic converges in distribution as . Note that 1) is self-evident since every probability measure in is tight; 2) has been proved by Theorem 1. Therefore, this lemma is completely proved.
Remark. Notice that the convergence in distribution () from Lemmas 2 and 4 paves the way for the convergence of expectations. Specifically, provided a linear and bounded functional as and a function which satisfies that , then we have and according to General Transformation Theorem (van2000:asymptotic, Theorem 2.3) and Uniform Integrability (billingsley2013:convergence), respectively. These results may serve as solid bases for applications of NNGP in the future.
Based on the notations of Lemma 3, for any and , there exist , such that
Proof. This proof follows mathematical induction. Before that, we show the following preliminary result. Let be one element of the augmented matrix at the
-th layer, then we can formulate its characteristic function as
where denotes the imaginary unit with
. Thus, the variance of hidden random variables at thelayer becomes
Since the activation is a well-posed function and , we affirm that is Lipschitz continuous (with Lipschitz constant ).
Now we start the mathematical induction. When , for any and , we have
where . Per mathematical induction, for , we have
Thus, one has
Thus, Eq. (5) becomes
Iterating this argument, we obtain
The above induction holds for any positive even . Let , then this lemma is proved as desired.
3.3 Tight Bound for the Smallest Eigenvalue
In this subsection, we provide a tight bound for the smallest eigenvalue of the NNGP kernel. For the NNGP with ReLU activation, we have the following theorem.
Suppose that are i.i.d. sampled from and is a well-scaled distribution, then for an integer , with probability , we have , where
Theorem 3 provides a tight bound for the smallest eigenvalue of the NNGP kernel. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization.
The key idea of proving Theorem 3 is based on the following inequalities about the smallest eigenvalue of real-valued symmetric square matrices. Given two symmetric matrices , it’s observed that
From Eqs. (2) and (3), we can unfold as a sum of covariance of the sequence of random variables . Thus, we can bound by via a chain of feedforward compositions in Eq. (1). We begin this proof with the following lemmas.
Let be a Lipschitz continuous function with constant and denote the Gaussian distribution , then for , there exists , s.t.
. This result also holds for the uniform distributions on the sphere or unit hypercube(nguyen2021:eigenvalues). Detailed proofs can be accessed from Appendix.
Suppose that are i.i.d. sampled from , then with probability , we have
for , where
Proof. From Definition 1, we have
Since are i.i.d. sampled from , for , we have with probability at least . Provided , the single-sided inner product is Lipschitz continuous with the constant . Thus, from Lemma 6, for , we have
Thus, for , we have
We complete the proof by setting .
Proof of Theorem 3. We start this proof with some notations. Recall the empirical NNGP kernel . For convenience, we force . We also abbreviate the covariance as and pick throughout this proof.
Unfolding Eq. (3), we have
in which the subscript indicates the -th element of vector . From Theorem 1, the sequence of random variables is weakly dependent with as . Thus, is an infinitesimal with respect to when and is sufficiently large.
From the Hermite expansion of ReLU function, we have
where indicates the expansion order. Thus, we have
where the superscript denotes the -th Khatri Rao power of the matrix , the first inequality follows from Eq. (12), the second one holds from Gershgorin Circle Theorem (salas1999gershgorin), and the third one follows from Lemma 7. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (13) into Eq. (11)
Generally, the depth can endow a network with a more powerful representation ability than the width. However, it is unclear whether or not the superiority of depth can sustain in the setting of NNGP, as all parameters are random rather than trained. In other words, it is unclear whether the established NNGP is more expressive than NNGP. To answer this question, in this section, we apply the NNGP
kernel into the generic regression task and then compare its performance on the Fashion-MNIST (FMNIST) and CIFAR10 data sets with that of NNGP.
NNGP regression. Provided the data set , where is the input and is the corresponding label, our goal is to predict for the test sample . From Theorem 1, and belong to a multivariate Gaussian process , whose mean equals to 0 and covariance matrix has the following form
where is an matrix computed by Eq. (3), and the -th element of is for . It’s observed that Eq. (15) provides a division paradigm corresponding to the training set and test sample, respectively. Thus, we have with
where denotes the label vector. When the observations are corrupted by the Gaussian additive noise of , Eq. (16) becomes
where is the identity matrix.
For numerical implementation, we calculate the NNGP and NNGP kernels as follows:
where indicates the deep network or wide network.
Experimental setups. We conduct regression experiments on FMNIST and CIFAR10 data sets. We respectively sample 1k, 2k, 3k data from the training sets to construct two kernels, and then test the performance of kernels on the test sets. Here, we employ an one-hidden-layer wide network for computing the NNGP kernel, whereas the width of the deep network is set to the number of classes which is the smallest possible width for prediction tasks. For a fair comparison, the depth of NNGP and the width of NNGP are equally set to (). For regression tasks, the class labels are encoded into an apposite regression formation where incorrect classes are and correct class is (lee2017deep). For two networks, we employ as the activation function. Following the setting of NNGPlee2017deep, all weights are initialized with a Gaussian distribution of mean and variance of for a normalization in each layer, where is the number of neurons in the -th layer. The initialization is repeated times to compute the empirical statistics of the NNGP and NNGP based on Eq. (18). We also run each experiment times for counting the mean and variance of accuracy. All experiments are conducted on Intel Core-i7-6500U.
|Model||FMNIST||Test accuracy||CIFAR10||Test accuracy|
Results. Table 2 lists the performance of the regression experiments using NNGP and NNGP kernels. It’s observed that the test accuracy of NNGP and NNGP kernels are comparable to each other, which implies that NNGP and NNGP kernels are similar to each other in representation ability. We think the reason is as follows. Both NNGP and NNGP kernels are not the stacked kernels. Their difference is mainly the aggregation of independent or weakly dependent variables. Thus, their discriminative ability should be similar (lee2017deep).
Next, we use the angular plot to investigate how the separation affects the representation ability of the NNGP kernel. The angle is computed according to
and the angular plot manifests the relationship between kernel values and angles. If an angular plot comes near the zero, the kernel cannot well recognize the difference between samples. Otherwise, the kernel is regarded to have a better discriminative ability. We render the network depth so that the NNGP kernel is empirically computed by aggregating shortcut connections with a separation of between neighboring shortcut connections. Figure 3 illustrates the angularities of NNGP kernels with for FMNIST-1k training data. It is observed that the angular plot of the kernel with is compressed to be closer to zero relative to that of the kernel with . This result suggests that the separation should be set to a smaller number for a powerful NNGP kernel.
5 Related Work
Deep Learning and Kernel Methods. There have been great efforts on correspondence between deep neural networks and Gaussian processes. neal1996:priors presented the seminal work by showing that a one-hidden-layer network of infinite width turns into a Gaussian process. cho:MKMs linked the multi-layer networks using rectified polynomial activation with compositional Gaussian kernels. lee2017deep
showed that the infinitely wide fully-connected neural networks with commonly used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks including Bayesian networks(novak2018bayesian), deep networks with convolution (garriga2019:bayesian), and recurrent networks (yang2019:rnn). Furthermore, wang2020:bridging wrote an inclusive review for studies on connecting neural networks and kernel learning. Despite great progresses being made, all existing works about NNGP rely on increasing width to induce the Gaussian processes, yet we investigate the depth paradigm and offer an NNGP by increasing depth, which not only complements the existing theory to a good degree but also enhances our understanding to the true picture of “deep” learning.
Developments of NNGPs. Recent years have witnessed a growing interest in neural network Gaussian processes. NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of lee2017deep, an explicit estimate in the form of variance prediction is given to each test sample. Besides, pang2019neural
showed that the NNGP is good at handling data with noises and is superior to discretizing differential operators in solving some linear or nonlinear partially differential equations.park2020towards employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the process of neural architecture search. dutordoir2020bayesian presented the translation insensitive convolutional kernel by relaxing the translation invariance of deep convolutional Gaussian processes. lu2020interpretable
proposed an interpretable NNGP by approximating an NNGP with its low-order moments.
6 Conclusions and Prospects
In this paper, we have presented the first NNGP by depth based on the width-depth symmetry consideration. We formulate the associated NNGP kernel by a network with increasing depth. Next, we have characterized the basic properties of the proposed NNGP kernel by proving its uniform tightness and estimating its smallest eigenvalue, respectively. Such results serve a solid base for future research, such as the generalization and optimization of NNGP and Bayesian inference with the NNGP. Lastly, we have conducted regression experiments on image classification and showed that our proposed NNGP kernel can achieve the performance comparable to the NNGP kernel. Future efforts can be put into scaling the proposed NNGP kernel into more applications.