Neural Network Gaussian Processes by Increasing Depth

08/29/2021
by   Shao-Qun Zhang, et al.
Rensselaer Polytechnic Institute
0

Recent years have witnessed an increasing interest in the correspondence between infinitely wide networks and Gaussian processes. Despite the effectiveness and elegance of the current neural network Gaussian process theory, to the best of our knowledge, all the neural network Gaussian processes are essentially induced by increasing width. However, in the era of deep learning, what concerns us more regarding a neural network is its depth as well as how depth impacts the behaviors of a network. Inspired by a width-depth symmetry consideration, we use a shortcut network to show that increasing the depth of a neural network can also give rise to a Gaussian process, which is a valuable addition to the existing theory and contributes to revealing the true picture of deep learning. Beyond the proposed Gaussian process by depth, we theoretically characterize its uniform tightness property and the smallest eigenvalue of its associated kernel. These characterizations can not only enhance our understanding of the proposed depth-induced Gaussian processes, but also pave the way for future applications. Lastly, we examine the performance of the proposed Gaussian process by regression experiments on two real-world data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/25/2021

A brief note on understanding neural networks as Gaussian processes

As a generalization of the work in [Lee et al., 2017], this note briefly...
01/03/2020

Wide Neural Networks with Bottlenecks are Deep Gaussian Processes

There is recently much work on the "wide limit" of neural networks, wher...
02/24/2014

Avoiding pathologies in very deep networks

Choosing appropriate architectures and regularization strategies for dee...
08/19/2020

Neural Networks and Quantum Field Theory

We propose a theoretical understanding of neural networks in terms of Wi...
06/11/2021

The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective

Large width limits have been a recent focus of deep learning research: m...
11/30/2017

How Deep Are Deep Gaussian Processes?

Recent research has shown the potential utility of probability distribut...
10/02/2020

Gaussian Process Molecule Property Prediction with FlowMO

We present FlowMO: an open-source Python library for molecular property ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Currently, kernel methods and deep neural networks are two of the most remarkable machine learning methodologies. Recent years have witnessed lots of works on the connection between these two methodologies.

lee2017deep pointed out that randomly initializing parameters of an infinitely wide network gives rise to a Gaussian process, which is referred as neural network Gaussian processes (NNGP). Due to the widespread attraction of this idea, the studies of NNGP have been scaled into more types of networks, such as attention-based models hron2020infinite and recurrent networks yang2019:rnn.

A Gaussian process is a classical non-parametric model. The equivalence between an infinitely wide fully-connected network and a Gaussian process has been established by

neal1996:priors; lee2017deep

. Given a fully-connected multi-layer network whose parameters are i.i.d. randomly initialized, the output of each neuron is an aggregation of neurons in the preceding layer whose outputs are i.i.d. When the network width goes infinitely large, according to the Central Limit Theorem

fischer2010history

, the output of each neuron conforms to the Gaussian distribution. As a result, the output function expressed by the network is essentially a Gaussian process. The correspondence between neural networks and Gaussian processes allows the exact Bayesian inference using the neural network

lee2017deep.

Despite the achievements of the current NNGP theory, there is an important limit in this theory not addressed satisfactorily. So far, the neural network Gaussian process is essentially induced by increasing width, regardless of how many layers are stacked in a network. But in the era of deep learning, what concerns us more regarding deep learning is its depth and how the depth affects the behaviors of a neural network, since the depth is the major element accountable for the power of deep learning. Although that the current NNGP theory is beautiful and elegant in its form, unfortunately, it can not accommodate our concern adequately. Therefore, it is highly necessary to expand the scope of the existing theory to include the depth issue. Specifically, our natural curiosity is what is going to happen if we have an infinitely deep but finitely wide network. Can we derive an NNGP by increasing depth rather than increasing width, which contributes to understanding the true picture of deep learning? If this question is positively answered, we are able to reconcile the successes of deep networks and the elegance of the NNGP theory. What’s more, as a valuable addition, the depth-induced NNGP greatly enlarges the scope of the existing NNGP theory, which is posited to open lots of doors for research and translation opportunities in this area.

Figure 1: A deep topology that can induce a neural network Gaussian process by increasing depth.

The above idea is well motivated based on a width-depth symmetry consideration. Previously, lu2017expressive and hornik1989multilayer have respectively proved that the width-bounded and depth-bounded neural networks are universal approximators. fan2020quasi

suggested that a wide network and a deep network can be converted to each other with a negligible error by using De Morgan’s law. Since somehow there exists symmetry between width and depth, it is very probable that deepening a neural network in certain conditions can lead to an NNGP as well. Along this direction, we investigate the feasibility of inducing an NNGP by depth (NNGP

), with a network that has a topology of shortcuts, as shown in Figure 1. The characteristic of this topology is that outputs of intermediate layers with a gap of are aggregated in the final layer, yielding the network output. Such a shortcut topology has been successfully applied into medical imaging you2019ct

and computer vision

fan2018sparse as a backbone structure.

An NNGP by width (NNGP) is accomplished by summing the i.i.d. output terms of infinitely many neurons and applying Central Limit Theorem. In contrast, for the topology in Figure 1, as the depth increases, the outputs of increasingly many neurons are aggregated together. We constrain the random weights and biases such that those summed neurons turn into weakly dependent by the virtue of their separation. Consequently, when going infinitely deep, the network, as illustrated in Figure 1, is also a function drawn from a Gaussian process according to the generalized Central Limit Theorem under weak dependence (b1995:clt). Beyond the proposed NNGP, we theoretically prove that NNGP is uniformly tight and provide a tight bound of the smallest eigenvalue of the concerned NNGP kernel. From the former, one can determine the properties of NNGP such as functional limit and continuity properties of NNGP, while the non-trivial lower and upper bounds mirror the characteristics of the derived kernel, which constitutes a fundamental cornerstone for optimization and generalization of deep learning theory.

Main Contributions.

In this work, we establish the NNGP by increasing depth, in contrast to the present mainstream NNGPs that are induced by increasing width. Our work substantially enlarges the scope of the existing elegant NNGP theory, making a stride towards understanding the true picture of deep learning. Furthermore, we investigate the essential properties of the proposed NNGP and its associated kernel, which lays a solid foundation for future research and applications. Lastly, we implement an NNGP

kernel and apply it for regression experiments on two well-known data sets.

2 Preliminaries

Let be the set for an integer . Given a function , we denote by if there exist positive constants and such that for every ; if there exist positive constants and such that for every ; if there exist positive constants and such that for every . Let denote the matrix norm for the matrix . Throughout this paper, we employ the maximum spectral norm

as the matrix norm (meyer2000:norm), where denotes the

-th singular value of the matrix

. Let denote the number of elements, e.g., . Finally, we provide several definitions for the characterization of inputs and parameters.

Definition 1

A data distribution is said to be well-scaled, if the following conditions hold for :

  1. ;

  2. ;

  3. .

Definition 2

A function is said to be well-posed, if is first-order differentiable, and its derivative is bounded by a certain constant

. Specially, the commonly used activation functions like ReLU, tanh, and sigmoid are well-posed (see Table 

1).

Definition 3

A matrix is said to be stable-pertinent for a well-posed activation function , in short , if the inequality holds.

Activations Well-Posedness
ReLU
sigmoid
Table 1: Well-posedness of the commonly-used activation functions.

3 Main Results

In this section, we formally present the neural network Gaussian process NNGP, led by an infinitely deep but finitely wide neural network with i.i.d. weight parameters. We also derive valuable characterizations for NNGP and its associated NNGP

kernel: uniform tightness with the increased depth and bound estimation of the kernel’s smallest eigenvalue.

3.1 Neural Network Gaussian Process with Increasing Depth

Consider an -layer neural network whose topology is illustrated as Figure 1, the feed-forward propagation follows

(1)

where and

are the weight matrix and bias vector of the

layer, respectively, and is the activation function. Invoking shortcut connections, the final output of this network is a mean of () previous layers with an equal separation and

(2)

where the ones matrix indicates the unit shortcut connection between and final layer, and denotes the summed number of concerned hidden neurons

Let

be the concatenation of all vectorized weight matrices and

. Regarding the neural network , we present the first main theorem as follows.

Theorem 1

The infinitely deep neural network, defined by Eqs. (1) and (2), is equivalent to a Gaussian process NNGP, if is well-posed and the augmented parameter matrix of each layer is stable-pertinent for , that is, , for .

Theorem 1 states that our proposed neural network converges to a Gaussian process as . Given a data set , the limit output variables of this network belongs to a multivariate Gaussian distribution whose mean equals to 0 and covariance matrix is an matrix, the -entry of which is defined as

(3)

The key idea of proving Theorem 1

is to show that our proposed neural network converges to a Gaussian process as depth increases according to the generalized Central Limit Theorem with weakly dependent variables instead of random ones. To implement this idea, we intentionally constrain the weights and biases to enable that random variables of two hidden layers with a sufficient separation degenerate to weak dependence, i.e., mixing processes. By aggregating the weakly dependent variables to the final layer via shortcut connections, the output of the proposed network converges to a Gaussian process as the depth goes to infinity. The key steps are formally stated by Lemmas 

1 and 2 as follows.

Lemma 1

Provided a well-posed and stable-pertinent parameter matrices, the concerned neural network comprises a stochastic sequence of weakly dependent variables as the depth goes to infinity .

Proof.  Let denote the distribution of the random variable sequence , where , and indicates the vector of random variables before the timestamp . We define a coefficient as

where

stands for a conditional probability distribution and

denotes a probability measure, or equally the -algebra of events  (joe1997:Sklar), which satisfies

for two probability distributions

and . According to Eq. (1), we have

where

Given the well-posed and stable-pertinent parameter matrices, i.e., for any , it holds true that

where and . Thus, we have

Therefore, the sequence led by Eq. (1) is -mixing, or equally weakly dependent, which completes the proof.

Lemma 2

Suppose that (i) a random variable sequence is weakly independent, satisfying -mixing with an exponential convergence rate, (ii) for , we have

Let , then we have

Further, the limit variable converges in distribution to as , provided .

Lemma 2 is a variant of the generalized Central Limit Theorem under weak dependence. The proof idea can be summarized as follows. From (doukhan2012:mixing), it’s observed that an -mixing sequence with an exponential convergence rate can be covered by the -mixing one with . Thus, the conditions of Lemma 2 satisfy the preconditions of the generalized Central Limit Theorem under weak dependence (b1995:clt, Theorem 27.5). This lemma also has alternative proofs according to the encyclopedic treatment of limit theorems under mixing conditions. Interested readers can refer to bradley2007:mixing for more details.

Proof of Theorem 1. Let denote the output variables of the -th layer, which satisfies that and . Because the weights and biases are taken to be i.i.d., the sequence leads to a stochastic process, and the post-activations in the same layer, such as and are independent for . Given an integer , we select a sub-sequence of as follows:

for and , which satisfy . From Lemma 1, the sequence leads to a weakly dependent stochastic process. Aggregating this sub-sequence with shortcut connections to the output layer, the output of the concerned neural network converges to a Gaussian process as as well as , from Lemma 2.

Remark. To the best of our knowledge, our proposed NNGP is the first NNGP induced by increasing depth. Currently, there is no rigorous definition for width and depth. The way we claim depth just aligns with the conventional usage of the width and depth for a neural network, in which the depth is understood as the maximum number of neurons among all possible routes from the input to the output and the width is the maximum number of neurons in a layer. As illustrated in Figure 2, if examined in an unravelled view, our network is a simultaneously wide and deep network due to the layer reuse in different routes. However, we argue that this will not affect our claim because not every layer has an infinite width in the unravelled view, which is different from the key character of NNGP. What’s more, the conventional usage is more acceptable relative to the unravelled view; otherwise, it is against the common sense to also regard the ResNet as a wide network.

Figure 2: Both ResNet and ours can be regarded as wide networks in the unraveled view.

3.2 Uniform Tightness of NNGP

In this subsection, we delineate the asymptotic behavior of NNGP as the depth goes to infinity. Here, we assume that the weights and biases are i.i.d. sampled from . Per the conditions of Theorem 1, we have the following theorem.

Theorem 2

For any , the stochastic process, described in Lemma 1, is uniformly tight in .

Theorem 2 reveals that the stochastic process contained by our network (illustrated in Figure 1) is uniformly tight, which is an intrinsic characteristic of NNGP. Based on Theorem 2, one can obtain not only the functional limit and continuity properties of NNGP, in analogy to the results of NNGP (bracale2020:asymptotic).

Similarly, we start the proof of Theorem 2 with some useful lemmas.

Lemma 3

Let denote a sequence of random variables in . This stochastic process is uniformly tight in , if (1) is a uniformly tight point of () in ; (2) for any and , there exist , such that .

Lemma 3 is the core guidance for proving Theorem 2. This lemma can be straightforwardly derived from Kolmogorov Continuity Theorem (stroock1997:Kolmogorov), provided the Polish space .

Lemma 4

Based on the notations of Lemma 3, is a uniformly tight point of () in .

Proof.  It suffices to prove that 1) is a tight point of () in and 2) the statistic converges in distribution as . Note that 1) is self-evident since every probability measure in is tight; 2) has been proved by Theorem 1. Therefore, this lemma is completely proved.

Remark. Notice that the convergence in distribution () from Lemmas 2 and 4 paves the way for the convergence of expectations. Specifically, provided a linear and bounded functional as and a function which satisfies that , then we have and according to General Transformation Theorem (van2000:asymptotic, Theorem 2.3) and Uniform Integrability (billingsley2013:convergence), respectively. These results may serve as solid bases for applications of NNGP in the future.

Lemma 5

Based on the notations of Lemma 3, for any and , there exist , such that

Proof.  This proof follows mathematical induction. Before that, we show the following preliminary result. Let be one element of the augmented matrix at the

-th layer, then we can formulate its characteristic function as

where denotes the imaginary unit with

. Thus, the variance of hidden random variables at the

layer becomes

(4)

Since the activation is a well-posed function and , we affirm that is Lipschitz continuous (with Lipschitz constant ).

Now we start the mathematical induction. When , for any and , we have

where . Per mathematical induction, for , we have

Thus, one has

(5)

where

Thus, Eq. (5) becomes

where

Iterating this argument, we obtain

where

The above induction holds for any positive even . Let , then this lemma is proved as desired.

Finally, Theorem 2 can be completely proved by invoking Lemmas 4 and 5 into Lemma 3.

3.3 Tight Bound for the Smallest Eigenvalue

In this subsection, we provide a tight bound for the smallest eigenvalue of the NNGP kernel. For the NNGP with ReLU activation, we have the following theorem.

Theorem 3

Suppose that are i.i.d. sampled from and is a well-scaled distribution, then for an integer , with probability , we have , where

Theorem 3 provides a tight bound for the smallest eigenvalue of the NNGP kernel. This nontrivial estimation mirrors the characteristics of this kernel, and usually be used as a key assumption for optimization and generalization.

The key idea of proving Theorem 3 is based on the following inequalities about the smallest eigenvalue of real-valued symmetric square matrices. Given two symmetric matrices , it’s observed that

(6)

From Eqs. (2) and (3), we can unfold as a sum of covariance of the sequence of random variables . Thus, we can bound by via a chain of feedforward compositions in Eq. (1). We begin this proof with the following lemmas.

Lemma 6

Let be a Lipschitz continuous function with constant and denote the Gaussian distribution , then for , there exists , s.t.

(7)

Lemma 6 shows that the Gaussian distribution corresponding to our samples satisfies the log-Sobolev inequality (i.e., Eq. (7)) with some constants unrelated to dimension

. This result also holds for the uniform distributions on the sphere or unit hypercube 

(nguyen2021:eigenvalues). Detailed proofs can be accessed from Appendix.

Lemma 7

Suppose that are i.i.d. sampled from , then with probability , we have

for , where

Proof.  From Definition 1, we have

Since are i.i.d. sampled from , for , we have with probability at least . Provided , the single-sided inner product is Lipschitz continuous with the constant . Thus, from Lemma 6, for , we have

Thus, for , we have

We complete the proof by setting .

Proof of Theorem 3. We start this proof with some notations. Recall the empirical NNGP kernel . For convenience, we force . We also abbreviate the covariance as and pick throughout this proof.

Unfolding Eq. (3), we have

(8)

where

in which the subscript indicates the -th element of vector . From Theorem 1, the sequence of random variables is weakly dependent with as . Thus, is an infinitesimal with respect to when and is sufficiently large.

Invoking Eqs. (6) into Eq. (8), we have

(9)
(10)

Iterating Eq. (10) and then invoking it into Eq. (9), we have

(11)

From the Hermite expansion of ReLU function, we have

(12)

where indicates the expansion order. Thus, we have

(13)

where the superscript denotes the -th Khatri Rao power of the matrix , the first inequality follows from Eq. (12), the second one holds from Gershgorin Circle Theorem (salas1999gershgorin), and the third one follows from Lemma 7. Therefore, we can obtain the lower bound of the smallest eigenvalue by plugging Eq. (13) into Eq. (11)

On the other hand, it’s observed that for ,

(14)

Thus, we have

where the second inequality follows from Eq. (8), the third one follows from Eq. (14), and the fourth one holds from Lemma 7. This completes the proof.

4 Experiments

Generally, the depth can endow a network with a more powerful representation ability than the width. However, it is unclear whether or not the superiority of depth can sustain in the setting of NNGP, as all parameters are random rather than trained. In other words, it is unclear whether the established NNGP is more expressive than NNGP. To answer this question, in this section, we apply the NNGP

kernel into the generic regression task and then compare its performance on the Fashion-MNIST (FMNIST) and CIFAR10 data sets with that of NNGP

.

NNGP regression. Provided the data set , where is the input and is the corresponding label, our goal is to predict for the test sample . From Theorem 1, and belong to a multivariate Gaussian process , whose mean equals to 0 and covariance matrix has the following form

(15)

where is an matrix computed by Eq. (3), and the -th element of is for . It’s observed that Eq. (15) provides a division paradigm corresponding to the training set and test sample, respectively. Thus, we have with

(16)

where denotes the label vector. When the observations are corrupted by the Gaussian additive noise of , Eq. (16) becomes

(17)

where is the identity matrix.

For numerical implementation, we calculate the NNGP and NNGP kernels as follows:

(18)

where indicates the deep network or wide network.

Experimental setups. We conduct regression experiments on FMNIST and CIFAR10 data sets. We respectively sample 1k, 2k, 3k data from the training sets to construct two kernels, and then test the performance of kernels on the test sets. Here, we employ an one-hidden-layer wide network for computing the NNGP kernel, whereas the width of the deep network is set to the number of classes which is the smallest possible width for prediction tasks. For a fair comparison, the depth of NNGP and the width of NNGP are equally set to (). For regression tasks, the class labels are encoded into an apposite regression formation where incorrect classes are and correct class is  (lee2017deep). For two networks, we employ as the activation function. Following the setting of NNGPlee2017deep, all weights are initialized with a Gaussian distribution of mean and variance of for a normalization in each layer, where is the number of neurons in the -th layer. The initialization is repeated times to compute the empirical statistics of the NNGP and NNGP based on Eq. (18). We also run each experiment times for counting the mean and variance of accuracy. All experiments are conducted on Intel Core-i7-6500U.

Model FMNIST Test accuracy CIFAR10 Test accuracy
NNGP 1k 0.3450.016 1k 0.1660.018
NNGP 0.3420.021 0.1870.018
NNGP 2k 0.3520.019 2k 0.1780.007
NNGP 0.3730.030 0.1880.012
NNGP 3k 0.372 0.024 3k 0.1820.005
NNGP 0.3650.007 0.1850.019
Table 2: Test accuracy of regression experiments based on NNGP and NNGP kernels.

Results. Table 2 lists the performance of the regression experiments using NNGP and NNGP kernels. It’s observed that the test accuracy of NNGP and NNGP kernels are comparable to each other, which implies that NNGP and NNGP kernels are similar to each other in representation ability. We think the reason is as follows. Both NNGP and NNGP kernels are not the stacked kernels. Their difference is mainly the aggregation of independent or weakly dependent variables. Thus, their discriminative ability should be similar (lee2017deep).

Next, we use the angular plot to investigate how the separation affects the representation ability of the NNGP kernel. The angle is computed according to

and the angular plot manifests the relationship between kernel values and angles. If an angular plot comes near the zero, the kernel cannot well recognize the difference between samples. Otherwise, the kernel is regarded to have a better discriminative ability. We render the network depth so that the NNGP kernel is empirically computed by aggregating shortcut connections with a separation of between neighboring shortcut connections. Figure 3 illustrates the angularities of NNGP kernels with for FMNIST-1k training data. It is observed that the angular plot of the kernel with is compressed to be closer to zero relative to that of the kernel with . This result suggests that the separation should be set to a smaller number for a powerful NNGP kernel.

5 Related Work

Deep Learning and Kernel Methods. There have been great efforts on correspondence between deep neural networks and Gaussian processes. neal1996:priors presented the seminal work by showing that a one-hidden-layer network of infinite width turns into a Gaussian process. cho:MKMs linked the multi-layer networks using rectified polynomial activation with compositional Gaussian kernels. lee2017deep

showed that the infinitely wide fully-connected neural networks with commonly used activation functions can converge to Gaussian processes. Recently, the NNGP has been scaled to many types of networks including Bayesian networks 

(novak2018bayesian), deep networks with convolution (garriga2019:bayesian), and recurrent networks (yang2019:rnn). Furthermore, wang2020:bridging wrote an inclusive review for studies on connecting neural networks and kernel learning. Despite great progresses being made, all existing works about NNGP rely on increasing width to induce the Gaussian processes, yet we investigate the depth paradigm and offer an NNGP by increasing depth, which not only complements the existing theory to a good degree but also enhances our understanding to the true picture of “deep” learning.

Figure 3: Angularities of NNGP kernels with various .

Developments of NNGPs. Recent years have witnessed a growing interest in neural network Gaussian processes. NNGPs can provide a quantitative characterization of how likely certain outcomes are if some aspects of the system are not exactly known. In the experiments of lee2017deep, an explicit estimate in the form of variance prediction is given to each test sample. Besides, pang2019neural

showed that the NNGP is good at handling data with noises and is superior to discretizing differential operators in solving some linear or nonlinear partially differential equations.

park2020towards employed the NNGP kernel in the performance measurement of network architectures for the purpose of speeding up the process of neural architecture search. dutordoir2020bayesian presented the translation insensitive convolutional kernel by relaxing the translation invariance of deep convolutional Gaussian processes. lu2020interpretable

proposed an interpretable NNGP by approximating an NNGP with its low-order moments.

6 Conclusions and Prospects

In this paper, we have presented the first NNGP by depth based on the width-depth symmetry consideration. We formulate the associated NNGP kernel by a network with increasing depth. Next, we have characterized the basic properties of the proposed NNGP kernel by proving its uniform tightness and estimating its smallest eigenvalue, respectively. Such results serve a solid base for future research, such as the generalization and optimization of NNGP and Bayesian inference with the NNGP. Lastly, we have conducted regression experiments on image classification and showed that our proposed NNGP kernel can achieve the performance comparable to the NNGP kernel. Future efforts can be put into scaling the proposed NNGP kernel into more applications.

References