Neural networks have shown an unprecedented success in the hands of practitioners in several fields. Examples of successful applications include image classification (krizhevsky2012imagenet; lecun2015deep) and generative modeling (goodfellow2014generative; arjovsky2017wasserstein). This practical efficiency has accelerated developing analytical tools for understanding the properties of neural networks. A long standing dilemma in the analysis of neural networks is their favorable generalization error despite typically being extremely overparameterized. It is well known that the classic statistical learning methods for bounding the generalization error, such as Vapnik–Chervonenkis dimension (bartlett1998sample; anthony2009neural) and Rademacher complexity (bartlett2002rademacher; neyshabur2015norm), often result in vacuous bounds in the case of neural networks (dziugaite2017computing).
Motivated with the problem mentioned above, we investigate the generalization bounds in overparameterized neural networks. In particular, let be a possibly noisy dataset generated according to a true model . Let be the learnt model using a gradient based training of an overparameterized neural network on . We are interested in bounding uniformly at all test points belonging to a particular domain .
A recent line of work has shown that gradient based learning of certain overparameterized neural networks can be analyzed using a projection on the tangent space. This leads to the equivalence of the solution of the neural network model to kernel ridge regression that is a linear regression in the possibility infinite dimensional features space of the corresponding kernel. The particular kernel resulting from this approach is called the neural tangent (NT) kernel(jacot2018neural). This equivalence is based on the observation that, during training, the parameters of the neural network remain within a close vicinity of the random initialization. This regime is sometimes referred to as lazy training (chizat2018lazy).
We propose the maximal information gain (MIG), , of the NT kernel as a measure of the complexity of the learning problem (see Section 4.1 for the definition of ). The MIG is closely related to the effective dimension of the NT kernel in its feature space (see Section 4.1). We use the Mercer decomposition of the NT kernel and some recent results on its spectral properties to prove a bound on its MIG. We then show that how a bound on MIG can be used to bound the generalization error.
The contributions of this paper are summarized as follows.
We propose the MIG of the NT kernel as a measure of complexity of the learning problem in overparameterized neural networks. We provide data independent and kernel specific bounds on the MIG which can be used to prove novel bounds on the generalization error of the overparameterized neural networks in kernel regimes. Unlike most classical statistical learning methods which result in vacuous error bounds, we analytically prove decaying error bounds, for networks with infinite number of parameters. More importantly, our results precisely capture the rate of decay of the error bound depending on the properties of the activation functions.
Our analysis crucially uses Mercer’s theorem and the spectral decomposition of the NT kernel in the basis of spherical harmonics (which are special functions on the surface of the hypersphere; see Section 3.1). We derive bounds on the decay rate of the Mercer eigenvalues (referred to as eigendecay) based on the smoothness properties of the activation functions. These eigendecays are obtained by careful characterization of a recent result relating the eigendecay to the endpoint expansions of the kernel given in (bietti2020deep). In particular, we consider times differentiable activation functions . We show that, with these activation functions and a hyperspherical support, the eigenvalues of the corresponding NT kernel decay at the rate . Our bounds on MIG are of the form 111Throughout the paper, the notations and are used to denote standard mathematical order, and that up to logarithmic factors, respectively., which we show to imply error bounds of the form , when belongs to the corresponding RKHS, and is distributed in a sufficiently informative fashion. In particular, we consider a that is collected sequentially in a way that maximally reduces the uncertainty in the model. That results in a quasi-uniform dataset over the entire domain (see Section 5.2 for the details).
We note that the RKHS of the NT kernel is a very general class of functions. In particular, as a byproduct of our results, we show that the RKHS of the NT kernel with activation function is equivalent to the RKHS of a Matérn kernel with smoothness parameter . The RKHS of Matérn family of kernels is known to uniformly approximate almost all continuous functions (srinivas2009gaussian), it was shown that , and consequently the RKHS of the corresponding NT kernel is equivalent to that of the Laplace kernel (that is a special Matérn kernel with ) (geifman2020similarity; chen2020deep). Our contribution includes generalizing this result by showing the equivalence between the RKHS of the NT kernel, under various smoothness of the activation functions, and that of a Matérn kernel with the corresponding smoothness (see Section 3.3).
In section 6, we provide a discussion on the implications of our results for the regret bounds in reinforcement learning (RL) and bandit problems which use overparameterized neural network models.
As an intermediate step, we also consider the simpler random feature (RF) model which is the result of partial training of the neural network. Also referred to as conjugate kernel or neural networks Gaussian process (NNGP) (cho2012kernel; daniely2016toward; lee2017deep; matthews2018gaussian), this model corresponds to the case where only the last layer of the neural network is trained (see Section 2 for the details). A summary of our results on MIG and error bounds is reported in Table 1.
We note that the decay rate of the error bound is faster in the case of RF kernel with the same activation function. Also, the decay rate of the error bound is faster when is larger. In interpreting these results, we should however take into account the RKHS of the corresponding kernels. The RKHS of the RF kernel is smaller than the RKHS of the NT kernel. Also, when is larger, the RKHSs are smaller for both RF and NT kernels.
1.2 Related Work
There are classical statistical learning methods which address the generalization error in machine learning models. Two notable approaches are Vapnik–Chervonenkis (VC) dimension(bartlett1998sample; anthony2009neural) and Rademacher complexity (bartlett2002rademacher; neyshabur2015norm). In the former, the expected generalization loss is bounded by the square root of the VC dimension divided by the square root of . In the latter, the expected generalization loss scales with the Rademacher complexity of the hypothesis class. In the case of a layer neural network of width , it is shown that both of these approaches result in a generalization bound of that is vacuous for over-parameterized neural networks with a very large .
A more recent approach to studying the generalization error is PAC Bayes. It provides non-vacuous error bounds that depend on the distribution of parameters (dziugaite2017computing)
. However, the distribution of parameters must be known or be estimated in order to use those bounds. In contrast, our bounds are explicit and are much stronger in the sense that they hold uniformly in.
The recent work of wang2020regularization proved that the norm of the error is bounded as , for a two layer neural network with ReLU activation functions. bordelon2020spectrum decomposed the
norm of the error to a series corresponding to eigenfunctions of the kernel and provided bounds based on corresponding eigenvalues. In contrast, our results are stronger because they are given in terms of absolute error instead ofnorm. In addition, we provide explicit rates for the error decay depending on the activation functions.
The MIG has been studied for popular kernels such as Matérn and Squared Exponential (srinivas2009gaussian; vakili2021information). Our proof technique is similar to that of vakili2021information. Their analysis however does not directly apply to the NT kernel on the hypersphere. The reason is that vakili2021information assume uniformly bounded eigenfunctions. In our analysis, we use a Mercer decomposition of the NT kernel in the basis of spherical harmonics which are not uniformly bounded. Technical details are provided in the analysis of Theorem 4. Applying a proof technique similar to srinivas2009gaussian results in suboptimal bounds in our case.
1.3 General Notation
We use the notations and, respectively. For a matrix (a vector ), () denotes its transpose. In addition and are used to denote determinant of and its logarithm, respectively. The notation is used to denote the norm of a vector . For and a symmetric positive definite matrix ,
denotes a normal distribution with meanand covariance . The Kronecker delta is denoted by . The notation denotes the dimensional hypersphere in . For example, is the usual sphere. For two sequences and , we use the notation , when and .
2 Overparameterized Neural Networks in Kernel Regimes
In this section, we briefly overview the equivalence of the overparameterized neural network models to the kernel methods. In overparameterized regimes, it was shown that the gradient based training of the neural network reaches a global minimum where the weights remain within a close vicinity of the random initialization. In particular, let denote a network with a large width parameterized by . The model can be approximated with its linear projection on the tangent space at random initialization as
The approximation error can be bounded by the second order term , and shown to be diminishing as grows, where is the spectral norm of the Hessian matrix of (liu2020linearity). The approximation becomes an equality when the width of the hidden layers approaches infinity. Known as lazy training regime (chizat2018lazy), this model is equivalent to the NT kernel (jacot2018neural), given by
2.1 Neural Kernels for a Layer Network
Consider the following layer neural network with width and a proper normalization
where is the activation function, is a dimensional input to the network, , initialized randomly according to , are the weights in the second layer, and , initialized randomly according to , are the wights in the first later. Then, the NT kernel of this network can be expressed as
where is the derivative of .
In addition, when are fixed at initialization, and only are trained with regularization, the model is known to be equivalent to a Gaussian process (GP) (neal2012bayesian) that is often referred to as a random feature (RF) model. The RF kernel is given as (rahimi2007random)
Note that the RF kernel is the same as the second term in the expression of the NT kernel given in 2.
The right hand side of equation 1 may be scaled with a constant factor , resulting in a scaling of the kernel, for a proper normalization. Such constant scalings will not affect our results on the eigendecay, MIG and the rate of error bounds.
2.2 Rotational Invariance
When the input domain is the hypersphere
, by the spherical symmetry of the Gaussian distribution, the neural kernels are invariant to unitary transformations222Rotationally invariant kernels are also sometimes referred to as dot product or zonal kernels. and can be expressed as a function of . We use the notations
to refer to the NT and RF kernels, with -positively homogeneous activation functions, , respectively.
The particular form of the kernel depends on . For example, for the special case of that corresponds to the ReLU activation function, the following closed form expressions can be given
2.3 Neural Kernels for Multilayer Networks
For an layer network, the NT kernel is recursively given as (jacot2018neural)
Here, corresponds to the RF kernel (all the weights are fixed before the last layer, and only the weights in the last layer are trained), as shown in cho2012kernel; daniely2016toward; lee2017deep; matthews2018gaussian.
3 Mercer’s Theorem and the RKHS
In this section, we overview the Mercer’s theorem, as well as the the reproducing kernel Hilbert spaces (RKHSs) associated with the kernels which are crucial for our analysis. Mercer’s theorem (mercer1909functions) provides a spectral decomposition of the kernel in terms of an infinite dimensional feature map (see, e.g. steinwart2008support, Theorem ).
Theorem 1 (Mercer’s Theorem )
Let be a compact domain. Let be a continuous square integrable kernel with respect to a finite Borel measure . Define a positive definite operator
Then, there exists a sequence of eigenvalues-eigenfunction pairs such that , and , for . Moreover, the kernel function can be represented as
where the convergence of the series holds uniformly on .
Let denote the RKHS corresponding to , defined as a Hilbert space equipped with an inner product satisfying the following: , , and , (reproducing property). As a consequence of Mercer’s theorem, can be represented in terms of using Mercer’s representation theorem (see, e.g., steinwart2008support, Theorem ).
Theorem 2 (Mercer’s Representation Theorem)
Let be the same as in Mercer’s Theorem. Then, the RKHS corresponding to is given by
Mercer’s representation theorem indicates that form an orthonormal basis for : . It also provides a constructive definition for the RKHS as the span of this orthonormal basis, and a definition for the norm of a member as the norm of the weights vector.
3.1 Spectral Decomposition of the Neural Kernels
In the case of RF and NT kernels, when the domain is the hypersphere, as discussed earlier, the kernel finds a rotationally invariant (dot product) form . The rotationally invariant kernels can be diagonalized in the basis of spherical harmonics which are special functions on the surface of the hypersphere. Specifically, applying the Mercer’s theorem to , with
being the uniform distribution on, we have
where is the ’th spherical harmonic of degree , and is the corresponding eigenvalue.
We have slightly amended the notation to enumerate the eigenfunctions (spherical harmonics) using two indices and (instead of one index as in the presentation of the Mercer’s theorem above). This notation change facilitates a more concise representation due to the multiplicity of the eigenvalues. In particular, the number of spherical harmonics of degree , which all share the same eigenvalue , is . Consequently, the corresponding RKHS can be written as
3.2 The Decay Rate of the Eigenvalues
As discussed, both RF and NT kernels can be represented in the basis of spherical harmonics. Our analysis of the MIG relies on the decay rate of the corresponding eigenvalues. In this section, we derive the eigendecay depending on the particular activation functions.
In the special case of ReLU activation function, several recent works (geifman2020similarity; chen2020deep; bietti2020deep) have proven a eigendecay for the NT kernel. In this work, we use a recent result from bietti2020deep to derive the eigendecay based on the properties of the activation functions under a more general case.
Consider the neural kernels on the hypersphere with -positively homogeneous activation functions . Consider the Mercer decomposition of the kernel in the basis of spherical harmonics. Then, the eigendecays are as reported in Table 2
|NT||For of the same parity of :|
|For of the opposite parity of :|
|RF||For of the opposite parity of :|
|For of the same parity of :|
Proof Sketch. The proof follows from Theorem of bietti2020deep which proved that the eigendecay of a rotationally invariant kernel can be determined based on its asymptotic expansions around the endpoints, . In particular, when
for polynomial and non-integer , the eigenvalues decay at the rate , for even , and
, for odd. A formal statement of this result is given in Appendix A. They derived these expansions for the ReLU activation function. For completeness, in Appendix A, we derive the asymptotic endpoint expansions for the neural kernels with -positively homogeneous activation functions, in the case of and layer networks. Our derivations is based on Lemma 2 in Appendix A, which proves (when the normalization of the kernels ensures , for all ). This result allows us to recursively derive the endpoint expansions of from those of through integration. We thus obtain and recursively, resulting in the eigendecays reported in Table 2.
The parity constraints in Table 2 may imply undesirable behavior for layer infinitely wide networks. ronen2019convergence studied this behavior and showed that adding a bias term removes these constraints resulting in , for the NT kernel. Although the parity constrained eigendecay affects the representation power of the neural kernels, it does not affect our analysis of MIG and error rates, since for that analysis we only use .
It may appear surprising that decay faster as becomes larger. We however should take into account the multiplicity of the eigenvalues corresponding to the spherical harmonics, which implies , for in Mercer’s theorem (see Appendix C). For example, in the cast of NT kernel with , , which shows a slower decay rate of Mercer eigenvalues with larger .
3.3 Equivalence of the RKHSs of NT and Matérn kernels
Our bounds on the decay rate of the eigenvalues of the neural kernels shows the connection between the RKHS corresponding to the neural kernels and , and that corresponding to the Metern family of kernels. Let denote the RKHS corresponding to the Matérn kernel with smoothness parameter .
When the domain is the hypersphere and , then, () is equivalent to , with (). When , we have (), with ().
Proof Sketch. The proof follows from the eigendecay of the neural kernels reported in Table 2, the eigendecay of the Matérn family of kernels on manifolds given in borovitskiy2020mat, and the Mercer’s representation theorem. Details are given in Appendix B.
The observations that when , we only have , and not vice versa, is a result of parity constraints in Table 2. The subset relation can change to equivalence, with a bias term which removes the parity constraints (see, ronen2019convergence).
A special case of Theorem 3 when was recently proven in (geifman2020similarity; chen2020deep). This special case pertains to the ReLU activation function and Matérn kernel with smoothness parameter , that is also referred to as the Laplace kernel.
The RKHS of a Matérn kernel with smoothness parameter is well known to be equivalent to a Sobolev space with parameter (e.g., see, Kanagawa2018). This observation provides an intuitive interpretation for the RKHS of the neural kernels as a consequence of Theorem 3. That is () is proportional to the cumulative norm of the weak derivatives of up to () order. I.e., () translates to the existence of weak derivatives of up to () order, which can be understood as a versatile measure for the smoothness of controlled by . For the details on the definition of weak derivatives and Sobolev spaces see Hunter2001Analysis.
4 Maximal Information Gain of the Neural Kernels
In this section, we use the eigendecay of the NT kernel shown in the previous section to derive bounds on its information gain, which will then be used to prove uniform bounds on the generalization error.
4.1 Information Gain: a Measure of Complexity
To define information gain, we need to use a fictitious Gaussian process (GP) model. Assume is a zero mean GP indexed on with kernel . Information gain then refers to the mutual information between the data values and . From the closed from expression of mutual information between two multivariate Gaussian distributions (cover1999elementsold), it follows that
Here is the kernel matrix , and is a regularization parameter. We also define the data independent and kernel specific maximal information gain (MIG)
The MIG is analogous to the log of the maximum volume of the ellipsoid generated by points in , which captures the geometric structure of (huang2021short).
The MIG is also closely related to the effective dimension of the kernel. In particular, although the feature space of neural kernels are infinite dimensional, for a finite dataset, the number of features with a significant impact on the regression model can be finite. The effective dimension of a kernel is often defined as (zhang2005learning; Valko2013kernelbandit)
It is known that the information gain and the effective dimension are the same up to logarithmic factors. Specifically, , and (Calandriello2019Adaptive).
The interpretation of the MIG as the effective dimension (up to a factor) provides further intuition into why it can capture the complexity of the learning problem. Roughly speaking, our bounds on the generalization error are analogous to the ones for a linear model, which has a feature dimension of .
4.2 Bounding the Information Gain for Neural Kernels
In the following theorem, we provide a bound on the maximal information gain of the neural kernels.
For the neural kernels presented in Section 2, with activations , , and number of the layers, we have
Proof Sketch. The main components of the analysis include the eigendecay given in Proposition 1, a projection on a finite dimensional RKHS technique proposed in vakili2021information, and the bound on the sum of spherical harmonics of degree determined by the Legendre polynomials that is sometimes referred to as Legendre addition theorem (malevcek2001inductiveaddition).
5 Uniform Bounds on the Generalization Error
In this section, we use the bound on MIG from previous section to establish uniform bounds on the generalization error.
5.1 Implicit Generalization Bounds
We first overview the implicit (date dependent) generalization bounds for the kernel methods under the following assumptions.
Assume the true data generating model is in the RKHS corresponding to a neural kernel . In particular, , for some . In addition, assume that the observation noise are independent sub-Gaussian random variables. That is
sub-Gaussian random variables. That is, , where , .
Under Assumption 1
, we have, with probability at least,
where , and
is the posterior variance of the fictitious GP modelwith kernel , .
Equation 7 provide a high probability bound on . The decay rate of this bound based on is however not explicitly given.
5.2 Explicit Generalization Bounds
In this section, we use the bounds on MIG and the results from the previous section to provide explicit (in ) generalization bounds. It is clear that with no assumption on the distribution of the data, generalization error cannot be bounded, nontrivially. For example, if all the data points are collected from a small region of the input domain, it is not expected for the error to be small far from this small region. We here consider a dataset that is distributed over the input space in a sufficiently informative way. For this purpose, we introduce a data collection module which collects the data points based on the current uncertainty level of the kernel model. In particular, Consider a dataset collected as follows: , where a tie is broken arbitrarily and is defined in equation 8. We have the following result.
Proof Sketch. The proof follows the same steps as in the proof of Theorem of vakili2021optimal. The key step is bounding the total uncertainty in the model with the information gain that is (srinivas2009gaussian)
That allows to bound the in the right hand side of equation 7 by up to multiplicative constants. Plugging in the bounds on from theorem 4, and using the implicit error bound given in equation 7 subject to a union bound on a discretization of the domain with size , we obtain equation 9.
We proved uniform bounds on generalization error of overparameterized neural networks in kernel regimes. For this purpose, we derived bounds on the MIG of the RF and NT kernels. This analysis is based on the decay rate of the eigenvalues of the corresponding kernels that is determined by their endpoint expansions. We note that the existing analysis of MIG does not directly apply here. In particular, the assumption of bounded Mercer eigenfunctions in vakili2021information does not hold in the case of spherical harmonics. In addition, our derivation of the eigendecay for various activation functions extends the existing results which mainly consider the ReLU activation function, and establishes the equivalence between the RKHS of the neural kernels and that of Matérn family of kernels, with the corresponding smoothness. This provides further intuition into the representation power of the neural kernels.
Regret bounds in RL and bandit:
We further provide a remark on the consequences of our bounds on MIG for some recent results in the analysis of RL and bandit problems which use overparameterized neural network models. In particular, uniform generalization bounds were used as an intermediate step in the analysis of RL and bandit problems to show an regret bound, in zhou2020neuralUCB; zhang2020neuralTS; yang2020function; gu2021batched. These works however did not characterize the bound on which leaves the regret bounds implicit, if not incomplete. A consequence of our bounds on is an explicit (in ) regret bound for this class of problems. Our results however have mixed implications by showing that the existing regret bounds are not necessarily sublinear in . In particular, plugging in our bound on in the existing regret bounds, we get , which is sublinear only when (that for example excludes ReLU activation function for dimension ). Our analysis of various activation functions is thus essential in showing sublinear regret bounds for at least some cases. As recently formalized in vakili2021open, it remains an open problem whether the analysis of RL and bandit algorithms in kernel regimes can improve to have an always sublinear regret bound. We note that this open problem and the analysis of RL and bandit problems was not considered in this work. Our remark is mainly concerned with the bound on MIG which appears in the regret bounds.
Due to the properties of the rotationally invariant kernels on the hypersphere mentioned in Section 2, we assumed a hyperspherical domain throughout this paper. We would like to emphasize that this is a standard practice in the literature and is not a limitation for the scope of our analysis. The reason is that, we can transform any compact domain to a compact subdomain of the hypersphere. In this transformation, any vector
is augmented with a dummy variableto obtain , which is then projected to the hypersphere as . The results then apply after this transformation.
Appendix A Proof of Proposition 1
This proposition follows from Theorem of bietti2020deep which proved that the eigendecay of a rotationally invariant kernel can be determined based on its asymptotic expansions around the endpoints . Their result is formally given in the following lemma.
Lemma 1 (Theorem in bietti2020deep)
Assume is and has the following asymptotic expansions around
for , where are polynomials, and is not an integer. Also, assume that the derivatives of admit similar expansions obtained by differentiating the above ones. Then, there exists such that, when is even, if : , and, when is odd, if : . In the case , then we have for one of the two parities. If is infinitely differentiable on so that no such exists, the decays faster than any polynomial.
Building on this result, the eigendecays can be obtained by deriving the endpoint expansions for NT and RF kernels. These results are derived in bietti2020deep for the ReLU activation function. For completeness, here, we derive the endpoint expansions for RF and NT kernels associated with -positively homogeneous activation functions .
Recall that for a layer network, the corresponding RF and NT kernels are given by
For , these expressions can be calculated in closed form using
where . have the following asymptotic endpoint expansions.
a.1 Endpoint Expansions for Layer Networks with
Without loss of generality, we normalize the neural kernels so that for all . The eigendecays will not be affected with such normalization. In particular,
when . Here, . We thus normalize the expressions given above for and by dividing them with .
We prove that the endpoint expansions of the neural kernels for the activation functions with , can be constructed in a recursive way, based on the following lemma.
For the normalized RF kernel () corresponding to a two layer network, we have
Proof of Lemma 2. The lemma is proven by taking the derivative of with respect to and applying the Stein’s lemma. In particular, let and , so that , and . Let . We then have
The third equality is a result of applying Stein’s lemma twice. Once with and , and once with and .
Therefore, using Lemma 2, we can recursively find the expansion of from that of , by integration. We obtain
Here, , subject to (that follows form ). Thus, using induction over , can also be expressed in closed form, starting from . For example,