DeepAI

# Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis

Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and spherical harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.

• 7 publications
• 13 publications
12/03/2019

### Towards Understanding the Spectral Bias of Deep Learning

An intriguing phenomenon observed during training neural networks is the...
02/27/2022

### The Spectral Bias of Polynomial Neural Networks

Polynomial neural networks (PNNs) have been recently shown to be particu...
09/16/2022

### Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study

Neural tangent kernel (NTK) is a powerful tool to analyze training dynam...
07/24/2019

### A Fine-Grained Spectral Perspective on Neural Networks

Are neural networks biased toward simple functions? Does depth always he...
03/10/2020

### Frequency Bias in Neural Networks for Input of Non-Uniform Density

Recent works have partly attributed the generalization ability of over-p...
05/22/2018

### Deep learning generalizes because the parameter-function map is biased towards simple functions

Deep neural networks generalize remarkably well without explicit regular...
01/14/2023

### Understanding the Spectral Bias of Coordinate Based MLPs Via Training Dynamics

Recently, multi-layer perceptrons (MLPs) with ReLU activations have enab...

## 1 Introduction

Several recent theoretical and empirical advances indicate that understanding generalization in deep learning requires incorporating the properties of the data distributions as well as the optimization algorithms

(Zhang et al., 2017). One view to interpret the generalization capabilities of deep neural networks is that the training dynamics of DNNs result in successive layers capturing functions invariant to high frequency pertubations or actions in the input space while incorporating low frequency functions such as classes.

Such properties in the trained models are desirable since most realistic data distributions correspond to data manifolds of objects such as the set of real images along with groups composed of actions such as rotations and shifting of objects acting on the data. A network that incorporates the invariance of properties such as classes and presence of objects to such group actions is expected to have better generalization capabilties. A dual view of understanding the changes in functions under such actions is through fourier analysis on the corresponding groups.

As a step towards explaining the generalization behaviour of Deep Neural Networks, Rahaman et al. (2019)

highlighted the intriguing phenomenon of “Spectral Bias” in Relu networks where they empirically demonstrated the bias of Relu networks towards learning low frequency functions.

While a theoretical understanding of the “Spectral Bias” in finite-width neural networks remains elusive, several recent works have attempted to shed light into the phenomenon of spectral bias through the properties of the Neural Tangent Kernel (NTK)’s spectrum. Neural Tangent Kernel (NTK) (Jacot et al., 2018)

describes the time evolution of the output function’s value at an input through the similarity between gradients of the outputs at the training points and at the given input. In the infinite-width limit, under suitable scaling, the NTK converges to a fixed kernel. The evolution of the output function of the Deep Neural Network in such regimes corresponds to Kernel regression using the NTK. This enables understanding the optimization and generalization properties of the training dynamics by studying the properties of the corresponding NTK. Recent theoretical and empirical results have demonstrated that the NTK has high eigenvalues for functions corresponding low frequency variations in the input space. The high eigenvalues of such low frequency functions leads to faster convergence along directions in function space having smoother variations in the input space.

The relationship between the spectral properties of Kernels and generalization is well known, and has been used to derive explicit generalization bounds (Bartlett and Mendelson, 2001) for Kernel methods. However, we argue that for the case of Deep Neural Networks, it can provide insights into not only the generalization capabilities but also the role played by different layers as well as the spectral properties of the learnt features.

Our analysis reveals that the eigenvalues corresponding to high frequency functions of the contribution to the NTK from initial layers are larger than those corresponding to latter layers. This is primarily due to application of the differentiation operation to the activation function while back-propagating the gradient through latter layers, leading the amplification of high frequency components. This explains the predisposition of the initial layers towards contributing to the learning of high frequency functions. For example, in CNNs, the initial layers often detect high frequency artifacts such as edges whereas the latter layers detect smoother properties such as the presence of class.

By decomposing the NTK of the full network into layer-wise contributions, we characterize the contributions of different layers to the decrease in the training cost. Furthermore, we prove that the ratio of contributions of different layers to the decrease in generalization error along a direction in the function space when training on a finite number of data points, can be approximately described by the ratio of the function’s squared norm under the inner product defined by the two Kernels This ratio differs across functions of different frequencies. By exploiting the shared basis of eigenvectors for the kernels corresponding to different layers, we relate this ratio for functions of different frequencies to the corresponding ratio of eigenvalues. We then derive a general approach to analyze these ratios without explicitly computing the eigenvalues.

## 2 Setup and Notation

We consider the setup of a deep neural network (DNN) having layers numbered , being training under gradient descent on a finite number of training points . We use and to denote the DNN’s output function’s value at a point and the set of parameters respectively.

## 3 Preliminaries: Neural Tangent Kernel

Our analysis relies on the framework of Neural Tangent Kernel, proposed by Jacot et al. (2018), who showed that under appropriate scaling, the training dynamics of neural networks in the limit of infinite-width can be described by a fixed kernel , whose value at two given points and is simply the inner product of the output’s gradients w.r.t the parameters at the given two points, i.e:

 κNTK(x,x′)=⟨∇θf(x),∇θf(x′)⟩,

The above inner product arises naturally when considering the evolution of the output function’s value at a point , upon training using a set of points . For instance, for the regression task, with the corresponding target values the output evolves as:

 ∂f(x)∂t=−1nn∑i=1⟨∇θf(x),∇θf(xi)⟩(f(xi)−yi)

Thus the spectral analysis of the NTK can reveal sensitivity of the output’s gradients w.r.t the parameters upon training with different target functions.

## 4 Related Work

A number of works have analyzed the spectrum of Neural Tangent Kernel under different conditions such as two layer relu networks without bias (Bietti and Mairal, 2019; Cao et al., 2019) under uniform input distribution on the sphere, in the presence of bias and non-uniform density (Basri et al., 2020) and on boolean input space (Yang and Salman, 2020). More recently, (Bietti and Bach, 2021) utilized the regularity of the kernels corresponding to Relu networks to prove that the NTK’s spectrum for -layer Relu networks has the same asymptotic decay as two layer Relu networks and the Laplace Kernel. The equivalence between the NTK for Relu networks and the Laplace Kernel was also derived independently by Chen and Xu (2021). Our work, instead focuses on the relative contributions of different layers to the NTK, to enable understanding the propagation of output gradients throughout the network. Moreover, our results are applicable for arbitrary activation functions satisying minor smoothness assumptions. The spectral bias during training for DNNs was first studied empirically by Rahaman et al. (2019)

for MNIST and toy datasets.

Valle-Perez et al. (2019) highlighted a different kind of simplicity bias, by demonstrating that the mapping from the parameters to the output functions is biased towards simpler functions. However, unlike their analysis based solely on the output functions at initialization, the Neural Tangent Kernel (NTK) framework offers the advantage of incorporating the properties of the training algorithm, namely gradient descent while still being dependent only on the network at initialization. While the entire network’s spectrum explains the network’s bias towards learning low frequency functions, and consequently the improved generalization performance for “smooth” target functions, the layer wise contributions explain how some layers are biased towards learning higher or lower frequencies in comparatively with the other layers. A number of works (Schoenholz et al., 2017) have analyzed the Forward information propagation in Deep Neural Networks by recursively describing the covariance at layer in terms of layer to determine when the ratio of covariance is a stable fixed point. Our work instead sheds light into the propagation of gradients across layers and the sensitivity of the output to changes in different layers. Our work also builds upon a long line of work on the analysis of dot-product Kernels (Smola et al., 2001; Scetbon and Harchaoui, 2021).

## 5 Analysis

### 5.1 Ratio of Decrements

Let denote the the subset of parameters corresponding to the layer. Since the NTK’s value at given points corresponds to the inner product of the output values w.r.t the parameters , it can be expressed as a sum of the inner products for as follows:

 ⟨∇θf(x),∇θf(x′)⟩=L−1∑ℓ=0⟨∇θℓf(x),∇θℓf(x′)⟩

The NTK for the entire network can thus be decomposed into the contributions from each layer to the NTK as follows:

 κNTK(x,x′)=L−1∑ℓ=0κ(ℓ)(x,x′) (1)

Following (Jacot et al., 2018), we represent the cost function as a functional defined on a function space on training inputs. A given cost functional and a given output function then defines an element in the dual space of functionals w.r.t the input distribution such that . The evolution of the cost thus described by (Jacot et al., 2018):

 ∂tC|f(t)=−⟨d|f(t),∇κ(L)NTKC|f(t)⟩pin=−∥∥d|f(t)∥∥2κ(L)NTK, (2)

where the inner product w.r.t a kernel is defined as To isolate the effect of each layer in the evolution of the training loss, we define as the contribution to the evolution of the training loss by the layer:

 d(ℓ)(d|f(t))=∥∥d|f(t)∥∥2κ(ℓ) (3)

The ratio between the contributions for layers and to the decrements of the cost are thus given by:

 d(ℓ1)(d|f(t))d(ℓ2)(d|f(t))=∥∥d|f(t)∥∥2κ(ℓ1)∥∥d|f(t)∥∥2κ(ℓ2) (4)

For simplicity, we consider to be the

i.e the cost is computed w.r.t a uniform distribution on the sphere. Moreover, as we demonstrate in Section

5.3

, using standard concentration arguments, the ratio of decrements for finite training points and a function on the given training points can be related to the corresponding ratio of decrements for the integral operators and the associated eigenfunction.

### 5.2 Relation with Mercer Decomposition

Given an input space and a measure such that the space of square integrable functions along with the corresponding inner product constitute a Hilbert space, Kernels defined on the space can be interpreted as symmetric Hilbert Schmidt Operators. By Mercer’s theorem, continuous positive symmetric operators acting on functions defined on an compact input space can be diagonalized by a countable orthonormal basis, known as the Mercer Decomposition, which plays a similar role to a spectral decomposition. Suppose (the functional defined in Equation 6) is continuous and bounded, so that it lies in the Hilbert space. Furthermore, suppose that it is an an eigenvector of both and with eigenvalues respectively. We obtain

 d(ℓ1)(d|f(t))d(ℓ2)(d|f(t))=∥∥d|f(t)∥∥2κ(ℓ1)∥∥d|f(t)∥∥2κ(ℓ2)=λ1λ2 (5)

Thus, the ratio of decrements for such a direction in the function space is simply given by the ratio of the corresponding eigenvalues.

### 5.3 Finite training data

Let denote an arbitrary function, corresponding to a given target direction in the function space. Consider the functions and , that describe the contributions to the decrease in the expected risk, of and respectively due to a single training point . Thus, and describe the corresponding contributions while training on the finite collection of training points with inputs . Since the inputs lie on a compact space (the unit sphere), for continuous kernels and and function , their magnitudes are bounded, say with constants and respectively. Thus and are bounded as well, since we have:

 ϕ1(xi) =∫g(x)(κ(ℓ1)(x,xi)g(x′i))dx≤AC2

for some constant independent of . Similarly, for , we have:

 ϕ2(xi)≤BC2

We then utilize Hoeffding’s Inequality to obtain the following two bounds:

 Pr[|1nn∑i=1ϕ1(xi)−∬g(x)κ(ℓ1)(x,x′)g(x′)dxdx′|≥t] ≤2e−2nt2AC2 Pr[|1nn∑i=1ϕ2(xi)−∬g(x)κ(ℓ1)(x,x′)g(x′)dxdx′|≥t] ≤2e−2nt2BC2

Let and denote and respectively. These correspond to the squared norms of the function under the inner products defined by and respectively. Let us assume that . As we prove in the appendix, we can utilize the above inequalities to bound the difference between the ratios and to arrive at the following concentration inequality:

 Pr[|1n∑ni=1ϕ1(xi)1n∑ni=1ϕ2(xi)−λ1λ2|≥ϵ] ≤2e−2nλ21ϵ2AC2+2e−2nλ22ϵ2BD2

Thus we obtain the following theorem:

###### Theorem 1.

Let be an arbitrary function with decrements and along kernels and . Let the corresponding decrements in total risk due to a single point be given by and . Then for

training points, with probability

, we have:

 |1n∑ni=1ϕ1(xi)1n∑ni=1ϕ2(xi)−λ1λ2| ≤(log4δ)1√nmax⎧⎨⎩√AC22λ21,√BD22λ22⎫⎬⎭

### 5.4 Two layer Network

We consider a two-layer network with the output for the

input vector

given by with denoting the row of the matrix for the first layer. For simplicity, we consider networks without the bias parameters and assume that all the parameters are initialized independently as . The gradients of the output w.r.t the parameters for the two layers are given by:

 ∂f(xi)∂vj=1√mσ(w⊤jxi) (6)

and

 ∇wjf(xi)=1√mvjσ′(w⊤jxi)xi (7)

Assuming and , we can express the contributions of the two layers to the NTK as follows:

 κ(0)(x,x′)=Ew∼N(0,1)[⟨x,x′⟩σ′(⟨w,x⟩)σ′(⟨w,x′⟩)] (8)
 κ(1)(x,x′)=Ew∼N(0,1)[σ(⟨w,x⟩)σ(⟨w,x′⟩)] (9)

#### 5.4.1 Hermite Polynomials

Since and

correspond to correlated gaussian random variables, it is natural to utilize the Hermite expansion for the activations

. We recall that Hermite polynomials form an orthonormal basis for the Hilbert space, where denotes the one-dimensional Gaussian measure . For , we have a.e. w.r.t the Gaussian measure:

 σ(⟨wi,x⟩)=∞∑i=0aiHi(⟨wi,x⟩) (10)

We recall that for random variables , having correlation , we have (Daniely et al., 2016):

 Ew∼N(0,1)[Hn(x)Hm(y)]={ρnn=m0otherwise (11)

Equation 11 and the bounded convergence theorem imply the following lemma (full proof in the Appendix):

###### Lemma 1.

Let be any function admitting a Hermite series expansion . Then for two random variables , the correlation only depends on the correlation between and can be expressed as:

 E[ξ(x)ξ(y)]=∞∑i=0e2iρi (12)

To simplify subsequent computations, we define:

 ψ1(x,x′)=Ew∼N(0,1)[σ′(⟨w,x⟩)σ′(⟨w,x′⟩)] ψ0(x,x′)=Ew∼N(0,1)[σ(⟨w,x⟩)σ(⟨w,x′⟩)]

Since, since are correlated gaussian random variables with correlation and and correspond to correlations of functions of and , Lemma 1 and the series expansion 10 implies that they can be expressed as follows:

 ψ1(⟨x,x′⟩)=∞∑i=1a2i(⟨x,x′⟩)iψ0=∞∑i=1a′2i(⟨x,x′⟩)i (13)

Equation 11 then implies To relate and , we recall the following recurrence relations:

 Hn+1(w)=wHn(w)−nHn−1(w) (14) H′n(w)=nHn−1(w). (15)

We denote by , the coefficient for the Hermite series expansion corresponding to . We note that, assuming , using integration by parts (full proof in the Appendix), we have:

 a′n=1n!Ew∼N(0,1)[σ′(w)Hn(w)]=1n!Ew∼N(0,1)[σ(w)(w)Hn(w)]−1n!Ew∼N(0,1)[σ(w)H′n(w)](using Equations ??? and ???)=1n!Ew∼N(0,1)[σ(w)(Hn+1(w))]=n+1(n+1)!Ew∼N(0,1)[σ(w)(Hn+1(w))]=(n+1)an+1 (16)

Therefore, for , by substituting the above expression in Equation 15, we obtain:

 κ(0)(⟨x,x′⟩)=⟨x,x′⟩Ew∼N(0,1)[σ′(⟨w,x⟩)σ′(⟨w,x′⟩)]=⟨x,x′⟩∞∑i=1(a′i−1)2(⟨x,x′⟩)i−1=∞∑i=1i2a2i(⟨x,x′⟩)i (17)

Therefore, we have the following theorem:

###### Theorem 2.

Assuming an activation function such that and admit Hermite series expansions, the ratio of the coefficients for the degree term in the power series expansion for the Kernels corresponding to layers and satisfies .

While our analysis relies on Hermite polynomials, the amplification of high frequency terms due to differentiation is a general phenomemom. For isntance, the derivative of a degree sinusoid is given by . Thus, we expect a similar analysis to hold in other suitable bases of functions.

### 5.5 Conversion to Legendre series

Since Legendre Polynomials, (or Gegenbauer polynomials for arbitrary dimension)s form a basis of polynomial functions on , they provide a convenient way to isolate the components of a dot-product kernel corresponding to different degrees of variation w.r.t the input space. Moreover, the Hecke-Funk Theorem allows us to express Legendre polynomials applied to the inner product in terms of the spherical harmonic functions acting on the constituent vectors. This relationship can be utilized to obtain the mercer decomposition of the given dot-product kernel as described in the subsequent sections. For our results, we utilize the following property (Chen and Xu, 2021) of positive definite functions on spheres from the classical paper by Schoenberg (1942):

###### Lemma 2.

Power series expansion: An inner product Kernel , defined by a continuous function as for and is a positive definite kernel for every if and only if , for a sequence satisfying and .

To obtain the Legendre series coefficients from the corresponding power series, we start by expresing in the basis of Legendre polynomials:

 ui=∞∑l=0βilPl(u) (18)

By applying Lemma 2 to the function , we note that is a positive definite function . Thus . Moreover, since lies in the span of Legendre polynomials of degree , it is orthogonal to all higher degree Legendre polynomials w.r.t the corresponding measure. Thus .

Consider a bounded function admitting a power series expansion with positive coefficients convergent in . Since is bounded and integrable in , applying the bounded convergence theorem yields:

 ∫1−1g(u)Pl(u)dμ=∞∑m=lgi∫1−1uiPl(u)dμ=∞∑m=lgmβml,

where the sum starts from since and denotes the corresponding dimension-dependent normalizing measure. The details of the measure and the assumptions are further described in the Appendix. Let us now consider two functions with the corresponding power series expansion and . The ratio of the components of the degree Legendre polynomials for and is then given by:

 gPlhPl=∫1−1g(u)Pl(u)du∫1−1h(u)Pl(u)du=∑∞m=lgmβml∑∞m=lhmβml (19)

This leads us to the following lemma:

###### Lemma 3.

Let denote two functions on admitting power series expansions and such that is an non-decreasing function of , then the ratio of the corresponding Legendre coefficients satisfies .

###### Proof.
 gPlhPl =∑∞m=lgmβml∑∞m=lhmβml=∑∞m=lgmhmhmβml∑∞m=lhmβml ≥∑∞m=lglhlhmβml∑∞m=lhmβml =glhl

where the inequality follows from the non-decreasing nature of . ∎

### 5.6 Relation to the NTK’s Spectrum

Substituting and in the above Lemma leads to:

 (uψ0(u))Pl(ψ1(u))Pl≥(a′i)2a2i=i2.

We formalize the above conclusion through the following corollary of Theorem 2 and Lemma 3:

###### Corollary 1.

For two layer networks with an activation function such that and admit Hermite series expansions, let denote the component along the degree Legendre polynomial of the contribution to the NTK by the first and second layer respectively. Then, .

#### 5.6.1 Spherical Harmonics

Since the inner product between two points lies in the range , any dot-product kernel can be expressed in the basis of Legendre polynomials. Subsequently, using the Funk-Hecke formula (Frye and Efthimiou, 2012), we can obtain the components of the kernel in the basis of products of the corresponding spherical harmonics, leading to the Mercer decomposition of the kernels. Analogous to the exponential functions on the real line, the spherical harmonics are eigenfunctions of the Laplace–Beltrami operator defined on the unit sphere. Let denote the coefficients in the Legendre series expansion for and respectively. We utilize the following standard identity, whose proof for arbitrary dimensions can be found in Frye and Efthimiou (2012):

###### Lemma 4.

The value of the degree Legendre polynomial as a function of the inner product of two points lying on the unit sphere

can be diagonalized in the basis of the tensor product of spherical harmonics as follows:

 Pk(⟨x′,x⟩)=1N(d,k)N(d,k)∑j=1Yk,j(x′)Yk,j(x)

Substituting in the Legendre series expansions for and , we obtain:

 ⟨x′,x⟩ψ0(⟨x′,x⟩))=∞∑k=0λ0,kPk(⟨x′,x⟩)=∞∑k=0λ0,k1N(d,k)N(d,k)∑j=1Yk,j(x′)Yk,j(x)ψ1(⟨x′,x⟩))=∞∑k=0λ1,kPk(⟨x′,x⟩)=∞∑k=0λ1,k1N(d,k)N(d,k)∑j=1Yk,j(x′)Yk,j(x) (20)

Thus the ratio of the eigenvalues for the two kernels corresponding to a degree spherical harmonic is simply given by Using the above relationship and corollary 1, we have the following corollary:

###### Corollary 2.

For a functional derivative

lying in the eigenspace of

degree spherical harmonics, the ratio of decrements satisfies .

### 5.7 Convergence Rate while Training Individual Layers

The spectral bias of the Kernels corresponding to a given layer is related to the rate of convergence along different eigenfunctions while training only the corresponding layer. Concretely, we observe that while training only the parameters of the layer i.e. , the changes in the output are described by the Kernel Thus, analogous to Equation 6, the evolution of the cost upon training the layer, while keeping the other layers fixed is given by:

 ∂tC|f(t)=−∥∥d|f(t)∥∥2κ(ℓ), (21)

where denotes the output function corresponding to the full network. Now, suppose that the cost functional can be decomposed along the eigenfunctions of as , where corresponds to the component along the eigenfunction with eigenvalue . Then, due to the orthonormality of the eigenfunctions, we obtain:

 ∂tC|f(t)=−k∑i=1λi∥∥d(λi)|f(t)∥∥2κ(ℓ), (22)

Thus, the decrements in the cost can be decomposed along the contributions from different eigenvalues, with the rate of decrease of the corresponding contribution being proportional to the magnitude of the eigenvalue. Our analysis thus predicts that, while training the initial layers, the relative rate of convergence of the function along different eigenfunctions should be larger for higher frequency directions when compared to when training the latter layers.

## 6 Experiments

We empirically verify the validity of our theoretical analysis in both synthetic datasets of spherical harmonics as well as the high dimensional image dataset of MNIST (Deng, 2012)

. In all our settings, we measure the norms of different directions in function space, for a given Kernel, as defined in Equation through the quadratic forms on the corresponding gram matrix defined on training points. To compare relative contributions, with further divide each layer’s quadratic form by the corresponding value for the last layer. Thus in all our plots, the “contributions” denote the ratio of the projections of a given target vector along the gram matrices corresponding to the given layer and the last layer. We include a full definition of the plotted quantities in the Appendix. Additional results for other datasets, details of the experiments, and experiments for rates of convergence, as discusses in section 5.7, are also provided in the Appendix. The relative values of the contributions in different settings indicate the prevalence of the layer-wise spectral bias in finite dimensional networks, supporting our analysis. In all our plots, confidence intervals are evaluated over random initializations.

### 6.1 Spherical Harmonics

We plot the layer-wise contributions for spherical harmonics corresponding to input dimension 2 and 10 in Figures 1 and 2 respectively.

#### 6.1.1 Two Dimensions

In two dimensions, the degree Legendre polynomial is simply , and the spherical harmonics simply correspond to the cosine and sine functions of the degree in terms of the polar angle of the input points. For this simple setup, we validate our theoretical analysis through experiments on regression task with uniformly distributed data on the unit sphere corresponding to dimension . Each input datapoint is thus described as with being uniformly distributed in the interval . We use a fully connected network with four layers, and consider the quadratic forms for the Kernel gram matrix, evaluated at functions . To obtain the relative magnitudes of the contributions of different layers for each degree function, we normalize the each contribution by the corresponding contribution of the last layer.

### 6.2 Higher Dimensions

For higher dimensions, we utilize the fact that the functions of the form , for fixed vectors are linear combinatations of degree spherical harmonics. This can be proved by substituting the expansion of in terms of spherical harmonics, as given by Lemma 4:

 L∑l=1alPk(⟨x′,x⟩)=L∑l=1alN(d,k)N(d,k)∑j=1Yk,j(xl)Yk,j(x) =N(d,k)∑j=1bjYk,j(x),

where Thus, we can sample functions lying in the space spanned by the spherical harmonics of degree by taking linear combinations of the degree Legendre polynomial evaluated at the inner product with randomly sampled points. .For our experiments, we set all consider input dimension and sample points from the uniform distribution on the unit sphere. Note that, since we evaluate the ratio of contributions, the values remain the same upon scaling the functions with arbitrary constants.

In the appendix, we provide additional results for other activation functions and settings.

### 6.3 Mnist

While our theoretical analysis is based on inputs distributed on the sphere, we hypothesize that a similar phenomenon of amplification of initial layers’ contributions for high frequency directions occurs in the case of more complex high dimensional datasets. To validate this, we consider the MNIST dataset (Deng, 2012), corresponding to images of dimension . Since an analysis of the Mercer decomposition for such a dataset is intractable, we utilize the following set of functions to represent different frequencies of variation in the input space.:

1. Following Rahaman et al. (2019), we consider radial noise of different frequencies, i.e. noise of the form of different frequencies.

2. Following Xu (2020), we consider the sine and cosine functions of different degrees, defined along the top principal component of the input data.

3. Unlike the case of uniform distribution on the sphere, the kernels corresponding to contributions from different layers may not be diagonalizable on a common basis of orthonormal eigenfunctions. However, we can approximate such a shared basis for the layer wise Kernels using the eigenvalues for the Kernel corresponding to the full network.

In each case, we consider a Relu network containing 4 fully connected layers and evaluate the quadratic forms for NTK gram matrix of different layers at functions of different frequency. We provide the results for the first setting in Figure 3, while the results for the remaining settings are provided in the Appendix.

## 7 Future Work and Limitations

While the NTK framework provides insights into the inductive biases of Deep Neural Networks, trained under gradient descent, it’s applicability is limited to settings belonging in the “lazy training” regime (Chizat et al., 2019). In particular, since the NTK setting assumes that the weights remain near the initialization, it does not directly explain the properties of feature learning in deep neural networks. However, the kernel and related objects at initialization can still allow us to characterize the initial phase of training. A promising direction could be to extend our analysis to the spectral properties of intermediate layer near initialization. Extending our analysis to multiple layers is complicated by the presence of products and compositions of power series. Future work could involve avoiding such impediments by utilizing more general characterizations of frequencies in the kernel feature space.

## References

• Bartlett and Mendelson (2001) P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. In D. P. Helmbold and R. C. Williamson, editors, Computational Learning Theory, 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 16-19, 2001, Proceedings, volume 2111 of Lecture Notes in Computer Science, pages 224–240. Springer, 2001.
• Basri et al. (2020) R. Basri, M. Galun, A. Geifman, D. Jacobs, Y. Kasten, and S. Kritchman. Frequency bias in neural networks for input of non-uniform density. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 685–694. PMLR, 13–18 Jul 2020.
• Bietti and Bach (2021) A. Bietti and F. Bach. Deep equals shallow for re{lu} networks in kernel regimes. In International Conference on Learning Representations, 2021.
• Bietti and Mairal (2019) A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels, 2019.
• Cao et al. (2019) Y. Cao, Z. Fang, Y. Wu, D.-X. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning, 2019.
• Chen and Xu (2021) L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same {rkhs}. In International Conference on Learning Representations, 2021.
• Chizat et al. (2019) L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
• Daniely et al. (2016) A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
• Deng (2012) L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
• Frye and Efthimiou (2012) C. Frye and C. J. Efthimiou. Spherical harmonics in p dimensions, 2012.
• Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2018.
• Rahaman et al. (2019) N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5301–5310. PMLR, 09–15 Jun 2019.
• Scetbon and Harchaoui (2021) M. Scetbon and Z. Harchaoui. A spectral analysis of dot-product kernels. In A. Banerjee and K. Fukumizu, editors,

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics

, volume 130 of Proceedings of Machine Learning Research, pages 3394–3402. PMLR, 13–15 Apr 2021.
• Schoenberg (1942) I. J. Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96 – 108, 1942.
• Schoenholz et al. (2017) S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
• Smola et al. (2001) A. Smola, Z. Óvári, and R. C. Williamson. Regularization with dot-product kernels. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001.
• Valle-Perez et al. (2019) G. Valle-Perez, C. Q. Camargo, and A. A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations, 2019.
• Xu (2020) Z.-Q. J. Xu. Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767, Jun 2020. ISSN 1991-7120.
• Yang and Salman (2020) G. Yang and H. Salman. A fine-grained spectral perspective on neural networks, 2020.
• Zhang et al. (2017) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.

## 8 Full Proof of Theorem 1

We follow the notation in Section 5.3. We recall that Hoeffding’s inequality leads to the following two inequalities:

 Pr[|1nn∑i=1ϕ1(xi)−λ1|≥t] ≤2e−2nt2AC2 Pr[|1nn∑i=1ϕ2(xi)−λ2|≥t] ≤2e−2nt2BC2

Our aim is to utilize the above two inequalities to bound and . We proceed by bounding the absolute difference between the ratios in terms of the absolute differences between the numerators and denominators as follows:

 |1n∑ni=1ϕ1(xi)1n∑ni=1ϕ2(xi)−λ1λ2| =|1n∑ni=1ϕ1(xi)λ2−1n∑ni=1ϕ2(xi)λ11n∑ni=1ϕ2(xi)λ2| =|1n∑ni=1ϕ1(xi)λ2−λ1λ2+λ1λ2−1n∑ni=1ϕ2(xi)λ11n∑ni=1ϕ2(xi)λ2| ≤|1n∑ni=1ϕ1(xi)−λ1|1n∑ni=1ϕ2(xi)+|1n∑ni=1ϕ2(xi)−λ2|λ11n∑ni=1ϕ2(xi)λ2

Next, we observe that to ensure that the above difference is bounded by it is sufficient to have:

 |1n∑ni=1ϕ1(xi)−λ1|1n∑ni=1ϕ2(xi)≤ϵ2λ1λ2 |1n∑ni=1ϕ2(xi)−λ2|λ11n∑ni=1ϕ2(xi)λ2≤ϵ2λ1λ2

Now suppose we have and for some Then the ratios can be bounded as follows:

 |1n∑ni=1ϕ1(xi)−λ1|1n∑ni=1ϕ2(xi) ≤ϵλ14λ2(1−ϵ4)