Over-parameterized neural networks have achieved great success in many applications such as computer vision(He et al., 2016)2008) and speech recognition (Hinton et al., 2012). It has been shown that over-parameterized neural networks can fit complicated target function or even randomly labeled data (Zhang et al., 2017)
and still exhibit good generalization performance when trained with real labels. Intuitively, this is at odds with the traditional notion of generalization ability such as model complexity. In order to understand neural network training, a line of work(Soudry et al., 2018; Gunasekar et al., 2018b, a) has made efforts in the perspective of “implicit bias”, which states that training algorithms for deep learning implicitly pose an inductive bias onto the training process and lead to a solution with low complexity measured by certain norms in the parameter space of the neural network.
Among many attempts to establish implicit bias, Rahaman et al. (2019) pointed out an intriguing phenomenon called spectral bias, which says that during training, neural networks tend to learn the components of lower complexity faster. The concept of spectral bias is appealing because this may intuitively explain why over-parameterized neural networks can achieve a good generalization performance without overfitting. During training, the networks fit the low complexity components first and thus lie in the concept class of low complexity. Arguments like this may lead to rigorous guarantee for generalization.
Great efforts have been made in seek of explanations about the spectral bias. Rahaman et al. (2019)
evaluated the Fourier spectrum of ReLU networks and empirically showed that the lower frequencies are learned first; also lower frequencies are more robust to random perturbation.Andoni et al. (2014) showed that for a sufficiently wide two-layer network, gradient descent with respect to the second layer can learn any low degree bounded polynomial. Xu (2018) provided Fourier analysis to two-layer networks and showed similar empirical results on one-dimensional functions and real data. Nakkiran et al. (2019)
used information theoretical approach to show that networks obtained by stochastic gradient descent can be explained by a linear classifier during early training. All these studies provide certain explanations about why neural networks exhibit spectral bias in real tasks. But explanations in the theoretical aspect, if any, are to some extent restricted. For example, the popular Fourier analysis is usually done in the one-dimensional setting, and thus lacks generality.
Meanwhile, a recent line of work (Jacot et al., 2018; Du et al., 2019b; Li and Liang, 2018; Chizat et al., 2019) has shed light on new approaches to analyze neural networks. In particular, they show that under certain over-parameterized condition, the neural network trained by gradient descent behaves similar to the kernel regression predictor using the neural tangent kernel (NTK) (Jacot et al., 2018). Du et al. (2019b) showed that the convergence is provably guaranteed under certain over-parameterization conditions determined by the smallest eigenvalue of NTK. Arora et al. (2019a) further gave a finer characterization of error convergence based on the eigenvalues of NTK’s Gram matrix. Su and Yang (2019) improved the convergence guarantee in terms of the -th largest eigenvalue for certain target functions.
Inspired by these works mentioned above, we can present a theoretical explanation for spectral bias. Under NTK regime, we establish a precise characterization for the training process of neural networks. More specifically, we theoretically prove that over-parameterized neural networks’ training process can be controlled by the eigenvalues of the integrating operator defined by the NTK. Under the specific case of uniform distribution on unit sphere, we give an exact calculation for these eigenvalues and show that the lower frequencies have larger eigenvalues, which thus leads to faster convergence. We also conduct experiments to corroborate the theory we establish.
Our contributions are highlighted as follows:
[leftmargin = *]
We prove a generic theorem for arbitrary data distributions, which states that under certain sample complexity and over-parameterization conditions, the error term’s convergence along different directions actually relies on the corresponding eigenvalues. This theorem gives a more precise control on the regression residual than Su and Yang (2019), where the authors focused on the case when the labeling function is close to the subspace spanned by the first few eigenfunctions.
We present a more general result about the spectra of the neural tangent kernel. In particular, we show that the order of eigenvalues appears as . Our result is better than the bound derived in Bietti and Mairal (2019) when , which is clearly a more practical setting.
We establish a rigorous explanation for the spectral bias based on the aforementioned theoretical results without any specific assumptions on the target function. We show that the error terms from different frequencies are provably controlled by the eigenvalues of the NTK, and the lower-frequency components can be learned with less training examples and narrower networks with a faster convergence rate. As far as we know, this is the first attempt to give a comprehensive theory justifying the existence of spectral bias.
1.1 Additional Related Work
Recently, there is a rich literature about the property of neural tangent kernel. Jacot et al. (2018) first showed that during training, the network function follows a descent along the kernel gradient with respect to the Neural Tangent Kernel (NTK) under infinity width setting. Li and Liang (2018) and Du et al. (2019b) implicitly built connection between Neural Tangent Kernel and gradient descent by showing that GD can provably optimize sufficiently wide two-layer neural networks. In Du et al. (2019b), it is shown that gradient descent can achieve zero training loss at a linear convergence rate for training two-layer ReLU network with square loss. Allen-Zhu et al. (2019); Du et al. (2019a); Zou et al. (2019); Cao and Gu (2019b); Arora et al. (2019b); Zou and Gu (2019); Cao and Gu (2019a); Frei et al. (2019) further studied the optimization and generalization of deep neural networks. These papers are all in the so-called neural tangent kernel regime, and their requirements on the network width depend either implicitly or explicitly on the smallest eigenvalue of the kernel Gram matrix. Later, Su and Yang (2019) showed that this smallest eigenvalue actually scales in the number of samples and will eventually converge to . In order to obtain constant convergence rate, Su and Yang (2019) assumed that the target function can be approximated by the first few eigenfunctions of the integrating operator where is the NTK function and is the input distribution, and proved linear convergence rate up to the this approximation error.
A few theoretical results have been established towards understanding the spectra of neural tangent kernels. Bach (2017) studied two-layer ReLU networks by relating it to kernel methods, and proposed a harmonic decomposition for the functions in the reproducing kernel Hilbert space which we utilize in our proof. Based on the technique in Bach (2017), Bietti and Mairal (2019) studied the eigenvalue decay of integrating operator defined by NTK on unit sphere by using spherical harmonics. Vempala and Wilmes (2019)
calculated the eigenvalues of NTK corresponding to two-layer neural networks with sigmoid activation function.Basri et al. (2019) established similar results as Bietti and Mairal (2019), but considered the case of training the first layer parameters of a two-layer networks with bias terms. Yang and Salman (2019) studied the the eigenvalues of integral operator with respect to the NTK on Boolean cube by Fourier analysis.
In this section we introduce the basic problem setup including the neural network structure and the training algorithm, as well as some background on the neural tangent kernel proposed recently in Jacot et al. (2018) and the corresponding integral operator.
We use lower case, lower case bold face, and upper case bold face letters to denote scalars, vectors and matrices respectively. For a vectorand a number , we denote its norm by . We also define infinity norm by . For a matrix , we use to denote the number of non-zero entries of , and use to denote its Frobenius norm. Let for , and . For two matrices , we define . We use if is positive semi-definite. In addition, we define the asymptotic notations , , and as follows. Suppose that and be two sequences. We write if , and if . We use and to hide the logarithmic factors in and .
2.2 Problem Setup
Here we introduce the basic problem setup. We consider two-layer fully connected neural networks of the form
where , 111Here the dimension of input is since throughout this paper we assume that all training data lie in the -dimensional unit sphere . are the first and second layer weight matrices respectively, and is the entry-wise ReLU activation function. The network is trained according to the square loss on training examples :
where is a small coefficient to control the effect of initialization, and the data inputs is assumed to follow some unknown distribution on the unit sphere . Without loss of generality, we also assume that .
We first randomly initialize the parameters of the network, and then apply gradient descent to optimize both layers. We present our detailed neural network training algorithm in Algorithm 1.
The initialization scheme for given in Algorithm 1 is known as He initialization (He et al., 2015)
. This scheme generates each entry of the weight matrices from a Gaussian distribution with mean zero. The variances of the Gaussian distributions in initialization are chosen following the principle that the initialization does not change the magnitudes of inputs in each layer. The second layer parameter is not associated with the ReLU activation function, thus it is initialized with varianceinstead of .
2.3 Neural Tangent Kernel
Many attempts have been made to study the convergence of gradient descent assuming the width of the network is extremely large (Du et al., 2019b; Li and Liang, 2018). When the width of the network goes to infinity, with certain initialization on parameters, the inner product of gradients of the output function would converge to a limiting kernel, i.e., Neural Tangent Kernel (Jacot et al., 2018). In this paper, we denote it by and we have
Since we apply gradient descent to both layers, the Neural Tangent Kernel is the sum of two different kernel functions and clearly it can be reduced to one layer training setting. These two kernels are arc-cosine kernels of degree 0 and 1 (Cho and Saul, 2009). Their explicit expressions are given as
2.4 Integral Operator
The theory of integral operator with respect to kernel function has been well studied in machine learning(Smale and Zhou, 2007; Rosasco et al., 2010) thus we only give a brief introduction here. Let be the Hilbert space of square-integrable functions with respect to a Borel measure from . For any continuous kernel function and we can define an integral operator on by
It has been pointed out in Cho and Saul (2009) that arc-cosine kernels are positive semi-definite. Thus the kernel function defined by (1) is positive semi-definite being a product and a sum of positive semi-definite kernels. Clearly this kernel is also continuous and symmetric. Thus we know that the neural tangent kernel is a Mercer kernel.
3 Main Results
In this section we present our main results. In Section 3.1, we give a general result on the convergence rate of gradient descent along different eigendirections of neural tangent kernel. Motivated by this result, in Section 3.2, we give a case study on the spectrum of when the input data are uniformly distributed over the unit sphere . In Section 3.3, we combine the spectrum analysis with the general convergence result to give explicit convergence rate for uniformly distributed data on the unit sphere.
3.1 Convergence Analysis of Gradient Descent
In this section we study the convergence of Algorithm 1
. Instead of studying the standard convergence of loss function value, we aim to provide a refined analysis on the speed of convergence along different directions defined by the eigenfunctions of. We first introduce the following definitions and notations.
Let with be the strictly positive eigenvalues of , and be the corresponding orthonormal eigenfunctions. Set , . Note that may have eigenvalues with multiplicities larger than and , are not distinct. Therefore for any integer , we define as the sum of the multiplicities of the first distinct eigenvalues of . Define . By definition, , are rescaled restrictions of orthonormal functions in on the training examples. Therefore we can expect them to form a set of almost orthonomal bases in the vector space . The following lemma follows by standard concentration inequality.
Suppose that for all and . For any
, with probability at least,
where is an absolute constant.
Denote and , . Then Lemma 3.1 shows that the convergence rate of roughly represents the speed gradient descent learns the components of the target function corresponding to the first eigenvalues. The following theorem gives the convergence guarantee of .
Suppose for and . For any and integer , if , , then with probability at least , Algorithm 1 with , satisfies
Theorem 3.1 provides a more refined convergence analysis than existing results (Arora et al., 2019a; Su and Yang, 2019) which only focus on the full residual. By studying the projections of the residual along different directions, our result theoretically reveals the spectral bias of deep learning. Specifically, as long as the network is wide enough and the sample size is large enough, gradient descent first learns the target function along the the eigendirections of neural tangent kernel with larger eigenvalues, and then learns the rest components corresponding to smaller eigenvalues. Moreover, by showing that learning the components corresponding to larger eigenvalues can be done with smaller sample size and narrower networks, our theory pushes the study of neural networks in the NTK regime towards a more practical setting. For these reasons, we believe that Theorem 3.1 to certain extent explains the empirical observations given in Rahaman et al. (2019)
, and demonstrates that the difficulty of a function to be learned by neural network can be characterized in the eigenspace of neural tangent kernel: if the target function has a component corresponding to a small eigenvalue of neural tangent kernel, then learning this function up to good accuracy takes longer time, and requires more examples and a wider network.
3.2 Spectral analysis of Neural Tangent Kernel for Uniform Distribution
After presenting a general theorem (without assumptions on data distribution) in the previous subsection, we now study the case when the data inputs are uniformly distributed over the unit sphere. We present our results (an extension of Proposition 5 in Bietti and Mairal (2019)) of spectral analysis of neural tangent kernel. We show Mercer decomposition of neural tangent kernel for two-layer setting. We give explicit expression of eigenvalues and show orders of eigenvalues in both cases when and . For any , we have the Mercer decomposition of the neural tangent kernel ,
where for are linearly independent spherical harmonics of degree in variables with and orders of are given by
where . More specifically, we have when and when , .
In the above theorem, the coefficients are actually different eigenvalues of the integral operator on defined by
where is the uniform probability measure on unit sphere . Therefore the in Theorem 3.1 is just given in Theorem 3.2 when is uniform distribution. Vempala and Wilmes (2019) studied two-layer neural networks with sigmoid activation function, and established guarantees to achieve error with iteration complexity under the over-parameterization condition , where is the target function, and is certain function approximation error. Another highly related work is Bietti and Mairal (2019), which gives . The order of eigenvalues we present appears as . This is better when , which is closer to the practical setting.
3.3 Explicit Convergence Rate for Uniformly Distributed Data
In this subsection, we combine our results in the previous two subsections and give explicit convergence rate for uniformly distributed data on the unit sphere. Note that the first distinct eigenvalues of NTK have spherical harmonics up to degree as eigenfunctions. Suppose that , and the sample follows the uniform distribution on the unit sphere . For any and integer , if , , then with probability at least , Algorithm 1 with , satisfies
where and with being a set of orthonomal spherical harmonics of degrees up to .
Suppose that , and the sample follows the uniform distribution on the unit sphere . For any and integer , if , , then with probability at least , Algorithm 1 with , satisfies
where and with being a set of orthonomal spherical harmonics of degrees up to . Corollaries 3.3 and 3.3 further illustrate the spectral bias of neural networks by providing exact calculations of , and in Theorem 3.1. They show that if the input distribution is uniform over unit sphere, then spherical harmonics with lower degrees are learned first by over-parameterized neural networks.
In Corollaries 3.3 and 3.3, it shows that the conditions on and depend exponentially on either or . We would like to emphasize that such exponential dependency is reasonable and unavoidable. In our case, we can take the setting as an example. The exponential dependency in is a natural consequence of the fact that in high dimensional space, there are a large number of linearly independent polynomials even for very low degrees. It is apparently only reasonable to expect to learn less than independent components of the true function, which means that it is unavoidable to assume
Similar arguments can apply to the requirement of and the setting.
In this section we illustrate our results by training neural networks on synthetic data. Across all tasks, we train a two-layer hidden neural networks with 4096 neurons and initialize it exactly as defined in the setup. The optimization method is vanilla full gradient descent. We sample 1000 training data which is uniformly sampled from the unit sphere in.
4.1 Learning combination of spherical harmonics
First, we show a result when the target function is exactly linear combination of spherical harmonics. The target function is explicitly defined as
where the is the Gegenbauer polynomial, and , are fixed vectors that are independently generated from uniform distribution on unit sphere in in our experiments. Note that according to the addition formula , every normalized Gegenbauer polynomial is a spherical harmonic, so is a linear combination of spherical harmonics of order 1,2 and 4. The higher odd-order Gegenbauer polynomials are omitted because the spectral analysis showed that for .
where and is the neural network function.
Here is the projection length onto an approximate vector. In the function space, we can also project the residual function onto the orthonormal Gegenbauer functions . Replacing the training data with randomly sampled data points
can lead to a random estimate of the projection length in function space. We provide the corresponding result for freshly sampled points in AppendixF.1.
The results are showned in Figure 1. It can be seen that at the beginning of training, the residual at the lowest frequency () converges to zero first and then the second lowest (). The highest frequency component is the last one to converge. Following the setting of Rahaman et al. (2019) we assign high frequencies a larger scale, expecting that larger scale will introduce a better descending speed. Still, the low frequencies are regressed first.
4.2 Learning functions of simple form
Apart from the synthesized low frequency function, we also showed the dynamics of normal functions’ projection to . These functions, though in a simple form, have non-zero components in almost all frequencies. In this subsection we further show our results still apply when all frequencies exist in the target function, which is given by or , where is a fixed unit vector. The coefficients of given components are calculated in the same way as in Section 4.1.
Figure 2 shows that even for arbitrarily chosen functions of simple form, the networks can still first learn the low frequency components of the target function. Notice that early in training not all the curves may descend, we believe this is due to the unseen components’ influence on the gradient. Again, as the training proceeds, the convergence is controlled at the predicted rate.
The reason why we only use cosine function and even polynomial is that the only odd basis function with non-zero eigenvalue is . To show a general tendency it is better to restrict the target function in the even function space.
5 Conclusion and Discussion
In this paper, we give theoretical justification for spectral bias through a detailed analysis of the convergence behavior of two-layer neural networks with ReLU activation function. We show that the convergence of gradient descent in different directions depends on the corresponding eigenvalues and essentially exhibits different convergence rates. We show Mercer decomposition of neural tangent kernel and give explicit order of eigenvalues of integral operator with respect to the neural tangent kernel when the data is uniformly distributed on the unit sphere . Combined with the convergence analysis, we give the exact order of convergence rate on different directions. We also conduct experiments on synthetic data to support our theoretical result.
So far, we have considered the upper bound for convergence with respect to low frequency components and present comprehensive theorem to explain the spectral bias. One desired improvement is to give the lower bound of convergence with respect to high frequency components, which is essential to establish tighter characterization of spectral-biased optimization. It is also interesting to extend our result to other training algorithms like Adam, where the analysis in Wu et al. (2019); Zhou et al. (2018) might be implemented with a more careful quantification on the projection of residual along different directions. Another potential improvement is to generalize the result to multi-layer neural networks, which might require different techniques since our analysis heavily rely on exactly computing the eigenvalues of the neural tangent kernel. It is also an important direction to weaken the requirement on over-parameterization, or study the spectral bias in a non-NTK regime to furthur close the gap between theory and practice.
Appendix A Review on spherical harmonics
In this section, we give a brief review on relevant concepts in spherical harmonics. For more detials, see Bach (2017), Bietti and Mairal (2019), Frye and Efthimiou (2012) and Atkinson and Han (2012) for references.
We consider the unit sphere , whose surface area is given by and denote the uniform measure on the sphere. For any , we consider a set of spherical harmonics
They form an orthonormal basis and satisfy the following equation . Moreover, since they are homogeneous functions of degree , it is clear that has the same parity as .
We have the addition formula
where is the Legendre polynomial of degree in dimensions, explicitly given by (Rodrigues’ formula)
We can also see that , the Legendre polynomial of degree shares the same parity with . By the orthogonality and the addition formula (6) we have,
Further we have the recurrence relation for the Legendre polynomials,
for and for .
The Hecke-Funk formula is given for a spherical harmonic of degree
Appendix B Proof of Main Theorems
b.1 Proof of Theorem 3.1
In this section we give the proof of Theorem 3.1. The core idea of our proof is to establish connections between neural network gradients throughout training and the neural tangent kernel. To do so, we first introduce the following definitions and notations.
Define , . Let , be the eigenvalues of , and
be the corresponding eigenvectors. Set, . For notation simplicity, we denote , , .
The following lemma is partly summarized from the proof of equation (44) in Su and Yang (2019). Its purpose is to further connect the eigenfunctions of NTK with their finite-width, finite-sample counterparts.
Suppose that for all . There exist absolute constants , such that for any and integer with , if , then with probability at least ,
The following two lemmas gives some preliminary bounds on the function value and gradients of the neural network around random initialization. They are proved in Cao and Gu (2019a).
[Cao and Gu (2019a)] For any , if for a large enough absolute constant , then with probability at least , for all .
[Cao and Gu (2019a)] There exists an absolute constant such that, with probability at least , for all , and with , it holds uniformly that
The following lemma is the key to characterize the dynamics of the residual throughout training. These bounds in Lemma B.1 are the ones that distinguish our analysis from previous works on neural network training in the NTK regime (Du et al., 2019b; Su and Yang, 2019), since our analysis provides more careful characterization on the residual along different directions.
Suppose that the iterates of gradient descent are inside the ball . If and , then with probability at least ,
for all .
Now we are ready to prove Theorem 3.1.