1 Introduction
Deep learning has achieved remarkable performances in many applications, such as computer vision, natural language processing, and scientific computing. One of the key reasons behind these successes is that neural networks can efficiently approximate certain highdimensional functions without the curse of dimensionality (CoD). A lot of effort has been made to establish mathematical foundations behind this observation such as shallow neural networks [4, 5, 2, 35, 16, 41], deep neural networks [36, 13, 17, 19, 23, 44, 27, 19], just to name a few. However, the understanding is still far from complete even for twolayer neural networks.
The most important progress of understanding twolayer neural networks was made by the seminal work of [4]. Specifically, [4]
constructs a function class using the Fourier transform. For target functions in this class, it is shown that the error rate of approximation with twolayer neural networks obeys the MonteCarlo estimate, which is independent of the input dimension
. On the other hand, the approximation error for any linear method suffers from CoD. They together establish the first quantitative separation between neural networks and linear methods, thereby explaining the superiority of neural networks. Along this line, [15, 28, 2, 16, 41] identify more function classes to separate two methods.Despite these advances, we still lack explicit understanding of the separation, since all the hardness results are based on worstcase analysis. In addition, it is also unclear that how the hardness of linear approximation is affected by other important factors including the activation function and the norms of innerlayer weights. Recently, [16] numerically shows that approximating a single neuron with random features [37] suffers from CoD, and a theoretical explanation is provided in [48]. Specifically, [48] proves that the approximation of a single neuron requires the number of random features (or the magnitude of the coefficients) to be exponential in . However, this result is still unsatisfying in two aspects. First, the restriction on the coefficient magnitudes means that we cannot truly claim the hardness of approximation. Second, it is still, in a way, a worstcase result, since the bias term needs to be chosen adversarially.
Our contributions
In this paper, we advance our understanding of the above questions by providing a comprehensive analysis of the linear approximability, quantified by the Kolmogorov width [31], of twolayer neural networks. To this end, we develop a spectralbased approach, which reduces the study of linear approximability to estimating the eigenvalue decay of an associate kernel. Specifically, our contributions are summarized as follows.

We first consider the simplest case: , i.e., a single neuron without bias term. We show that the linear approximability is controlled by the eigenvalue decay of the kernel . Then, we proceed to the specific case where both the input and the weight lie on the unit sphere. In this case, the eigenvalue admits an integral representation, which allows us to derive explicit estimates and perform numerical computations. Specifically, for a class of nonsmooth activation functions, including the ReLU and Heaviside step function, we prove that the linear approximation suffers from CoD. On the contrary, for smooth activations, the linear approximation is free of CoD.

Then, the above results are applied to analyze twolayer neural networks. By exploiting the rotational invariance, we prove that the Kolmogorov width is also controlled by the eigenvalue decay of the associate kernel. Suppose that the norm of innerlayer weights of each neuron is bounded by . We show that when
, the networks activated by smooth activation functions can be efficiently approximated by spherical harmonics, which are the eigenfunctions of the associate kernel, without CoD. However, if
, the linear approximation cannot avoid CoD. As a comparison, for nonsmooth activation functions such as ReLU, restricting does not change the expressiveness. 
As a direct consequence, we also show that for any , the linear approximation of the single neuron: with random features suffers from CoD for the activation function . This successfully removes all the restrictions of [48] (see the related work section below for more details). Hence, we can truly claim the separation between neural networks and random feature models for this explicit target function.
1.1 Related work
The learning of a single neuron has been widely studied previously, and plays a important role in understanding neural networks, e.g., the training dynamics [49, 21], the hardness of learning [40, 26], the expressiveness of neural networks [22, 16, 49]. In this paper, we provide a full characterization of the linear approximation of single neurons.
As mentioned before, our results of single neuron are closely related to [48], which also shows that hardness of approximating single neurons with random features. Specifically, the authors prove that there exist constants such that, if and the coefficients satisfy that , then there exists a such that for any with , the error of approximating with random features is no less then . The differences from our work are elaborated as follows. First, in [48], needs to be chosen in an adversarial way, while our results imply that is enough. Second, we do not have any restriction on the coefficients. Removing this restriction is critical, otherwise we cannot truly claim that the linear approximation suffers CoD. Third, our results work for more general activation functions, and [48] only considers the ReLU activation.
Our work is also related to [22], which shows that the random feature models effectively fits polynomials. Let
be the standard normal distribution. Specifically, for the single neuron target function,
[22] implies that the approximating error with random features is roughly controlled by when for small . Here, is the projection of orthogonally to the subspace spanned by polynomials of minimum degree . However, no explicit estimate is provided, and it is, therefore, unclear how the approximability depends on the input dimension. Moreover, the analysis of [22] is limited to the specific forms of features. By contrast, our analysis is applicable to very general features.[30]
proved that twolayer neural networks activated by the sigmoid function can be efficiently approximated by polynomials, if the norms of innerlayer weights are bounded by a constant independent of
. Then, they use this expressiveness result to show that this these networks can be learned efficiently in polynomial time (see also [50]). In this paper, we extend the preceding result to more general smooth activations, which include all the commonlyused ones. Moreover, we prove that when the norms of innerlayer weights are large than ( is a loworder polynomial), there does not exist fixed features such that the linear approximation can avoid CoD. These improvements over [30] benefit from the spectralbased approach developed in our study, which is quite different from the techniques used in [30].Lastly, we remark that the eigenvalue decay of the kernel also plays an important role in analyzing the corresponding random feature model [8, 3, 33]. Recently, it is extensively explored to understand neural networks in the kernel regime [14, 47, 25, 45, 9, 7, 6, 38]. In contrast to these works, our study reveals that the eigenvalue decay of this kernel is also related to the separation between neural networks and linear methods, which is essentially a property beyond the kernel regime.
1.2 Notation.
Let denote the unit sphere, be the surface area of , and
be the uniform distribution over
. We use , if there exist absolute constants such that . means for an absolute constant , and is defined analogously. We useto denote the set of probability measures over
. For any probability measure , we use and to denote the inner product and norm, respectively.2 Preliminaries
In this paper, we consider the twolayer neural network: where denote all the parameters and denotes the network width. is a nonlinear activation function. Let be a set of twolayer neural networks (to be determined later). The linear approximability of is described by the Kolmogorov width [31]:
(1) 
where denotes the input distribution. We are interested in how , thereby the linear approximability, is affected by the input dimension , the choice of , and the norm of innerlayer weights: .
Legendre polynomials and spherical harmonics.
This paper mainly focuses on the case where . Hence, we need to prepare some basic techniques for analyzing functions on . Denote by the associated Legendre polynomials of degree in dimensions [1]. The Rodrigues’s formula is given by
(2) 
The polynomial
is even (resp. odd) when
is even (resp. odd). Moreover,(3) 
where denotes the multiplicity of spherical harmonics of degree on .
Let denotes the th spherical harmonics of degree . Then, and , and forms an orthnormal basis of : . The spherical harmonics is related to the Legendre polynomials:
(4) 
For any and , the HeckeFunk formula is given by
(5) 
We refer to [39] for more details about function analysis on .
2.1 Linear approximation of general parametric functions
Consider a general parametric function with and . The single neuron corresponds to . Let and define
(6) 
Assume that is continuous on for any and is compact. Then, Mercer’s theorem implies the decomposition: Here, are the eigenvalues in the nonincreasing order, and are the corresponding eigenfunctions that satisfy The trace of satisfies that . In particular, we are interested in the following quantity
(7) 
For any , and , we have
(8) 
The equality is reached by the optimal fixed features: for .
Proof.
Eq. (8) provides a lower bound of the average approximation error. This can be converted to a worstcase result as follows
(10) 
To obtain a sharp lower bound, one can choose the distribution such that the decay of as slow as possible. In the remaining of this paper, we will omit the subscript of for simplicity since is always fixed to be .
3 Linear approximability of single neurons
We first consider the single neuron: . The results established in this section will be utilized later to analyze twolayer neural networks.
According to Proposition (2.1), what remains is to estimate the eigenvalues of . Assume that . By the rotational invariance, can be written in a dotproduct form: with . Following [43], the spectral decomposition of is given by:
(11) 
where is the eigenvalue and the spherical harmonics is the corresponding eigenfunction: . Note that are the eigenvalues counted with multiplicity, while are the eigenvalues counted without multiplicity. We refer to [39, 43] for more details about the spectral decomposition of a dotproduct kernel on . The following integral representation allows both explicit estimation and numerical computation of .
For the single neuron, we have with
(12) 
Proof.
3.1 Nonsmooth activations
We consider a family of nonsmooth activation functions: with . The ReLU and step function correspond to and , respectively. The case of also has many applications in scientific computing [46, 42, 29]. In particular, for , [12] shows
(13) 
Let with . There exists a constant that depends on polynomially such that In particular, .
The proof is deferred to Appendix A.1. Figure 1 shows that how decays with for various ’s and . In this case, the total trace is . Therefore, and is computed through integrating Eq. (12) numerically. It is clear that the decay suffers from CoD, and the explicit rate provided in Proposition 3.1 aligns very well with the ground truth.
Let for . Then, there exists a constant that depends on polynomially such that the following statements hold.

For any fixed features , we have
(14) 
Consider the random feature: . We assume are rotationally invariant, i.e., for any , for any orthonormal matrix and is sampled from a rotationinvariant distribution . Let . Then, for any ,
(15)
Proof.
Eq. (14) follows from a simple combination of Proposition 3.1 and Proposition 2.1. To prove Eq. (15), we need to exploit the rotational invariance of the random features. Let . For any , let be an orthonormal matrix such that . Then,
where and follow from the rotational invariance of and , respectively. Therefore, is constant with respect to . By Proposition 2.1, we have
Then, applying Proposition 3.1 completes the proof. ∎
Theorem 3.1 shows that the linear approximation of single neurons activated by the nonsmooth activation functions suffers from CoD. In particular, (15) shows that the lower bound holds for any if the features are rotationally invariant. A typical form of rotationinvariant random features is with . This includes (emerging in analyzing neural networks) and kernel predictors with dotproduct kernels [43], etc. In contrast with [48, Theorem 4.1], we successfully remove the restriction on the coefficient magnitudes and do not need to adversarially choose the bias term.
3.2 Smooth activations
Now, we turn to smooth activation functions, such as Sigmoid, Softplus, GELU/SiLU [24, 20], which are also widely used in practice,
Assume that . Then,
(16) 
Proof.
Assume that . All the popular smooth activation functions satisfy the above assumption as shown below.

For and , .

Consider the sigmoid function: , which can be viewed as a complex function . The singular points of are . For any , let . Then, all the singular points must be outside the curve . Using Cauchy’s integral formula, for any , we have
(17) 
For all the other commonlyused smooth activation functions, we can obtain similar estimates of the th order derivatives by using Cauchy’s integral formula.
Under Assumption 3.2, we have The proof is deferred to Appendix A.2. We remark that the above estimate of is rather rough for most smooth activation functions, where is actually much smaller than as demonstrated in Eq. (3.2). However, it is sufficient to show that the linear approximation is free of CoD. A simple combination of Proposition 3.2 and Proposition 2.1 gives
(18) 
where are the leading spherical harmonics. By comparing with Theorem 3.1, we see that smooth activations behave very differently from the nonsmooth ones. Moreover, by exploiting the rotational invariance, we can obtain a stronger result as follows. Let be the leading spherical harmonics. Assume that there exists a nonincreasing function such that and , we have, for any ,
(19) 
The proof can be found in Appendix A.3. In fact, one can choose to have a tight bound. The introduction of facilitates the explicit calculation of , since we hardly have the exact rate of . When , . For the nonsmooth activation functions considered in Section 3.1, taking yields . For smooth activation functions that satisfy Assumption 3.2, we can take , for which . Hence, we can conclude that the spherical harmonics can approximate all the single neurons uniformly well. In particular, for smooth activation functions, the approximation is free of CoD.
4 Expressiveness of twolayer neural networks
To analyze the expressiveness of the twolayer neural network, we define
(20) 
where . For any , let
(21) 
Obviously, all the finitewidth neural network belongs to . We use to denote the unit ball of . Similar function spaces have been widely studied previously and play a fundamental role in theoretical analysis of twolayer neural networks [2, 16, 17, 10, 11, 32, 18]. Our definition is slightly different from the ones in [2, 17], which are specific to ReLU.
For any , define . Let denote the trace decay of defined according to Eq. (7).
Let and . Then,
Proof.
In the proof, the key ingredient is the uniform approximability of single neurons shown in Proposition 3.2. Proposition 4 implies that provides a tight characterization for the linear approximability of twolayer neural networks. Next, we study how the linear approximability is affected by the value of for different activation functions.
4.1 Influence of the norms of innerlayer weights
For ReLU , (up to a rescaling) are obviously the same for different ’s because of the homogeneity. In particular, Theorem 3.1 implies
(22) 
where depends on polynomially. Hence, for any , the linear approximation of twolayer neural networks activated by ReLU suffers from CoD.
However, for general activation functions, it is unclear whether restricting affects the linear approximability. For example, are there significant difference between and ? Suppose that satisfies Assumption 3.2 and . Then, The equality is reached by choosing the spherical harmonics as the basis functions.
This theorem follows from a simple combination of Proposition 3.2 and Proposition 4. It shows that twolayer networks activated by smooth functions are not more expressive than polynomials if the norms of innerlayer weights are restricted to be too small. This is clearly different from that of ReLU, where the expressiveness is independent of .
For the specific arctangent activation function, we have a refined characterization of how the linear approximability depends on the value of . Assume . We have
(23) 
The equality is reached by choosing the spherical harmonics as the basis functions. The proof is deferred to Appendix B.1. The idea is to estimate the integral (16) in the Fourier domain using the Parseval’s theorem. For the arctangent function, the Fourier transform of has an explicit formula, which facilitates the refined estimate of the dependence on .
The preceding proposition implies that the error rate decreases with but independent of if . We conjecture that similar results hold for general smooth activation functions and some numerical supports are provided in Figure 2. Specifically, we examine four activation functions including two sigmoidlike activations: Arctan and Sigmoid, and two ReLUlike activations: SiLU and Softplus. According to 4, is a good proxy of the Kolmogorov width. The eigenvalues are numerically computed using Eq. (12). In experiments, we find that for all the activation functions examined. Figure 2 shows that for all the cases, the rate is independent of for a fixed , and decreases with for a fixed . This is consistent with Eq. (23), which is only proved for Arctan.




Note that a result similar to Proposition 4.1 has be provided in [30] for the sigmoid activation function. Ours differs from [30] in two aspects. First, we show that the same observation holds for more general smooth activation functions. In particular, Proposition 4 combined with Lemma 3 provides us an easy way to numerically compute upper bounds. Second, we can establish a hardness result given below. These are benefiting from our spectralbased analysis, and it is impossible to obtain these results using the techniques in [30].
Suppose that the activation function satisfies either or for any and some constant . This assumption is satisfied by all the commonlyused activation functions.
Suppose that satisfies Assumption 4.1 and . Then, there exists a constant such that if , we have
Proof.
We only present the proof for the sigmoidlike activation functions. The proof for ReLUlike ones is similar can be found in Appendix B.2. For any ,