DeepAI
Log In Sign Up

Linear approximability of two-layer neural networks: A comprehensive analysis based on spectral decay

08/10/2021
by   Jihao Long, et al.
0

In this paper, we present a spectral-based approach to study the linear approximation of two-layer neural networks. We first consider the case of single neuron and show that the linear approximability, quantified by the Kolmogorov width, is controlled by the eigenvalue decay of an associate kernel. Then, we show that similar results also hold for two-layer neural networks. This spectral-based approach allows us to obtain upper bounds, lower bounds, and explicit hard examples in a united manner. In particular, these bounds imply that for networks activated by smooth functions, restricting the norms of inner-layer weights may significantly impair the expressiveness. By contrast, for non-smooth activation functions, such as ReLU, the network expressiveness is independent of the inner-layer weight norms. In addition, we prove that for a family of non-smooth activation functions, including ReLU, approximating any single neuron with random features suffers from the curse of dimensionality. This provides an explicit separation of expressiveness between neural networks and random feature models.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/14/2020

Complexity Measures for Neural Networks with General Activation Functions Using Path-based Norms

A simple approach is proposed to obtain complexity controls for neural n...
02/03/2021

On the Approximation Power of Two-Layer Networks of Random ReLUs

This paper considers the following question: how well can depth-two ReLU...
04/15/2020

A function space analysis of finite neural networks with insights from sampling theory

This work suggests using sampling theory to analyze the function space r...
10/29/2020

Over-parametrized neural networks as under-determined linear systems

We draw connections between simple neural networks and under-determined ...
07/14/2017

On the Complexity of Learning Neural Networks

The stunning empirical successes of neural networks currently lack rigor...
09/16/2022

Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study

Neural tangent kernel (NTK) is a powerful tool to analyze training dynam...

1 Introduction

Deep learning has achieved remarkable performances in many applications, such as computer vision, natural language processing, and scientific computing. One of the key reasons behind these successes is that neural networks can efficiently approximate certain high-dimensional functions without the curse of dimensionality (CoD). A lot of effort has been made to establish mathematical foundations behind this observation such as shallow neural networks [4, 5, 2, 35, 16, 41], deep neural networks [36, 13, 17, 19, 23, 44, 27, 19], just to name a few. However, the understanding is still far from complete even for two-layer neural networks.

The most important progress of understanding two-layer neural networks was made by the seminal work of [4]. Specifically, [4]

constructs a function class using the Fourier transform. For target functions in this class, it is shown that the error rate of approximation with two-layer neural networks obeys the Monte-Carlo estimate, which is independent of the input dimension

. On the other hand, the approximation error for any linear method suffers from CoD. They together establish the first quantitative separation between neural networks and linear methods, thereby explaining the superiority of neural networks. Along this line, [15, 28, 2, 16, 41] identify more function classes to separate two methods.

Despite these advances, we still lack explicit understanding of the separation, since all the hardness results are based on worst-case analysis. In addition, it is also unclear that how the hardness of linear approximation is affected by other important factors including the activation function and the norms of inner-layer weights. Recently, [16] numerically shows that approximating a single neuron with random features [37] suffers from CoD, and a theoretical explanation is provided in [48]. Specifically, [48] proves that the approximation of a single neuron requires the number of random features (or the magnitude of the coefficients) to be exponential in . However, this result is still unsatisfying in two aspects. First, the restriction on the coefficient magnitudes means that we cannot truly claim the hardness of approximation. Second, it is still, in a way, a worst-case result, since the bias term needs to be chosen adversarially.

Our contributions

In this paper, we advance our understanding of the above questions by providing a comprehensive analysis of the linear approximability, quantified by the Kolmogorov width [31], of two-layer neural networks. To this end, we develop a spectral-based approach, which reduces the study of linear approximability to estimating the eigenvalue decay of an associate kernel. Specifically, our contributions are summarized as follows.

  • We first consider the simplest case: , i.e., a single neuron without bias term. We show that the linear approximability is controlled by the eigenvalue decay of the kernel . Then, we proceed to the specific case where both the input and the weight lie on the unit sphere. In this case, the eigenvalue admits an integral representation, which allows us to derive explicit estimates and perform numerical computations. Specifically, for a class of non-smooth activation functions, including the ReLU and Heaviside step function, we prove that the linear approximation suffers from CoD. On the contrary, for smooth activations, the linear approximation is free of CoD.

  • Then, the above results are applied to analyze two-layer neural networks. By exploiting the rotational invariance, we prove that the Kolmogorov width is also controlled by the eigenvalue decay of the associate kernel. Suppose that the norm of inner-layer weights of each neuron is bounded by . We show that when

    , the networks activated by smooth activation functions can be efficiently approximated by spherical harmonics, which are the eigenfunctions of the associate kernel, without CoD. However, if

    , the linear approximation cannot avoid CoD. As a comparison, for non-smooth activation functions such as ReLU, restricting does not change the expressiveness.

  • As a direct consequence, we also show that for any , the linear approximation of the single neuron: with random features suffers from CoD for the activation function . This successfully removes all the restrictions of [48] (see the related work section below for more details). Hence, we can truly claim the separation between neural networks and random feature models for this explicit target function.

1.1 Related work

The learning of a single neuron has been widely studied previously, and plays a important role in understanding neural networks, e.g., the training dynamics [49, 21], the hardness of learning [40, 26], the expressiveness of neural networks [22, 16, 49]. In this paper, we provide a full characterization of the linear approximation of single neurons.

As mentioned before, our results of single neuron are closely related to [48], which also shows that hardness of approximating single neurons with random features. Specifically, the authors prove that there exist constants such that, if and the coefficients satisfy that , then there exists a such that for any with , the error of approximating with random features is no less then . The differences from our work are elaborated as follows. First, in [48], needs to be chosen in an adversarial way, while our results imply that is enough. Second, we do not have any restriction on the coefficients. Removing this restriction is critical, otherwise we cannot truly claim that the linear approximation suffers CoD. Third, our results work for more general activation functions, and [48] only considers the ReLU activation.

Our work is also related to [22], which shows that the random feature models effectively fits polynomials. Let

be the standard normal distribution. Specifically, for the single neuron target function,

[22] implies that the approximating error with random features is roughly controlled by when for small . Here, is the projection of orthogonally to the subspace spanned by polynomials of minimum degree . However, no explicit estimate is provided, and it is, therefore, unclear how the approximability depends on the input dimension. Moreover, the analysis of [22] is limited to the specific forms of features. By contrast, our analysis is applicable to very general features.

[30]

proved that two-layer neural networks activated by the sigmoid function can be efficiently approximated by polynomials, if the norms of inner-layer weights are bounded by a constant independent of

. Then, they use this expressiveness result to show that this these networks can be learned efficiently in polynomial time (see also [50]). In this paper, we extend the preceding result to more general smooth activations, which include all the commonly-used ones. Moreover, we prove that when the norms of inner-layer weights are large than ( is a low-order polynomial), there does not exist fixed features such that the linear approximation can avoid CoD. These improvements over [30] benefit from the spectral-based approach developed in our study, which is quite different from the techniques used in [30].

Lastly, we remark that the eigenvalue decay of the kernel also plays an important role in analyzing the corresponding random feature model [8, 3, 33]. Recently, it is extensively explored to understand neural networks in the kernel regime [14, 47, 25, 45, 9, 7, 6, 38]. In contrast to these works, our study reveals that the eigenvalue decay of this kernel is also related to the separation between neural networks and linear methods, which is essentially a property beyond the kernel regime.

1.2 Notation.

Let denote the unit sphere, be the surface area of , and

be the uniform distribution over

. We use , if there exist absolute constants such that . means for an absolute constant , and is defined analogously. We use

to denote the set of probability measures over

. For any probability measure , we use and to denote the inner product and norm, respectively.

2 Preliminaries

In this paper, we consider the two-layer neural network: where denote all the parameters and denotes the network width. is a nonlinear activation function. Let be a set of two-layer neural networks (to be determined later). The linear approximability of is described by the Kolmogorov width [31]:

(1)

where denotes the input distribution. We are interested in how , thereby the linear approximability, is affected by the input dimension , the choice of , and the norm of inner-layer weights: .

Legendre polynomials and spherical harmonics.

This paper mainly focuses on the case where . Hence, we need to prepare some basic techniques for analyzing functions on . Denote by the associated Legendre polynomials of degree in dimensions [1]. The Rodrigues’s formula is given by

(2)

The polynomial

is even (resp. odd) when

is even (resp. odd). Moreover,

(3)

where denotes the multiplicity of spherical harmonics of degree on .

Let denotes the -th spherical harmonics of degree . Then, and , and forms an orthnormal basis of : . The spherical harmonics is related to the Legendre polynomials:

(4)

For any and , the Hecke-Funk formula is given by

(5)

We refer to [39] for more details about function analysis on .

2.1 Linear approximation of general parametric functions

Consider a general parametric function with and . The single neuron corresponds to . Let and define

(6)

Assume that is continuous on for any and is compact. Then, Mercer’s theorem implies the decomposition: Here, are the eigenvalues in the non-increasing order, and are the corresponding eigenfunctions that satisfy The trace of satisfies that . In particular, we are interested in the following quantity

(7)

For any , and , we have

(8)

The equality is reached by the optimal fixed features: for .

Proof.

Without loss of generality (WLOG), assume are orthonormal in , otherwise we can perform Gram-Schmidt orthonormalization. Then,

(9)

The second term in (2.1) is a standard PCA problem for , where the supremum is reached at and Plugging it back to (2.1) completes the proof. ∎

Eq. (8) provides a lower bound of the average approximation error. This can be converted to a worst-case result as follows

(10)

To obtain a sharp lower bound, one can choose the distribution such that the decay of as slow as possible. In the remaining of this paper, we will omit the subscript of for simplicity since is always fixed to be .

3 Linear approximability of single neurons

We first consider the single neuron: . The results established in this section will be utilized later to analyze two-layer neural networks.

According to Proposition (2.1), what remains is to estimate the eigenvalues of . Assume that . By the rotational invariance, can be written in a dot-product form: with . Following [43], the spectral decomposition of is given by:

(11)

where is the eigenvalue and the spherical harmonics is the corresponding eigenfunction: . Note that are the eigenvalues counted with multiplicity, while are the eigenvalues counted without multiplicity. We refer to [39, 43] for more details about the spectral decomposition of a dot-product kernel on . The following integral representation allows both explicit estimation and numerical computation of .

For the single neuron, we have with

(12)
Proof.

By the Hecke-Funk formula (5),

3.1 Non-smooth activations

We consider a family of non-smooth activation functions: with . The ReLU and step function correspond to and , respectively. The case of also has many applications in scientific computing [46, 42, 29]. In particular, for , [12] shows

(13)

Let with . There exists a constant that depends on polynomially such that In particular, .

The proof is deferred to Appendix A.1. Figure 1 shows that how decays with for various ’s and . In this case, the total trace is . Therefore, and is computed through integrating Eq. (12) numerically. It is clear that the decay suffers from CoD, and the explicit rate provided in Proposition 3.1 aligns very well with the ground truth.

Figure 1: The decay of for various ’s. Here, and the dashed curve corresponds the explicit estimate given in Proposition 3.1.

Let for . Then, there exists a constant that depends on polynomially such that the following statements hold.

  • For any fixed features , we have

    (14)
  • Consider the random feature: . We assume are rotationally invariant, i.e., for any , for any orthonormal matrix and is sampled from a rotation-invariant distribution . Let . Then, for any ,

    (15)
Proof.

Eq. (14) follows from a simple combination of Proposition 3.1 and Proposition 2.1. To prove Eq. (15), we need to exploit the rotational invariance of the random features. Let . For any , let be an orthonormal matrix such that . Then,

where and follow from the rotational invariance of and , respectively. Therefore, is constant with respect to . By Proposition 2.1, we have

Then, applying Proposition 3.1 completes the proof. ∎

Theorem 3.1 shows that the linear approximation of single neurons activated by the non-smooth activation functions suffers from CoD. In particular, (15) shows that the lower bound holds for any if the features are rotationally invariant. A typical form of rotation-invariant random features is with . This includes (emerging in analyzing neural networks) and kernel predictors with dot-product kernels [43], etc. In contrast with [48, Theorem 4.1], we successfully remove the restriction on the coefficient magnitudes and do not need to adversarially choose the bias term.

3.2 Smooth activations

Now, we turn to smooth activation functions, such as Sigmoid, Softplus, GELU/SiLU [24, 20], which are also widely used in practice,

Assume that . Then,

(16)
Proof.

Substituting the Rodrigues formula (2) into Eq. (12) leads to

where the last equality follows from integration by parts. Inserting completes the proof. ∎

Assume that . All the popular smooth activation functions satisfy the above assumption as shown below.

  • For and , .

  • Consider the sigmoid function: , which can be viewed as a complex function . The singular points of are . For any , let . Then, all the singular points must be outside the curve . Using Cauchy’s integral formula, for any , we have

    (17)
  • For all the other commonly-used smooth activation functions, we can obtain similar estimates of the -th order derivatives by using Cauchy’s integral formula.

Under Assumption 3.2, we have The proof is deferred to Appendix A.2. We remark that the above estimate of is rather rough for most smooth activation functions, where is actually much smaller than as demonstrated in Eq. (3.2). However, it is sufficient to show that the linear approximation is free of CoD. A simple combination of Proposition 3.2 and Proposition 2.1 gives

(18)

where are the leading spherical harmonics. By comparing with Theorem 3.1, we see that smooth activations behave very differently from the non-smooth ones. Moreover, by exploiting the rotational invariance, we can obtain a stronger result as follows. Let be the leading spherical harmonics. Assume that there exists a non-increasing function such that and , we have, for any ,

(19)

The proof can be found in Appendix A.3. In fact, one can choose to have a tight bound. The introduction of facilitates the explicit calculation of , since we hardly have the exact rate of . When , . For the non-smooth activation functions considered in Section 3.1, taking yields . For smooth activation functions that satisfy Assumption 3.2, we can take , for which . Hence, we can conclude that the spherical harmonics can approximate all the single neurons uniformly well. In particular, for smooth activation functions, the approximation is free of CoD.

4 Expressiveness of two-layer neural networks

To analyze the expressiveness of the two-layer neural network, we define

(20)

where . For any , let

(21)

Obviously, all the finite-width neural network belongs to . We use to denote the unit ball of . Similar function spaces have been widely studied previously and play a fundamental role in theoretical analysis of two-layer neural networks [2, 16, 17, 10, 11, 32, 18]. Our definition is slightly different from the ones in [2, 17], which are specific to ReLU.

For any , define . Let denote the trace decay of defined according to Eq. (7).

Let and . Then,

Proof.

Let . Then, for any ,

where the second inequality follows from Proposition 2.1. Hence, the lower bound is proved. Let Then, by Proposition 3.2 with taking ,

For any with , there must exist and such that and . Let . Then,

Thus, we complete the proof. ∎

In the proof, the key ingredient is the uniform approximability of single neurons shown in Proposition 3.2. Proposition 4 implies that provides a tight characterization for the linear approximability of two-layer neural networks. Next, we study how the linear approximability is affected by the value of for different activation functions.

4.1 Influence of the norms of inner-layer weights

For ReLU , (up to a rescaling) are obviously the same for different ’s because of the homogeneity. In particular, Theorem 3.1 implies

(22)

where depends on polynomially. Hence, for any , the linear approximation of two-layer neural networks activated by ReLU suffers from CoD.

However, for general activation functions, it is unclear whether restricting affects the linear approximability. For example, are there significant difference between and ? Suppose that satisfies Assumption 3.2 and . Then, The equality is reached by choosing the spherical harmonics as the basis functions.

This theorem follows from a simple combination of Proposition 3.2 and Proposition 4. It shows that two-layer networks activated by smooth functions are not more expressive than polynomials if the norms of inner-layer weights are restricted to be too small. This is clearly different from that of ReLU, where the expressiveness is independent of .

For the specific arctangent activation function, we have a refined characterization of how the linear approximability depends on the value of . Assume . We have

(23)

The equality is reached by choosing the spherical harmonics as the basis functions. The proof is deferred to Appendix B.1. The idea is to estimate the integral (16) in the Fourier domain using the Parseval’s theorem. For the arctangent function, the Fourier transform of has an explicit formula, which facilitates the refined estimate of the dependence on .

The preceding proposition implies that the error rate decreases with but independent of if . We conjecture that similar results hold for general smooth activation functions and some numerical supports are provided in Figure 2. Specifically, we examine four activation functions including two sigmoid-like activations: Arctan and Sigmoid, and two ReLU-like activations: SiLU and Softplus. According to 4, is a good proxy of the Kolmogorov width. The eigenvalues are numerically computed using Eq. (12). In experiments, we find that for all the activation functions examined. Figure 2 shows that for all the cases, the rate is independent of for a fixed , and decreases with for a fixed . This is consistent with Eq. (23), which is only proved for Arctan.

(a) Arctan.
(b) Sigmoid.
(c) SiLU.
(d) Softplus.
Figure 2: How the decay of , thereby the linear approximability, changes with for fixed (left), and changes with for fixed (right). Two sigmoid-like and two ReLU-like smooth activation functions are examined.

Note that a result similar to Proposition 4.1 has be provided in [30] for the sigmoid activation function. Ours differs from [30] in two aspects. First, we show that the same observation holds for more general smooth activation functions. In particular, Proposition 4 combined with Lemma 3 provides us an easy way to numerically compute upper bounds. Second, we can establish a hardness result given below. These are benefiting from our spectral-based analysis, and it is impossible to obtain these results using the techniques in [30].

Suppose that the activation function satisfies either or for any and some constant . This assumption is satisfied by all the commonly-used activation functions.

Suppose that satisfies Assumption 4.1 and . Then, there exists a constant such that if , we have

Proof.

We only present the proof for the sigmoid-like activation functions. The proof for ReLU-like ones is similar can be found in Appendix B.2. For any ,