1 Introduction and main results
In the canonical statistical learning problem, we are given independent and identically distributed (i.i.d.) pairs , , where is a feature vector and
is a label or response variable. We would like to construct a functionwhich allows us to predict future responses. Throughout this paper, we will measure the quality of a predictor via its square prediction error (risk): .
Current practice suggests that —for a number of important applications— the best learning method is a multi-layer neural network. The simplest model in this class is given by two-layers networks (NN):
Here is the number of neurons and
is an activation function. Over the last several years, considerable attention has been devoted to two classes of models that can be regarded as linearization of two-layers networks. The first one is the random features model of Rahimi and Recht[RR08], which only optimizes over the weights ’s, while keeping the first layer fixed:
Here is a matrix whose -th row is the vector . In the RF model, this is chosen randomly, and independent of the data.
The second model is the neural tangent kernel model of Jacot, Gabriel and Hongler [JGH18], which we define as
Again, is a matrix of weight that is not optimized over, but instead drawn at random. Further is the derivative of the activation function with respect to its argument (if has a density, only needs to be weakly differentiable). This can be viewed as first order Taylor expansion of the neural network model around a random initialization [JGH18]
. Several recent papers argue that this linearization indeed captures the behavior of the original neural network, when the latter is fitted using stochastic gradient descent (SGD), and provided the model is sufficiently overparametrized (see Section2 for pointers to this line of work).
1.1 A numerical experiment
(bottom right). We use least square to estimate the model coefficients fromsamples and report the test error over fresh samples. Data points correspond to averages over independent repetitions, and the risk is normalized by the risk of the trivial (constant) predictor.
The starting point of this paper is a simple —and yet surprising— simulation study. We consider feature vectors normalized so that , and otherwise uniformly random, and responses , for a certain function . Indeed, this will be the setting throughout the paper: (where denotes the sphere with radius in dimensions) and . We draw random weights , and use samples to learn a model in or . We estimate the risk (test error) using fresh samples, and normalize it by the risk of the trivial model .
). We use shifted ReLU activations, , and learn the model parameters using least squares. If the model is overparametrized, we select the minimum -norm solution. (We refer to Appendix A for simulations using ridge regression instead.)
The results are somewhat disappointing: in two cases (the first and third figures) these advanced machine learning method do not beat the trivial predictor. In one case (the second one), theNT model surpasses the trivial baseline, and it appears to decrease to as the number of samples gets large. We also note that the risk shows a cusp when , with the number of parameters of the model ( for RF, and for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19] for relevant work). We will instead focus on the population behavior .
The reader might wonder whether these poor performances are due to the choice of extremely complex ground truth . Figures 1 and 2 use a simple quadratic function . In Figure 3 we instead try to learn a third-order polynomial . In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. This is surprising, especially in view of two remarks:
General theory implies that both these functions can be represented arbitrarily well with an unbounded number of neurons (see, e.g., [Cyb89]).
We demonstrate the second point empirically in Fig. 4 by choosing weight vectors , where are i.i.d. uniformly random indices, and the scaling factor is . Fixing this random first-layer weights, we fit the second-layer weights by least squares. The risk achieved is an upper bound on the minimum risk in the model, namely , and is significantly smaller than the baseline . (The risk reported in Fig. 4 can also be interpreted as a ‘random features’ risk. However, the specific distribution of the vectors is tailored to the function , and hence not achievable within the RF model.)
1.2 Main results
The origin for the mismatch between classical theory and the findings of Figures 1 to 3 is that universal approximation results apply to the case of fixed dimension , as number of neurons grows to infinity. In this paper we focus on the population behavior (i.e. ) and unveil a remarkably simple behavior of the models RF and NT when and grow together. Our results can be summarized as follows:
If , then RF does not outperform linear regression in the raw covariates (i.e. least squares with the model , , ).
More generally, if , then RF does not outperform linear regression over all monomials of degree at most in .
If , then NT does not outperform linear regression over monomials of degree at most two in (i.e. least squares with the model , , , ).
More generally, if , then RF does not outperform linear regression over all monomials of degree at most in .
In the following, we state formally these results. We define the minimum population risk for any of the models by
Notice that this is a random variable because of the random features encoded in the matrix. Also, it depends implicitly on , but we will make this dependence explicit only when necessary.
For , we denote by the orthogonal projector onto the subspace of polynomials of degree at most . (We also let .) In other words, is the function obtained by linear regression of onto monomials of degree at most .
Our main theorems formalize the above discussion, under certain technical conditions on the activation function. For the RF model we only require that does not grow too fast at infinity (in particular, exponential growth is fine), and is not a low-degree polynomial (but non-trivial results are obtained already for a degree- polynomial). [Risk of the RF model] Assume for a fixed integer , and let be a sequence of functions. Let be an activation function such that for some constants , with . Further assume that is not a polynomial of degree smaller than .
Let with independently. Then we have for any
, the following happens with high probability:
For the NT model we require the same growth condition, although on the weak derivative of , denoted by . We further require the Hermite decomposition of to satisfy a mild ‘genericity’ condition. Recall that the -th Hermite coefficient of a function can be defined as , where is the -th Hermite polynomial (see Section 3 for further background). [Risk of the NT model] Assume for a fixed integer , and let be a sequence of functions. Let be an activation function which is weakly differentiable, with weak derivative such that for some constants , with . Further assume the Hermite coefficients to be such that, there exists such that and
Let with independently. Then we have for any , the following happens with high probability:
In words, Eq. (2) amounts to say that the risk of the random feature model can be approximately decomposed in two parts, each non-negative, and each with a simple interpretation:
The second contribution, is simply the risk achieved by linear regression with respect to polynomials of degree at most . In the special case , this is the risk of simple linear regression with respect to the raw features. The first contribution is the risk of the RF model when applied to the low-degree component of (the linear component for ). In general this will be strictly positive. Equation (4) yields a similar decomposition for the NT model. It is easy to check that the conditions on the activation function in Theorem 1.2 and Theorem 1.2 hold for all , for all commonly used activations.
For instance the ReLU activation obviously satisfies the assumptions of Theorem 1.2 (it has subexponential growth and is not a polynomial). As for Theorem 1.2, its weak derivative is , which has subexponential growth. Further its Hermite coefficients are and
which satisfy the required condition for each . (In checking the condition, it might be useful to notice the relation .)
In the next section, we will briefly overview related literature. Section 3 provides some technical background, in particular on orthogonal polynomials, that is useful for the proofs. We will prove the statement for the RF model, Theorem 1.2, in Section 4. The proof for the NT model, Theorem 1.2, is similar but technically more involved and is presented in Section 5.
2 Related work
Approximation properties of neural networks and, more generally, nonlinear approximation have been studied in detail in the nineties, see e.g. [DHM89, GJP95, Mha96]. The main concern of the present paper is quite different, since we focus on the random feature model, and the (recently proposed) neural tangent model. Further, our focus is on the high-dimensional regime in which grows with . Most approximation theory literature considers fixed, and .
The random features model RF has been studied in considerable depth since the original work in [RR08]. The classical viewpoint suggests that should be regarded as an approximation of the reproducing kernel Hilbert space (RKHS) defined by the kernel (see [BTA11] for general background)
Indeed the space is the RKHS defined by the following finite-rank approximation of this kernel
provides upper and lower bounds in terms of the eigenvalues of the kernel, which match up to logarithmic terms (see also [Bac13, AM15, RR17] for related work).
Such approximation results can be used to derive risk bounds. Namely, a given function in a suitable smoothness class (e.g. a Sobolev space) can be approximated by a function in , with sufficiently small norm of the coefficients . This implies that the risk decays to as the number of samples and number of neurons diverge, for any fixed dimension.
Of course, this approach generally breaks down if the dimension is large (technically, if it grows with
). This ‘curse of dimensionality’ is already revealed by classical lower bounds in functional approximation, see e.g.[DHM89, Bac17]. However, previous work does not clarify what happens precisely in this high-dimensional regime. In contrast, the picture emerging from our work is remarkably simple. In particular, in the regime , random feature models are performing vanilla linear regression with respect to the raw features.
who pointed out intriguing similarities between some properties of modern deep learning models, and large scale kernel learning. A concrete explanation for this analogy was proposed in[JGH18] via the NT model. This explanation postulates that, for large neural networks, the network weights do not change much during the training phase. Considering a random initialization and denoting by the change during the training phase, we linearize the neural network as
Assuming (which is reasonable for certain random initializations), this suggests that a two-layers neural network learns a model in (if both layers are trained), or simply (if only the first layer is trained). The analysis of [DZPS18, DLL18, AZLS18, ZCZG18] establishes that indeed this linearization is accurate in a certain highly overparametrized regime, namely when for a certain constant . Empirical evidence in the same direction was presented in [LXS19].
Does this mean that large (wide) neural networks can be interpreted as random feature approximations to certain kernel methods? Our results suggest some caution: in high dimension, the actual models learnt by random features methods are surprisingly naive. The recent paper [YS19] also suggests caution by showing that a single neuron cannot be approximated by random feature models with a subexponential number of neurons.
It is worth mentioning that an alternative approach to the analysis of two-layers neural networks, in the limit of a large number of neurons, was developed in [MMN18, RVE18, SS18, CB18, MMM19]. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.
3 Technical background
In this section we introduce some notation and technical background which will be useful for the proofs in the next sections. In particular, we will use decompositions in (hyper-)spherical harmonics on the and in orthogonal polynomials on the real line. All of the properties listed below are classical: we will however prove a few facts that are slightly less standard. We refer the reader to [EF14, Sze39, Chi11] for further information on these topics.
3.1 Functional spaces over the sphere
For , we let denote the sphere with radius in . We will mostly work with the sphere of radius , and will denote by the uniform probability measure on . All functions in the following are assumed to be elements of , with scalar product and norm denoted as and :
For , let be the space of homogeneous harmonic polynomials of degree on (i.e. homogeneous polynomials satisfying ), and denote by the linear space of functions obtained by restricting the polynomials in to . With these definitions, we have the following orthogonal decomposition
The dimension of each subspace is given by
For each , the spherical harmonics form an orthonormal basis of :
Note that our convention is different from the more standard one, that defines the spherical harmonics as functions on . It is immediate to pass from one convention to the other by a simple scaling. We will drop the superscript and write whenever clear from the context.
We denote by the orthogonal projections to in . This can be written in terms of spherical harmonics as
We also define , , and , .
3.2 Gegenbauer polynomials
The -th Gegenbauer polynomial is a polynomial of degree . Consistently with our convention for spherical harmonics, we view as a function . The set forms an orthogonal basis on , where is the distribution of when , satisfying the normalization condition:
In particular, they are normalized so that . As above, we will omit the superscript when clear from the context.
Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix and consider the subspace of formed by all functions that are invariant under rotations in that keep unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of the function .
We will use the following properties of Gegenbauer polynomials
3.3 Hermite polynomials
The Hermite polynomials form an orthogonal basis of , where is the standard Gaussian measure, and has degree . We will follow the classical normalization (here and below, expectation is with respect to ):
As a consequence, for any function , we have the decomposition
The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials are constructed by Gram-Schmidt orthogonalization of the monomials with respect to the measure , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to . Since (here denotes weak convergence), it is immediate to show that, for any fixed integer ,
Here and below, for a polynomial, is the vector of the coefficients of .
Throughout the proofs, (resp. ) denotes the standard big-O (resp. little-o) notation, where the subscript emphasizes the asymptotic variable. We denote (resp. ) the big-O (resp. little-o) in probability notation: if for any , there exists and , such that
and respectively: , if converges to in probability.
We will occasionally hide logarithmic factors using the notation (resp. ): if there exists a constant such that . Similarly, we will denote (resp. ) when considering the big-O in probability notation up to a logarithmic factor.
4 Proof of Theorem 1.2, Rf model
We begin with some notations and simple remarks. Assume is an activation function with for some constants and . Then
Let . Then there exists such that, for ,
Let . Then there exists a coupling of and such that
Claim 1 is obvious.
For claim 2, note that the probability distribution ofwhen is given by
A simple calculation shows that as , and hence . Therefore
where the last inequality holds provided .
Finally, for point 3, without loss of generality we will take , so that . By the same argument given above (and since both and have densities bounded uniformly in ), for any we can choose bounded continuous so that for any ,
It is therefore sufficient to prove the claim for . Letting , independent of , we construct the coupling via
where we set . We thus have almost surely, and the claim follows by weak convergence. ∎
We denote the Hermite decomposition of by
For future reference, we state separately the two assumptions we use to prove Theorem 1.2 for the RF model. [Integrability condition] There exists constants , , with and such that, for all , .
[Non-trivial Hermite components] The activation function is not a polynomial of degree smaller than .
Equivalently, there exists , such that .
4.2 Proof of Theorem 1.2: Outline
Recall that independently. We define for , so that independently. Let , and . We denote to be the expectation operator with respect to , to be the expectation operator with respect to , and to be the expectation operator with respect to .
Define the random vectors , , , with
Define the random matrix