Linearized two-layers neural networks in high dimension

04/27/2019 ∙ by Behrooz Ghorbani, et al. ∙ 0

We consider the problem of learning an unknown function f_ on the d-dimensional sphere with respect to the square loss, given i.i.d. samples {(y_i, x_i)}_i< n where x_i is a feature vector uniformly distributed on the sphere and y_i=f_( x_i). We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: (RF) The random feature model of Rahimi-Recht; (NT) The neural tangent kernel model of Jacot-Gabriel-Hongler. Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and hence enjoy universal approximation properties when the number of neurons N diverges, for a fixed dimension d. We prove that, if both d and N are large, the behavior of these models is instead remarkably simpler. If N = o(d^2), then RF performs no better than linear regression with respect to the raw features x_i, and NT performs no better than linear regression with respect to degree-one and two monomials in the x_i. More generally, if N= o(d^ℓ+1) then RF fits at most a degree-ℓ polynomial in the raw features, and NT fits at most a degree-(ℓ+1) polynomial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and main results

In the canonical statistical learning problem, we are given independent and identically distributed (i.i.d.) pairs , , where is a feature vector and

is a label or response variable. We would like to construct a function

which allows us to predict future responses. Throughout this paper, we will measure the quality of a predictor via its square prediction error (risk): .

Current practice suggests that —for a number of important applications— the best learning method is a multi-layer neural network. The simplest model in this class is given by two-layers networks (NN):

(NN)

Here is the number of neurons and

is an activation function. Over the last several years, considerable attention has been devoted to two classes of models that can be regarded as linearization of two-layers networks. The first one is the random features model of Rahimi and Recht

[RR08], which only optimizes over the weights ’s, while keeping the first layer fixed:

(RF)

Here is a matrix whose -th row is the vector . In the RF  model, this is chosen randomly, and independent of the data.

The second model is the neural tangent kernel model of Jacot, Gabriel and Hongler [JGH18], which we define as

(NT)

Again, is a matrix of weight that is not optimized over, but instead drawn at random. Further is the derivative of the activation function with respect to its argument (if has a density, only needs to be weakly differentiable). This can be viewed as first order Taylor expansion of the neural network model around a random initialization [JGH18]

. Several recent papers argue that this linearization indeed captures the behavior of the original neural network, when the latter is fitted using stochastic gradient descent (SGD), and provided the model is sufficiently overparametrized (see Section

2 for pointers to this line of work).

1.1 A numerical experiment

A
A
Figure 1: Risk of the random features model for learning a quadratic function , for (top left), (top right), (bottom left) and

(bottom right). We use least square to estimate the model coefficients from

samples and report the test error over fresh samples. Data points correspond to averages over independent repetitions, and the risk is normalized by the risk of the trivial (constant) predictor.
A
Figure 2: Risk (test error) of the neural tangent model in learning a quadratic function , for (left frame) and (right frame). The other settings are the same as in Figure 1.
A
Figure 3: Risk (test error) of the neural tangent model in learning a third order polynomial , for (left frame) and (right frame). The other settings are the same as in Figures 1 and 2.

The starting point of this paper is a simple —and yet surprising— simulation study. We consider feature vectors normalized so that , and otherwise uniformly random, and responses , for a certain function . Indeed, this will be the setting throughout the paper: (where denotes the sphere with radius in dimensions) and . We draw random weights , and use samples to learn a model in or . We estimate the risk (test error) using fresh samples, and normalize it by the risk of the trivial model .

Figures 1, 2, 3 report the results of such a simulation using RF  –for Figure 1– and NT  –for Figures 2 and 3 (which differ by the choice of the function

). We use shifted ReLU activations

, , and learn the model parameters using least squares. If the model is overparametrized, we select the minimum -norm solution. (We refer to Appendix A for simulations using ridge regression instead.)

The results are somewhat disappointing: in two cases (the first and third figures) these advanced machine learning method do not beat the trivial predictor. In one case (the second one), the

NT  model surpasses the trivial baseline, and it appears to decrease to as the number of samples gets large. We also note that the risk shows a cusp when , with the number of parameters of the model ( for RF, and for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19] for relevant work). We will instead focus on the population behavior .

A
Figure 4: Upper bounds on the optimal risk of the neural network model when used to learn the third order polynomial (same target function as in Figure 3), for (left frame) and (right frame). We use train samples and report the test error over fresh samples. Data points correspond to averages over independent repetitions, and the risk is normalized by the risk of the trivial (constant) predictor. Training uses oracle knowledge of the function .

The reader might wonder whether these poor performances are due to the choice of extremely complex ground truth . Figures 1 and 2 use a simple quadratic function . In Figure 3 we instead try to learn a third-order polynomial . In other words, the RF  model does not appear to be able to learn a simple quadratic function, and the NT  model does not appear to be able to learn a third order polynomial. This is surprising, especially in view of two remarks:

  • General theory implies that both these functions can be represented arbitrarily well with an unbounded number of neurons (see, e.g., [Cyb89]).

  • There exist models in with , and a small risk at both and (see, e.g., [Bac17], or [MMN18, Proposition 1]).

We demonstrate the second point empirically in Fig. 4 by choosing weight vectors , where are i.i.d. uniformly random indices, and the scaling factor is . Fixing this random first-layer weights, we fit the second-layer weights by least squares. The risk achieved is an upper bound on the minimum risk in the   model, namely , and is significantly smaller than the baseline . (The risk reported in Fig. 4 can also be interpreted as a ‘random features’ risk. However, the specific distribution of the vectors is tailored to the function , and hence not achievable within the RF  model.)

1.2 Main results

The origin for the mismatch between classical theory and the findings of Figures 1 to 3 is that universal approximation results apply to the case of fixed dimension , as number of neurons grows to infinity. In this paper we focus on the population behavior (i.e. ) and unveil a remarkably simple behavior of the models RF  and NT  when and grow together. Our results can be summarized as follows:

  1. If , then RF  does not outperform linear regression in the raw covariates (i.e. least squares with the model , , ).

    More generally, if , then RF  does not outperform linear regression over all monomials of degree at most in .

  2. If , then NT  does not outperform linear regression over monomials of degree at most two in (i.e. least squares with the model , , , ).

    More generally, if , then RF  does not outperform linear regression over all monomials of degree at most in .

In the following, we state formally these results. We define the minimum population risk for any of the models by

(1)

Notice that this is a random variable because of the random features encoded in the matrix

. Also, it depends implicitly on , but we will make this dependence explicit only when necessary.

For , we denote by the orthogonal projector onto the subspace of polynomials of degree at most . (We also let .) In other words, is the function obtained by linear regression of onto monomials of degree at most .

Our main theorems formalize the above discussion, under certain technical conditions on the activation function. For the RF  model we only require that does not grow too fast at infinity (in particular, exponential growth is fine), and is not a low-degree polynomial (but non-trivial results are obtained already for a degree- polynomial). [Risk of the RF  model] Assume for a fixed integer , and let be a sequence of functions. Let be an activation function such that for some constants , with . Further assume that is not a polynomial of degree smaller than .

Let with independently. Then we have for any

, the following happens with high probability:

(2)

For the NT  model we require the same growth condition, although on the weak derivative of , denoted by . We further require the Hermite decomposition of to satisfy a mild ‘genericity’ condition. Recall that the -th Hermite coefficient of a function can be defined as , where is the -th Hermite polynomial (see Section 3 for further background). [Risk of the NT  model] Assume for a fixed integer , and let be a sequence of functions. Let be an activation function which is weakly differentiable, with weak derivative such that for some constants , with . Further assume the Hermite coefficients to be such that, there exists such that and

(3)

Let with independently. Then we have for any , the following happens with high probability:

(4)

In words, Eq. (2) amounts to say that the risk of the random feature model can be approximately decomposed in two parts, each non-negative, and each with a simple interpretation:

(5)

The second contribution, is simply the risk achieved by linear regression with respect to polynomials of degree at most . In the special case , this is the risk of simple linear regression with respect to the raw features. The first contribution is the risk of the RF  model when applied to the low-degree component of (the linear component for ). In general this will be strictly positive. Equation (4) yields a similar decomposition for the NT  model. It is easy to check that the conditions on the activation function in Theorem 1.2 and Theorem 1.2 hold for all , for all commonly used activations.

For instance the ReLU activation obviously satisfies the assumptions of Theorem 1.2 (it has subexponential growth and is not a polynomial). As for Theorem 1.2, its weak derivative is , which has subexponential growth. Further its Hermite coefficients are and

(6)

which satisfy the required condition for each . (In checking the condition, it might be useful to notice the relation .)

In the next section, we will briefly overview related literature. Section 3 provides some technical background, in particular on orthogonal polynomials, that is useful for the proofs. We will prove the statement for the RF  model, Theorem 1.2, in Section 4. The proof for the NT  model, Theorem 1.2, is similar but technically more involved and is presented in Section 5.

2 Related work

Approximation properties of neural networks and, more generally, nonlinear approximation have been studied in detail in the nineties, see e.g. [DHM89, GJP95, Mha96]. The main concern of the present paper is quite different, since we focus on the random feature model, and the (recently proposed) neural tangent model. Further, our focus is on the high-dimensional regime in which grows with . Most approximation theory literature considers fixed, and .

The random features model RF  has been studied in considerable depth since the original work in [RR08]. The classical viewpoint suggests that should be regarded as an approximation of the reproducing kernel Hilbert space (RKHS) defined by the kernel (see [BTA11] for general background)

(7)

Indeed the space is the RKHS defined by the following finite-rank approximation of this kernel

(8)

The paper [RR08] proved convergence of to as functions. Subsequent work established quantitative approximation of by the random feature model . In particular, [Bac17]

provides upper and lower bounds in terms of the eigenvalues of the kernel

, which match up to logarithmic terms (see also [Bac13, AM15, RR17] for related work).

Such approximation results can be used to derive risk bounds. Namely, a given function in a suitable smoothness class (e.g. a Sobolev space) can be approximated by a function in , with sufficiently small norm of the coefficients . This implies that the risk decays to as the number of samples and number of neurons diverge, for any fixed dimension.

Of course, this approach generally breaks down if the dimension is large (technically, if it grows with

). This ‘curse of dimensionality’ is already revealed by classical lower bounds in functional approximation, see e.g.

[DHM89, Bac17]. However, previous work does not clarify what happens precisely in this high-dimensional regime. In contrast, the picture emerging from our work is remarkably simple. In particular, in the regime , random feature models are performing vanilla linear regression with respect to the raw features.

The connection between kernel methods and neural networks was recently revived by the work of Belkin and coauthors [BMM18, BRT18]

who pointed out intriguing similarities between some properties of modern deep learning models, and large scale kernel learning. A concrete explanation for this analogy was proposed in

[JGH18] via the NT  model. This explanation postulates that, for large neural networks, the network weights do not change much during the training phase. Considering a random initialization and denoting by the change during the training phase, we linearize the neural network as

(9)
(10)

Assuming (which is reasonable for certain random initializations), this suggests that a two-layers neural network learns a model in (if both layers are trained), or simply (if only the first layer is trained). The analysis of [DZPS18, DLL18, AZLS18, ZCZG18] establishes that indeed this linearization is accurate in a certain highly overparametrized regime, namely when for a certain constant . Empirical evidence in the same direction was presented in [LXS19].

Does this mean that large (wide) neural networks can be interpreted as random feature approximations to certain kernel methods? Our results suggest some caution: in high dimension, the actual models learnt by random features methods are surprisingly naive. The recent paper [YS19] also suggests caution by showing that a single neuron cannot be approximated by random feature models with a subexponential number of neurons.

It is worth mentioning that an alternative approach to the analysis of two-layers neural networks, in the limit of a large number of neurons, was developed in [MMN18, RVE18, SS18, CB18, MMM19]. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.

3 Technical background

In this section we introduce some notation and technical background which will be useful for the proofs in the next sections. In particular, we will use decompositions in (hyper-)spherical harmonics on the and in orthogonal polynomials on the real line. All of the properties listed below are classical: we will however prove a few facts that are slightly less standard. We refer the reader to [EF14, Sze39, Chi11] for further information on these topics.

3.1 Functional spaces over the sphere

For , we let denote the sphere with radius in . We will mostly work with the sphere of radius , and will denote by the uniform probability measure on . All functions in the following are assumed to be elements of , with scalar product and norm denoted as and :

(11)

For , let be the space of homogeneous harmonic polynomials of degree on (i.e. homogeneous polynomials satisfying ), and denote by the linear space of functions obtained by restricting the polynomials in to . With these definitions, we have the following orthogonal decomposition

(12)

The dimension of each subspace is given by

(13)

For each , the spherical harmonics form an orthonormal basis of :

Note that our convention is different from the more standard one, that defines the spherical harmonics as functions on . It is immediate to pass from one convention to the other by a simple scaling. We will drop the superscript and write whenever clear from the context.

We denote by the orthogonal projections to in . This can be written in terms of spherical harmonics as

(14)

We also define , , and , .

3.2 Gegenbauer polynomials

The -th Gegenbauer polynomial is a polynomial of degree . Consistently with our convention for spherical harmonics, we view as a function . The set forms an orthogonal basis on , where is the distribution of when , satisfying the normalization condition:

(15)

In particular, they are normalized so that . As above, we will omit the superscript when clear from the context.

Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix and consider the subspace of formed by all functions that are invariant under rotations in that keep unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of the function .

We will use the following properties of Gegenbauer polynomials

  1. For

    (16)
  2. For

    (17)
  3. Recurrence formula

    (18)

3.3 Hermite polynomials

The Hermite polynomials form an orthogonal basis of , where is the standard Gaussian measure, and has degree . We will follow the classical normalization (here and below, expectation is with respect to ):

(19)

As a consequence, for any function , we have the decomposition

(20)

The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials are constructed by Gram-Schmidt orthogonalization of the monomials with respect to the measure , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to . Since (here denotes weak convergence), it is immediate to show that, for any fixed integer ,

(21)

Here and below, for a polynomial, is the vector of the coefficients of .

3.4 Notations

Throughout the proofs, (resp. ) denotes the standard big-O (resp. little-o) notation, where the subscript emphasizes the asymptotic variable. We denote (resp. ) the big-O (resp. little-o) in probability notation: if for any , there exists and , such that

and respectively: , if converges to in probability.

We will occasionally hide logarithmic factors using the notation (resp. ): if there exists a constant such that . Similarly, we will denote (resp. ) when considering the big-O in probability notation up to a logarithmic factor.

4 Proof of Theorem 1.2, Rf  model

4.1 Preliminaries

We begin with some notations and simple remarks. Assume is an activation function with for some constants and . Then

  1. .

  2. Let . Then there exists such that, for ,

    (22)
  3. Let . Then there exists a coupling of and such that

    (23)
Proof.

Claim 1 is obvious.

For claim 2, note that the probability distribution of

when is given by

(24)
(25)

A simple calculation shows that as , and hence . Therefore

(26)
(27)

where the last inequality holds provided .

Finally, for point 3, without loss of generality we will take , so that . By the same argument given above (and since both and have densities bounded uniformly in ), for any we can choose bounded continuous so that for any ,

(28)

It is therefore sufficient to prove the claim for . Letting , independent of , we construct the coupling via

(29)

where we set . We thus have almost surely, and the claim follows by weak convergence. ∎

We denote the Hermite decomposition of by

(30)

For future reference, we state separately the two assumptions we use to prove Theorem 1.2 for the RF  model. [Integrability condition] There exists constants , , with and such that, for all , .

[Non-trivial Hermite components] The activation function is not a polynomial of degree smaller than .

Equivalently, there exists , such that .

4.2 Proof of Theorem 1.2: Outline

Recall that independently. We define for , so that independently. Let , and . We denote to be the expectation operator with respect to , to be the expectation operator with respect to , and to be the expectation operator with respect to .

Define the random vectors , , , with

(31)
(32)
(33)

Define the random matrix