On the Inductive Bias of Neural Tangent Kernels

05/29/2019 ∙ by Alberto Bietti, et al. ∙ Inria 4

State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

ckn_kernel

computation of convolutional kernels (CKN and NTK) in C++


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The large number of parameters in state-of-the-art deep neural networks makes them very expressive, with the ability to approximate large classes of functions [22, 33]. Since many networks can potentially fit a given dataset, the optimization method, typically a variant of gradient descent, plays a crucial role in selecting a model that generalizes well [31].

A recent line of work [11, 16, 17, 23] has shown that when training deep networks in a certain over-parameterized regime, the dynamics of gradient descent behave like those of a linear model on (non-linear) features determined at initialization. In the over-parameterization limit, these features correspond to a kernel known as the neural tangent kernel. In particular, in the case of a regression loss, the obtained model behaves similarly to a minimum norm kernel least squares solution, suggesting that this kernel may play a key role in determining the inductive bias of the learning procedure and its generalization properties. While it is still not clear if this regime is at play in state-of-the-art deep networks, there is some evidence that this phenomenon of “lazy training” [11], where weights only move very slightly during training, may be relevant for early stages of training and for the outmost layers of deep networks [25, 42], motivating a better understanding of its properties.

In this paper, we study the inductive bias of this regime by studying properties of functions in the space associated with the neural tangent kernel for a given architecture (that is, the reproducing kernel Hilbert space, or RKHS). Such kernels can be defined recursively using certain choices of dot-product kernels at each layer that depend on the activation function. For the convolutional case with rectified linear unit (ReLU) activations and arbitrary patches and linear pooling operations, we show that the NTK can be expressed through kernel feature maps defined in a tree-structured hierarchy.

We study smoothness and stability properties of the kernel mapping for two-layer networks and CNNs, which control the variations of functions in the RKHS. In particular, a useful inductive bias when dealing with natural signals such as images is stability of the output to deformations of the input, such as translations or small rotations. A precise notion of stability to deformations was proposed by Mallat [28], and was later studied in [8] in the context of CNN architectures, showing the benefits of different architectural choices such as small patch sizes. In contrast to the kernels studied in [8], which for instance cover the limiting kernels that arise from training only the last layer of a ReLU CNN, we find that the obtained NTK kernel mappings for the ReLU activation lack a desired Lipschitz property which is needed for stability to deformations in the sense of [8, 9, 28]. Instead, we show that a weaker smoothness property similar to Hölder smoothness holds, and this allows us to show that the kernel mapping is stable to deformations, albeit with a different guarantee.

In order to balance our observations on smoothness, we also consider approximation properties for the NTK of two-layer ReLU networks, by characterizing the RKHS using a Mercer decomposition of the kernel in the basis of spherical harmonics [5, 38, 39]

. In particular, we study the decay of eigenvalues for this decomposition, which is then related to the regularity of functions in the space, and provides rates of approximation for Lipschitz functions 

[5]. We find that the full NTK has better approximation properties compared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instance when only training the weights in the last layer, or when considering Gaussian process limits of ReLU networks (e.g.[20, 24, 32]).

Contributions.

Our main contributions can be summarized as follows:

  • [noitemsep,leftmargin=*]

  • We provide a derivation of the NTK for convolutional networks with generic linear operators for patch extraction and pooling, and express the corresponding kernel feature map hierarchically using these operators.

  • We study smoothness properties of the kernel mapping for ReLU networks, showing that it is not Lipschitz but satisfies a weaker Hölder smoothness property. For CNNs, we then provide a guarantee on deformation stability.

  • We characterize the RKHS of the NTK for two-layer ReLU networks by providing a spectral decomposition of the kernel and studying its spectral decay. This leads to improved approximation properties compared to other function classes based on ReLU.

Related work.

Neural tangent kernels were introduced in [23], and similar ideas were used to obtain more quantitative guarantees on the global convergence of gradient descent for over-parameterized neural networks [1, 2, 3, 11, 16, 17]. The papers [3, 16, 41] also derive NTKs for convolutional networks, but focus on simpler architectures. Kernel methods for deep neural networks were studied for instance in [13, 15, 27]. Stability to deformations was originally introduced in the context of the scattering representation [9, 28], and later extended to neural networks through kernel methods in [8]. The inductive bias of optimization in neural network learning was considered, e.g., by [1, 31, 40]. [5, 21, 37] study function spaces corresponding to two-layer ReLU networks. In particular, [21] also analyzes properties of the NTK, but studies a specific high-dimensional limit for generic activations, while we focus on ReLU networks, studying the corresponding eigenvalue decays in finite dimension.

2 Neural Tangent Kernels

In this section, we provide some background on “lazy training” and neural tangent kernels (NTKs), and introduce the kernels that we study in this paper. In particular, we derive the NTK for generic convolutional architectures on 

signals. For simplicity of exposition, we consider scalar-valued functions, noting that the kernels may be extended to the vector-valued case, as done,

e.g., in [23].

2.1 Lazy training and neural tangent kernels

Multiple recent works studying global convergence of gradient descent in neural networks (e.g.[2, 16, 17, 23]) show that when a network is sufficiently over-parameterized, weights remain close to initialization during training. The model is then well approximated by its linearization around initialization. For a neural network with parameters  and initialization , we then have:111While we use gradients in our notations, we note that weak differentiability (e.g., with ReLU activations) is sufficient when studying the limiting NTK [23].

(1)

This regime where weights barely move has also been referred to as “lazy training” [11], in contrast to other situations such as the “mean-field” regime (e.g.[12, 30, 29]), where weights move according to non-linear dynamics. Yet, with sufficient over-parameterization, the (non-linear) features of the linearized model (1) become expressive enough to be able to perfectly fit the training data, by approximating a kernel method.

Neural Tangent Kernel (NTK).

When the width of the network tends to infinity, assuming an appropriate initialization on weights, the features of the linearized model tend to a limiting kernel , called neural tangent kernel [23]:

(2)

In this limit and under some assumptions, one can show that the weights move very slightly and the kernel remains fixed during training [23], and that gradient descent will then lead to the minimum norm kernel least-squares fit of the training set in the case of the loss (see [23] and [29, Section H.7]

). Similar interpolating solutions have been found to perform well for generalization, both in practice 

[7] and in theory [26]

. When the number of neurons is large but finite, one can often show that the kernel only deviates slightly from the limiting NTK, at initialization and throughout training, thus allowing convergence as long as the initial kernel matrix is non-degenerate 

[3, 11, 16, 17].

NTK for two-layer ReLU networks.

Consider a two layer network of the form , where is the ReLU activation, , and are parameters with values initialized as . Practitioners often include the factor 

in the variance of the initialization of 

, but we treat it as a scaling factor following [16, 17, 23], noting that this leads to the same predictions. The factor  is simply a normalization constant specific to the ReLU activation and commonly used by practitioners, which avoids vanishing or exploding behavior for deep networks. The corresponding NTK is then given by [11, 17]:

(3)

where

(4)
(5)

The expressions for  and  follow from standard calculations for arc-cosine kernels of degree 0 and 1 (see [13]). Note that in this two-layer case, the non-linear features obtained for finite neurons correspond to a random features kernel [34], which is known to approximate the full kernel relatively well even with a moderate amount of neurons [6, 34, 35]. One can also extend the derivation to other activation functions, which may lead to explicit expressions for the kernel in some cases [15].

NTK for fully-connected deep ReLU networks.

We define a fully-connected neural network by , with , and

where and are initialized with i.i.d.  entries, and  is the ReLU activation and is applied element-wise. Following [23], the corresponding NTK is defined recursively by with , and for ,

where . Using a change of variables and definitions of arc-cosine kernels of degrees 0 and 1 [13], it is easy to show that

(6)
(7)

where  and  are defined in (5).

Feature maps construction.

We now provide a reformulation of the previous kernel in terms of explicit feature maps, which provides a representation of the data and makes our study of stability in Section 4 more convenient. For a given input Hilbert space , we denote by the kernel mapping into the RKHS  for the kernel , and by  the kernel mapping into the RKHS  for the kernel . We will abuse notation and hide the input space, simply writing  and .

Lemma 1 (NTK feature map for fully-connected network).

The NTK for the fully-connected network can be defined as , with and for ,

where 

is the tensor product.

2.2 Neural tangent kernel for convolutional networks

In this section we study NTKs for convolutional networks (CNNs) on signals, focusing on the ReLU activation. We consider signals in , that is, signals  with  denoting the location, , and (for instance, and for RGB images). The infinite support allows us to avoid dealing with boundary conditions when considering deformations and pooling. The precise study of  membership is deferred to Section 4.

Patch extraction and pooling operators  and .

Following [8], we define two linear operators  and  on  for extracting patches and performing (linear) pooling at layer , respectively. For an -valued signal , is defined by , where  is a finite subset of  defining the patch shape (e.g., a 3x3 box). Pooling is defined as a convolution with a linear filter , e.g., a Gaussian filter at scale  as in [8], that is, . In this discrete setting, we can easily include a downsampling operation with factor  by changing the definition of  to (in particular, if 

is a Dirac at 0, we obtain a CNN with “strided convolutions”). In fact, our NTK derivation supports general linear operators 

on scalar signals.

For defining the NTK feature map, we also introduce the following non-linear point-wise operator , given for two signals , by

(8)

where  are kernel mappings of arc-cosine 0/1 kernels, as defined in Section 2.1.

CNN definition and NTK.

We consider a network , with

where and are initialized with  entries, and  denotes the signal with  applied element-wise to . We are now ready to state our result on the NTK for this model.

Proposition 2 (NTK feature map for CNN).

The NTK for the above CNN, obtained when the number of feature maps (sequentially), is given by , with , where and  are defined recursively for a given input  by , and for ,

with the abuse of notation  for a signal .

The proof is given in Appendix A.2, where we also show that in the over-parameterization limit, the pre-activations tend to a Gaussian process with covariance  (this is related to recent papers [20, 32] studying Gaussian process limits of Bayesian convolutional networks). The proof is by induction and relies on similar arguments to [23] for fully-connected networks, in addition to exploiting linearity of the operators and , as well as recursive feature maps for hierarchical kernels. The recent papers [3, 41] also study NTKs for certain convolutional networks; in contrast to these works, our derivation considers general signals in , supports intermediate pooling or downsampling by changing , and provides a more intuitive construction through kernel mappings and the operators  and . Note that the feature maps  are defined independently from the , and in fact correspond to more standard multi-layer deep kernel machines [8, 13, 15, 27]

or covariance functions of certain deep Bayesian networks 

[20, 24, 32]. They can also be seen as the feature maps of the limiting kernel that arises when only training weights in the last layer and fixing other layers at initialization (see, e.g.[15]).

3 Two-Layer Networks

In this section, we study smoothness and approximation properties of the RKHS defined by neural tangent kernels for two-layer networks. For ReLU activations, we show that the NTK kernel mapping is not Lipschitz, but satisfies a weaker smoothness property. In Section 3.2, we characterize the RKHS for ReLU activations and study its approximation properties and benefits. Finally, we comment on the use of other activations in Section 3.3.

3.1 Smoothness of two-layer ReLU networks

Here we study the RKHS  of the NTK for two-layer ReLU networks, defined in (3), focusing on smoothness properties of the kernel mapping, denoted . Recall that smoothness of the kernel mapping guarantees smoothness of functions , through the relation

(9)

We begin by showing that the kernel mapping for the NTK is not Lipschitz. This is in contrast to the kernel  in (5), obtained by fixing the weights in the first layer and training only the second layer weights ( is 1-Lipschitz by [8, Lemma 1]).

Proposition 3 (Non-Lipschitzness).

The kernel mapping  of the two-layer NTK is not Lipschitz:

This is true even when looking only at points  on the sphere. It follows that the RKHS  contains unit-norm functions with arbitrarily large Lipschitz constant.

Note that the instability is due to , which comes from gradients w.r.t. first layer weigts. We now show that a weaker guarantee holds nevertheless, resembling 1/2-Hölder smoothness.

Proposition 4 (Smoothness for ReLU NTK).

We have the following smoothness properties:

  1. [noitemsep,leftmargin=*,topsep=0pt]

  2. For such that , the kernel mapping  satisfies .

  3. For general non-zero , we have .

  4. The kernel mapping  of the NTK then satisfies

3.2 Approximation properties for the two-layer ReLU NTK

In the previous section, we found that the NTK  for two-layer ReLU networks yields weaker smoothness guarantees compared to the kernel  obtained when the first layer is fixed. We now show that the NTK has better approximation properties, by studying the RKHS through a spectral decomposition of the kernel and the decay of the corresponding eigenvalues. This highlights a tradeoff between smoothness and approximation.

The next proposition gives the Mercer decomposition of the NTK in (4), where are in the sphere . The decomposition is given in the basis of spherical harmonics, as is common for dot-product kernels [38, 39], and our derivation uses results by Bach [5] on similar decompositions of positively homogeneous activations of the form . See Appendix C for background and proofs.

Proposition 5 (Mercer decomposition of ReLU NTK).

For any , we have the following decomposition of the NTK :

(10)

where are spherical harmonic polynomials of degree , and the non-negative eigenvalues  satisfy , if with , and otherwise as , with  a constant depending only on . Then, the RKHS is described by:

(11)

The zero eigenvalues prevent certain functions from belonging to the RKHS, namely those with non-zero Fourier coefficients on the corresponding basis elements. Here, a sufficient condition for all such coefficients to be zero is that the function is even [5]. Note that for the arc-cosine 1 kernel , we have a faster decay , leading to a “smaller” RKHS (see Lemma 17 in Appendix C and [5]). Moreover, the asymptotic equivalent comes from the term  in the definition (4) of , which comes from gradients of first layer weights; the second layer gradients yield , whose contribution to  becomes negligible for large . We use an identity also used in the recent paper [21] which compares similar kernels in a specific high-dimensional limit for generic activations; in contrast to [21], we focus on ReLUs and study eigenvalue decays in finite dimension. Note that our result is also related to eigenvalue decays of integral operators for learning problems (up to a change of measure), which can determine, e.g., non-parametric rates of convergence (e.g.[10, 19]

) as well as degrees-of-freedom quantities for kernel approximation (

e.g.[6, 35]). It is also related to the rate of convergence of gradient descent in the lazy training regime, which depends on the minimum eigenvalue of the empirical kernel matrix in [11, 16, 17].

We now provide sufficient conditions for a function to be in , as well as rates of approximation of Lipschitz functions on the sphere, adapting results of [5] (specifically Proposition 2 and 3 in [5]) to our NTK setting.

Corollary 6 (Sufficient condition for ).

Let  be an even function such that all -th order derivatives exist and are bounded by  for , with . Then with , where  is a constant that only depends on .

Corollary 7 (Approximation of Lipschitz functions).

Let  be an even function such that and , for all . There is a function with , where is larger than a constant depending only on , such that

For both results, there is an improvement over , for which Corollary 6 requires  bounded derivatives, and Corollary 7 leads to a weaker rate in  (see [5, Propositions 2 and 3] with ). These results show that in the over-parameterized regime of the NTK, training multiple layers leads to better approximation properties compared to only training the last layer, which corresponds to using  instead of . In the different regime of “convex neural networks” (e.g.[5, 37]) where neurons can be selected with a sparsity-promoting penalty, the approximation rates shown in [5] for ReLU networks are also weaker than for the NTK in the worst case (though the regime presents benefits in terms of adaptivity), suggesting that perhaps in some situations the “lazy” regime of the NTK could be preferred over the regime where neurons are selected using sparsity.

Homogeneous case.

When inputs do not lie on the sphere  but in , the NTK for two-layer ReLU networks takes the form of a homogeneous dot-product kernel (3), which defines a different RKHS  that we characterize below in terms of the RKHS  of the NTK on the sphere.

Proposition 8 (RKHS of the homogeneous NTK).

The RKHS  of the kernel on  consists of functions of the form with , where  is the RKHS on the sphere, and we have .

Note that while such a restriction to homogeneous functions may be limiting, one may easily obtain non-homogeneous functions by considering an augmented variable  and defining , where  is now defined on the -sphere . When inputs are in a ball of radius , this reformulation preserves regularity properties (see [5, Section 3]).

3.3 Smoothness with other activations

In this section, we look at smoothness of two-layer networks with different activation functions. Following the derivation for the ReLU in Section 2.1, the NTK for a general activation  is given by

We then have the following the following result.

Proposition 9 (Lipschitzness for smooth activations).

Assume that  is twice differentiable and that the quantities for are bounded, with . Then, for  on the unit sphere, the kernel mapping  of  satisfies

The proof uses results from [15] on relationships between activations and the corresponding kernels, as well as smoothness results for dot-product kernels in [8] (see Appendix B.3). If, for instance, we consider the exponential activation , we have for all  (using results from [15]), so that the kernel mapping is Lipschitz with constant . For the soft-plus activation , we may evaluate the integrals numerically, obtaining , so that the kernel mapping is Lipschitz with constant .

4 Deep Convolutional Networks

In this section, we study smoothness and stability properties of the NTK kernel mapping for convolutional networks with ReLU activations. In order to properly define deformations, we consider continuous signals  in  instead of  (i.e., we have ), following [8, 28]. The goal of deformation stability guarantees is to ensure that the data representation (in this case, the kernel mapping ) does not change too much when the input signal is slightly deformed, for instance with a small translation or rotation of an image—a useful inductive bias for natural signals. For a -diffeomorphism , denoting the action operator of the diffeomorphism, we will show a guarantee of the form

where  is the maximum operator norm of the Jacobian  over , , is an increasing function and  a positive constant. The second term controls translation invariance, and  typically decreases with the scale of the last pooling layer ( below), while the first term controls deformation stability, since  measures the “size” of deformations. The function  is typically a linear function of  in other settings [8, 28], here we will obtain a faster growth of order  for small , due to the weaker smoothness that arises from the arc-cosine 0 kernel mappings.

Properties of the operators.

In this continuous setup, is now given for a signal  by , where  is the Lebesgue measure. We then have , and considering normalized Gaussian pooling filters, we have by Young’s inequality [8]. The non-linear operator  is defined point-wise analogously to (8), and satisfies . We thus have that the feature maps in the continuous analog of the NTK construction in Proposition 2 are in  as long as  is in . Note that this does not hold for some smooth activations, where may be a positive constant even when , leading to unbounded  norm for . The next lemma studies the smoothness of , extending results from Section 3.1 to signals in .

Lemma 10 (Smoothness of operator ).

For two signals , we have

(12)

Assumptions on architecture.

Following [8], we introduce an initial pooling layer , corresponding to an anti-aliasing filter, which is necessary to allow stability and is a reasonable assumption given that in practice the inputs are discrete signals, for which high frequencies have typically been filtered by an acquisition device. Thus, we consider the kernel representation , with  as in Proposition 2. We also assume that patch sizes are controlled by the scale of pooling filters, that is

(13)

for some constant , where  is the scale of the pooling operation , which typically increases exponentially with depth, corresponding to a fixed downsampling factor at each layer in the discrete case. By a simple induction, we can show the following.

Lemma 11 (Norm and smoothness of ).

We have , and

Deformation stability bound.

We now present our main guarantee on deformation stability for the NTK kernel mapping.

Proposition 12 (Stability of NTK).

Let , and assume . We have the following stability bound:

where  are constants depending only on , and  also depends on  defined in (13).

The proof is given in Appendix B. Compared to the bound in [8], the first term shows weaker stability due to faster growth with , which comes from (12). The dependence in  is also poorer ( instead of ), however note that in contrast to [8], the norm and smoothness constants of  in Lemma 11 grow with  here, partially explaining this gap. We also note that as in [8], choosing small  (i.e., small patches in a discrete setting) is more helpful to improve stability than a small number of layers , given that  increases with  as , while  typically decreases with  as  when one seeks a fixed target level of translation invariance (see [8, Section 3.2]).

By fixing weights of all layers but the last, we would instead obtain feature maps of the form  (using notation from Proposition 2), which satisfy the improved stability guarantee of [8]. This again hints at a tradeoff between stability and approximation, suggesting that one may be able to learn less stable but more discriminative functions in the NTK regime by training all layers.

5 Discussion

In this paper, we have studied the inductive bias of the “lazy training” regime for over-parameterized neural networks, by considering the neural tangent kernel of different architectures, and analyzing properties of the corresponding RKHS, which characterizes the functions that can be learned efficiently in this regime. We find that the NTK for ReLU networks has better approximation properties compared to other neural network kernels, but weaker smoothness properties, although these can still guarantee a form of stability to deformations for CNN architectures, providing an important inductive bias for natural signals. While these properties may help obtain better performance when large amounts of data are available, they can also lead to a poorer estimation error when data is scarce, a setting in which smoother kernels or better regularization strategies may be helpful.

It should be noted that while our study of functions in the RKHS may determine what target functions can be learned by over-parameterized networks, the obtained networks with finite neurons do not belong to the same RKHS, and hence may be less stable than such target functions, at least outside of the training data, due to approximations both in the linearization (1

) and between the finite neuron and limiting kernels. Finally, we note that while this “lazy” regime is interesting and could partly explain the success of deep learning methods, it does not explain, for instance, the common behavior in early layers where neurons move to select useful features in the data, such as Gabor filters, as pointed out in 

[11]. In particular, such a behavior might provide better statistical efficiency by adapting to simple structures in the data (see, e.g.[5]), something which is not captured in a kernel regime like the NTK. It would be interesting to study inductive biases in a regime somewhere in between, where neurons may move at least in the first few layers.

Acknowledgements

This work was supported by the ERC grant number 714381 (SOLARIS project) and by the MSR-Inria joint centre. The authors thank Francis Bach and Lénaïc Chizat for useful discussions.

References

Appendix A Proofs of NTK derivations

a.1 Proof of Lemma 1

Proof of Lemma 1.

By induction, using (6) and (7) and the corresponding definitions of , we can write

The result follows by using the following relation, given three pairs of vectors , and  in arbitrary Hilbert spaces:

a.2 Proof of Proposition 2 (NTK for CNNs)

In this section, we will denote by (resp ) the feature maps associated to an input (resp ), as defined in Proposition 2. We follow the proofs of Jacot et al. [23, Proposition 1 and Theorem 1].

We begin by proving the following lemma, which characterizes the Gaussian process behavior of the pre-activations , seen as a function of  and , in the over-parameterization limit.

Lemma 13.

As , the pre-activations  for tend (in law) to i.i.d. centered Gaussian processes with covariance

(14)
Proof.

We show this by induction. For , is clearly Gaussian, and we have

Writing the vector of weights for the filter associated to the input feature map  and output feature map , we have . Then we have

by noticing that .

Now, for , we have by similar arguments that conditioned on , is Gaussian, with covariance

By the inductive hypothesis, as a function of  and  tend to Gaussian processes in the limit

. By the law of large numbers, we have, as

,

Since this covariance is deterministic, the pre-activations  are also unconditionally a Gaussian process in the limit, with covariance .

Now it remains to show that

Notice that by linearity of  and , it suffices to show

for any  (the last equality follows from the definition of ). Noting that