computation of convolutional kernels (CKN and NTK) in C++
State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks.READ FULL TEXT VIEW PDF
We analyze the convergence of the averaged stochastic gradient descent f...
In this paper, we study deep signal representations that are invariant t...
An intriguing phenomenon observed during training neural networks is the...
We study the interplay between memorization and generalization of
A recent line of work studies overparametrized neural networks in the
A remarkable recent discovery in machine learning has been that deep neu...
How can neural networks such as ResNet efficiently learn CIFAR-10
computation of convolutional kernels (CKN and NTK) in C++
The large number of parameters in state-of-the-art deep neural networks makes them very expressive, with the ability to approximate large classes of functions [22, 33]. Since many networks can potentially fit a given dataset, the optimization method, typically a variant of gradient descent, plays a crucial role in selecting a model that generalizes well .
A recent line of work [11, 16, 17, 23] has shown that when training deep networks in a certain over-parameterized regime, the dynamics of gradient descent behave like those of a linear model on (non-linear) features determined at initialization. In the over-parameterization limit, these features correspond to a kernel known as the neural tangent kernel. In particular, in the case of a regression loss, the obtained model behaves similarly to a minimum norm kernel least squares solution, suggesting that this kernel may play a key role in determining the inductive bias of the learning procedure and its generalization properties. While it is still not clear if this regime is at play in state-of-the-art deep networks, there is some evidence that this phenomenon of “lazy training” , where weights only move very slightly during training, may be relevant for early stages of training and for the outmost layers of deep networks [25, 42], motivating a better understanding of its properties.
In this paper, we study the inductive bias of this regime by studying properties of functions in the space associated with the neural tangent kernel for a given architecture (that is, the reproducing kernel Hilbert space, or RKHS). Such kernels can be defined recursively using certain choices of dot-product kernels at each layer that depend on the activation function. For the convolutional case with rectified linear unit (ReLU) activations and arbitrary patches and linear pooling operations, we show that the NTK can be expressed through kernel feature maps defined in a tree-structured hierarchy.
We study smoothness and stability properties of the kernel mapping for two-layer networks and CNNs, which control the variations of functions in the RKHS. In particular, a useful inductive bias when dealing with natural signals such as images is stability of the output to deformations of the input, such as translations or small rotations. A precise notion of stability to deformations was proposed by Mallat , and was later studied in  in the context of CNN architectures, showing the benefits of different architectural choices such as small patch sizes. In contrast to the kernels studied in , which for instance cover the limiting kernels that arise from training only the last layer of a ReLU CNN, we find that the obtained NTK kernel mappings for the ReLU activation lack a desired Lipschitz property which is needed for stability to deformations in the sense of [8, 9, 28]. Instead, we show that a weaker smoothness property similar to Hölder smoothness holds, and this allows us to show that the kernel mapping is stable to deformations, albeit with a different guarantee.
In order to balance our observations on smoothness, we also consider approximation properties for the NTK of two-layer ReLU networks, by characterizing the RKHS using a Mercer decomposition of the kernel in the basis of spherical harmonics [5, 38, 39]
. In particular, we study the decay of eigenvalues for this decomposition, which is then related to the regularity of functions in the space, and provides rates of approximation for Lipschitz functions. We find that the full NTK has better approximation properties compared to other function classes typically defined for ReLU activations [5, 13, 15], which arise for instance when only training the weights in the last layer, or when considering Gaussian process limits of ReLU networks (e.g., [20, 24, 32]).
Our main contributions can be summarized as follows:
We provide a derivation of the NTK for convolutional networks with generic linear operators for patch extraction and pooling, and express the corresponding kernel feature map hierarchically using these operators.
We study smoothness properties of the kernel mapping for ReLU networks, showing that it is not Lipschitz but satisfies a weaker Hölder smoothness property. For CNNs, we then provide a guarantee on deformation stability.
We characterize the RKHS of the NTK for two-layer ReLU networks by providing a spectral decomposition of the kernel and studying its spectral decay. This leads to improved approximation properties compared to other function classes based on ReLU.
Neural tangent kernels were introduced in , and similar ideas were used to obtain more quantitative guarantees on the global convergence of gradient descent for over-parameterized neural networks [1, 2, 3, 11, 16, 17]. The papers [3, 16, 41] also derive NTKs for convolutional networks, but focus on simpler architectures. Kernel methods for deep neural networks were studied for instance in [13, 15, 27]. Stability to deformations was originally introduced in the context of the scattering representation [9, 28], and later extended to neural networks through kernel methods in . The inductive bias of optimization in neural network learning was considered, e.g., by [1, 31, 40]. [5, 21, 37] study function spaces corresponding to two-layer ReLU networks. In particular,  also analyzes properties of the NTK, but studies a specific high-dimensional limit for generic activations, while we focus on ReLU networks, studying the corresponding eigenvalue decays in finite dimension.
In this section, we provide some background on “lazy training” and neural tangent kernels (NTKs), and introduce the kernels that we study in this paper. In particular, we derive the NTK for generic convolutional architectures on
signals. For simplicity of exposition, we consider scalar-valued functions, noting that the kernels may be extended to the vector-valued case, as done,e.g., in .
Multiple recent works studying global convergence of gradient descent in neural networks (e.g., [2, 16, 17, 23]) show that when a network is sufficiently over-parameterized, weights remain close to initialization during training. The model is then well approximated by its linearization around initialization. For a neural network with parameters and initialization , we then have:111While we use gradients in our notations, we note that weak differentiability (e.g., with ReLU activations) is sufficient when studying the limiting NTK .
This regime where weights barely move has also been referred to as “lazy training” , in contrast to other situations such as the “mean-field” regime (e.g., [12, 30, 29]), where weights move according to non-linear dynamics. Yet, with sufficient over-parameterization, the (non-linear) features of the linearized model (1) become expressive enough to be able to perfectly fit the training data, by approximating a kernel method.
When the width of the network tends to infinity, assuming an appropriate initialization on weights, the features of the linearized model tend to a limiting kernel , called neural tangent kernel :
In this limit and under some assumptions, one can show that the weights move very slightly and the kernel remains fixed during training , and that gradient descent will then lead to the minimum norm kernel least-squares fit of the training set in the case of the loss (see  and [29, Section H.7]
). Similar interpolating solutions have been found to perform well for generalization, both in practice and in theory 
. When the number of neurons is large but finite, one can often show that the kernel only deviates slightly from the limiting NTK, at initialization and throughout training, thus allowing convergence as long as the initial kernel matrix is non-degenerate[3, 11, 16, 17].
Consider a two layer network of the form , where is the ReLU activation, , and are parameters with values initialized as . Practitioners often include the factor
in the variance of the initialization of, but we treat it as a scaling factor following [16, 17, 23], noting that this leads to the same predictions. The factor is simply a normalization constant specific to the ReLU activation and commonly used by practitioners, which avoids vanishing or exploding behavior for deep networks. The corresponding NTK is then given by [11, 17]:
The expressions for and follow from standard calculations for arc-cosine kernels of degree 0 and 1 (see ). Note that in this two-layer case, the non-linear features obtained for finite neurons correspond to a random features kernel , which is known to approximate the full kernel relatively well even with a moderate amount of neurons [6, 34, 35]. One can also extend the derivation to other activation functions, which may lead to explicit expressions for the kernel in some cases .
We define a fully-connected neural network by , with , and
where and are initialized with i.i.d. entries, and is the ReLU activation and is applied element-wise. Following , the corresponding NTK is defined recursively by with , and for ,
where . Using a change of variables and definitions of arc-cosine kernels of degrees 0 and 1 , it is easy to show that
where and are defined in (5).
We now provide a reformulation of the previous kernel in terms of explicit feature maps, which provides a representation of the data and makes our study of stability in Section 4 more convenient. For a given input Hilbert space , we denote by the kernel mapping into the RKHS for the kernel , and by the kernel mapping into the RKHS for the kernel . We will abuse notation and hide the input space, simply writing and .
The NTK for the fully-connected network can be defined as , with and for ,
where is the tensor product.
is the tensor product.
In this section we study NTKs for convolutional networks (CNNs) on signals, focusing on the ReLU activation. We consider signals in , that is, signals with denoting the location, , and (for instance, and for RGB images). The infinite support allows us to avoid dealing with boundary conditions when considering deformations and pooling. The precise study of membership is deferred to Section 4.
Following , we define two linear operators and on for extracting patches and performing (linear) pooling at layer , respectively. For an -valued signal , is defined by , where is a finite subset of defining the patch shape (e.g., a 3x3 box). Pooling is defined as a convolution with a linear filter , e.g., a Gaussian filter at scale as in , that is, . In this discrete setting, we can easily include a downsampling operation with factor by changing the definition of to (in particular, if
is a Dirac at 0, we obtain a CNN with “strided convolutions”). In fact, our NTK derivation supports general linear operatorson scalar signals.
For defining the NTK feature map, we also introduce the following non-linear point-wise operator , given for two signals , by
where are kernel mappings of arc-cosine 0/1 kernels, as defined in Section 2.1.
We consider a network , with
where and are initialized with entries, and denotes the signal with applied element-wise to . We are now ready to state our result on the NTK for this model.
The NTK for the above CNN, obtained when the number of feature maps (sequentially), is given by , with , where and are defined recursively for a given input by , and for ,
with the abuse of notation for a signal .
The proof is given in Appendix A.2, where we also show that in the over-parameterization limit, the pre-activations tend to a Gaussian process with covariance (this is related to recent papers [20, 32] studying Gaussian process limits of Bayesian convolutional networks). The proof is by induction and relies on similar arguments to  for fully-connected networks, in addition to exploiting linearity of the operators and , as well as recursive feature maps for hierarchical kernels. The recent papers [3, 41] also study NTKs for certain convolutional networks; in contrast to these works, our derivation considers general signals in , supports intermediate pooling or downsampling by changing , and provides a more intuitive construction through kernel mappings and the operators and . Note that the feature maps are defined independently from the , and in fact correspond to more standard multi-layer deep kernel machines [8, 13, 15, 27]
or covariance functions of certain deep Bayesian networks[20, 24, 32]. They can also be seen as the feature maps of the limiting kernel that arises when only training weights in the last layer and fixing other layers at initialization (see, e.g., ).
In this section, we study smoothness and approximation properties of the RKHS defined by neural tangent kernels for two-layer networks. For ReLU activations, we show that the NTK kernel mapping is not Lipschitz, but satisfies a weaker smoothness property. In Section 3.2, we characterize the RKHS for ReLU activations and study its approximation properties and benefits. Finally, we comment on the use of other activations in Section 3.3.
Here we study the RKHS of the NTK for two-layer ReLU networks, defined in (3), focusing on smoothness properties of the kernel mapping, denoted . Recall that smoothness of the kernel mapping guarantees smoothness of functions , through the relation
We begin by showing that the kernel mapping for the NTK is not Lipschitz. This is in contrast to the kernel in (5), obtained by fixing the weights in the first layer and training only the second layer weights ( is 1-Lipschitz by [8, Lemma 1]).
The kernel mapping of the two-layer NTK is not Lipschitz:
This is true even when looking only at points on the sphere. It follows that the RKHS contains unit-norm functions with arbitrarily large Lipschitz constant.
Note that the instability is due to , which comes from gradients w.r.t. first layer weigts. We now show that a weaker guarantee holds nevertheless, resembling 1/2-Hölder smoothness.
We have the following smoothness properties:
For such that , the kernel mapping satisfies .
For general non-zero , we have .
The kernel mapping of the NTK then satisfies
In the previous section, we found that the NTK for two-layer ReLU networks yields weaker smoothness guarantees compared to the kernel obtained when the first layer is fixed. We now show that the NTK has better approximation properties, by studying the RKHS through a spectral decomposition of the kernel and the decay of the corresponding eigenvalues. This highlights a tradeoff between smoothness and approximation.
The next proposition gives the Mercer decomposition of the NTK in (4), where are in the sphere . The decomposition is given in the basis of spherical harmonics, as is common for dot-product kernels [38, 39], and our derivation uses results by Bach  on similar decompositions of positively homogeneous activations of the form . See Appendix C for background and proofs.
For any , we have the following decomposition of the NTK :
where are spherical harmonic polynomials of degree , and the non-negative eigenvalues satisfy , if with , and otherwise as , with a constant depending only on . Then, the RKHS is described by:
The zero eigenvalues prevent certain functions from belonging to the RKHS, namely those with non-zero Fourier coefficients on the corresponding basis elements. Here, a sufficient condition for all such coefficients to be zero is that the function is even . Note that for the arc-cosine 1 kernel , we have a faster decay , leading to a “smaller” RKHS (see Lemma 17 in Appendix C and ). Moreover, the asymptotic equivalent comes from the term in the definition (4) of , which comes from gradients of first layer weights; the second layer gradients yield , whose contribution to becomes negligible for large . We use an identity also used in the recent paper  which compares similar kernels in a specific high-dimensional limit for generic activations; in contrast to , we focus on ReLUs and study eigenvalue decays in finite dimension. Note that our result is also related to eigenvalue decays of integral operators for learning problems (up to a change of measure), which can determine, e.g., non-parametric rates of convergence (e.g., [10, 19]
) as well as degrees-of-freedom quantities for kernel approximation (e.g., [6, 35]). It is also related to the rate of convergence of gradient descent in the lazy training regime, which depends on the minimum eigenvalue of the empirical kernel matrix in [11, 16, 17].
We now provide sufficient conditions for a function to be in , as well as rates of approximation of Lipschitz functions on the sphere, adapting results of  (specifically Proposition 2 and 3 in ) to our NTK setting.
Let be an even function such that all -th order derivatives exist and are bounded by for , with . Then with , where is a constant that only depends on .
Let be an even function such that and , for all . There is a function with , where is larger than a constant depending only on , such that
For both results, there is an improvement over , for which Corollary 6 requires bounded derivatives, and Corollary 7 leads to a weaker rate in (see [5, Propositions 2 and 3] with ). These results show that in the over-parameterized regime of the NTK, training multiple layers leads to better approximation properties compared to only training the last layer, which corresponds to using instead of . In the different regime of “convex neural networks” (e.g., [5, 37]) where neurons can be selected with a sparsity-promoting penalty, the approximation rates shown in  for ReLU networks are also weaker than for the NTK in the worst case (though the regime presents benefits in terms of adaptivity), suggesting that perhaps in some situations the “lazy” regime of the NTK could be preferred over the regime where neurons are selected using sparsity.
When inputs do not lie on the sphere but in , the NTK for two-layer ReLU networks takes the form of a homogeneous dot-product kernel (3), which defines a different RKHS that we characterize below in terms of the RKHS of the NTK on the sphere.
The RKHS of the kernel on consists of functions of the form with , where is the RKHS on the sphere, and we have .
Note that while such a restriction to homogeneous functions may be limiting, one may easily obtain non-homogeneous functions by considering an augmented variable and defining , where is now defined on the -sphere . When inputs are in a ball of radius , this reformulation preserves regularity properties (see [5, Section 3]).
In this section, we look at smoothness of two-layer networks with different activation functions. Following the derivation for the ReLU in Section 2.1, the NTK for a general activation is given by
We then have the following the following result.
Assume that is twice differentiable and that the quantities for are bounded, with . Then, for on the unit sphere, the kernel mapping of satisfies
The proof uses results from  on relationships between activations and the corresponding kernels, as well as smoothness results for dot-product kernels in  (see Appendix B.3). If, for instance, we consider the exponential activation , we have for all (using results from ), so that the kernel mapping is Lipschitz with constant . For the soft-plus activation , we may evaluate the integrals numerically, obtaining , so that the kernel mapping is Lipschitz with constant .
In this section, we study smoothness and stability properties of the NTK kernel mapping for convolutional networks with ReLU activations. In order to properly define deformations, we consider continuous signals in instead of (i.e., we have ), following [8, 28]. The goal of deformation stability guarantees is to ensure that the data representation (in this case, the kernel mapping ) does not change too much when the input signal is slightly deformed, for instance with a small translation or rotation of an image—a useful inductive bias for natural signals. For a -diffeomorphism , denoting the action operator of the diffeomorphism, we will show a guarantee of the form
where is the maximum operator norm of the Jacobian over , , is an increasing function and a positive constant. The second term controls translation invariance, and typically decreases with the scale of the last pooling layer ( below), while the first term controls deformation stability, since measures the “size” of deformations. The function is typically a linear function of in other settings [8, 28], here we will obtain a faster growth of order for small , due to the weaker smoothness that arises from the arc-cosine 0 kernel mappings.
In this continuous setup, is now given for a signal by , where is the Lebesgue measure. We then have , and considering normalized Gaussian pooling filters, we have by Young’s inequality . The non-linear operator is defined point-wise analogously to (8), and satisfies . We thus have that the feature maps in the continuous analog of the NTK construction in Proposition 2 are in as long as is in . Note that this does not hold for some smooth activations, where may be a positive constant even when , leading to unbounded norm for . The next lemma studies the smoothness of , extending results from Section 3.1 to signals in .
For two signals , we have
Following , we introduce an initial pooling layer , corresponding to an anti-aliasing filter, which is necessary to allow stability and is a reasonable assumption given that in practice the inputs are discrete signals, for which high frequencies have typically been filtered by an acquisition device. Thus, we consider the kernel representation , with as in Proposition 2. We also assume that patch sizes are controlled by the scale of pooling filters, that is
for some constant , where is the scale of the pooling operation , which typically increases exponentially with depth, corresponding to a fixed downsampling factor at each layer in the discrete case. By a simple induction, we can show the following.
We have , and
We now present our main guarantee on deformation stability for the NTK kernel mapping.
Let , and assume . We have the following stability bound:
where are constants depending only on , and also depends on defined in (13).
The proof is given in Appendix B. Compared to the bound in , the first term shows weaker stability due to faster growth with , which comes from (12). The dependence in is also poorer ( instead of ), however note that in contrast to , the norm and smoothness constants of in Lemma 11 grow with here, partially explaining this gap. We also note that as in , choosing small (i.e., small patches in a discrete setting) is more helpful to improve stability than a small number of layers , given that increases with as , while typically decreases with as when one seeks a fixed target level of translation invariance (see [8, Section 3.2]).
By fixing weights of all layers but the last, we would instead obtain feature maps of the form (using notation from Proposition 2), which satisfy the improved stability guarantee of . This again hints at a tradeoff between stability and approximation, suggesting that one may be able to learn less stable but more discriminative functions in the NTK regime by training all layers.
In this paper, we have studied the inductive bias of the “lazy training” regime for over-parameterized neural networks, by considering the neural tangent kernel of different architectures, and analyzing properties of the corresponding RKHS, which characterizes the functions that can be learned efficiently in this regime. We find that the NTK for ReLU networks has better approximation properties compared to other neural network kernels, but weaker smoothness properties, although these can still guarantee a form of stability to deformations for CNN architectures, providing an important inductive bias for natural signals. While these properties may help obtain better performance when large amounts of data are available, they can also lead to a poorer estimation error when data is scarce, a setting in which smoother kernels or better regularization strategies may be helpful.
It should be noted that while our study of functions in the RKHS may determine what target functions can be learned by over-parameterized networks, the obtained networks with finite neurons do not belong to the same RKHS, and hence may be less stable than such target functions, at least outside of the training data, due to approximations both in the linearization (1
) and between the finite neuron and limiting kernels. Finally, we note that while this “lazy” regime is interesting and could partly explain the success of deep learning methods, it does not explain, for instance, the common behavior in early layers where neurons move to select useful features in the data, such as Gabor filters, as pointed out in. In particular, such a behavior might provide better statistical efficiency by adapting to simple structures in the data (see, e.g., ), something which is not captured in a kernel regime like the NTK. It would be interesting to study inductive biases in a regime somewhere in between, where neurons may move at least in the first few layers.
This work was supported by the ERC grant number 714381 (SOLARIS project) and by the MSR-Inria joint centre. The authors thank Francis Bach and Lénaïc Chizat for useful discussions.
Proceedings of the International Conference on Machine Learning (ICML), 2019.
Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research (JMLR), 18(19):1–53, 2017.
Learning with kernels: support vector machines, regularization, optimization, and beyond. 2001.
We begin by proving the following lemma, which characterizes the Gaussian process behavior of the pre-activations , seen as a function of and , in the over-parameterization limit.
As , the pre-activations for tend (in law) to i.i.d. centered Gaussian processes with covariance
We show this by induction. For , is clearly Gaussian, and we have
Writing the vector of weights for the filter associated to the input feature map and output feature map , we have . Then we have
by noticing that .
Now, for , we have by similar arguments that conditioned on , is Gaussian, with covariance
By the inductive hypothesis, as a function of and tend to Gaussian processes in the limit
. By the law of large numbers, we have, as,
Since this covariance is deterministic, the pre-activations are also unconditionally a Gaussian process in the limit, with covariance .
Now it remains to show that
Notice that by linearity of and , it suffices to show
for any (the last equality follows from the definition of ). Noting that