Despite the empirical success of neural networks and other highly parameterized machine learning methods, a major open question remains: why do these methods perform well? Classical learning theory does not predict or explain the success of large, often overparameterized models (Lawrence et al., 1997; Harvey et al., 2017; Bartlett et al., 2019). Most models are highly expressive (Poole et al., 2016; Raghu et al., 2017), but can still generalize when trained with many less data points than parameters (He et al., 2016; Szegedy et al., 2016; Dai et al., 2019).
In particular, many functions of importance to physics (Mills et al., 2017; Gao and Duan, 2017; Carrasquilla and Melko, 2017; Raissi and Karniadakis, 2018), chemistry (Rupp et al., 2012; Faber et al., 2017), and biology (Kosciolek and Jones, 2016; Ainscough et al., 2018; Xu, 2019), among other fields, can be learned using neural networks and kernel methods. This leads to the question: can we understand what types of functions can be learned efficiently with particular methods? Recent theoretical work has focused on answering this question by constructing bounds on the generalization error, given properties of the model and data distribution. Specifically, for very wide networks with specific kernels or activations (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019; Arora et al., 2019; Lee et al., 2019), data-dependent generalization bounds can be derived by relating wide networks to kernel learning with a specific network-induced kernel, known as the neural tangent kernel (NTK) (Jacot et al., 2018). These bounds while not tight, can be used to mathematically justify why neural networks trained on noisy labels can achieve low training error but will fail to generalize, while those trained on real data generalize well. It is unclear, however, whether they give a sense of the relative difficulty of learning different types of functions with different types of methods.
1.1 Our contributions
To explain and further understand the efficacy of deep learning in numerous applications, we present generalization bounds on learning analytic functions on the unit sphere with any kernel method or sufficiently wide neural network (Section3.1). In the particular case of a wide, one-hidden layer ReLU network, we present a succinct bound on the number of samples needed to guarantee low test error. Informally, we prove the following:
Theorem 1.1 (informal).
Given an analytic function , the function , for fixed and inputs is learnable to error with samples, with
where the are the power series coefficients of .
We prove a much more general version for multivariate analytic functions:
Theorem 1.2 (informal).
Given a multivariate analytic function for in the -dimensional unit ball, there is a function as defined in Theoreom 1.1 such that is learnable to error with samples.
Using Theorem 1.2, we develop a calculus of bounds
- showing that the sum, product, and composition of learnable functions is also learnable, with bounds constructed using the familiar product and chain rules of univariate calculus (Corollaries1.1, 1.2, and 1.3). These bounds also can be applied when has a singularity, provided that the data is sampled away from the singularity.
Since many functions used in scientific theories and models fall into this function class, our calculus allows for a clear quantifiable explanation for why neural network models have had successful applications to many of those fields. As an important example from physics, we consider the forces between bodies with positions interacting via Newtonian gravitation:
We show that, as long as there is some minimum distance between the and , we can still use the calculus of bounds to show that the
force vectors can be efficiently learned. We prove the following:
Theorem 2 (informal).
A wide, one-hidden layer ReLU network can learn the force law between gravitational bodies up to error using only samples.
Lastly, we compare our generalization bounds for the ReLU network with those for more traditional kernel learning methods. Specifically, we show asymptotically weaker bounds for other models, including for kernel regression with Gaussian kernels, providing some theoretical evidence why neural networks with ReLU activation (or their induced kernels) often achieve superior performance than the standard Gaussian kernels. We support our theoretical work with numerical experiments on a synthetic dataset that show that wide networks can learn the gravitational force law with minimal fine-tuning, and achieve lower test and generalization error than the standard Gaussian kernel counterpart. Our results suggest that networks with better theoretical learning bounds may in fact perform better in practice as well, even when theoretical bounds are pessimistic.
Classical tools like VC dimension (Vapnik, 2000) are insufficient to explain the performance of overparameterized neural networks (Bartlett et al., 2019). Networks generalize well even though are often very expressive (Poole et al., 2016; Raghu et al., 2017), can memorize random data (Zhang et al., 2017; Arpit et al., 2017), and have correspondingly large VC dimensions (Maass, 1994; Harvey et al., 2017; Bartlett et al., 2019).
Some progress has been made by focusing on the “implicit regularization” provided by training dynamics (Gunasekar et al., 2018; Du and Lee, 2018; Arora et al., 2018). In particular, SGD biases networks to solutions with small weight changes under the norm (plus any additional regularization), which has been used to inspire various norm-based bounding strategies (Neyshabur et al., 2018; Allen-Zhu et al., 2019; Arora et al., 2019). While many of these bounds must be computed post-training, some bounds can be computed using the architecture alone, and show that function classes like outputs of small networks with smooth activations can be efficiently learned with large networks (Allen-Zhu et al., 2019; Arora et al., 2019; Du et al., 2019).
Recently, learning bounds for very wide networks have been derived by combining insights on learning dynamics with more classical generalization error bounds in kernel learning. In the limit of infinite width, the total change in each individual parameter is small, and the outputs of the network are linear in the weight changes and it can be shown that the learning dynamics are largely governed by the neural tangent kernel (NTK) of the corresponding network (Jacot et al., 2018). Given the dimensional vector of model outputs on the training data, as a function of trainable parameters , the empirical tangent kernel is the matrix given by
Since the derivatives are only taken with respect to the parameters that are trained, the tangent kernel is different if particular layers of the network are fixed after initialization. The empirical kernel concentrates around some limiting matrix in the limit of a large number of parameters. The NTK kernel evaluated on two inputs corresponds to the limiting value of evaluated at and .
As an example, the kernel function for inputs and into a single hidden layer fully-connected network, with non-linearity , with only final layer weights trained is (Jacot et al., 2018; Lee et al., 2019)
where theas the hidden layer weights. Note here that is a function of , and only.
With MSE loss, the learning dynamics in the wide network regime is similar to kernel regression or kernel learning. Rademacher complexity (Koltchinskii and Panchenko, 2000) can be used to generate learning bounds for wide networks near the infinite width limit, such as in (Arora et al., 2019). The following theorem shows that if for training labels , the product is bounded by , then wide networks trained with SGD have error less than when trained with samples:
Theorem ((Arora et al., 2019), 3.3).
Let be a function over , and
be a distribution over the inputs. Let
be a 1-Lipschitz loss function. Consider training
a two-layer ReLu network to learn
be a 1-Lipschitz loss function. Consider training a two-layer ReLu network to learnusing SGD with MSE loss on i.i.d. samples from . Define the generalization error of the trained model as
where is the Gram matrix whose elements correspond to the NTK kernel evaluated at pairs of the i.i.d. training examples, and is the -dimensional vector of training labels.
If there is some -independent constant such that , then with probability at least , samples are sufficient to ensure generalization error less than .
We define to be the Euclidean norm, unless otherwise specified and to be the dot product between vectors . For a vector and a scalar , we define and other operations analogously. For the remainder of this paper, we focus on the learnability of functions under different learning algorithms and so we will define efficiently learnable functions as:
Given a learning algorithm, we say that a function over a distribution of inputs is efficiently learnable if, given an error scale , with probability greater than , the generalization error of the trained model with respect to any 1-Lipschitz loss function is less than when the training data consists of at least i.i.d. samples drawn from , for some -independent constant .
3.1 Kernel learning bounds
In this section, we extend the bounds derived in (Arora et al., 2019) to any kernel that can be written as a power series in the dot product of inputs . We emphasize that our kernel learning bounds can be generalized to the setting where we train a wide neural network on our data. In Appendix A, we make this relation rigorously clear and show that Equation 6 applies when training the upper layer only of any wide network - which is equivalent to a draw from the posterior of a Gaussian process with the NTK kernel given by Equation 4. Therefore, we focus on kernels in this section.
We can extend the following corollary, originally proved for wide ReLU networks with trainable hidden layer only:
Corollary ((Arora et al., 2019), 6.2).
Consider the function given by:
Then, if is restricted to , and the NTK kernel can be written as , the function can be learned efficiently with a wide one-hidden-layer network in the sense of Definition 1 with
up to -independent constants of , where . In the particular case of a ReLU network, the bound is
if the are non-zero only for or even.
Extension of Corollary ((Arora et al., 2019), 6.2).
Consider a kernel method or appropriately wide network with only the upper layer trained, with kernel
Then the learning bound in Equation 8 holds for these models as well.
Building off of this learning bound, we will prove in Section 3.2 that all analytic functions are efficiently learnable, via both kernel methods and wide networks.
Equation 8 suggests that kernels with slowly decaying (but still convergent) will give the best bounds for learning polynomials. Many popular kernels do not meet this criteria. For example, for inputs on the sphere of radius , the Gaussian kernel can be written as . This has , which increases rapidly with . This provides theoretical justification for the empirically inferior performance of the Gaussian kernel which we will present in Section 4.
Guided by this theory, we focus on kernels where , for all . While the ReLU NTK kernel (with inputs on the sphere) satisfies this bound for even positive powers
, it fails to satisfy our criteria for odd values of. One way to ensure the bound exists for all is to construct a kernel by hand: for example,
with is a valid slowly decaying kernel on the sphere.
Another approach, which keeps the model similar to those used in practice, is to introduce a novel kernel by applying the following modification to the NTK kernel. Consider appending a constant component to the input so that the new input to the network is . The kernel then becomes:
Re-writing the power series as an expansion around
, we have terms of all powers. An asymptotic analysis of the coefficients (AppendixB.1) shows that coefficients are asymptotically - meeting our needs. In particular, this means that the bound in Equation 9 applies to these kernels, without restriction to even . Note that for , the constant function can be learned with samples.
3.2 Learning analytic functions
For the remainder of this section, we assume that we are using a GP/wide network with a kernel of the form . Unless otherwise noted, we also assume that for large so Equation 9 applies for all powers of . We will use this to show that all univariate analytic functions are efficiently learnable, and then extend the results to multivariate functions.
Analytic functions are a rich class with a long history of use in the sciences and applied mathematics. Functions are analytic if they have bounded derivatives of all orders when extended to the complex plane. This is equivalent to having a locally convergent power series representation, a fact which we will exploit for many of our proofs.
3.2.1 Univariate analytic functions
We start with the univariate case and first prove the following:
Let be a function analytic around , with radius of convergence . Define the auxiliary function by the power series
where the are the power series coefficients of . Then the function , for some fixed vector with is efficiently learnable in the sense of Definition 1 using a model with the slowly decaying kernel with
if the norm is less than .
We first note that the radius of convergence of the power series of is also since is analytic. Applying Equation 9, pulling out the th order term, and factoring out , we get
since . ∎
The relationship between and the original depends on the power series representation of . For example, for , the power series has all positive coefficients and . The worst case scenario is when the power series has alternating sign; for example, for , .
3.2.2 Multivariate analytic functions
The above class of efficiently learnable functions is somewhat limiting; it is “closed” over addition (the sum of learnable functions is learnable), but not over products and composition. The following lemma, proved in Appendix B.2, allows us to generalize:
Given a collection of vectors in , the function is efficiently learnable with
Using this lemma we can prove:
Let be a function with multivariate power series representation:
where the elements of index the th order terms of the power series. We define with coefficients
If the power series of converges at then with high probability can be learned efficiently in the sense of Definition 1 with .
Since the set of efficiently learnable functions is now appropriately “closed” over addition and multiplication, the standard machinery of calculus can be used to prove learning bounds for combinations of functions with known bounds. For example, we have:
Corollary 1.1 (Product rule).
Let and meet the conditions of Theorem 1. Then the product is efficiently learnable as well, with bound
Consider the power series of , which exists and is convergent since each individual series exists and is convergent. Let the elements of and index the th order terms of and the th order terms of respectively. The individual terms in the series look like:
for all terms with and for the term with .
Distribute the product, and first focus on the term only. Summing over all the for all , we get
Now summing over the and we get . If we do the same for the term, after summing we get . These bounds add and we get the desired formula for , which, up to the additional term looks is the product rule applied to and . ∎
One immediate application for this corollary is the product of many univariate analytic functions. If we define
where each of the corresponding have the appropriate convergence properties, then is efficiently learnable with bound given by
We can also derive the equivalent of the chain rule for function composition:
Corollary 1.2 (Chain rule).
Let be an analytic function and be efficiently learnable, with auxiliary functions and respectively. Then the composition is efficiently learnable as well with bound
provided that and converge (equivalently, if and are in the radius of convergence of ).
Writing out as a power series in , we have:
We can bound each term individually, and use the -wise product rule to bound each term of . Doing this, we have:
Factoring out from the first term and then evaluating each of the series gets us the desired result. ∎
The chain rule bound be generalized to a -dimensional outermost function , as proved in Appendix C:
Let be analytic, with be the function obtained by taking the multivariate power series of and replacing all coefficients with their absolute values. Then, if and are both efficiently learnable, is as well with bound
provided converges at .
3.3 Learning dynamical systems
We can use the product and chain rules to show that many functions important in scientific applications can be efficiently learnable. This is true even when the function has a singularity. As an example demonstrating both, we prove the following bound on learning Newton’s law of gravitation:
Consider a system of bodies with positions and masses , interacting via the force:
where . We assume that , the ratio between the largest and smallest pairwise distance between any two bodies, is constant. Suppose the have been rescaled to be between and . Then the force law is efficiently learnable in the sense of Definition 1 using the modified ReLU kernel to generalization error less than using samples.
We will prove learning bounds for each component of separately, showing efficient learning with probability greater than . Then, using the union bound, the probability of simultaneously learning all the components efficiently will be .
There are two levels of approximation: first, we will construct a function which is within of the original force law, but more learnable. Secondly, we will prove bounds on learning that function to within error .
We first rescale the vector of collective so that their collective length is at most . In these new units, this gives us . The first component of the force on can be written as:
If we find a bound for an individual contribution to the force, we can get a bound on the total . Consider an individual force term in the sum. The force has a singularity at . In addition, the function itself is non-analytic due to the branch cut at .
We instead will approximate the force law with a finite power series in , and get bounds on learning said power series. The power series representation of is . If we approximate the function with terms, the error can be bounded using Taylor’s theorem. The Lagrange form of the error gives us the bound
where we use for large . We can use the above expansion by rewriting
for some shift . Approximation with , the first terms of the power series in gives us the error:
which we want to be small over the range .
The bound is optimized when it takes the same value at and , so we set . In the limit that , where learning is most difficult, the bound becomes
where , which is constant by assumption.
In order to estimate an individual contribution to the force force to error(so the total error is ), we must have:
This allows us to choose the smallest which gives us this error. Taking the logarithm of both sides, we have:
where we use that after rescaling. The choice ensures error less than per term.
Using this approximation, we can use the product and chain rules to get learning bounds on the force law. We can write the approximation
where and The number of samples needed for efficient learning is bounded by , for
Evaluating, we have
which, after using and gives us the bound
The asymptotic behavior is
since is bounded.
We can therefore learn an -approximation of one component of , with probability at least and error with samples. Therefore, we can learn to error with the same number of samples. Using a union bound, with probability at least we can simultaneously learn all components of all with that number of samples. ∎
We note that since the cutoff of the power series at dominates the bound, we can easily compute learning bounds for other power-series kernels as well. If the th power series coefficient of the kernel is , then the bound on is increased by . For example, for the Gaussian kernel, since , the bound becomes
which increases the exponent of by a factor of .
We empirically validated our analytical learning bounds by training models to learn the gravitational force function for bodies (with ranging from to ) in a dimensional space. We created synthetic datasets by randomly drawing points from corresponding to the location of bodies, and compute the gravitational force (according to Equation 2) on a target body also drawn randomly from . To avoid singularities, we ensured a minimum distance of between the target body and the other bodies (corresponding to the choice ). As predicted by the theory, none of the models learn well if is not fixed. We randomly drew the masses corresponding to the bodies from . We generated million such examples - each example with features corresponding to the location and mass of each of the bodies, and a single label corresponding to the gravitational force on the target body along the -axis. We held out of the dataset as test data to compute the root mean square error (RMSE) in prediction. We trained three different neural networks on this data, corresponding to various kernels we analyzed in the previous section:
A wide one hidden-layer ReLU network (corresponding to the ReLU NTK kernel).
A wide one hidden-layer ReLU network with a constant bias feature added to the input (corresponding to the NTK kernel in Equation 12).
A wide one hidden-layer network with exponential activation function, where only the top layer of the network is trained (corresponding to the Gaussian kernel).
We used a hidden layer of width for all the networks, as we observed that increasing the network width further did not improve results significantly. All the hidden layer weights were initialized randomly.
In Figure 1 we show the the normalized RMSE (RMSE/) for each of the neural networks for different values of the number of bodies .
All three networks are able to learn the gravitational force equation with small normalized RMSE for hundreds of bodies. Both the ReLU network and ReLU with bias outperform the network corresponding to the Gaussian kernel (in terms of RMSE) as increases. In particular, the Gaussian kernel learning seems to quickly degrade at around bodies, with a normalized RMSE exceeding . This is consistent with the learning bounds for these kernels in Section 3.1, and suggests that those bounds may in fact be useful to compare the performances of different networks in practice.
We did not, however, observe much difference in the performance of the ReLU network when adding a bias to the input, which suggests that the inability to get an analytical bound due to only even powers in the ReLU NTK kernel might be a shortcoming of the proof technique, rather than a property which fundamentally limits the model.
Our theoretical work shows that the broad and important class of analytic functions is provably learnable with the right kernels (or the equivalent wide networks). The methods which we developed may be useful for proving learnability of other classes of functions, such as the flows induced by finite-time integration of differential equations. Furthermore, in general, there is an open question as to whether these generalization bounds can be substantially improved.
Our experiments suggest that these bounds may be useful for distinguishing which types of models are suited for specific problems. Further experimental and theoretical work is necessary to ascertain whether this holds for the finite-width networks used in practice, or when common hyperparameter tuning/regularization strategies are used during training, such as ARD in kernel learning.
- A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nature Genetics 50 (12), pp. 1735–1743 (en). External Links: Cited by: §1.
- Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. In Advances in Neural Information Processing Systems 32, pp. 6155–6166. Cited by: §1, §2.1.
- Learning Polynomials with Neural Networks. In International Conference on Machine Learning, pp. 1908–1916 (en). Cited by: Appendix A.
- Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In International Conference on Machine Learning, pp. 322–332 (en). Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix A, §B.1, §B.2, §B.2, §B.2, §1, §2.1, §2.1, §3.1, Extension of Corollary ((Arora et al., 2019), 6.2), Corollary ((Arora et al., 2019), 6.2), Theorem ((Arora et al., 2019), 3.3).
- Stronger Generalization Bounds for Deep Nets via a Compression Approach. In International Conference on Machine Learning, pp. 254–263 (en). Cited by: §2.1.
- A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, Sydney, NSW, Australia, pp. 233–242. Cited by: §2.1.
- Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research 20 (63), pp. 1–17. External Links: Cited by: §1, §2.1.
- Machine learning phases of matter. Nature Physics 13 (5), pp. 431–434 (en). External Links: Cited by: §1.
- Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Cited by: §1.
- On the Power of Over-parametrization in Neural Networks with Quadratic Activation. In International Conference on Machine Learning, pp. 1329–1338 (en). Cited by: §2.1.
- Gradient Descent Provably Optimizes Over-parameterized Neural Networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: Appendix A, Appendix A, Appendix A, §1, §2.1.
- Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error. Journal of Chemical Theory and Computation 13 (11), pp. 5255–5264. External Links: Cited by: §1.
- Efficient representation of quantum many-body states with deep neural networks. Nature Communications 8 (1), pp. 662 (eng). External Links: Cited by: §1.
- Characterizing Implicit Bias in Terms of Optimization Geometry. In International Conference on Machine Learning, pp. 1832–1841 (en). Cited by: §2.1.
- Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Conference on Learning Theory, pp. 1064–1068 (en). Cited by: §1, §2.1.
- Deep Residual Learning for Image Recognition. In , pp. 770–778. Cited by: §1.
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pp. 8571–8580. Cited by: §1, §2.1, §2.1.
- Rademacher Processes and Bounding the Risk of Function Learning. In High Dimensional Probability II, Progress in Probability, Boston, MA, pp. 443–457 (en). External Links: Cited by: §2.1.
- Accurate contact predictions using covariation techniques and machine learning. Proteins: Structure, Function, and Bioinformatics 84 (S1), pp. 145–151 (en). External Links: Cited by: §1.
Lessons in neural network training: overfitting may be harder than expected.
Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, AAAI’97/IAAI’97, Providence, Rhode Island, pp. 540–545. External Links: Cited by: §1.
- Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. In Advances in Neural Information Processing Systems 32, pp. 8570–8581. Cited by: Appendix A, Appendix A, §1, §2.1.
- Neural Nets with Superlinear VC-Dimension. Neural Computation 6 (5), pp. 877–884. External Links: Cited by: §2.1.
- Deep learning and the Schrödinger equation. Physical Review A 96 (4), pp. 042113. External Links: Cited by: §1.
- The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, Cited by: §2.1.
- Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems 29, pp. 3360–3368. Cited by: §1, §2.1.
- On the Expressive Power of Deep Neural Networks. In International Conference on Machine Learning, pp. 2847–2854 (en). Cited by: §1, §2.1.
Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics 357, pp. 125–141 (en). External Links: Cited by: §1.
- Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Physical Review Letters 108 (5), pp. 058301. External Links: Cited by: §1.
- Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1.
The Nature of Statistical Learning Theory. 2 edition, Information Science and Statistics, Springer-Verlag, New York (en). External Links: Cited by: §2.1.
- Distance-based protein folding powered by deep learning. Proceedings of the National Academy of Sciences 116 (34), pp. 16856–16865 (en). External Links: Cited by: §1.
- Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.1.
Appendix A Kernels and two layer networks
Previous work focused on generalization bounds for training the hidden layers of wide networks with SGD. Here we show that these bounds also apply to the case where only the final layer weights are trained (corresponding to the NNGP kernel in (Lee et al., 2019)). The proof strategy consists of showing that finite-width networks have a sensible infinite-width limit, and showing that training causes only a small change in parameters of the network.
Let be the number of hidden units, and be the number of data points. Let be a random matrix denoting the activations of the hidden layer (as a function of the weights of the lower layer) for all data points. Similarly to (Arora et al., 2019; Du et al., 2019) we will argue that for large enough even if we take a random input layer and just train the upper layer weights then the generalization error is at most . For our purposes, we define:
which is the NNGP kernel from (Lee et al., 2019).
If , the kernel function which generates is given by a infinite Taylor series in it can be argued that has full rank for most real world distributions. For example, the ReLU activation this holds as long as no two data points are co-linear (see Definition 5.1 in (Arora et al., 2019)). We can prove this more explicitly in the general case.
If all the data points are distinct and the Taylor series of in has positive coefficients everywhere then is not singular.
First consider the case where the input is a scalar. Since the Taylor series corresponding to consists of monomials of all degrees of , we can view it as some inner product in a kernel space induced by the function , where the inner product is diagonal (but with potentially different weights) in this basis. For any distinct set of inputs the set of vectors are linearly independent. The first columns produce the Vandermonde matrix obtained by stacking rows for different values of
, which is well known to be non-singular (since a zero eigenvector would correspond to a degreepolynomial with distinct roots ).
This extends to the case of multidimensional if the values, projected along some dimension, are distinct. In this case, the kernel space corresponds to the direct sum of copies of applied elementwise to each coordinate . If all the points are distinct and and far apart from each other, the probability that a given pair coincides under random projection is negligible. From a union bound, the probability that a given pair coincide is also bounded – so there must be directions such that projections along that direction are distinct. Therefore, can be considered to be invertible in general. ∎
As , concentrates to its expected value. More precisely, approaches for large if we assume that the smallest eigenvalue , which from the above lemma we know to be true for fixed . (For the ReLU NTK kernel the difference becomes negligible with high probability for (Arora et al., 2019).) This allows us to replace with in any bounds involving the former.
We can get learning bounds in terms of in the following manner. The output of the network is given by , where is the vector of upper layer weights and is vector of training output values. The outputs are linear in . Training only the , and assuming is invertible (which the above arguments show is true with high probability for large ), the following lemma holds:
If we initialize a random lower layer and train the weights of the upper layer, then there exists a solution with norm .
The minimum norm solution to is
The norm of this solution is given by .
We claim that . To show this, consider the SVD decomposition . Expanding we have
Evaluating the right hand side gets us .
Therefore, the norm of the minimum norm solution is . ∎
For large , the norm approaches . Since the lower layer is fixed, the optimization problem is linear and therefore convex in the trained weights . Therefore SGD with small learning rate will reach this optimal solution. The Rademacher complexity of this function class is at most . The optimal solution has train error based on the assumption that is full rank and the test error will be no more than this Rademacher complexity - identical to the previous results for training a ReLu network (Arora et al., 2019; Du et al., 2019).
Note that although we have argued here assuming only upper layer is trained, (Arora et al., 2019; Du et al., 2019; Andoni et al., 2014) show that even if both layers are trained for large enough the training dynamics is governed by the NTK kernel, and the lower layer changes so little over the training steps that remains close to through the gradient descent.