Understanding the models (i.e. pairs of input distribution and target function
) on which neural networks algorithms guaranteed to learn a good predictor is at the heart of deep learning theory today. In recent years, there has been an impressive progress in this direction. It is now known that neural networks algorithms can learn, in polynomial time, linear models, certain kernel spaces, polynomials, and memorization models (e.g.[1, 6, 5, 4, 12, 20, 13, 7, 2, 17, 14, 9, 3]).
Yet, while such models has been shown to be learnable in polynomial time and polynomial sized networks, the required size (i.e., number of parameteres) of the networks is still very large, unless the model is linear separable , or the activation is a polynomial . This means that the proofs are valid for networks whose size is significantly larger then the minimal size of the network that implements a good predictor.
We make a progress in this direction, and prove that certain NN algorithms can learn memorization models, polynomials, and kernel spaces, with near optimal network size, sample complexity, and runtime (i.e. SGD iterations). Specifically we assume that the instance space is and consider depth networks with hidden neurons. Such networks calculate a function of the form
We assume that the network is trained via SGD, starting from random weights that are sampled from the following variant of Xavier initialization : will be initialized to be a duplication of a matrix of standard Gaussians and will be a duplication of the all-vector in dimension , for some , with its negation. We will use rather large , that will depend on the model that we want to learn. We will prove the following results
In the problem of memorization, we consider SGD training on top of a sample . The goal is to understand how large the networks should be, and (to somewhat leaser extent) how many SGD steps are needed in order to memorize fraction of the examples, where an example is considered memorized if for the output function . Most results, assumes that the points are random or “looks like random” in some sense.
In order to memorize even just slightly more that half of the examples we need a network with at least parameters (up to poly-log factors). However, unless (in which case the points are linearly separable), best know results require much more than parameters, and the current state of the art results [17, 14] require parameters. We show that if the points are sampled uniformly at random from , then any fraction of the examples can be memorized by a network with parameters, and
Our results for polynomials and kernels will depend on what we call the boundedness of the data distribution. We say that a distribution on is -bounded if for every , . To help the reader to calibrate our results, we first note that by Cauchy-Schwartz, any distribution is -bounded, and this bound is tight in the cases that is supported on a single point. Despite that, many distributions of interest are -bounded or even
-bounded. This includes the uniform distribution on, the uniform distribution on the discrete cube , the uniform distribution on random points, and more (see section 4.4). For simplicity, we will phrase our results in the introduction for -bounded distribution. We note that if the distribution is -bounded (rather than -bounded), our results suffer a multiplicative factor of in the number of parameters, and remains the same in the runtime (SGD steps).
For the sake of clarity, we will describe our result for learning even polynomials, with ReLU networks, and the loss being the logistic loss or the hinge loss. Fix a constant integer and consider the class of even polynomials of degree and coefficient vector norm at most . Namely,
where for and we denote and . Learning the class requires a networks with at least parameters (and this remains true even if we restrict to -bounded distributions). We show that for -bounded distributions, SGD learns , with error parameter (that is, it returns a predictor with error ), using a network with parameters and SGD iterations.
The connection between networks and kernels has a long history (early work inclues [19, 15] for instance). In recent years, this connection was utilized to analyze the capability of neural networks algorithm (e.g. [1, 6, 5, 4, 12, 20, 13, 7, 2, 17, 14, 9]). In fact, virtually all known non-linear learnable models, including memorization models, polynomials, and kernel spaces utilize this connection. Our paper is not different, and our result for polynomials is a corollary of a more general result about learning certain kernel spaces, that we describe next. Our result about memorization is not a direct corollary, but is also a refinement of that result. We consider the kernel given by
which is a variant of the Neural Tangent Kernel  (see section 2.6). We show that for -bounded distributions, SGD learns functions with norm in the corresponding kernel space, with error parameter , using a network with parameters and SGD iterations. We note that the network size is optimal up to the dependency on and poly-log factors, and the number of iteration is optimal up to a constant factor. This result is valid for most Lipschitz losses including the hinge loss and the log-loss, and for most popular activation functions, including the ReLU.
For weights and we denote by the gradient, w.r.t. the hidden weights , of . Our initialization scheme ensures that the SGD on the network, at the onset of the initialization process, is approximately equivalent to linear SGD starting at , on top of the embedding , where are the initial weights. Now, it holds that
where is the kernel defined in (1). Hence, if the network is large enough, we would expect that , and therefore that SGD on the network, in the onset of the initialization process, is approximately equivalent to linear SGD starting at , w.r.t. the kernel . Our main technical contribution is the analysis of the rate (it terms of the size of the network) in which converges to .
We would like to mention Fiat et al.  whose result shares some ideas with our proof. In their paper it is shown that for the square loss and the ReLU activation, linear optimization over the embedding can memorize points with parameters.
We denote vectors by bold-face letters (e.g. ), matrices by upper case letters (e.g. ), and collection of matrices by bold-face upper case letters (e.g. ). We denote the ’s row in a matrix by . The -norm of is denoted by , and for a matrix , is the spectral norm . We will also use the convention that . For a distribution on a space , and we denote . We use to hide poly-log factors.
2.2 Supervised learning
The goal in supervised learning is to devise a mapping from the input spaceto an output space based on a sample , where drawn i.i.d. from a distribution over . In our case, the instance space will always be
. A supervised learning problem is further specified by a loss function, and the goal is to find a predictor whose loss, , is small. The empirical loss is commonly used as a proxy for the loss . When is defined by a vector of parameters, we will use the notations , and . For a class of predictors from to we denote and
Regression problems correspond to and, for instance, the squared loss . Classification is captured by and, say, the zero-one loss or the hinge loss . A loss is -Lipschitz if for all , the function is -Lipschitz. Likewise, it is convex if is convex for every . We say that is -decent if for every , is convex, -Lipschitz, and twice differentiable in all but finitely many points.
2.3 Neural network learning
We will consider fully connected neural networks of depth with hidden neurons and activation function . Throughout, we assume that the activation function is continuous, is twice differentiable in all but finitely many points, and there is such that for every point for which is twice differentiable in . We call such an activation a decent activation. This includes most popular activations, including the ReLU activation , as well as most sigmoids.
We also denote by the aggregation of all weights.
We next describe the learning algorithm that we analyze in this paper.
We will use a variant of the popular Xavier initialization  for the network weights, which we call Xavier initialization with zero outputs.
The neurons will be arranged in pairs, where each pair consists of two neurons that are initialized identically, up to sign. Concretely, the weight
matrix will be initialized to be a duplication of a matrix of standard Gaussians111It is
more standard to assume that the instances has norm , or infinity norm , and the entries of has variance
has variance. For the sake of notational convenience we chose a different scaling—divided the instances by and accordingly multiplied the initial matrix by . Identical results can be derived for the more standard convention. and will be a duplication of the all- vector in dimension , for some , with its negation. We denote the distribution of this initialization scheme by . Note that if then w.p. 1, . Finally, the training algorithm is described in 1.
2.4 Kernel spaces
Let be a set. A kernel is a function such that for every the matrix is positive semi-definite. A kernel space is a Hilbert space of functions from to such that for every the linear functional is bounded. The following theorem describes a one-to-one correspondence between kernels and kernel spaces.
For every kernel there exists a unique kernel space such that for every , . Likewise, for every kernel space there is a kernel for which . We denote the norm and inner product in by and . The following theorem describes a tight connection between kernels and embeddings of into Hilbert spaces. A function is a kernel if and only if there exists a mapping to some Hilbert space for which . In this case, where . Furthermore, and the minimizer is unique.
A special type of kernels that we will useful for us are inner product kernels. These are kernels of the form
For scalars with . It is well known that for any such sequence is a kernel. The following lemma summarizes a few properties of inner product kernels. Let be the inner product kernel . Suppose that
If then and furthermore
For every , the function belongs to and
For a kernel and we denote . We note that spaces of the form often form a benchmark for learning algorithms.
2.5 Hermite Polynomials and the dual activation
Hermite polynomials are the sequence of orthonormal polynomials corresponding to the standard Gaussian measure on . Fix an activation . Following the terminology of  we define the dual activation of as
It holds that if then
In particular, is an inner product kernel.
2.6 The Neural Tangent Kernel
Fix network parameters and . The neural tangent kernel corresponding to weights is222The division by is for notational convenience.
The neural tangent kernel space, , is a linear approximation of the trajectories in which changes by changing a bit. Specifically, if and only if there is such that
Furthermore, we have that is the minimal Euclidean norm of that satisfies equation (2). The expected initial neural tangent kernel is
We will later see that depends only on and . If the network is large enough, we can expect that at the onset of the optimization process, . Hence, approximately, consists of the directions in which the initial function computed by the network can move. Since the initial function (according to Xavier initialization with zero outputs) is , is a linear approximation of the space of functions computed by the network in the vicinity of the initial weights. NTK theory based of the fact close enough to the initialization point, the linear approximation is good, and hence SGD on NN can learn functions in that has sufficiently small norm. The main question is how small should the norm be, or alternatively, how large should the network be.
We next derive a formula for . We have, for
Taking expectation we get
Finally, we decompose the expected initial neural tangent kernel into two kernels, that corresponds to the hidden and output weights respectively. Namely, we let
3.1 Learning the neural tangent kernel space with SGD on NN
Fix a decent activation function and a decent loss . Our first result shows that algorithm 1 can learn the class using a network with parameters and using examples. We note that unless is linear, the number of samples is optimal up to constant factor, and the number of parameters is optimal, up to poly-log factor and the dependency on . This remains true even if we restrict to -bounded distributions.
Given , , and there is a choice of , , as well as and , such that for every -bounded distribution and batch size , the function returned by algorithm 1 satisfies
As an application, we conclude that for the ReLU activation, algorithm 1 can learn even polynomials of bounded norm with near optimal sample complexity and network size. We denote
Fix a constant and assume that the activation is ReLU. Given , , and there is a choice of , , as well as and , such that for every -bounded distribution and batch size , the function returned by algorithm 1 satisfies We note that as in theorem 3.1, the number of samples is optimal up to constant factor, and the number of parameters is optimal, up to poly-log factor and the dependency on , and this remains true even if we restrict to -bounded distributions.
Theorem 3.1 can be applied to analyze memorization by SGD. Assume that is the hinge loss (similar result is valid for many other losses such as the log-loss) and is any decent non-linear activation. Let be random, independent and uniform points in with for some . Suppose that we run SGD on top of . Namely, we run algorithm 1 where the underlying distribution is the uniform distribution on the points in . Let be the output of the algorithm. We say that the algorithm memorized the ’th example if . The memorization problem investigate how many points the algorithm can memorize, were most of the focus is on how large the network should be in order to memorize fraction of the points.
As shown in section 4.4, the uniform distribution on the examples in is -bounded w.h.p. over the choice of . Likewise, it is not hard to show that w.h.p. over the choice of there is a function such that for all . By theorem 3.1 we can conclude the by running SGD on a network with parameters and steps, the network will memorize fraction of the points. This size of networks is optimal up to poly-log factors, and the dependency of . This is satisfactory is is considered a constant. However, for small , more can be desired. For instance, in the case that we want to memorize all points, we need , and we get a network with parameters. To circumvent that, we perofem a more refined analysis of this memorization problem and show that even perfect memorization of points can be done via SGD on a network with parameters, which is optimal, up to poly-log factors.
There is a choice of , , as well as and , such that for every batch size , w.p. , the function returned by algorithm 1 memorizes fraction of the examples.
We emphasize the our result is true for any non-linear and decent activation function.
3.3 Open Questions
The most obvious open question is to generalize our results to the standard Xavier initialization, where is a matrix of independent Gaussians of variance , while is a vector of independent Gaussians of variance . Another open question is to generalize our result to deeper networks.
4.1 Reduction to SGD over vector random features
We will prove our result via a reduction to linear learning over the initial neural tangent kernel space, corresponding the the hidden weights.
That is, we define by the gradient of the function w.r.t. the hidden weights. Namely,
Denote and consider algorithm 2.
Fix a decent activation as well as convex a decent loss . There is a choice , such that for every input distribution the following holds. Let be the functions returned algorithm 1 with parameters and algorithm 2 with parameters . Then,
Given , , and there is a choice of , , as well as , such that for every -bounded distribution and batch size , the function returned by algorithm 2 satisfies
Our next step is to rephrase algorithm 2 in the language of (vector) random features. We note that algorithm 2 is SGD on top of the random embedding . This embedding composed of i.i.d. random mappings where is a standard Gaussian. This can be slightly simplified to SGD on top of the i.i.d. random mappings . Indeed, if we make this change the inner products between the different examples, after the mapping is applied, do not change (up to multiplication by ), and SGD only depends on these inner products. This falls in the framework of learning with (vector) random features scheme, which we define next, and analyze in the next section.
Let be a measurable space and let be a kernel. A random features scheme (RFS) for is a pair where
is a probability measure on a measurable space, and is a measurable function, such that
We often refer to (rather than ) as the RFS. The NTK RFS is given by the mapping defined by
an being the standard Gaussian measure on . It is an RFS for the kernel (see section 2.6). We define the norm of as . We say that is -bounded if . We note that the NTK RFS is -bounded for . We say that an RFS is factorized if there is a function such that . We note that the NTK RFS is factorized.
A random -embedding generated from is the random mapping
where are i.i.d. We next consider an algorithm for learning , by running SGD on top of random features.
Assume that is factorized and -bounded RFS for , that is convex and -Lipschitz, and that has -bounded marginal. Let be the function returned by algorithm 3. Fix a function . Then
In particular, if and we have
The next section is devoted to the analysis of RFS an in particular to the proof of theorem 4.1. We note that since the NTK RFS is factorized and -bounded (for ), theorem 4.1 follows from theorem 4.1. Together with lemma 4.1, this implies theorem 3.1.
4.2 Vector random feature schemes
For the rest of this section, let us fix a -bounded RFS for a kernel and a random embedding . The random -kernel corresponding to is . Likewise, the random -kernel space corresponding to is . For every
is an average of
independent random variables whose expectation is. By Hoeffding’s bound we have. [Kernel Approximation] Assume that , then for every we have . We next discuss approximation of functions in by functions in . It would be useful to consider the embedding
and for every ,
Fix with Hermite expansion and let and
Consider the RFS with begin the standard Gaussian measure on . We have that is an RFS for the kernel . Consider the function . We claim that . Indeed, we have,
Consider the NTK RFS with begin the standard Gaussian measure on . We have that is an RFS for the kernel . Consider the function . As in the item above, it is not hard to show that .
Let us denote