# Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks – much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with Õ(m/d) hidden neurons (and hence Õ(m) parameters) can memorize m random labeled points in S^d-1.

## Authors

• 20 publications
02/27/2017

### SGD Learns the Conjugate Kernel Class of the Network

We show that the standard stochastic gradient decent (SGD) algorithm is ...
11/12/2018

### Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Neural networks have great success in many machine learning applications...
03/21/2019

### Learning Two layer Networks with Multinomial Activation and High Thresholds

Giving provable guarantees for learning neural networks is a core challe...
07/04/2012

### Learning Factor Graphs in Polynomial Time & Sample Complexity

We study computational and sample complexity of parameter and structure ...
02/27/2017

### Depth Separation for Neural Networks

Let f:S^d-1×S^d-1→S be a function of the form f(x,x') = g(〈x,x'〉) for g:...
05/31/2020

### Neural Networks with Small Weights and Depth-Separation Barriers

In studying the expressiveness of neural networks, an important question...
02/23/2021

### Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

A recent series of theoretical works showed that the dynamics of neural ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Understanding the models (i.e. pairs of input distribution and target function

) on which neural networks algorithms guaranteed to learn a good predictor is at the heart of deep learning theory today. In recent years, there has been an impressive progress in this direction. It is now known that neural networks algorithms can learn, in polynomial time, linear models, certain kernel spaces, polynomials, and memorization models (e.g.

[1, 6, 5, 4, 12, 20, 13, 7, 2, 17, 14, 9, 3]).

Yet, while such models has been shown to be learnable in polynomial time and polynomial sized networks, the required size (i.e., number of parameteres) of the networks is still very large, unless the model is linear separable [3], or the activation is a polynomial [9]. This means that the proofs are valid for networks whose size is significantly larger then the minimal size of the network that implements a good predictor.

We make a progress in this direction, and prove that certain NN algorithms can learn memorization models, polynomials, and kernel spaces, with near optimal network size, sample complexity, and runtime (i.e. SGD iterations). Specifically we assume that the instance space is and consider depth networks with hidden neurons. Such networks calculate a function of the form

 hW,\bu(\x)=2q∑i=1uiσ(\inner\wi,\x)=\inner\bu,σ(W\x)

We assume that the network is trained via SGD, starting from random weights that are sampled from the following variant of Xavier initialization [10]: will be initialized to be a duplication of a matrix of standard Gaussians and will be a duplication of the all-vector in dimension , for some , with its negation. We will use rather large , that will depend on the model that we want to learn. We will prove the following results

#### Memorization

In the problem of memorization, we consider SGD training on top of a sample . The goal is to understand how large the networks should be, and (to somewhat leaser extent) how many SGD steps are needed in order to memorize fraction of the examples, where an example is considered memorized if for the output function . Most results, assumes that the points are random or “looks like random” in some sense.

In order to memorize even just slightly more that half of the examples we need a network with at least parameters (up to poly-log factors). However, unless (in which case the points are linearly separable), best know results require much more than parameters, and the current state of the art results [17, 14] require parameters. We show that if the points are sampled uniformly at random from , then any fraction of the examples can be memorized by a network with parameters, and

SGD iterations. Our result is valid for the hinge loss, and most popular activation functions, including the ReLU.

#### Bounded distributions

Our results for polynomials and kernels will depend on what we call the boundedness of the data distribution. We say that a distribution on is -bounded if for every , . To help the reader to calibrate our results, we first note that by Cauchy-Schwartz, any distribution is -bounded, and this bound is tight in the cases that is supported on a single point. Despite that, many distributions of interest are -bounded or even

-bounded. This includes the uniform distribution on

, the uniform distribution on the discrete cube , the uniform distribution on random points, and more (see section 4.4). For simplicity, we will phrase our results in the introduction for -bounded distribution. We note that if the distribution is -bounded (rather than -bounded), our results suffer a multiplicative factor of in the number of parameters, and remains the same in the runtime (SGD steps).

#### Polynomials

For the sake of clarity, we will describe our result for learning even polynomials, with ReLU networks, and the loss being the logistic loss or the hinge loss. Fix a constant integer and consider the class of even polynomials of degree and coefficient vector norm at most . Namely,

 \cpMc=⎧⎨⎩p(\x)=∑|α| is even and ≤caα\xα:∑|α| is even and ≤ca2α≤M2⎫⎬⎭

where for and we denote and . Learning the class requires a networks with at least parameters (and this remains true even if we restrict to -bounded distributions). We show that for -bounded distributions, SGD learns , with error parameter (that is, it returns a predictor with error ), using a network with parameters and SGD iterations.

#### Kernel Spaces

The connection between networks and kernels has a long history (early work inclues [19, 15] for instance). In recent years, this connection was utilized to analyze the capability of neural networks algorithm (e.g. [1, 6, 5, 4, 12, 20, 13, 7, 2, 17, 14, 9]). In fact, virtually all known non-linear learnable models, including memorization models, polynomials, and kernel spaces utilize this connection. Our paper is not different, and our result for polynomials is a corollary of a more general result about learning certain kernel spaces, that we describe next. Our result about memorization is not a direct corollary, but is also a refinement of that result. We consider the kernel given by

 k(\x,\y)=\inner\x,\y⋅\E\w∼\cn(I,0)σ′(\inner\w,\x,\inner\w,\y) (1)

which is a variant of the Neural Tangent Kernel [11] (see section 2.6). We show that for -bounded distributions, SGD learns functions with norm in the corresponding kernel space, with error parameter , using a network with parameters and SGD iterations. We note that the network size is optimal up to the dependency on and poly-log factors, and the number of iteration is optimal up to a constant factor. This result is valid for most Lipschitz losses including the hinge loss and the log-loss, and for most popular activation functions, including the ReLU.

#### Technical Contribution

For weights and we denote by the gradient, w.r.t. the hidden weights , of . Our initialization scheme ensures that the SGD on the network, at the onset of the initialization process, is approximately equivalent to linear SGD starting at , on top of the embedding , where are the initial weights. Now, it holds that

 k(\x,\y)=\EW[\innerΨW,\bu(\x),ΨW,\bu(\x)2qB]

where is the kernel defined in (1). Hence, if the network is large enough, we would expect that , and therefore that SGD on the network, in the onset of the initialization process, is approximately equivalent to linear SGD starting at , w.r.t. the kernel . Our main technical contribution is the analysis of the rate (it terms of the size of the network) in which converges to .

We would like to mention Fiat et al. [8] whose result shares some ideas with our proof. In their paper it is shown that for the square loss and the ReLU activation, linear optimization over the embedding can memorize points with parameters.

## 2 Preliminaries

### 2.1 Notation

We denote vectors by bold-face letters (e.g. ), matrices by upper case letters (e.g. ), and collection of matrices by bold-face upper case letters (e.g. ). We denote the ’s row in a matrix by . The -norm of is denoted by , and for a matrix , is the spectral norm . We will also use the convention that . For a distribution on a space , and we denote . We use to hide poly-log factors.

### 2.2 Supervised learning

The goal in supervised learning is to devise a mapping from the input space

to an output space based on a sample , where drawn i.i.d. from a distribution over . In our case, the instance space will always be

. A supervised learning problem is further specified by a loss function

, and the goal is to find a predictor whose loss, , is small. The empirical loss is commonly used as a proxy for the loss . When is defined by a vector of parameters, we will use the notations , and . For a class of predictors from to we denote and

Regression problems correspond to and, for instance, the squared loss . Classification is captured by and, say, the zero-one loss or the hinge loss . A loss is -Lipschitz if for all , the function is -Lipschitz. Likewise, it is convex if is convex for every . We say that is -decent if for every , is convex, -Lipschitz, and twice differentiable in all but finitely many points.

### 2.3 Neural network learning

We will consider fully connected neural networks of depth with hidden neurons and activation function . Throughout, we assume that the activation function is continuous, is twice differentiable in all but finitely many points, and there is such that for every point for which is twice differentiable in . We call such an activation a decent activation. This includes most popular activations, including the ReLU activation , as well as most sigmoids.

Denote

 \cnσd,q={h\W(\x)=\inner\bu,σ(W\x):W∈M2q,d,\bu∈\reals2q} .

We also denote by the aggregation of all weights. We next describe the learning algorithm that we analyze in this paper. We will use a variant of the popular Xavier initialization [10] for the network weights, which we call Xavier initialization with zero outputs. The neurons will be arranged in pairs, where each pair consists of two neurons that are initialized identically, up to sign. Concretely, the weight matrix will be initialized to be a duplication of a matrix of standard Gaussians111It is more standard to assume that the instances has norm , or infinity norm , and the entries of

has variance

. For the sake of notational convenience we chose a different scaling—divided the instances by and accordingly multiplied the initial matrix by . Identical results can be derived for the more standard convention. and will be a duplication of the all- vector in dimension , for some , with its negation. We denote the distribution of this initialization scheme by . Note that if then w.p. 1, . Finally, the training algorithm is described in 1.

### 2.4 Kernel spaces

Let be a set. A kernel is a function such that for every the matrix is positive semi-definite. A kernel space is a Hilbert space of functions from to such that for every the linear functional is bounded. The following theorem describes a one-to-one correspondence between kernels and kernel spaces.

For every kernel there exists a unique kernel space such that for every , . Likewise, for every kernel space there is a kernel for which . We denote the norm and inner product in by and . The following theorem describes a tight connection between kernels and embeddings of into Hilbert spaces. A function is a kernel if and only if there exists a mapping to some Hilbert space for which . In this case, where . Furthermore, and the minimizer is unique.

A special type of kernels that we will useful for us are inner product kernels. These are kernels of the form

 k(\x,\y)=∞∑n=0bn\inner\x,\yn

For scalars with . It is well known that for any such sequence is a kernel. The following lemma summarizes a few properties of inner product kernels. Let be the inner product kernel . Suppose that

1. If then and furthermore

2. For every , the function belongs to and

For a kernel and we denote . We note that spaces of the form often form a benchmark for learning algorithms.

### 2.5 Hermite Polynomials and the dual activation

Hermite polynomials are the sequence of orthonormal polynomials corresponding to the standard Gaussian measure on . Fix an activation . Following the terminology of [6] we define the dual activation of as

 ^σ(ρ)=\EX,Y are ρ-correlated standard Gaussianσ(X)σ(Y)

It holds that if then

 ^σ(ρ)=∞∑n=0a2nρn

In particular, is an inner product kernel.

### 2.6 The Neural Tangent Kernel

Fix network parameters and . The neural tangent kernel corresponding to weights is222The division by is for notational convenience.

 \tk\W(\x,\y)=\inner∇\Wh\W(\x),∇\Wh\W(\y)2qB2

The neural tangent kernel space, , is a linear approximation of the trajectories in which changes by changing a bit. Specifically, if and only if there is such that

 ∀\x∈\sphered1−1,h(\x)=limϵ→0h\W+ϵ\U(\x)−h\W(\x)ϵ (2)

Furthermore, we have that is the minimal Euclidean norm of that satisfies equation (2). The expected initial neural tangent kernel is

 \tkσ,B(\x,\y)=\tkσ,d,q,B(\x,\y)=\E\W∼(d,q,B)\tk\W(\x,\y)

We will later see that depends only on and . If the network is large enough, we can expect that at the onset of the optimization process, . Hence, approximately, consists of the directions in which the initial function computed by the network can move. Since the initial function (according to Xavier initialization with zero outputs) is , is a linear approximation of the space of functions computed by the network in the vicinity of the initial weights. NTK theory based of the fact close enough to the initialization point, the linear approximation is good, and hence SGD on NN can learn functions in that has sufficiently small norm. The main question is how small should the norm be, or alternatively, how large should the network be.

We next derive a formula for . We have, for

 \tk\W(\x,\y) = \inner∇WhW(\x),∇WhW(\y)2qB2 = 1qB2q∑i=1\innerBσ′(\inner\wi,\x)\x,Bσ′(\inner\wi,\y)\y+1qB2q∑i=1σ(\inner\wi,\x)σ(\inner\wi,\y) = \inner\x,\yqq∑i=1σ′(\inner\wi,\x)σ′(\inner\wi,\y)+1qB2q∑i=1σ(\inner\wi,\x)σ(\inner\wi,\y)

Taking expectation we get

 \tkσ,B(\x,\y)=\inner\x,\y^σ′(\inner\x,\y)+1B2^σ(\inner\x,\y)=\inner\x,\ykσ′(\x,\y)+1B2kσ(\x,\y)

Finally, we decompose the expected initial neural tangent kernel into two kernels, that corresponds to the hidden and output weights respectively. Namely, we let

 \tkσ,B=\tkhσ,B+\tkoσ,B for \tkhσ(\x,\y)=\inner\x,\y^σ′(\inner\x,\y) and \tkoσ,B(\x,\y)=1B2^σ(\inner\x,\y)

## 3 Results

### 3.1 Learning the neural tangent kernel space with SGD on NN

Fix a decent activation function and a decent loss . Our first result shows that algorithm 1 can learn the class using a network with parameters and using examples. We note that unless is linear, the number of samples is optimal up to constant factor, and the number of parameters is optimal, up to poly-log factor and the dependency on . This remains true even if we restrict to -bounded distributions.

Given , , and there is a choice of , , as well as and , such that for every -bounded distribution and batch size , the function returned by algorithm 1 satisfies

As an application, we conclude that for the ReLU activation, algorithm 1 can learn even polynomials of bounded norm with near optimal sample complexity and network size. We denote

 \cpMc=⎧⎨⎩p(\x)=∑|α| is even and ≤caα\xα:∑|α| is even and ≤ca2α≤M2⎫⎬⎭

Fix a constant and assume that the activation is ReLU. Given , , and there is a choice of , , as well as and , such that for every -bounded distribution and batch size , the function returned by algorithm 1 satisfies We note that as in theorem 3.1, the number of samples is optimal up to constant factor, and the number of parameters is optimal, up to poly-log factor and the dependency on , and this remains true even if we restrict to -bounded distributions.

### 3.2 Memorization

Theorem 3.1 can be applied to analyze memorization by SGD. Assume that is the hinge loss (similar result is valid for many other losses such as the log-loss) and is any decent non-linear activation. Let be random, independent and uniform points in with for some . Suppose that we run SGD on top of . Namely, we run algorithm 1 where the underlying distribution is the uniform distribution on the points in . Let be the output of the algorithm. We say that the algorithm memorized the ’th example if . The memorization problem investigate how many points the algorithm can memorize, were most of the focus is on how large the network should be in order to memorize fraction of the points.

As shown in section 4.4, the uniform distribution on the examples in is -bounded w.h.p. over the choice of . Likewise, it is not hard to show that w.h.p. over the choice of there is a function such that for all . By theorem 3.1 we can conclude the by running SGD on a network with parameters and steps, the network will memorize fraction of the points. This size of networks is optimal up to poly-log factors, and the dependency of . This is satisfactory is is considered a constant. However, for small , more can be desired. For instance, in the case that we want to memorize all points, we need , and we get a network with parameters. To circumvent that, we perofem a more refined analysis of this memorization problem and show that even perfect memorization of points can be done via SGD on a network with parameters, which is optimal, up to poly-log factors.

There is a choice of , , as well as and , such that for every batch size , w.p. , the function returned by algorithm 1 memorizes fraction of the examples.

We emphasize the our result is true for any non-linear and decent activation function.

### 3.3 Open Questions

The most obvious open question is to generalize our results to the standard Xavier initialization, where is a matrix of independent Gaussians of variance , while is a vector of independent Gaussians of variance . Another open question is to generalize our result to deeper networks.

## 4 Proofs

### 4.1 Reduction to SGD over vector random features

We will prove our result via a reduction to linear learning over the initial neural tangent kernel space, corresponding the the hidden weights.

That is, we define by the gradient of the function w.r.t. the hidden weights. Namely,

 Ψ\W(\x)=(u1σ′(\inner\w1,\x)\x,…,u2qσ′(\inner\w2q,\x)\x)∈\reals2q×d

Denote and consider algorithm 2.

It is not hard to show that by taking large enough , algorithm 1 is essentially equivalent to algorithm 2. Namely,

Fix a decent activation as well as convex a decent loss . There is a choice , such that for every input distribution the following holds. Let be the functions returned algorithm 1 with parameters and algorithm 2 with parameters . Then,

By lemma 4.1 in order to prove theorem 3.1 it is enough to analyze algorithm 2. Specifically, theorem 3.1 follows form the following theorem:

Given , , and there is a choice of , , as well as , such that for every -bounded distribution and batch size , the function returned by algorithm 2 satisfies

Our next step is to rephrase algorithm 2 in the language of (vector) random features. We note that algorithm 2 is SGD on top of the random embedding . This embedding composed of i.i.d. random mappings where is a standard Gaussian. This can be slightly simplified to SGD on top of the i.i.d. random mappings . Indeed, if we make this change the inner products between the different examples, after the mapping is applied, do not change (up to multiplication by ), and SGD only depends on these inner products. This falls in the framework of learning with (vector) random features scheme, which we define next, and analyze in the next section.

Let be a measurable space and let be a kernel. A random features scheme (RFS) for is a pair where

is a probability measure on a measurable space

, and is a measurable function, such that

 ∀\x,\x′∈\cx,k(\x,\x′)=\Eω∼μ[\innerψ(ω,\x),ψ(ω,\x′)]. (3)

We often refer to (rather than ) as the RFS. The NTK RFS is given by the mapping defined by

 ψ(ω,\x)=σ′(\innerω,\x)\x

an being the standard Gaussian measure on . It is an RFS for the kernel (see section 2.6). We define the norm of as . We say that is -bounded if . We note that the NTK RFS is -bounded for . We say that an RFS is factorized if there is a function such that . We note that the NTK RFS is factorized.

A random -embedding generated from is the random mapping

 Ψ\vomega(\x)\footnotesizedef=(ψ(ω1,\x),…,ψ(ωq,\x))√q,

where are i.i.d. We next consider an algorithm for learning , by running SGD on top of random features.

Assume that is factorized and -bounded RFS for , that is convex and -Lipschitz, and that has -bounded marginal. Let be the function returned by algorithm 3. Fix a function . Then

 \E\cl\cd(f)≤\cl\cd(f∗)+LRC∥f∗∥k√qd+∥f∗∥2k2ηT+ηL2C22

In particular, if and we have

 \E\cl\cd(f)≤L\cd(f∗)+LRCM√qd+LCM√T

The next section is devoted to the analysis of RFS an in particular to the proof of theorem 4.1. We note that since the NTK RFS is factorized and -bounded (for ), theorem 4.1 follows from theorem 4.1. Together with lemma 4.1, this implies theorem 3.1.

### 4.2 Vector random feature schemes

For the rest of this section, let us fix a -bounded RFS for a kernel and a random embedding . The random -kernel corresponding to is . Likewise, the random -kernel space corresponding to is . For every

 k\vomega(\x,\x′)=1qq∑i=1\innerψ(ωi,\x),ψ(ωi,\x′)

is an average of

independent random variables whose expectation is

. By Hoeffding’s bound we have. [Kernel Approximation] Assume that , then for every we have . We next discuss approximation of functions in by functions in . It would be useful to consider the embedding

 \x↦Ψ\x where Ψ\x\footnotesizedef=ψ(⋅,\x)∈L2(Ω,\realsd). (4)

From (3) it holds that for any , . In particular, from Theorem 2.4, for every there is a unique function such that

 ∥ˇf∥L2(Ω)=∥f∥k (5)

and for every ,

 f(\x)=\innerˇf,Ψ\xL2(Ω,\realsd)=\Eω∼μ\innerˇf(ω),ψ(ω,\x). (6)

Fix with Hermite expansion and let and

1. Consider the RFS with begin the standard Gaussian measure on . We have that is an RFS for the kernel . Consider the function . We claim that . Indeed, we have,

 \Eω∼μσ(\innerω,\x)1anhn(\inner\x0,ω) = 1an∞∑k=0\Eω∼μakhk(\innerω,\x)hn(\inner\x0,\vomega) = 1an∞∑k=0akδkn\inner\x,\x0k = \inner\x,\x0n

and

 ∥∥∥ω↦1anhn(\inner\x0,ω)∥∥∥L2(Ω)=\Eω∼μ1a2nh2n(\inner\x0,ω)=1a2n=∥f∥2k
2. Consider the NTK RFS with begin the standard Gaussian measure on . We have that is an RFS for the kernel . Consider the function . As in the item above, it is not hard to show that .

Let us denote