 # A Priori Estimates of the Generalization Error for Two-layer Neural Networks

New estimates for the generalization error are established for the two-layer neural network model. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model. In contrast, most existing results for neural networks are a posteriori in nature in the sense that the bounds depend on some norms of the model parameters. The error rates are comparable to that of the Monte Carlo method for integration problems. Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

One of the most important theoretical challenges in machine learning comes from the fact that classical learning theory cannot explain the effectiveness of over-parametrized models in which the number of parameters is much larger than the size of the training set. This is especially the case for neural network models, which have achieved remarkable performance for a wide variety of problems

[2, 16, 26]. Understanding the mechanism behind these successes requires developing new analytical tools that can work effectively in the over-parametrized regime .

Our work is partly motivated by the situation in classical approximation theory and finite element analysis . There are two kinds of error bounds in finite element analysis depending on whether the target solution (the ground truth) or the numerical solution enters into the bounds. Let and be the true solution and the “numerical solution”, respectively. In “a priori” error estimates, only norms of the true solution enter into the bounds, namely

 ∥^fn−f∗∥1≤C∥f∗∥2.

In “a posteriori” error estimates, the norms of the numerical solution enter into the bounds:

 ∥^fn−f∗∥1≤C∥^fn∥3.

Here denote various norms.

In this language, most recent theoretical efforts [24, 4, 10, 21, 22, 23] on estimating the generalization error of neural networks should be viewed as “a posteriori” analysis, since the bounds depend on various norms of the solutions. Unfortunately, as observed in  and , the numerical values of these norms are usually quite large for real situations, yielding vacuous bounds.

In this paper we pursue a different line of attack by providing “a priori” analysis. For this purpose, a suitably regularized two-layer network is considered. It is proved that the generalization error of the regularized solutions is asymptotically sharp with constants depending only on the properties of the target function. Numerical experiments show that these a priori bounds are non-vacuous  for datasets of practical interests, such as MNIST and CIFAR-10. In addition, our experimental results also suggest that such regularization terms are necessary in order for the model to be “well-posed” (see Section 6 for the precise meaning).

### 1.1 Setup

We will focus on the regression problem. Let be the target function, with , and be the training set. Here are i.i.d samples drawn from an underlying distribution with , and with being the noise. Our aim is to recover by fitting

using a two-layer fully connected neural network with ReLU (rectified linear units) activation:

 f(x;θ)=m∑k=1akσ(bk⋅x+ck), (1)

where is the ReLU function, , and represents all the parameters to be learned from the training data. denotes the network width. To control the complexity of networks, we use the following scale-invariant norm.

###### Definition 1 (Path norm ).

For a two-layer ReLU network (1), the path norm is defined as

 ∥θ∥P=m∑k=1|ak|(∥bk∥1+|ck|).
###### Definition 2 (Spectral norm).

Given , denote by an extension of to . Let

be the Fourier transform of

, then We define the spectral norm of by

 γ(f)=infF∈L2(Rd),F|Ω=f|Ω∫Rd∥ω∥21|^F(ω)|dω. (2)

We also define .

###### Assumption 1.

Following [6, 14], we consider target functions that are bounded and have finite spectral norm:

 (3)

We assume that .

As a consequence of the assumption that , we can truncate the network by . By an abuse of notation, in the following we still use to denote . The ultimate goal is to minimize the generalization error (expected risk)

 L(θ)=Ex,y[ℓ(f(x;θ),y)].

In practice, we only have at our disposal the empirical risk

 ^Ln(θ)=1nn∑i=1ℓ(f(xi;θ),yi).

The generalization gap is defined as the difference between expected and empirical risk. Here the loss function is

, unless it is specified otherwise.

### 1.2 Our Results

We propose a regularized estimator (defined in Section 3) and prove a priori estimates for its generalization error shown in Table 1. As a comparison, we also list the result of  which analyzed a similar problem. It is worth mentioning that they require the network width to be the orders of whereas we allow arbitrary network width. See Theorem 8 and 9 for more details about the results.

## 2 Preliminary Results

In this section, we summarize some results on the approximation error and generalization bound for two-layer ReLU networks. These results are required by our subsequent a priori analysis.

### 2.1 Approximation Properties

Most of the content in this part is adapted from [3, 6, 14].

###### Proposition 1.

For any , one has the integral representation:

 f(x)−f(0)−x⋅∇f(0)=v∫{−1,1}×[0,1]×Rdh(x;z,t,ω)dp(z,t,ω),

where

 p(z,t,ω) =|^f(ω)|∥ω∥21|cos(∥ω∥1t−zb(ω))|/v s(z,t,ω) =−sign(cos(∥ω∥1t−zb(ω))) h(x;z,t,ω) =s(z,t,ω)(zx⋅ω/∥ω∥1−t)+.

is the normalization constant such that , which satisfies .

###### Proof.

By an abuse of notation, let be its own extension in . Since , can be written as

 ∫Rd(eiω⋅x−iω⋅x−1)^f(ω)dω. (4)

Note that the identity

 −∫c0[(z−s)+eis+(−z−s)+e−is]ds=eiz−iz−1

holds when . Choosing , , we have

 |z|≤∥ω∥1∥x∥∞≤c.

Let , , and , we have

 −∥ω∥21∫10[(^ω⋅x−t)+ei∥ω∥1t+(−^ω⋅x−t)+e−i∥ω∥1t]dt=eiω⋅x−iω⋅x−1. (5)

Let , inserting (5) into (4) yields

 f(x)−x⋅∇f(0)−f(0)=∫Rd∫10g(t,ω)dtdω,

where

 g(t,ω)=−∥ω∥21||^f(ω)|[(^ω⋅x−t)+cos(∥ω∥1t+b(ω))+(−^ω⋅x−t)+cos(∥ω∥1t−b(ω))].

Consider a density on defined by

 p(z,t,ω)=|^f(ω)|∥ω∥21|cos(∥ω∥1t−zb(ω))|/v (6)

where the normalized constant is given by

 v=∫Rd∫10|^f(ω)|∥ω∥21(|cos(∥ω∥1t+b(ω))|+|cos(∥ω∥1t−b(ω))|)dωdt.

Since belongs to , so we have

 v≤2γ(f)<+∞, (7)

therefore the density is well-defined. To simplify the notations, we let

 s(z,t,ω) =−sign(cos(∥ω∥1t−zb(ω))) h(x;z,t,ω) =s(z,t,ω)(z^ω⋅x−t)+.

Then we have

 f(x)−x⋅∇f(0)−f(0) =v∫{−1,1}×[0,B]×Rdh(x;z,t,ω)dp(z,t,ω).

Since , we obtain

 f(x)=f(0)+(x⋅∇f(0))+−(−x⋅∇f(0))++v∫{−1,1}×[0,B]×Rdh(x;z,t,ω)dp(z,t,ω).

For simplicity, in the rest of this paper, we assume . We take samples with randomly drawn from , and consider the empirical average , which is exactly a two-layer ReLU network of width

. The central limit theorem (CLT) tells us that the approximation error is roughly

 E(z,t,ω)[h(x;z,t,ω)]−1mm∑k=1h(x;zk,tk,ωk)≈√Var(z,t,ω)[h(x;z,t,ω)]m.

So as long as we can bound the variance at the right-hand side, we will have an estimate of the approximation error. The following result formalizes this intuition.

###### Theorem 2.

For any distribution with and any , there exists a two-layer network of width such that

 Ex∼π|f(x)−f(x;~θ)|2≤16γ2(f)m.

Furthermore , i.e. the path norm of can be bounded by the spectral norm of the target function.

###### Proof.

Let be the Monte-Carlo estimator, we have

 ETmEx|f(x)−^fm(x)|2 =ExETm|f(x)−^fm(x)|2 =v2mEx(E(z,t,ω)[h2(x;z,t,ω)]−f2(x)) ≤v2mExE(z,t,ω)[h2(x;z,t,ω)]

Furthermore, for any fixed , the variance can be upper bounded since

 E(z,t,ω)[h2(x;z,t,ω)] ≤E(z,t,ω)[(z^ω⋅x−t)2+] ≤E(z,t,ω)[(|^ω⋅x|+t)2] ≤4.

Hence we have

 ETmEx|f(x)−^fm(x)|2≤4v2m≤16γ2(f)m

Therefore there must exist a set of , such that the corresponding empirical average satisfies

 Ex|f−fm|2≤16γ2(f)m.

Due to the special structure of the Monte-Carlo estimator, we have . It follows Equation (7) that . ∎

### 2.2 Estimating the Generalization Gap

Let be a hypothesis space, i.e. a set of functions. The Rademacher complexity of with respect to samples is defined as

 ^R(H)=1nEξ[suph∈Hn∑i=1h(zi)ξi],

where

are independent random variables with

.

The generalization gap can be estimated via the Rademacher complexity by the following theorem (see [5, 25] ).

###### Theorem 3.

Fix a hypothesis space . Assume that for any and , . Then for any

, with probability at least

over the choice of , we have

 suph∈H|1nn∑i=1h(zi)−Ez[h(z)]|≤2ES[^R(H)]+c√2log(2/δ)n.

Before to provide the upper bound for the Rademacher complexity of two-layer networks, we first need the following two lemmas.

###### Lemma 1 (Lemma 26.11 of ).

Let be vectors in . Then the Rademacher complexity of has the following upper bound,

 ^R(H1)≤maxi∥xi∥∞√2log(2d)n

The above lemma characterizes the Rademacher complexity of a linear predictor with norm bounded by

. To handle the influence of nonlinear activation function, we need the following contraction lemma.

###### Lemma 2 (Lemma 26.9 of ).

Let be a Lipschitz function, i.e. for all we have . For any , let , then we have

 ^R(ϕ∘H)≤ρ^R(H)

We are now ready to characterize the Rademacher complexity of two-layer networks. Specifically, The path norm is used to control the complexity.

###### Lemma 3.

Let be the set of two-layer networks with path norm bounded by , then we have

 ^R(FQ)≤2Q√2log(2d)n
###### Proof.

To simplify the proof, we let , otherwise we can define and .

 n^R(FQ) =Eξ[sup∥θ∥P≤Qn∑i=1ξim∑k=1ak∥bk∥1σ(^bTkxi)] ≤Eξ[sup∥θ∥P≤Q,∥uk∥1=1n∑i=1ξim∑k=1ak∥bk∥1σ(uTkxi)] =Eξ[sup∥θ∥P≤Q,∥uk∥1=1m∑k=1ak∥bk∥1n∑i=1ξiσ(uTkxi)] ≤Eξ[sup∥θ∥P≤Qm∑k=1|ak∥bk∥1|sup∥u∥1=1|n∑i=1ξiσ(uTxi)|] ≤QEξ[sup∥u∥1=1|n∑i=1ξiσ(uTxi)|]≤QEξ[sup∥u∥1≤1|n∑i=1ξiσ(uTxi)|]

Due to the symmetry, we have that

 Eξ[sup∥u∥1≤1|n∑i=1ξiσ(uTxi)|] ≤Eξ[sup∥u∥1≤1n∑i=1ξiσ(uTxi)+sup∥u∥1≤1n∑i=1−ξiσ(uTxi)] =2Eξ[sup∥u∥1≤1n∑i=1ξiσ(uTxi)]

Since is Lipschitz continuous with Lipschitz constant , by applying Lemma 2 and Lemma 1, we obtain

 ^R(FQ)≤2Q√2log(2d)n.

###### Proposition 4.

Assume the loss function is Lipschitz continuous and bounded by , then with probability at least we have,

 sup∥θ∥P≤Q|L(θ)−^Ln(θ)|≤4ρQ√2log(2d)n+B√2log(2/δ)n (8)
###### Proof.

Define , then we have which follows from Lemma 2 and 3. Then directly applying Theorem 3 yields the result. ∎

###### Theorem 5 (A posterior generalization bound).

Assume the loss function is Lipschitz continuous and bounded by . Then for any , with probability at least over the choice of the training set , we have, for any two-layer network ,

 |L(θ)−^Ln(θ)|≤4ρ(∥θ∥P+1)√2log(2d)n+B√2log(2c(1+∥θ∥P)2/δ)n, (9)

where .

We can see that the generalization gap is roughly bounded by up to some logarithmic terms.

###### Proof.

Consider the decomposition , where . Let where . According to Theorem 4, if we fixed in advance, then with probability at least over the choice of ,

 sup∥θ∥P≤l|L(θ)−^Ln(θ)|≤4ρl√2log(2d)n+B√2log(2/δl)n.

So the probability that there exists at least one such that (2.2) fails is at most . In other words, with probability at least , the inequality (2.2) holds for all .

Given an arbitrary set of parameters , denote , then . Equation (2.2) implies that

 |L(θ)−^Ln(θ)| ≤4ρl0√2log(2d)n+B√2log(2cl20/δ)n ≤4ρ(∥θ∥P+1)√2log(2d)n+B√2log(2c(1+∥θ∥P)2/δ)n.

## 3 A Priori Estimates

For simplicity we first consider the noiseless case, i.e. . In the next section, we deal with the noise.

We see that the path norm of the special solution which achieves the optimal approximation error is independent of the network size, and this norm can also be used to bound the generalization gap (Theorem 5). Therefore, if the path norm is suitably penalized during training, we should be able to control the generalization gap without harming the approximation accuracy. One possible implementation of this idea is through the structural empirical risk minimization [29, 25].

###### Definition 4 (Path-norm regularized estimator).

Define the regularized risk by

 Jλ(θ):=^Ln(θ)+λ√2log(2d)n(1+∥θ∥P), (10)

where is a positive constant. The path-norm regularized estimator is defined as

 ^θn=argminJλ(θ). (11)

It is worth noting that the minimizer is not necessarily unique, and should be understood as any of the minimizers. Without loss of generality, we also assume . In the following, we provide detailed analysis for the generalization error of the regularized estimator.

Since the path norm of the network associated with is bounded, we have the following estimate of its regularized risk.

###### Proposition 6.

Let be the network constructed in Theorem 2. Then with probability at least , we have

 Jλ(~θ)≤L(~θ)+1√n(^γ(f∗)(4+5(λ+4)√2log(2d))+√2log(2c/δ)). (12)
###### Proof.

According to Definition 4 and the property that , the regularized cost of satisfies

 Jλ(~θ) =^Ln(~θ)+λ√2log(2d)n(∥~θ∥P+1) (1)≤L(~θ)+(4+λ)√2log(2d)n(∥~θ∥P+1)+√2log(2c(1+∥~θ∥P)2/δ)n ≤L(~θ)+(4+λ)√2log(2d)n(4γ(f∗)+1)+√2log(2c(1+4γ(f∗))2/δ)n, (13)

where follows from the generalization bound in Theorem 5 and the fact that . The last term can be simplified by using and for . So we have

 √2log(2c(1+4γ(f∗))2/δ) ≤ √2log(2c/δ)+√4log(1+4γ(f∗)) ≤ √2log(2c/δ)+4√γ(f∗)≤√2log(2c/δ)+4^γ(f∗).

By plugging it into Equation (3), and using , we obtain

 Jλ(~θ)≤L(~θ)+1√n(^γ(f∗)(4+5(λ+4)√2log(2d))+√2log(2c/δ)).

###### Proposition 7 (Properties of the regularized estimator).

The path-norm regularized estimator satisfies:

 Jλ(^θn) ≤Jλ(~θ) ∥^θn∥P√n ≤λ−1L(~θ)+n−1/2(^γ(f∗)(5+24λ−1)+λ−1√2log(2c/δ))
###### Proof.

The first claim follows from the definition of . For the second claim, we have , so . By using Proposition 6 and , we have

 ∥^θn∥P√n ≤λ−1√2log2dL(~θ)+λ−1√n(^γ(f∗)(4√2log(2d)+5(λ+4))+√2log(2c/δ)2log(2d)) ≤λ−1L(~θ)+n−1/2(^γ(f∗)(5+24λ−1)+λ−1√2log(2c/δ))

###### Remark 1.

The above proposition establishes the connection between the regularized solution and the special solution constructed in Theorem 2. In particular, the upper bound of the generalization gap of the regularized solution satisfies as . This suggests that our regularization term is added appropriately, which forces the generalization gap to be roughly in the same order of approximation error.

###### Theorem 8 (Main result, noiseless case).

Under Assumption 1, there exists a pure constant such that for any and , with probability at least over the choice of the training set , the generalization error of estimator (11) satisfies

 E|f(x;^θn)−f∗(x)|2≤Cγ2(f∗)m+ C√n(√log(2d)^γ(f∗)λ+γ(f∗)√mλ+√log(n/δ)) +√C(^γ(f∗)+log1/2(1/δ))n. (14)
###### Remark 2.

It should be noted that both terms at the right hand side of the above result has a Monte Carlo nature (for numerical integration). From this viewpoint, the result is quite sharp.

###### Proof.

We first have that

 L(^θn) (1)≤^Ln(^θn)+4(∥^θn∥P+1)√2log(2d)n+√log(2c(1+∥^θn∥P)2/δ)n (2)≤Jλ(^θn)+√log(2c(1+∥^θn∥P)2/δ)n

Where follows from the a posteriori generalization bound in Theorem 5, and is due to . Furthermore,

 √log(2c(1+∥^θn∥P)2/δ) ≤√log(2nc/δ)+√2log(1+n−1/2∥^θn∥P) ≤√log(2nc/δ)+√2n−1/2∥^θn∥P.

By Proposition 7 and the condition , we have that

 ∥^θn∥P√n≤λ−1L(~θ)+20√n(^γ(f∗)+√log(1/δ)),

Thus we obtain that

 √log(2c(1+∥^θn∥P)2/δ)n≤√log(2nc/δ)n+√2L(~θ)λn+√40(^γ(f∗)+log1/2(1/δ))n. (15)

On the other hand, for we have

 Jλ(^θn)≤Jλ(~θ)≤L(~θ)+1√n(11λ^γ(f∗)√2log(2d)+√2log(2c/δ)) (16)

By combining Equation (15) and  (16), we obtain

 L(