# Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels

We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor L^2-approximators for the class of two-layer neural networks in high dimension, and that two-layer networks with small path norm are poor approximators for certain Lipschitz functions, also in the L^2-topology.

## Authors

• 43 publications
• 5 publications
• ### Reproducing kernel Hilbert spaces on manifolds: Sobolev and Diffusion spaces

We study reproducing kernel Hilbert spaces (RKHS) on a Riemannian mani...
05/27/2019 ∙ by Ernesto De Vito, et al. ∙ 5

• ### On overcoming the Curse of Dimensionality in Neural Networks

Let H be a reproducing Kernel Hilbert space. For i=1,...,N, let x_i∈R^d ...
09/02/2018 ∙ by Karen Yeressian, et al. ∙ 0

• ### Random sampling in weighted reproducing kernel subspaces of L^p_ν(R^d)

In this paper, we mainly study the random sampling and reconstruction fo...
03/06/2020 ∙ by Yingchun Jiang, et al. ∙ 0

• ### On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

We study the risk of minimum-norm interpolants of data in a Reproducing ...
08/27/2019 ∙ by Tengyuan Liang, et al. ∙ 6

• ### Representation formulas and pointwise properties for Barron functions

We study the natural function space for infinitely wide two-layer neural...
06/10/2020 ∙ by Weinan E, et al. ∙ 0

• ### Neural Networks, Ridge Splines, and TV Regularization in the Radon Domain

We develop a variational framework to understand the properties of the f...
06/10/2020 ∙ by Rahul Parhi, et al. ∙ 0

• ### Finding branch-decompositions of matroids, hypergraphs, and more

Given n subspaces of a finite-dimensional vector space over a fixed fini...
11/04/2017 ∙ by Jisu Jeong, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

It has been known since the early 1990s that two-layer neural networks with sigmoidal or ReLU activation can approximate arbitrary continuous functions on compact sets in the uniform topology

[Cyb89, Hor91]. In fact, when approximating a suitable (infinite-dimensional) class of functions in the topology of any

compactly supported Radon probability measure, two-layer networks can evade the curse of dimensionality

1. infinitely wide random feature functions with norm bounds are much worse approximators in high dimension compared to two-layer neural networks.

2. infinitely wide neural networks are subject to the curse of dimensionality when approximating general Lipschitz functions in high dimension.

In both cases, we consider approximation in the -topology. Both statements apply more generally. In the first point, we can consider more general kernel methods instead of random features (including certain neural tangent kernels), and the second claim also holds true for deep ResNets of bounded width. We conjecture that Lipschitz functions in the second statement could be replaced with functions for fixed . Precise statements of the results are given in Corollary 3.4 and Example 4.3.

To prove these results, we show more generally that if are subspaces of a Banach space and a sequence of linear maps converges quickly to a limit on , but not on , then there must be a Kolmogorov width-type separation between and . The classical notion of Kolmogorov width is considered in Lemma 2.1 and later extended to a stronger notion of separation in Lemma 2.3.

We apply the abstract result to the pairs Barron space (for two-layer networks)/ Lipschitz space, and RKHS/

Barron space. In the first case, the sequence of linear maps is given by a type of Monte-Carlo integration, in the second case by projection onto the eigenspaces of the RKHS kernel.

This article is structured as follows. In Section 2, we prove the abstract result which we apply to Barron and Lipschitz space in Section 3 and to RKHS and Barron space in Section 4. We conclude by discussing our results and some open questions in Section 5. In appendices A and B, we review the natural function spaces for shallow neural networks and kernel methods respectively. In Appendix B, we specifically focus on kernels arising from random feature models and neural tangent kernels for two-layer neural networks.

### 1.1. Notation

We denote the closed ball of radius around the origin in a Banach space by and the unit ball by . The space of continuous linear maps between Banach spaces is denoted by and the continuous dual space of by .

## 2. An Abstract Lemma

### 2.1. Kolmogorov Width Version

The Kolmogorov width of a function class in another function class with respect to a metric on the union of both classes is defined as the biggest distance of an element in from the class :

 wd(F;G)=supg∈Gdist(g,F)=supg∈Ginff∈Fd(f,g).

In this article, we consider the case where is the unit ball in a Banach space , is the ball of radius in a Banach space and is induced by the norm on a Banach space into which both and embed densely. As increases, points in are approximated to higher degrees of accuracy by elements of . The rate of decay

 ρ(t):=wZ(BXt,BY1)

provides a quantitative measure of density of in with respect to the topology of . For a different point of view on width focusing on approximation by finite-dimensional spaces, see [Lor66, Chapter 9].

In the following Lemma, we show that if there exists a sequence of linear operators on which behaves sufficiently differently on and , then must decay slowly as .

###### Lemma 2.1.

Let be Banach spaces such that . Assume that are continuous linear operators such that

 ∥An−A∥L(X,W)≤CXn−α,∥An−A∥L(Y,W)≥cYn−β,∥An−A∥L(Z,W)≤CZ

for and constants . Then

 (2.1) ρ(t)≥2−β(cY/2)αα−βCZCβα−βXt−βα−β∀ t≥cY2CX

and

 (2.2) liminft→∞(tβα−βρ(t))≥(cY/2)αα−βCZCβα−βX.
###### Proof.

Choose a sequence such and such that

 xn∈argmin{x:∥x∥X≤tn}∥x−yn∥Zfor tn:=cY2CXnα−β

(see Remark 2.2). Then

 cYn−β ≤∥(An−A)yn∥W ≤∥(An−A)(yn−xn)∥W+∥(An−A)xn∥W ≤CZ∥yn−xn∥Z+CXn−α∥xn∥X ≤CZ∥xn−yn∥Z+cY2n−β.

We therefore have

 ∥xn−yn∥Z≥cY2CZn−β=cY2CZ(2CXcYtn)−βα−β=(cY2)αα−β1CZCβα−βXt−βα−βn.

Clearly since . For general , take . Then

 ρ(t) ≥ρ(tn) ≥(cY/2)αα−βCZCβα−βXt−βα−βn ≥(cY/2)αα−βCZCβα−βX(tn−1tn)βα−βt−βα−β =(cY/2)αα−βCZCβα−βX(n−1n)βt−βα−β.

As , so does , and the -dependent term converges to . ∎

###### Remark 2.2.

Generally elements like may not exist if the extremum is not attained. Otherwise, we can choose such that is sufficiently close to its infimum and is sufficiently close to its supremum. To simplify our presentation, we assume that the supremum and infimum are attained.

The choice of as a minimizer is valid if

1. embeds into compactly, so the minimum of the continuous function is attained on the compact set , or

2. the embedding maps closed bounded sets to closed sets and admits continuous projections onto closed convex sets (for example, is uniformly convex).

In the applications below, the first condition will be met.

### 2.2. Improved Estimate

In the previous section, we have shown by elementary means that the estimate

 liminft→∞(tγsup∥y∥≤1inf∥x∥X≤t∥x−y∥Z)≥c>0

holds for suitable if a sequence of linear maps between and another Banach space behaves very differently on subspaces and of . So intuitively, on each scale there exists an element such that is poorly approximable by elements in on this scale. In this section, we establish that there exists a single point

which is poorly approximable across infinitely many scales. This statement has applications in Wasserstein gradient flows for machine learning which we discuss in a companion article

[WE20].

###### Lemma 2.3.

Let be Banach spaces such that . Assume that are operators such that

 ∥An−A∥L(X,W)≤CXn−α,∥An−A∥L(Y,W)≥cYn−β,∥An−A∥L(Z,W)≤CZ

for and constants . Then there exists such that for every we have

 limsupt→∞(tγinf∥x∥X≤t∥x−y∥Z)=∞.

The result is stronger than the previous one in that it fixes a single point which is poorly approximable in infinitely many scales . While in each scale there exists a point which is poorly approximable, we only show that is poorly approximable in infinitely many scales, not in all scales.

###### Proof of Lemma 2.3.

Since , there exists a constant such that .

Definition of . Choose sequences and such

 w∗n∘(An−A)(yn)≥cYn−β.

Consider two sequences of strictly increasing integers such that

 ∞∑k=11nk≤1.

We will impose further conditions below. Set

 y:=∞∑k=0εknkymk

where the signs are chosen inductively such that

 εK⋅w∗mk∘(AmK−A)(K−1∑k=1εknkymk)≥0.

Clearly

 ∥y∥Y≤∞∑k=1|εk|nk∥ymk∥Y=∞∑k=11nk=1.

To shorten notation, define and note that the estimates for transfer to . If we have

 Lky =Lk(K−1∑k=1εknkymk)+1nKLkymK+Lk(∞∑k=K+1εknkymk) ≥0+1nKLkymK−CY∞∑l=K+11nl ≥1nK(cYm−βK−CYnK∞∑l=k+11nl)

and similarly if we obtain

 Lky≤−1nK(cYm−βK−CYnK∞∑l=K+11nl).

Slow approximation rate. Choose

 tk:=cYmα−βk2CXnk,xk∈argmin∥x∥X≤tk∥x−y∥Z.

Then

 1nk(cYm−βk−CYnk∞∑l=k+11nl) ≤∣∣Lky∣∣ ≤∣∣Lk(y−xk)∣∣+∣∣Lkxk∣∣ ≤CZ∥y−xk∥Z+∥Amk−A∥X∗∥xk∥X ≤CZ∥y−xk∥Z+CXm−αktk.

Since was chosen precisely such that

 CXm−αktk=cY2nkm−βk,

we obtain that

 (2.3) 12CZnk(cYm−βk−2CYnk∞∑l=k+11nl)≤∥y−xk∥Z=min∥x∥X≤tk∥x−y∥Z.

For this lower bound to be meaningful, the first term in the bracket has to dominate the second term. We specify the scaling relationship between and as

 mk=nkα−βk.

In this definition, is not typically an integer unless is an integer (or, to hold for a subsequence, rational). In the general case, we choose the integer closest to . To simplify the presentation, we proceed with the non-integer and note that the results are insensitive to perturbations of order .

We obtain

 tk=cY2CXmα−βknk=cY2CXnk−1k,m−βknk=n−βkα−β−1k=n−β(k−1)+αα−βk=(2CXcYtk)−βα−β−α(k−1)(α−β).

In particular, note that as . In order for

 nk∞∑l=k+11nl

to be small, we need to grow super-exponentially. Note that since . We specify and compute

 ∞∑l=k+11nl =∞∑l=12−((k+l)k(k+l)l)≤∞∑l=12−(kk(k+l)l)=∞∑l=1(1nk)((k+l)l) ≤2nk+1k≪n−βkα−β−1k=m−βknk

for large enough . Thus we can neglect the negative term on the left hand side of (2.3) at the price of a slightly smaller constant. Thus

 cY4CZ(2CXcYtk)−βα−β−α(k−1)(α−β)=cY4CZnkm−βk≤min∥x∥X≤tk∥xk−y∥Z.

Finally, we conclude that for all we have

 limsupt→∞(tγinf∥x∥X≤t}∥x−y∥Z) ≥limsupk→∞(tγkinf∥x∥X≤tk}∥x−y∥Z) ≥CX,Y,Zlimsupk→∞tγ−βα−β−α(k−1)(α−β)k =∞.

## 3. Approximating Lipschitz Functions by Functions of Low Complexity

In this section, we apply Lemma 2.3 to the situation where general Lipschitz functions are approximated by functions in a space with much lower complexity. Examples include function spaces for infinitely wide neural networks with a single hidden layer and spaces for deep ResNets of bounded width. For simplicity, we first consider uniform approximation and then modify the ideas to also cover -approximation.

### 3.1. Approximation in L∞

Consider the case where

1. is the space of continuous functions on the unit cube with the norm

 ∥ϕ∥Z=supx∈Qϕ(x),
2. is the space of Lipschitz-continuous functions with the norm

 ∥ϕ∥Y=supx∈Qϕ(x)+supx≠y|ϕ(x)−ϕ(y)||x−y|, and
3. is a Banach space of functions such that

• embeds continuously into ,

• the Monte-Carlo estimate

 EXi∼Ld|Q iid{supϕ∈BX[1nn∑i=1ϕ(Xi)−∫Qϕ(x)dx]}≤CX√n

holds.

Examples of admissible spaces for are Barron space for two-layer ReLU networks and the compositional function space for deep ReLU ResNets of finite width, see [EMW19a, EMW18, EMW19b]. A brief review of Barron space is provided in Appendix A. The Monte-Carlo estimate is proved by estimating the Rademacher complexity of the unit ball in the respective function space. For Barron space, and for compositional function space , see [EMW19b, Theorems 6 and 12].

We observe the following: If

is a vector of iid random variables sampled from the uniform distribution on

, then

 supϕ is 1-Lipschitz(1nn∑i=1ϕ(Xi)−∫Qϕ(x)dx) =W1(Ld|Q,1nn∑i=1δXi)

is the -Wasserstein distance between -dimensional Lebesgue measure on the cube and the empirical measure generated by the random points – see [Vil08, Chapter 5] for further details on Wasserstein distances and the link between Lipschitz functions and optimal transport theory. The distance on for which the Wasserstein transportation cost is computed is the same for which is -Lipschitz.

Empirical measures converge to the underlying distribution slowly in high dimension [FG15], by which we mean that

 EX∼(Ld|Q)nW1(Ld|Q,1nn∑i=1δXi)≥cdn−1/d

for some dimension-dependent constant . Observe that also

 supϕ is 1-Lipschitz(1nn∑i=1ϕ(Xi)−∫Qϕ(x)dx) =supϕ is 1-Lipschitz,ϕ(0)=0(1nn∑i=1ϕ(Xi)−∫Qϕ(x)dx) ≤[1+diam(Q)]sup∥ϕ∥Y≤1,ϕ(0)=0(1nn∑i=1ϕ(Xi)−∫Qϕ(x)dx)

where is a diameter of the -dimensional unit cube with respect to the norm for which is -Lipschitz. Here we used that replacing by for does not change the difference of the two expectations, and that on the space of functions with the equivalence

 [ϕ]Y:=supx≠y|ϕ(x)−ϕ(y)||x−y|≤∥ϕ∥Y≤(1+diam(Q))[ϕ]Y

holds. By we denote the Lebesgue measure of the unit ball in with respect to the correct norm.

###### Lemma 3.1.

For every we can choose points in such that

and

 supϕ∈BX[1nn∑i=1ϕ(xi)−∫Qϕ(x)dx]≤CXn1/2.
###### Proof.

First, we prove the following. Claim: Let be any collection of points in . Then

 W1(Ld|Q,1nn∑i=1δxi)≥dd+11[(d+1)ωd]1dn−1/d.

Proof of claim: Choose and consider the set

 U=n⋃i=1Bεn−1/d(xi).

We observe that

 Ld(U∩Q)≤Ld(U)≤n∑i=1Ld(Bεn−1/d(xi))=nωd(εn−1/d)d=ωdεd.

So any transport plan between and the empirical measure needs to transport mass by a distance of at least . We conclude that

The infimum is attained when

 0=1−(d+1)ωdεd⇔ε=[(d+1)ωd]−1d⇒1−ωdεd=1−1d+1=dd+1.

This concludes the proof of the claim.

Proof of the Lemma: Using the claim, any points such that

 supϕ∈BX[1NN∑i=1ϕ(xi)−∫Qϕ(x)dx]≤E{supϕ∈BX[1NN∑i=1ϕ(Xi)−∫Qϕ(x)dx]}≤CXn1/2

satisfy the conditions. ∎

For any , we fix such a collection of points and define

 An:Z→R,An(ϕ)=1nn∑i=1ϕ(xni),A:Z→R,A(ϕ)=∫Qϕ(x)dx.

Clearly

 |Aϕ|,|Anϕ|≤∥ϕ∥C0=∥ϕ∥Z.

Thus we can apply Lemma 2.3 with

 βα−β=1d12−1d=1dd−22d=2d−2.
###### Corollary 3.2.

There exists a -Lipschitz function on such that

 limsupt→∞(tγinf∥f∥X≤t∥ϕ−f∥L∞(Q))=∞.

for all .

### 3.2. Approximation in L2

Point evaluation functionals are no longer well defined if we choose . We therefore need to replace by functionals of the type

 An(ϕ)=1nn∑i=1\fintBεn(Xni)ϕdx

for sample points and find a balance between the radii shrinking too fast (causing the norms to blow up) and shrinking too slowly (leading to better approximation properties on Lipschitz functions).

We interpret as the unit cube for function spaces, but as a -dimensional flat torus when considering balls. Namely the ball in is to be understood as projection of the ball of radius around on onto . This allows us to avoid boundary effects.

###### Lemma 3.3.

For every we can choose points in such that the estimates

 supϕ∈BX[1nn∑i=1\fintBεn(xi)ϕdx−∫Qϕ(x)dx] ≤3CXn1/2 supϕ∈BY[1nn∑i=1\fintBεn(xi)ϕdx−∫Qϕ(x)dx] ≥cdn−1/d supϕ∈BZ[1nn∑i=1\fintBεn(xi)ϕdx−∫Qϕ(x)dx] ≤Cd

hold. are dimension dependent constants and

 εn=γdn−1/d

for a dimension-dependent .

###### Proof of Lemma 3.3.

-estimate. In all of the following, we rely on the interpretation of balls as periodic to avoid boundary effects. For a sample denote

 AS(ϕ)=1nn∑i=1\fintBεn(Xi)ϕdx

Observe that

 sup∥ϕ∥L2≤1(AS(ϕ)−A(ϕ)) =sup∥ϕ∥L2≤1[1nn∑i=1\fintBεn(Xi)ϕdx−∫ϕdx] =sup∥ϕ∥L2≤1,∫ϕ=01nn∑i=1\fintBεn(Xi)ϕdx ≤1nωdεdn∥∥ ∥∥n∑i=11Bεn(Xi)∥∥ ∥∥L2.

We compute

 ∥∥ ∥∥n∑i=11Bεn(xi)∥∥ ∥∥2L2 =n∑i=1∥∥1Bεn(xi)∥∥2L2+∑i≠j∫1Bεn(xi)1Bεn(xj)dx (3.1) =nωd|εn|d+∑i≠j∣∣Bεn(xi)∩Bεn(xj)∣∣

It is easy to see that

 E(Xi,Xj)∼UQ×Q∣∣Bεn(Xi)∩Bεn(Xj)∣∣ =EX∼UQ∣∣Bεn(X)∩Bεn(0)∣∣ =∫B2εn∣∣Bεn(x)∩Bεn(0)∣∣dx =εdn∫B2∣∣Bεn(εnx)∩Bεn(0)∣∣dx =εdn∫B2εdn∣∣B1(x)∩B1(0)∣∣dx =ε2dnωd2d1ωd2d∫B2∣∣B1(x)∩B1(0)∣∣dx (3.2) =ε2dn¯cd2dωd.

where

 ¯cd:=1ωd2d∫B2∣∣B1(x)∩B1(0)∣∣dx

is a dimension-dependent constant. Thus combining (3.1) and (3.2) we find that

 ES∼(Ld|Q)n∥∥ ∥∥n∑i=11Bεn(Xi)∥∥ ∥∥2L2 =ωdεdn[n+n(n−1)¯cd(2εn)d] ≤ωdnεdn[1+¯cd2dnεdn].

This allows us to estimate

 ES[sup∥ϕ∥L2≤1(ASϕ−Aϕ)] ≤1nωdεdnES∥∥ ∥∥n∑i=11Bεn(Xi)∥∥ ∥∥L2 ≤1nωdεdn⎛⎝ES∥∥ ∥∥n∑i=11Bεn(Xi)∥∥ ∥∥2L2⎞⎠12 ≤1nωdεdn√ωdnεdn[1+cd2dnεdn] =√1+¯cd2dnεdnωd