# On overcoming the Curse of Dimensionality in Neural Networks

Let H be a reproducing Kernel Hilbert space. For i=1,...,N, let x_i∈R^d and y_i∈R^m comprise our dataset. Let f^*∈ H be the unique global minimiser of the functional J(f) = 1/2 f_H^2 + 1/N∑_i=1^N1/2 f(x_i)-y_i^2. In this paper we show that for each n∈N there exists a two layer network where the first layer has nm number of basis functions Φ_x_i_k,j for i_1,...,i_n∈{1,...,N}, j=1,...,m and the second layer takes a weighted summation of the first layer, such that the functions f_n realised by these networks satisfy f_n-f^*_H≤ O(1/√(n))for all n∈N. Thus the error rate is independent of input dimension d, output dimension m and data size N.

## Authors

• 1 publication
• ### Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels

We establish a scale separation of Kolmogorov width type between subspac...
05/21/2020 ∙ by Weinan E, et al. ∙ 22

• ### Binary output layer of feedforward neural networks for solving multi-class classification problems

Considered in this short note is the design of output layer nodes of fee...
01/22/2018 ∙ by Sibo Yang, et al. ∙ 0

• ### Feature Space Saturation during Training

We propose layer saturation - a simple, online-computable method for ana...
06/15/2020 ∙ by Justin Shenk, et al. ∙ 11

• ### The Rate of Convergence of Variation-Constrained Deep Neural Networks

Multi-layer feedforward networks have been used to approximate a wide ra...
06/22/2021 ∙ by Gen Li, et al. ∙ 0

• ### Deep Learning for Functional Data Analysis with Adaptive Basis Layers

Despite their widespread success, the application of deep neural network...
06/19/2021 ∙ by Junwen Yao, et al. ∙ 8

• ### Indistinguishability Obfuscation from Well-Founded Assumptions

In this work, we show how to construct indistinguishability obfuscation ...
08/21/2020 ∙ by Aayush Jain, et al. ∙ 0

• ### An optimal linear filter for estimation of random functions in Hilbert space

Let be a square-integrable, zero-mean, random vector with observable re...
08/28/2020 ∙ by Phil Howlett, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The regularisation networks have been introduced in [2]. As cited in [1] and [3] a function can be approximated with various forms of such networks with first layer having

components and achieving an error estimate of

. Thus in higher dimensions one needs exponentially more neurons to achieve the same error or one should have very smooth functions to approximate.

In this paper for the specific minimiser of the regularised empirical loss function we prove a dimension independent result with error estimate of

.

### 1.1. Structure of this paper

In Section 2 we present the main result of this paper. In Section 3 we outline main properties of Reproducing Kernel Hilbert spaces that we use. In Section 4 we prove the preliminary properties of the functionals appearing in our minimisation problem. In Section 5 we prove the existence and uniqueness of the global minimiser of our problem. In Section 6

we obtain the convergence rate of stochastic gradient descent sequence. In Section

7 we prove that the sequence generated by stochastic gradient descent is realisable by our networks and prove our main result. In Section 8 we outline our current research directions.

## 2. Main Result

Let be a reproducing Kernel Hilbert space. Let us denote by the dual space of . Let be the Riesz representation operator.

Let and then we may consider the linear functional defined by

 ⟨ℓx,ej,φ⟩=φj(x).

Let us denote

 Φx,j=RH(ℓx,ej),

i.e. and

 (Φx,j,φ)H=⟨ℓx,ej,φ⟩=φj(x)for allφ∈H.

Let us note that for each and ,

is a vector valued function.

For let and comprise our dataset.

For we consider the minimisation of the functional

 J(f)=12∥f∥2H+1NN∑i=112|f(xi)−yi|2

where is a regularisation term and are the corresponding losses.

Let us also denote

 Ji(f)=12∥f∥2H+12|f(xi)−yi|2.

Clearly we have .

As we will see that has a unique global minimiser .

The following theorem is our main result.

###### Theorem 1.

For each there exists a two layer network where the first layer has number of basis functions for and and the second layer takes a weighted summation of the first layer, such that the function realised by this network has the governing error rate , more precisely for any there exists such that

 ∥fn−f∗∥2H≤Cp(∥f∗∥2H1np+E[∥DJI(f∗)∥2H∗]1n)for alln∈N.

## 3. The Reproducing Kernel Hilbert space H

Let be a reproducing Kernel Hilbert space of functions defined on with values in . By definition this means that is continuously embedded in the space of bounded continuous functions , i.e. there exists such that

 (3.1) ∥φ∥Cb(Rd;Rm)≤M∥φ∥Hfor allφ∈H.

Let be the Riesz representation operator and its inverse.

Let and then we may consider the linear functional defined on by

 ⟨ℓx,c,φ⟩=c⋅φ(x).

We have that the dual of is continuously embedded in the dual of . Therefore we have .

In the following we may also use the notation . We compute

 (3.2) ∥cTδx∥H∗=sup∥φ∥H≤1⟨cTδx,φ⟩H∗,H=sup∥φ∥H≤1(c⋅φ(x))≤|c|sup∥φ∥H≤1|φ(x)|≤M|c|.

## 4. Preliminary Properties of the Minimisation Functional

In this section is a reproducing Kernel Hilbert space.

###### Lemma 1 (Derivative functional, Uniform convexity and C1,1 regularity of Ji).

We have defined by

 (4.1) DJi(f)=LHf+(f(xi)−yi)Tδxi

with the following properties

 (4.2) ∥DJi(f2)−DJi(f1)∥H∗≤(1+M2)∥f2−f1∥H

and

 (4.3) ⟨DJi(f2)−DJi(f1),f2−f1⟩H∗,H≥∥f2−f1∥2H

for all .

###### Proof.

Let and . We compute

 Ji(f+tφ)=12∥f+tφ∥2H+12|(f+tφ)(xi)−yi|2=12∥f∥2H+t(f,φ)H+12t2∥φ∥2H+12|f(xi)−yi|2+t(f(xi)−yi)⋅φ(xi)+12t2|φ(xi)|2

and therefore

 ddtJi(f+tφ)=(f,φ)H+t∥φ∥2H+(f(xi)−yi)⋅φ(xi)+t|φ(xi)|2.

Thus we have

which proves (4.1).

Now let . We compute

and using (3.1) and (3.2) estimate

 ∥DJi(f2)−DJi(f1)∥H∗=∥LH(f2−f1)+(f2(xi)−f1(xi))Tδxi∥H∗≤∥LH(f2−f1)∥H∗+∥(f2(xi)−f1(xi))Tδxi∥H∗≤∥f2−f1∥H+M|f2(xi)−f1(xi)|≤∥f2−f1∥H+M2∥f2−f1∥H=(1+M2)∥f2−f1∥H

which proves (4.2).

For we estimate

 ⟨DJi(f2)−DJi(f1),f2−f1⟩H∗,H=⟨LH(f2−f1)+(f2(xi)−f1(xi))Tδxi,f2−f1⟩H∗,H=⟨LH(f2−f1),f2−f1⟩H∗,H+⟨(f2(xi)−f1(xi))Tδxi,f2−f1⟩H∗,H=∥f2−f1∥2H+|f2(xi)−f1(xi)|2≥∥f2−f1∥2H

which proves (4.3). ∎

## 5. Existence of Unique Global Minimiser

In this section is a general Hilbert space, not necessarily a reproducing Kernel Hilbert space.

###### Lemma 2.

Let be a differentiable functional. Assume there exists such that

 (5.1) ∥DU(f2)−DU(f1)∥H∗≤Λ∥f2−f1∥H

and

 (5.2) ⟨DU(f2)−DU(f1),f2−f1⟩H∗,H≥λ∥f2−f1∥2H.

Then has a unique global minimiser in .

###### Proof.

Let us define

 S(f)=f−ηRH(DU(f))

where to be chosen. We compute

 ∥S(f)−S(g)∥2H=∥(f−g)−ηRH(DU(f)−DU(g))∥2H=∥f−g∥2H−2η(RH(DU(f)−DU(g)),f−g)H+η2∥RH(DU(f)−DU(g))∥2H=∥f−g∥2H−2η⟨DU(f)−DU(g),f−g⟩H∗,H+η2∥DU(f)−DU(g)∥2H∗≤∥f−g∥2H−2ηλ∥f−g∥2H+η2Λ2∥f−g∥2H=(1−2ηλ+η2Λ2)∥f−g∥2H.

By choosing small enough we obtain

 ∥S(f)−S(g)∥H≤α∥f−g∥H

where .

Now from Banach Fixed Point theorem we obtain that has a unique fixed point in . From it follows that .

Now let us show that is the unique global minimiser of .

Let and for we define

 γ(t)=U(f∗+t(g−f∗)).

We compute

 U(g)=γ(1)=γ(0)+∫10γ′(t)dt=γ(0)+∫10⟨DU(f∗+t(g−f∗)),g−f∗⟩dt=γ(0)+∫101t⟨DU(f∗+t(g−f∗))−DU(f∗),(f∗+t(g−f∗))−f∗⟩dt≥γ(0)+∫101tλ∥(f∗+t(g−f∗))−f∗∥2Hdt=γ(0)+λ∥g−f∗∥2H∫10tdt=γ(0)+12λ∥g−f∗∥2H=U(f∗)+12λ∥g−f∗∥2H.

Thus we have shown that

 U(g)≥U(f∗)+12λ∥g−f∗∥2Hfor allg∈H

and from this it follows that is the unique global minimiser of . ∎

In this section is a general Hilbert space, not necessarily a reproducing Kernel Hilbert space.

###### Theorem 2.

Let for be differentiable functionals. Assume there exist such that

 (6.1) ∥DUi(f2)−DUi(f1)∥H∗≤Λ∥f2−f1∥H

and

 (6.2) ⟨DUi(f2)−DUi(f1),f2−f1⟩H∗,H≥λ∥f2−f1∥2H

for all and .

Let us define

 (6.3) U(f)=1NN∑i=1Ui(f).

From (6.1), (6.2) and (6.3) it follows that satisfies (5.1) and (5.2).

Let and

be a sequence of independent and identically uniformly distributed random variables taking values in

.

Let us consider the stochastic gradient descent sequence

 Fk+1=Fk−ηkRH(DUIk(Fk))

for where .

Let

 p>1,b=2(Λλ)2pandηk=pλ1b+kfork∈N

then there exists such that

here is the unique global minimiser of as in Lemma 2.

###### Proof.

By considering we might assume that .

Let us consider the decomposition

 RH(DUIk(Fk))=Ak+Bk

where

 Ak=RH(DUIk(Fk))−RH(DUIk(0))

and

 Bk=RH(DUIk(0)).

Using (6.1) we estimate

 (6.4) ∥Ak∥H=∥RH(DUIk(Fk))−RH(DUIk(0))∥H=∥DUIk(Fk)−DUIk(0)∥H∗≤Λ∥Fk∥H

and using (6.2) we estimate

 (6.5) (Ak,Fk)H=(RH(DUIk(Fk))−RH(DUIk(0)),Fk)H=⟨DUIk(Fk)−DUIk(0),Fk⟩H∗,H≥λ∥Fk∥2H.

Using Young’s inequality and (6.4) we estimate

 (6.6) (Ak,Bk)H≤∥Ak∥H∥Bk∥H≤Λ∥Fk∥H∥Bk∥H≤Λ(Λ2∥Fk∥2H+12Λ∥Bk∥2H)=Λ22∥Fk∥2H+12∥Bk∥2H.

Using (6.4), (6.5) and (6.6) we estimate

 (6.7) ∥Fk+1∥2H=∥Fk−ηkRH(DUIk(Fk))∥2H=∥Fk−ηk(Ak+Bk)∥2H=∥Fk∥2H+η2k∥Ak∥2H+η2k∥Bk∥2H−2ηk(Fk,Ak)H−2ηk(Fk,Bk)H+2η2k(Ak,Bk)H≤∥Fk∥2H+η2kΛ2∥Fk∥2H+η2k∥Bk∥2H−2ηkλ∥Fk∥2H−2ηk(Fk,Bk)H+2η2k(Λ22∥Fk∥2H+12∥Bk∥2H)=(1−2ληk+2Λ2η2k)∥Fk∥2H+2η2k∥Bk∥2H−2ηk(Fk,Bk)H.

We have that depends on and depends only on . Because is independent of we obtain that is independent of .

We compute

 E[Bk]=E[RH(DUIk(0))]=E[RH(DUI(0))]=RH(DU(0))=0

and

 (6.8) E[(Fk,Bk)H]=(E[Fk],E[Bk])H=(E[Fk],0)H=0.

We compute

 (6.9) E[∥Bk∥2H]=E[∥RH(DUIk(0))∥2H]=E[∥DUIk(0)∥2H∗]=E[∥DUI(0)∥2H∗].

Taking the expectation in (6.7) and using (6.8) and (6.9) we obtain

 (6.10) E[∥Fk+1∥2H]≤(1−2ληk+2Λ2η2k)E[∥Fk∥2H]+2η2kE[∥DUI(0)∥2H∗].

By our choice of we have

 ηk≤λ2Λ2

and thus we have

 (6.11) 1−2ληk+2Λ2η2k≤1−ληk≤e−ληk.

From (6.10) and (6.11) we obtain

 E[∥Fk+1∥2H]≤E[∥Fk∥2H]e−ληk+2E[∥DUI(0)∥2H∗]η2k

and by iteration we obtain

 (6.12) E[∥Fn∥2H]≤E[∥F1∥2H]e−λ∑n−1i=1ηi+2E[∥DUI(0)∥2H∗]n−1∑k=1η2ke−λ∑n−1i=k+1ηi.

By our choice of one may see that we have

 (6.13) e−λ∑n−1i=1ηi≤(b+1b+n)p

and

 (6.14) n−1∑k=1η2ke−λ∑n−1i=k+1ηi≤(pλ)2(1+2b)p1p−11n+b.

From (6.12), (6.13) and (6.14) the result of the Theorem follows. ∎

## 7. Application in Neural Networks (Proof of Theorem 1)

In this section is a reproducing Kernel Hilbert space.

###### Lemma 3.

Let and for

 Fk+1=Fk−ηkRH(DJIk(Fk)).

Then for , is a linear combination of for and .

###### Proof.

Let us denote

 Vk=yIk−Fk(xIk)

and using (4.1) we compute

 (7.1) Fk+1=Fk−ηkRH(DJIk(Fk))=Fk−ηkRH(LHFk+(Fk(xIk)−yIk)TδxIk)=Fk−ηk(Fk+RH((Fk(xIk)−yIk)TδxIk))=Fk−ηkFk−ηkRH((Fk(xIk)−yIk)TδxIk)=(1−ηk)Fk−ηkRH((Fk(xIk)−yIk)TδxIk)=(1−ηk)Fk+ηkRH((yIk−Fk(xIk))TδxIk)=(1−ηk)Fk+ηkRH(VTkδxIk)
 =(1−ηk)Fk+ηkRH(m∑j=1Vk,jeTjδxIk)=(1−ηk)Fk+ηkm∑j=1Vk,jRH(eTjδxIk)=(1−ηk)Fk+ηkm∑j=1Vk,jΦxIk,j.

Because from (7.1) it follows that

 F2=η1m∑j=1V1,jΦxI1,j

thus the claim of the lemma holds for .

Now from (7.1), using induction the lemma is proved. ∎

###### Proof of Theorem 1.

This follows from Theorem 2 and Lemma 3. ∎

## 8. Further Research

We currently work on using similar results as in this paper to get a satisfactory bound on the generalisation error and also to investigate in which cases a deep neural network realising the function achieves better approximation and generalisation errors.

## References

• [1] Mhaskar, H. N.; Poggio, T., Deep vs. shallow networks: an approximation theory perspective, Anal. Appl. (Singap.) 14 (2016), no. 6, 829–848.
• [2] Poggio, T.; Girosi, F., Regularization algorithms for learning that are equivalent to multilayer networks, Science 247 (1990), no. 4945, 978–982.
• [3] Poggio, T.; Mhaskar, H.; Rosasco, L. et al., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review, Int. J. Autom. Comput. (2017) 14: no. 5, 503–519.