On overcoming the Curse of Dimensionality in Neural Networks

09/02/2018 ∙ by Karen Yeressian, et al. ∙ KTH Royal Institute of Technology 0

Let H be a reproducing Kernel Hilbert space. For i=1,...,N, let x_i∈R^d and y_i∈R^m comprise our dataset. Let f^*∈ H be the unique global minimiser of the functional J(f) = 1/2 f_H^2 + 1/N∑_i=1^N1/2 f(x_i)-y_i^2. In this paper we show that for each n∈N there exists a two layer network where the first layer has nm number of basis functions Φ_x_i_k,j for i_1,...,i_n∈{1,...,N}, j=1,...,m and the second layer takes a weighted summation of the first layer, such that the functions f_n realised by these networks satisfy f_n-f^*_H≤ O(1/√(n))for all n∈N. Thus the error rate is independent of input dimension d, output dimension m and data size N.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The regularisation networks have been introduced in [2]. As cited in [1] and [3] a function can be approximated with various forms of such networks with first layer having

components and achieving an error estimate of

. Thus in higher dimensions one needs exponentially more neurons to achieve the same error or one should have very smooth functions to approximate.

In this paper for the specific minimiser of the regularised empirical loss function we prove a dimension independent result with error estimate of

.

1.1. Structure of this paper

In Section 2 we present the main result of this paper. In Section 3 we outline main properties of Reproducing Kernel Hilbert spaces that we use. In Section 4 we prove the preliminary properties of the functionals appearing in our minimisation problem. In Section 5 we prove the existence and uniqueness of the global minimiser of our problem. In Section 6

we obtain the convergence rate of stochastic gradient descent sequence. In Section

7 we prove that the sequence generated by stochastic gradient descent is realisable by our networks and prove our main result. In Section 8 we outline our current research directions.

2. Main Result

Let be a reproducing Kernel Hilbert space. Let us denote by the dual space of . Let be the Riesz representation operator.

Let and then we may consider the linear functional defined by

Let us denote

i.e. and

Let us note that for each and ,

is a vector valued function.

For let and comprise our dataset.

For we consider the minimisation of the functional

where is a regularisation term and are the corresponding losses.

Let us also denote

Clearly we have .

As we will see that has a unique global minimiser .

The following theorem is our main result.

Theorem 1.

For each there exists a two layer network where the first layer has number of basis functions for and and the second layer takes a weighted summation of the first layer, such that the function realised by this network has the governing error rate , more precisely for any there exists such that

3. The Reproducing Kernel Hilbert space

Let be a reproducing Kernel Hilbert space of functions defined on with values in . By definition this means that is continuously embedded in the space of bounded continuous functions , i.e. there exists such that

(3.1)

Let be the Riesz representation operator and its inverse.

Let and then we may consider the linear functional defined on by

We have that the dual of is continuously embedded in the dual of . Therefore we have .

In the following we may also use the notation . We compute

(3.2)

4. Preliminary Properties of the Minimisation Functional

In this section is a reproducing Kernel Hilbert space.

Lemma 1 (Derivative functional, Uniform convexity and regularity of ).

We have defined by

(4.1)

with the following properties

(4.2)

and

(4.3)

for all .

Proof.

Let and . We compute

and therefore

Thus we have

which proves (4.1).

Now let . We compute

and using (3.1) and (3.2) estimate

which proves (4.2).

For we estimate

which proves (4.3). ∎

5. Existence of Unique Global Minimiser

In this section is a general Hilbert space, not necessarily a reproducing Kernel Hilbert space.

Lemma 2.

Let be a differentiable functional. Assume there exists such that

(5.1)

and

(5.2)

Then has a unique global minimiser in .

Proof.

Let us define

where to be chosen. We compute

By choosing small enough we obtain

where .

Now from Banach Fixed Point theorem we obtain that has a unique fixed point in . From it follows that .

Now let us show that is the unique global minimiser of .

Let and for we define

We compute

Thus we have shown that

and from this it follows that is the unique global minimiser of . ∎

6. Stochastic Gradient Descent

In this section is a general Hilbert space, not necessarily a reproducing Kernel Hilbert space.

Theorem 2.

Let for be differentiable functionals. Assume there exist such that

(6.1)

and

(6.2)

for all and .

Let us define

(6.3)

From (6.1), (6.2) and (6.3) it follows that satisfies (5.1) and (5.2).

Let and

be a sequence of independent and identically uniformly distributed random variables taking values in

.

Let us consider the stochastic gradient descent sequence

for where .

Let

then there exists such that

here is the unique global minimiser of as in Lemma 2.

Proof.

By considering we might assume that .

Let us consider the decomposition

where

and

Using (6.1) we estimate

(6.4)

and using (6.2) we estimate

(6.5)

Using Young’s inequality and (6.4) we estimate

(6.6)

Using (6.4), (6.5) and (6.6) we estimate

(6.7)

We have that depends on and depends only on . Because is independent of we obtain that is independent of .

We compute

and

(6.8)

We compute

(6.9)

Taking the expectation in (6.7) and using (6.8) and (6.9) we obtain

(6.10)

By our choice of we have

and thus we have

(6.11)

From (6.10) and (6.11) we obtain

and by iteration we obtain

(6.12)

By our choice of one may see that we have

(6.13)

and

(6.14)

From (6.12), (6.13) and (6.14) the result of the Theorem follows. ∎

7. Application in Neural Networks
(Proof of Theorem 1)

In this section is a reproducing Kernel Hilbert space.

Lemma 3.

Let and for

Then for , is a linear combination of for and .

Proof.

Let us denote

and using (4.1) we compute

(7.1)

Because from (7.1) it follows that

thus the claim of the lemma holds for .

Now from (7.1), using induction the lemma is proved. ∎

Proof of Theorem 1.

This follows from Theorem 2 and Lemma 3. ∎

8. Further Research

We currently work on using similar results as in this paper to get a satisfactory bound on the generalisation error and also to investigate in which cases a deep neural network realising the function achieves better approximation and generalisation errors.

References

  • [1] Mhaskar, H. N.; Poggio, T., Deep vs. shallow networks: an approximation theory perspective, Anal. Appl. (Singap.) 14 (2016), no. 6, 829–848.
  • [2] Poggio, T.; Girosi, F., Regularization algorithms for learning that are equivalent to multilayer networks, Science 247 (1990), no. 4945, 978–982.
  • [3] Poggio, T.; Mhaskar, H.; Rosasco, L. et al., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review, Int. J. Autom. Comput. (2017) 14: no. 5, 503–519. https://doi.org/10.1007/s11633-017-1054-2