# Some negative results for Neural Networks

We demonstrate some negative results for approximation of functions with neural networks.

Comments

There are no comments yet.

## Authors

• 1 publication
• 2 publications
• 1 publication
01/29/2019

### Approximation of functions by neural networks

We study the approximation of measurable functions on the hypercube by f...
08/14/2021

### Optimal Approximation with Sparse Neural Networks and Applications

We use deep sparsely connected neural networks to measure the complexity...
11/06/2019

### Neural Network Processing Neural Networks: An efficient way to learn higher order functions

Functions are rich in meaning and can be interpreted in a variety of way...
09/18/2018

### Negative type diversities, a multi-dimensional analogue of negative type metrics

Diversities are a generalization of metric spaces in which a non-negativ...
03/27/2016

### Negative Learning Rates and P-Learning

We present a method of training a differentiable function approximator f...
06/07/2021

### Application of neural networks to classification of data of the TUS orbital telescope

We employ neural networks for classification of data of the TUS fluoresc...
10/05/2014

### On the Computational Efficiency of Training Neural Networks

It is well-known that neural networks are computationally hard to train....
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The standard model of the single hidden layer feedforward neural networks leads to the problem of approximation of arbitrary functions by elements of the set

 Σσ,dn={n∑k=1ckσ(wk⋅x−bk):wk∈Rd,ck,bk∈R},

where is the given activation function of the net and

is the dot product of the vectors

and .

It is well known that is dense in (for the topology of uniform convergence on compact subsets of ) if and only if is not a polynomial (see, e.g. [9, Theorem 1] and [8] for a different density result). In particular, this means that feedforward networks with a nonpolynomial activation function can approximate any continuous function and, thus, are good for any learning objective in the sense that, given a precision there exists with the property that the associated feedforward neural network with one hidden layer and

units can be (in principle) trained to approximate the associated learning function

with uniform accuracy smaller than . In other words, we know that for any nonpolynomial activation function and for any compact set and any ,

 E(f,Σσ,dn)C(K)=infg∈Σσ,dn∥f−g∥C(K)≤ε

for large enough. These are of course good news for people working with neural networks. On the other hand, several examples of high-oscillating functions such as those studied at [14], [16] have shown that sometimes it is necessary to increase dramatically the number of units (and layers) of a neural network if one wants to approximate certain functions. But these examples do not contain, by themselves, any information about the decay of best approximation errors when the size of the neural network goes to infinity, so that one considers the full sequence of best approximation errors. In this paper we demonstrate a negative result which, in philosophical terms, claims the existence of learning functions which are as difficult to approximate with neural networks with one hidden layer as one may want and we also demonstrate an analogous result for neural networks with an arbitrary number of layers for some special types of activation functions . Concretely, in section 2 we demonstrate that for any non polynomial activation function , for any natural number and for any compact set with nonempty interior and any given decreasing sequence of positive real numbers which converges to zero, there exists a continuous function such that

 E(f,Σσ,dn)C(K)≥εn, for all n=1,2,⋯.

We also demonstrate the same type of result for the norms , . The proofs of these theorems are based on the combination of a general negative result in approximation theory demonstrated by Almira and Oikhberg in 2012 [1] (see also [2]

) and the estimations of the deviations of Sobolev classes by ridge functions proved by Gordon, Maiorov, Meyer and Reisner in 2002

[5]. It is important to stand up the fact that our result requires the use of functions of several variables and can’t be applied to univariate functions. This is natural not only because of the method of proof we use, which is based on the statement of a negative result for approximation by ridge functions that holds true only for , but also because quite recently Guliyev and Ismailov [6, Theorems 4.1 and 4.2] have demonstrated that for the case a general negative result for approximation by single hidden layer feedforward neural networks is impossible. In particular they have explicitly constructed an infinitely differentiable sigmoidal activation function such that is a dense subset of with the topology of uniform convergence on compact subsets of .

On the other hand, as a consequence of Kolmogorov’s Superposition Theorem [4, Chapter 17, Theorem 1.1], a general negative result for approximation by several hidden layer feedforward neural networks is impossible not only for univariate () but also for arbitrary multivariate () functions (see [10] for a proof of this claim and [7, 11, 12] for other interesting related results). Nevertheless, in section 3 we demonstrate several negative results for specific choices of activation functions and for arbitrary . Concretely, we prove that if is either a rational (nonpolynomial) function or a linear spline with finitely many pieces (which is nonpolynomial too), then for any pair of sequences of natural numbers , and any non-increasing sequence which converges to , and for appropriate compact subsets of , there exists a function such that

 E(f,τσ,drk,nk)C(K)≥εk for all k=1,2,⋯,

where denotes the set of functions defined on by a neural network with activation function and at most layers and units in each layer. It is important to stand up the fact that the results of this section apply for all values of and, moreover, they also apply to neural networks with activation functions

 σ(t)=ReLU(t)={0t<0tt≥0

and

 σ(t)=Hard Tanh(t)=⎧⎪⎨⎪⎩−1t<−1t−1≤t≤11t>1,

which are two of the most used activation functions for people working with neural networks.

## 2. A general negative result for feedforward neural networks with one hidden layer

Let us start by introducing some notation and terminology from Approximation Theory. Given a Banach space, we say that is an approximation scheme (or that is an approximation scheme in ) if satisfies the following properties:

• is a nested sequence of subsets of (with strict inclusions)

• There exists a map such that and for all

• for all and all scalars .

• is dense in .

It is known that for any activation function and any infinite compact set , is an approximation scheme, with jump function , as soon as is not a polynomial, since the sets always satisfy , and and, given an infinite compact set , they also satisfy if (and only if) is not a polynomial (see [9]). In fact, if is not a polynomial, then is an approximation scheme for every compact set of positive measure and all .

Another important example of approximation scheme, again with jump function , is given by , where , or (with ) for some compact set with positive Lebesgue measure, and, for ,

 Rdn={n∑i=1gi(ai⋅x):ai∈Sd−1,gi∈C(R),i=1,⋯,n},

where . The elements of are called ridge functions (in variables). The density of ridge functions in these spaces is a well known fact (see [13]).

We prove, for the sake of completeness, the following result:

###### Proposition 1.
 Σσ,dn⊂Rdn

Proof. Let be any element of and let us set and . Then and for . Moreover,

 d∑k=1gk(ak⋅x)=d∑k=1ckσ(∥wk∥ak⋅x−bk)=d∑k=1ckσ(wk⋅x−bk)=ϕ(x),

which means that .

In [1, Theorems 2.2 and 3.4] (see also [2, Theorem 1.1]) the following result was proved

###### Theorem 2 (Almira and Oikhberg, 2012).

Given an approximation scheme on the Banach space , the following are equivalent claims:

• For every non-increasing sequence there exists an element such that

 E(x,An)=infan∈An∥x−an∥X≥εn for % all n∈N
• There exists a constant and an infinite set such that, for all there exists such that

 E(xn,An)≤cE(xn,AJ(n))
###### Remark 3.

If satisfies or of Theorem 2 we say that the approximation scheme satisfies Shapiro’s theorem.

Let us state the main result of this section:

###### Theorem 4.

Let be any nonpolynomial activation function, let be a natural number and let be a compact set with nonempty interior. For any non-increasing sequence of real numbers such that , there exist a learning function such that

 E(f,Σσ,dn)C(K)≥E(f,Rdn)C(K)≥εn for all n=1,2,⋯.

Proof. It is enough to demonstrate that the approximation scheme satisfies condition of Theorem 2, since Proposition 1 implies that

 E(f,Σσ,dn)C(K)≥E(f,Rdn)C(K)

for all . In order to do this, we need to use the following result (see [5, Theorem 1.1])

###### Theorem 5 (Gordon, Maiorov, Meyer and Reisner, 2002).

Let and let and . Then the asymptotic relation

 dist(B(Wr,dp),Rdn,Lq)≍n−rd−1

holds.

Here, denotes the unit ball in the Sobolev-Slobodezkii class of functions defined on the unit ball of (but the result holds true with the same asymptotics for all balls of ) and

 dist(B(Wr,dp),Rdn,Lq)=supf∈B(Wr,dp)E(f,Rdn)Lq(Bd).

We apply Theorem 5 to functions defined on a ball of of radius , . Thus, given and , there exists two positive constants such that:

• For any ,

 E(f,Rdn)Lq(Bd(t))≤c1n−rd−1.
• For any there exists such that

 E(fm,Rdm)Lq(Bd(t))≥c0m−rd−1.

Combining these inequalities for , , we have that

 E(f2n,Rd2n)Lq(Bd(t)) ≥ c0(2n)−rd−1=c02−rd−1n−rd−1 ≥ 2−rd−1c0c1E(f2n,Rdn)Lq(Bd(t))

Hence

 E(f2n,Rdn)Lq(Bd(t))≤2rd−1c1c0E(f2n,Rd2n)Lq(Bd(t)) for all n∈N

and satisfies condition of Theorem 2, which implies that, for any non-increasing sequence of positive numbers there exists a function such that

Taking any extension of to the compact set , we also have that

 E(¯¯¯f,Rdn)Lq(K)≥εn for all n∈N.

Moreover, in the case , we can use Sobolev’s embedding theorem to guarantee that, for big enough, the Sobolev class is formed by continuous functions, which implies that the inequalities

 E(f2n,Rdn)L∞(Bd(t))≤2rd−1c1c0E(f2n,Rd2n)L∞(Bd(t)) for all n∈N

can be stated for continuous functions:

 E(f2n,Rdn)C(Bd(t))≤2rd−1c1c0E(f2n,Rd2n)C(Bd(t)) for all n∈N

and we can apply Theorem 2 to the approximation scheme and, consequently, to the approximation scheme . This ends the proof.

## 3. Negative results for feedforward neural networks with several hidden layers

In [10, Theorem 4 and comments below it] it was proved that, for a proper choice of the activation function , which may be chosen real analytic and of sigmoidal type, a feedforward neural network with two layers and a fixed finite number of units in each layer is enough for uniform approximation of arbitrary continuous functions on any compact set . Concretely, the result was demonstrated for neural networks with units in the first layer and units in the second layer. Moreover, reducing the restrictions on , the same result can be obtained for a neural network of two layers with units in the first layer and units in the second layer. Finally, Guliyev and Ismailov have recently shown, for , an algorithmically constructed two hidden layers feedforward neural network with fixed weights which has a total of units and uniformly approximates arbitrary continuous functions on compact subsets of (see [7]).

In this section we prove that, for certain natural choices of , a negative result holds true for neural networks with any number of layers and units in each layer. Concretely, we demonstrate a negative result for multivariate uniform approximation by rational functions and spline functions with free partitions and, as a consequence, we get a negative result for approximation by feedforward neural networks with arbitrary number of layers when the activation function is either a rational (non polynomial) function or a linear spline with finitely many pieces (e.g., the well known ReLU and Hard Tanh functions are linear spline functions with two and three pieces, respectively). Note that the use of rational and/or spline approximation tools in connection with the study of these classical neural networks is a natural one and has been faced by several other authors (see, e.g. [14, 15, 16, 17, 18]).

In [1, Sections 6.3 and 6.4] the one dimensional case of rational and free partitions spline approximation were studied, so that we can state the following two results:

###### Theorem 6.

Let denote the set of rational functions with and such that does not vanish on , and assume that with . Let be an increasing sequence of natural numbers. Let and , . Then is an approximation scheme which satisfies Shapiro’s theorem.

Proof. The result follows from [1, Item from Theorem 6.9] for the case , . Let us now assume that is any other sequence. Given a non-increasing sequence which converges to , we introduce the new sequence

 ϵn=εk , when nk−1≤n

Then is non-increasing and . Thus, there exists a continuous function such that for all . In particular,

 E(f,Ai)=E(f,R1ni(I))≥ϵni=εi for all i∈N.

This ends the proof.

###### Theorem 7.

Let denote the set of polynomial splines of degree with free knots in the interval . Let and be a pair of increasing sequences of natural numbers. Let and and let . Then and are approximation schemes which satisfy Shapiro’s Theorem.

Proof. See [1, Theorem 6.12]

Let us now study the multivariate case. For the case of rational approximation, we consider, for compact sets , the approximation schemes , where

 Rdn(K)={p(x1,⋯,xd)q(x1,⋯,xd):p,q are polynomials, max{deg(p),deg(q)}≤n}∩C(K).
###### Theorem 8.

For any convex compact set and any increasing sequence of natural numbers , the approximation scheme satisfies Shapiro’s theorem.

Proof. Let . We may assume, with no loss of generality, that for certain , since rotations and translations of the space preserve the set of rational functions of any given order. Thus, the functions with satisfy . Moreover, the convexity of and the fact that imply that every function extends to a function just taking .

Hence, for any we have that

 E(f,Rdn(K))C(K) = infr∈Rdn(K)supx∈K|f(x)−r(x)| ≥ infr∈Rdn(K)supx∈[a,a+ρ]×{0}×⋯×{0}|f(x)−r(x)| = infϕ∈R1n([a,a+ρ])sups∈[a,a+ρ]|f(s,0,⋯,0)−ϕ(s)| = E(g,R1n([a,a+ρ]))C[a,a+ρ],

where is a continuous function on the interval . Now, given any function there exists a continuous function such that for all (e.g., we may take ). Hence for all . The proof ends applying Theorem 6 to the space .

Given a polyhedron , we say that is a triangulation of if

• Every set is a -simplex. This means that is the convex hull of a set of affinely independent points of .

• and, for , is either empty or an -simplex for some (thus, it is the convex hull of a set of affinely independent points of ).

If the polyhedron admits a triangulation we say that it is triangularizable. For example, all convex polyhedron is triangularizable. We consider, for any trianguralizable polyhedron , the approximation schemes , where denotes the set of spline functions defined on a partition of with at most simplices and . Concretely, is an element of if and only if it is a continuous function on and satisfies the identity

 ϕ(x)=∑Δ∈ΓpΔ(x)χΔ(x) for all x∈K,

for some triangulation of with , where denotes, for each , a polynomial of total degree .

We need, to prove Shapiro’s theorem for approximation with spline functions on free partitions, the following technical result on triangulations:

###### Proposition 9.

Given , there exists a function such that, if is a convex polyhedron and are two triangulations of with cardinalities , , then there exists a triangulation of which is a refinement of and and satisfies that .

Proof. Given two triangulations of with cardinalities , , let us consider the following set:

 Θ={Δ1∩Δ2:Δi is a simplex of Γi,  i=1,2}.

Obviously, is a polyhedral subdivision of both but in many cases is not a triangulation of . Moreover, the number of vertices of will, in general, be bigger than and . Let us find an upper bound of . Assume that is a new vertex of (i.e., a vertex of which is not a vertex of ) and let us consider the smaller simplices () which contain in its relative interior (that is, in the interior of the simplex with its relative topology). Then . Hence the number of new vertices can’t be bigger than the product of the number of simplices (including all dimensions) of by the number of simplices of (including again all dimensions), which gives the following upper bound of :

 N≤nm22d

Now, using [3, Lemma 2.3.10] we can refine the polyhedral subdivision , without adding any new vertex, to get a triangulation of . By this way, we get a triangulation of with at most vertices. Now, the Upper Bound Theorem (see [3, Corollary 2-6-5]) applies to this triangulation of , since is a topological ball of dimension . In particular, we get that

 h(n,m)=fd(C(nm22d+1,d+1))−(d+1)

satisfies the proposition. Here denotes the number of faces of dimension (i.e., -simplices) of the simplicial complex and denotes the simplicial -sphere given by the proper faces of the cyclic -polytope with vertices, which is the convex hull of

arbitrary points taken from the moment curve

.

###### Theorem 10.

Let be a convex polyhedron with non-empty interior, and let , be two sequences of natural numbers. Then the following holds true:

• is an approximation scheme.

• satisfies Shapiro’s theorem.

• For any decreasing sequence which converges to zero, there exists such that

 E(f,Sdrk,nk(K))C(K)≥εk for all % k∈N

Proof. It follows from Proposition 9 that there exists a function such that

 Sdrk,nk(K)+Sdrs,ns(K)⊆Sdh(rk,rs),max{nk,ns}(K),

which proves with jump function , since the other properties needed to be an approximation scheme are well known for these sets. To prove and we use a similar argument to the one used in the proof of Theorem 8. Concretely, given the convex polyhedron , we can assume with no loss of generality that and

 K∩R×{0}×⋯×{0}=[a,a+ρ]×{0}×⋯×{0}

for certain , since rotations and translations of the space transform triangulations into triangulations and preserve the spaces polynomials.

For each , there exists a triangulation of with such that

 ϕ(x)=∑Δ∈ΓpΔ(x)χΔ(x) for all x∈K,

where denotes, for each , a polynomial of total degree . Now, given the triangulation , it is clear that induces a triangulation (i.e., a partition) of which contains at most intervals, with the property that . Hence, for any we have that

 E(f,Sdrk,nk(K))C(K)≥E(g,S12rk,nk([a,a+ρ]))C[a,a+ρ],

where . Obviously, the convexity of and the fact that again imply that, given , the function belongs to and for all . Hence, parts and follow from Theorem 7

Let us now demonstrate a negative result for approximation by feedforward neural networks with many layers. Let be a nonpolynomial function and let us consider the approximation of continuous learning functions defined on compact subsets of with activation function . We denote by the set of functions defined on computed by a feedforward neural network with at most layers and units in each layer, with activation function . For example, an element of is a function of the form

 ϕ(x)=n∑i=1ciσ(wi⋅x+bi) where wi,x∈Rd and ci,bi∈R

and an element of is a function of the form

 ϕ(x)=n∑i=1diσ(n∑j=1cijσ(wij⋅x+bij)+δi), where wij,x∈Rd and cij,bij,di,δi∈R

The following result holds:

###### Theorem 11.

Let , be two sequences of natural numbers and let be any non-increasing sequence which converges to . Then:

• If is compact and convex and is a rational function, there exists such that

 E(f,τσ,drk,nk)C(K)≥εk for all k=1,2,⋯.
• If is a convex polyhedron and is a linear spline with finitely many pieces, there exists such that

 E(f,τσ,drk,nk)C(K)≥εk for all k=1,2,⋯.

In particular, this result applies to neural networks with activation functions and .

Proof. If is a (non polynomial) rational function of order then all elements of are also rational functions and their orders are uniformly bounded by a function . Moreover, for the estimation of the errors with , we need to impose since, if has a pole on , then and this function does not contribute to the estimation of the error . Thus, we can claim that

 E(f,τσ,dr,n)C(K)≥E(f,Rdo(r,n,N)(K))C(K)

and, taking , we have that

 E(f,τσ,drk,nk)C(K)≥E(f,Rdmk(K))C(K), k=1,2,⋯

and follows from Theorem 8.

Let us now prove . If is a continuous linear spline with finitely many pieces, say , then every element of is again a continuous linear spline (in variables), defined on a finite set of polyhedra. For we have that every element is of the form

 ϕ(x)=n∑i=1ciσ(wi⋅x+bi) with wi,x∈Rd and ci,bi∈R,

where for all and is a fixed partition of in intervals (two of them are semi-infinite and the others are finite intervals). This means that the evaluation of takes the form:

 ϕ(x)=n∑i=1ci(Ak(i)(wi⋅x+bi)+Bk(i)) where% wi⋅x+bi∈Ik(i),, i=1,⋯,n,

which proves that is piecewise linear and the number of pieces (where the function changes from one linear polynomial to another) is finite and controlled by a finite set of linear inequalities, so that each piece is a convex polyhedron with a well controlled number of faces. An induction argument -which takes into account the definition of neural networks of several layers- shows that all elements of belong to for a certain fixed function . The result follows from part of Theorem 10.

## References

• [1] J. M. Almira, T. Oikhberg Approximation schemes satisfying Shapiro’s Theorem, Journal of Approximation Theory 164 (2012) 534-571.
• [2] J. M. Almira, T. Oikhberg Shapiro’s theorem for subspaces, Journal of Mathematical Analysis and Applications, 388 (2012) 282-302.
• [3] J. A. De Loera, J. Rambau and F. Santos, Triangulations. Structures for Algorithms and Applications, Springer, 2010.
• [4] G.G. Lorentz, M. v. Golitschek, Y. Makovoz, Constructive Approximation. Advanced Problems, Springer, 1996.
• [5] Y. Gordon, V. Maiorov, M. Meyer, S. Reisner, On the best approximation by ridge functions in the uniform norm, Constructive Approximation 18 (2002) 61-85.
• [6] N. J. Guliyev, V. E. Ismailov, On the approximation by single hidden layer feedforward neural networks with fixed weights, Neural Networks, 98 (2018) 296-304.
• [7] N. J. Guliyev, V. E. Ismailov, Approximation capability of two hidden layer feedforward neural networks with fixed weights, Neurocomputing, 316 (2018) 262-269.
• [8] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359-366.
• [9] M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer Feedforward Networks with a non polynomial activation function can approximate any function, Neural Networks 6 (1993) 861-867.
• [10] V. Maiorov, A. Pinkus, Lower bounds for approximations by MLP neural networks, Neurocomputing 25 (1999) 81-91.
• [11] V. Kůrková, Kolmogorov’s theorem and multilayer neural networks, Neural Networks 5 (1992) 501-506.
• [12] V. Kůrková, Kolmogorov’s theorem is relevant, Neural Computation 3 (1991) 617-622.
• [13] A. Pinkus, Approximation theory of the MLP model in Neural Networks, Acta Numerica 8 (1999), 143-196.
• [14] K-Y. Siu, V. P. Roychowdhury, and T. Kailath, Rational Approximation Techniques for Analysis of Neural Networks IEEE Transactions of Inf. Theory 40 (2) (1994) 455-466.
• [15] M. Telgarsky,

Neural networks and rational functions, Proc. Machine Learning Research ICML (2017) 3387-3393.

• [16] M. Telgarsky, Benefits of depth in neural networks, JMLR: Workshop and Conference Proceedings vol 49 (2016) 1-23.
• [17] R. C. Williamson, Rational parametrization of neural networks, Advances in Neural Inf. Processing Systems 6 (1993) 623-630.
• [18] R. C. Williamson, P. L. Barlett, Splines, rational functions and neural networks, Advances in Neural Inf. Processing Systems 5 (1992) 1040-1047.