# Barron Spaces and the Compositional Function Spaces for Neural Network Models

One of the key issues in the analysis of machine learning models is to identify the appropriate function space for the model. This is the space of functions that the particular machine learning model can approximate with good accuracy, endowed with a natural norm associated with the approximation process. In this paper, we address this issue for two representative neural network models: the two-layer networks and the residual neural networks. We define Barron space and show that it is the right space for two-layer neural network models in the sense that optimal direct and inverse approximation theorems hold for functions in the Barron space. For residual neural network models, we construct the so-called compositional function space, and prove direct and inverse approximation theorems for this space. In addition, we show that the Rademacher complexity has the optimal upper bounds for these spaces.

Comments

There are no comments yet.

## Authors

• 55 publications
• 58 publications
• 40 publications
• ### Direct and inverse theorems on the approximation of almost periodic functions in Besicovitch-Stepanets spaces

Direct and inverse approximation theorems are proved in the Besicovitch-...
05/14/2021 ∙ by Anatolii Serdyuk, et al. ∙ 0

read it

• ### On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models

We study the generalization properties of minimum-norm solutions for thr...
12/15/2019 ∙ by Weinan E, et al. ∙ 10

read it

• ### On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics

We develop Banach spaces for ReLU neural networks of finite depth L and ...
07/30/2020 ∙ by Weinan E, et al. ∙ 11

read it

• ### Dynamic Compositional Neural Networks over Tree Structure

Tree-structured neural networks have proven to be effective in learning ...
05/11/2017 ∙ by Pengfei Liu, et al. ∙ 0

read it

• ### ResPerfNet: Deep Residual Learning for Regressional Performance Modeling of Deep Neural Networks

The rapid advancements of computing technology facilitate the developmen...
12/03/2020 ∙ by Chuan-Chi Wang, et al. ∙ 0

read it

• ### Explanatory models in neuroscience: Part 1 – taking mechanistic abstraction seriously

Despite the recent success of neural network models in mimicking animal ...
04/03/2021 ∙ by Rosa Cao, et al. ∙ 0

read it

• ### Composite Neural Network: Theory and Application to PM2.5 Prediction

This work investigates the framework and performance issues of the compo...
10/22/2019 ∙ by Ming Chuan Yang, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The task of supervised learning is to approximate a function using a given set of data. This type of problem has been the subject of classical numerical analysis and approximation theory for a long time. The theory of splines and the theory of finite element methods are very successful examples of such classical results

[10, 9]

, both are concerned with approximating functions using piecewise polynomials. In these theories, one starts from a function in a particular function space, say a Sobolev or Besov space, and proceeds to derive optimal error estimates for this function. The optimal error estimates depend on the regularity encoded in the function space as well as the approximation scheme. They are the most important pieces of information for understanding the underlying approximation scheme. It should be noted that when discussing a particular function space, we are not just concerned with the set of functions it contains, the norm associated with the space is also crucial.

Identifying the right function space that one should use is the most crucial step in this analysis. Sobolev/Besov type of spaces are good function spaces for these classical theories since:

1. One can prove direct and inverse approximation theorems for these spaces. Roughly speaking, a function can be approximated by piecewise polynomials with certain convergence rate if and only if the function is in certain Sobolev/Besov space.

2. The functions we are interested in, e.g. solutions of partial differential equations (PDEs), are in these spaces. This is at the heart of the regularity theory for PDEs.

However, these spaces are tied with the piecewise polynomial basis used in the approximation scheme. These approximation schemes suffer from the curse of dimensionality, i.e. the number of parameters needed to achieve certain level of accuracy grows exponentially with dimension. Consequently, Sobolev/Besov type of spaces are not the right function spaces for studying machine learning models that can potentially address the curse of dimensionality problem.

Another inspiration for this paper comes from kernel methods. It is well-known that the right function space associated with a kernel method is the corresponding reproducing kernel Hilbert space (RKHS) [1]. RKHS and kernel methods provide one of the first examples for which dimension-independent error estimates can be established.

The main purpose of this paper is to construct and identify the analog of these spaces for two-layer and residual neural network models. For two-layer neural network models, we show that the right function space is the so-called “Barron space”. Roughly speaking, a function belongs to the Barron space if and only if it can be approximated by “well-behaved” two-layer neural networks. The analog of the Barron space for deep residual neural networks is the “compositional function space” that we construct in the second part of this paper. We will establish direct and inverse approximation theorems for these spaces as well as the optimal Rademacher complexity estimates.

One important difference between approximation theory in low and high dimensions is that in high dimensions, the best error rate (or order of convergence) that one can hope for is the Monte Carlo error rate. Therefore using the error rate as an indicator to distinguish the quality of different approximation schemes or machine learning models is not a good option. The function spaces or the associated norms seem to be a better alternative. We take the viewpoint that a function space is defined by its approximation property using a particular approximation scheme. In this sense, Sobolev/Besov spaces are the result when we consider approximation by piecewise polynomials or wavelets. Barron space is the analog when we consider approximation by two-layer neural networks and the compositional function space is the analog when we consider approximation by deep residual networks. The norms that are associated with these spaces may seem a bit unusual at a first sight, but they arise naturally in the approximation process, when we consider direct and inverse approximation theorems.

Although this work was motivated by the problem of understanding approximation theory for neural network models in machine learning, we believe that it should have an implication for high dimensional analysis in general. One natural follow-up question is whether one can show that solutions to high dimensional partial differential equations (PDE) belong to the function spaces introduced here. At least for linear parabolic PDEs, the work in [12] suggests that some close analog of the compositional spaces should serve the purpose.

In Section 2, we introduce the Barron space for two-layer neural networks. Although not all the results in this section are new (some have appeared in various forms in [13, 2]), they are useful for illustrating our angle of attack and they are also useful for the work in Section 3 where we introduce the compositional function space for residual networks.

Notations: Let . We define if otherwise . For simplicity, we fix the domain of interest to be . We denote by the input variable, and let . We sometimes abuse notation and use (or some other analogs) to denote the function in order to signify the independent variable under consideration. We use to denote the norm of function defined by

 ∥f∥=(∫X|f(x)|2μ(dx))12, (1)

where

is a probability distribution on

. We do not specify in this paper.

One important point for working in high dimension is the dependence of the constants on the dimension. We will use to denote constants that are independent of the dimension.

## 2 Barron spaces

### 2.1 Definition of the Barron spaces

We will consider functions that admit the following representation

 f(x)=∫Ωaσ(bTx+c)ρ(da,db,dc),x∈X (2)

where , is a probability distribution on (, ), with being a Borel -algebra on , and

is the ReLU activation function. This representation can be considered as the continuum analog of two-layer neural networks:

 fm(x;Θ):=1mm∑j=1ajσ(bTjx+cj), (3)

where denotes all the parameters. It should be noted that in general, the ’s for which (2) holds are not unique.

To get some intuition about the representation (2), we write the Fourier representation of a function as:

 f(x)=∫Rd^f(ω)cos(ωTx)dω=∫R1×Rdacos(ωTx)ρ(da,dω), (4)
 ρ(da,dω)=δ(a−^f(ω))dadω.

This can be thought of as the analog of (2) for the case when except for the fact that the defined in (4) is not normalizable.

For functions that admit the representation (2), we define its Barron norm:

 ∥f∥Bp=infρ(Eρ[|a|p(∥b∥1+|c|)p])1/p,1≤p≤+∞. (5)

Here the infimum is taken over all for which (2) holds for all . Barron spaces are defined as the set of continuous functions that can be represented by (2) with finite Barron norm. We name these spaces after Barron to honor his contribution to the mathematical analysis of two-layer neural networks, in particular the work in [4, 5, 13].

As a consequence of the Hölder’s inequality, we trivially have

 B∞⊂⋯B2⊂B1. (6)

However, the opposite is also true.

###### Proposition 1.

For any , we have and

 ∥f∥B1=∥f∥B∞.

As a consequence, we have that for any , and . Hence, we can use and to denote the Barron space and Barron norm.

A natural question is: What kind of functions are in the Barron space? The following is a restatement of an important result proved in [8]. It is an extension of the Fourier analysis of two-layer sigmoidal neural networks in Barron’s seminal work [4] (see also [13] for another proof).

###### Theorem 2.

Let , the space of continuous functions on , and assume that satisfies:

 γ(f):=inf^f∫Rd∥ω∥21|^f(ω)|dω<∞,

where

is the Fourier transform of an extension of

to . Then admits a representation as in (2). Moreover,

 ∥f∥B≤2γ(f)+2∥∇f(0)∥1+2|f(0)|. (7)
###### Remark 1.

In Section 9 of [4], examples of functions with bounded are given (e.g. Gaussian, positive definite functions, linear functions, radial functions, etc.). By Theorem 2, these functions belong to the Barron space.

In addition, the Barron space is also closely related to a family of RKHS. Let . Due to the scaling invariance of , we can assume . Then (2) can be written as

 f(x)=∫\SSdaσ(wT~x)ρ(da,dw)=∫\SSda(w)σ(wT~x)π(dw),

where,

 a(w)=∫Raρ(a,w)daπ(w),π(w)=∫Rρ(a,w)da

Moreover,

 ∥f∥2B2=infπ∈P(\SSd)Eπ[|a(w)|2], (8)

where denotes the collection of the probability measures over , is the Borel -algebra on .

Given a fixed probability distribution , we can define a kernel:

 kπ(x,x′)=Ew∼π[σ(wT~x)σ(wT~x′)]

Let denote the RKHS induced by . Then we have the following theorem.

###### Theorem 3.
 B=⋃π∈P(\SSd)Hkπ.
###### Proof.

According to [17], we have the following characterization of :

 Hkπ={∫\SSda(w)σ(wT~x)dπ(w):Eπ[|a(w)|2]<∞}.

In addition, for any , . It is obvious that for any , , which implies that . Conversely, for any , there exists a probability distribution that satisfies

 f(x)=∫\SSda(w)σ(wT~x)~π(dw)∀x∈X, (9)

and . Hence we have , which implies . Therefore . Together with Proposition 1, we complete the proof. ∎

### 2.2 Direct and inverse approximation theorems

With (2), approximating by two-layer networks becomes a Monte Carlo integration problem.

###### Theorem 4.

For any , there exists a two-layer neural network such that

 ∥f(⋅)−fm(⋅;Θ)∥2≤3∥f∥2Bm,

Furthermore, we have

 ∥Θ∥P:=1mm∑j=1|aj|(∥bj∥1+|cj|)≤2∥f∥B.
###### Remark 2.

We call the path norm of two-layer neural networks. This is the analog of the Barron norm of functions in . Hence, when studying approximation properties, it is natural to study two-layer neural networks with bounded path norm.

One can also prove an inverse approximation theorem. To state this result, we define:

 NQ={1mm∑k=1akσ(bTkx+ck):1mm∑k=1|ak|(∥bk∥1+|ck|)≤Q,m∈N+}.
###### Theorem 5.

Let be a continuous function on . Assume there exists a constant and a sequence of functions such that

 fm(x)→f∗(x)

for all . Then there exists a probability distribution on , such that

 f∗(x)=∫aσ(bTx+c)ρ∗(da,db,dc),

for all . Furthermore, we have with

 ∥f∗∥B≤Q.

### 2.3 Estimates of the Rademacher complexity

Next, we show that the Barron spaces we defined have low complexity. We show this by bounding the Rademacher complexity of bounded sets in the Barron spaces.

###### Definition 1 (Rademacher complexity).

Given a set of functions and data samples }, the Rademacher complexity of with respect to is defined as

 Radn(F)=1nEξsupf∈Fn∑i=1ξif(xi), (10)

where

is a vector of

i.i.d.random variables that satisfy .

The following theorem gives an estimate of the Rademacher complexity of the Barron space. Similar results can be found in [2]. We include the proof for completeness.

###### Theorem 6.

Let . Then we have

 Radn(FQ)≤2Q√2ln(2d)n (11)

From Theorem 8 in [6], we see that the above results implies that functions in the Barron spaces can be learned efficiently .

### 2.4 Proofs

#### 2.4.1 Proof of Theorem 1

Take . For any , there exists a probability measure that satisfies

 f(x)=∫Ωaσ(bTx+c)ρ(da,db,dc),∀x∈X, (12)

and

 Eρ[|a|(∥b∥1+|c|)]<∥f∥B1+ε. (13)

Let , and consider two measures and on defined by

 ρ+(A)=∫{(a,b,c): (^b,^c)∈A,a>0}|a|(∥b∥1+|c|)ρ(da,db,dc), (14) ρ−(A)=∫{(a,b,c): (^b,^c)∈A,a<0}|a|(∥b∥1+|c|)ρ(da,db,dc), (15)

for any Borel set , where

 ^b=b∥b∥1+|c|,  ^c=c∥b∥1+|c|. (16)

Obviously , and

 f(x)=∫Λσ(bTx+c)ρ+(db,dc)−∫Λσ(bTx+c)ρ−(db,dc). (17)

Next, we define extensions of and to by

 ~ρ+(A′)=ρ+({(b,c): (1,b,c)∈A′}), (18) ~ρ−(A′)=ρ−({(b,c): (−1,b,c)∈A′}), (19)

for any Borel sets , and let . Then we have and

 f(x)=∫{−1,1}×Λaσ(bTx+c)~ρ(da,db,dc). (20)

Therefore, we can normalize to be a probability measure, and

 ∥f∥B∞≤~ρ({−1,1}×Λ)≤∥f∥B1+ε. (21)

Taking the limit of , we have . Since from Holder’s inequality, we conclude that . ∎

#### 2.4.2 Proof of Theorem 4

Let be a positive number such that . Let be a probability distribution such that and . Let with . Then we have . Let be i.i.d. random variables drawn from , and consider the following empirical average,

 ^fm(x;Θ)=1mm∑j=1ϕ(x;θj).

Let be the approximation error. Then we have

 EΘ[E(Θ)] =EΘEx|^fm(x;Θ)−f(x)|2 =ExEΘ|1mm∑j=1ϕ(x;θj)−f(x)|2 =1m2Exm∑j,k=1Eθj,θk[(ϕ(x;θj)−f(x))(ϕ(x;θk)−f(x))] ≤1m2m∑j=1ExEθj[(ϕ(x;θj)−f(x))2] ≤1mExEθ∼ρ[ϕ2(x;θ)] ≤(1+ε)∥f∥2B2m.

In addition,

Define the event , and . By Markov’s inequality, we have

 P{E1} =1−P{Ec1}≥1−EΘ[E(Θ)]3∥f∥2B2/m≥2−ε3 P{E2} =1−P{Ec2}≥1−EΘ[∥Θ∥P]2∥f∥B2≥1−ε2.

Therefore we have

 P{E1∩E2}=P{E1}+P{E2}−1≥2−ε3+1−ε2−1=1−5ε6>0.

Choose any in . The two-layer neural network model defined by this satisfies both requirements in the theorem. ∎

#### 2.4.3 Proof of Theorem 5

Without loss of generality, we assume that , otherwise due to the scaling invariance of we can redefine the parameters as follows,

 a←a(∥b∥1+|c|),b←b∥b∥1+|c|,c←c∥b∥1+|c|.

Let be the parameters in the two-layer neural network model and let and . Then we can define a probability measure:

 ρm=m∑k=1αkδ⎛⎝a−sign(a(m)k)Am⎞⎠δ(b−b(m)k)δ(c−c(m)k),

which satisfies

 fm(x;Θm)=∫aσ(bTx+c)ρm(da,db,dc).

Let

 KQ={(a,b,c):|a|≤Q,∥b∥1+|c|≤1}.

It is obvious that for all . Since is compact, the sequence of probability measure is tight. By Prokhorov’s Theorem, there exists a subsequence and a probability measure such that converges weakly to .

The fact that implies . Therefore, we have

 ∥f∗∥B=∥f∗∥B∞≤Q.

For any , is continuous with respect to and bounded from above by . Since is the weak limit of , we have

 f∗(x)=limk→∞∫aσ(bTx+c)dρmk=∫aσ(bTx+c)dρ∗(da,db,dc).

#### 2.4.4 Proof of Theorem 6

Let and . For any and , let be a distribution satisfying and . Then,

 nRadn(FQ) =Eξ[supf∈FQn∑i=1ξiEρεf[aσ(wTxi)]] =Eξ[supf∈FQEρεf[n∑i=1ξiaσ(wTxi)]] =Eξ[supf∈FQEρεf[|a|∥w∥1|n∑i=1ξiσ(^wTxi)|]] ≤(1+ε)QEξ[sup∥w∥≤1|n∑i=1ξiσ(wTxi)|]. (22)

Due to the symmetry, we have

 Eξ[sup∥w∥≤1|n∑i=1ξiσ(wTxi)|] ≤Eξ[sup∥w∥≤1n∑i=1ξiσ(wTxi)]+Eξ[sup∥w∥≤1−n∑i=1ξiσ(wTxi)] =2Eξ[sup∥w∥≤1n∑i=1ξiσ(wTxi)] ≤2Eξ[sup∥w∥≤1n∑i=1ξiwTxi], (23)

where the last inequality follows from the contraction property of Rademacher complexity (see Lemma 26.9 in [18]) and that is Lipschitz continuous with Lipschitz constant . Applying Lemma 26.11 in [18] and plugging (2.4.4) into (2.4.4), we obtain

 Radn(FQ)≤2(1+ε)Q√2ln(2d)n. (24)

Taking , we complete the proof. ∎

## 3 Compositional function spaces

In this section, we carry out a similar program for residual neural networks. Due to the compositional and incremental form of the residual networks, the natural function spaces and norms associated with the residual neural networks are also compositional in nature. We will call them compositional spaces and compositional norms, respectively. Similar to what was done in the last section, we establish a natural connection between these function spaces and residual neural networks, by proving direct and inverse approximation theorems. We also prove an optimal complexity bound for the compositional space.

We postpone all the proofs to the end of this section.

### 3.1 The compositional law of large numbers

We consider residual neural networks defined by

 z0,L(x) = Vx, zl+1,L(x) = zl,L(x)+1LUlσ∘(Wlzl,L(x)), fL(x;Θ) = αTzL,L(x), (25)

where is the input, , and we use to denote all the parameters to be learned from data. Without loss of generality, we will fix to be

 V=[Id×d0(D−d)×d]. (26)

We will fix and throughout this paper, and when there is no danger for confusion we will omit in the notation and use to denote the residual network for simplicity.

For two layer neural networks, if the parameters are i.i.d sampled from a probability distribution , then we have

 1mm∑k=1akσ(bTkx+ck)→∫aσ(bTx+c)ρ(da,db,dc), (27)

when

as a consequence of the law of large numbers. To get some intuition in the current situation, we will first study a similar setting for residual networks in which

and are i.i.d sampled from a probability distribution on . To this end, we will study the behavior of as . The sequence of mappings we obtained is the repeated composition of many i.i.d. random near-identity maps.

The following theorem can be viewed as a compositional version of the law of large numbers. The “compositional mean” is defined with the help of the following ordinary differential equation (ODE) system:

 z(x,0) = Vx, ddtz(x,t) = E(U,W)∼ρUσ(Wz(x,t)). (28)
###### Theorem 7.

Assume that is Lipschitz continuous and

 Eρ∥|U||W|∥2F<∞. (29)

Then, the ODE (28) has a unique solution. For any , we have

 zL,L(x)→z(x,1) (30)

in probability as . Moreover, we have

 limL→∞supx∈XE∥zL,L(x)−z(x,1)∥2=0, (31)

which means the convergence is uniform with respect to .

This result can be extended to situations when the distribution is time-dependent, which is the right setting in applications.

###### Theorem 8.

Let be a family of probability distributions on with the property that there exists constants and such that

 Eρt∥|U||W|∥2F

and

 ∣∣EρtUσ(Wz)−EρsUσ(Wz)∣∣≤c2|t−s||z| (33)

for all . Assume is Lipschitz continuous. Let be the solution of the following ODE,

 z(x,0) = Vx, ddtz(x,t) = E(U,W)∼ρtUσ(Wz(x,t)).

Then, for any fixed , we have

 zL,L(x)→z(x,1) (34)

in probability as . Moreover, the convergence is uniform in .

Similar results have been proved in the context of stochastic approximation, for example in [14, 7].

### 3.2 The compositional function spaces

Motivated by the previous results, we consider the set of functions defined by:

 z(x,0) = Vx, ˙z(x,t) = E(U,W)∼ρtUσ(Wz(x,t)), fα,{ρt}(x) = αTz(x,1), (35)

where is given in (26), , , and . To define a norm for these functions, we consider the following linear ODEs ()

 Np(0) = e, ˙Np(t) = (Eρt(|U||W|)p)1/pNp(t), (36)

where is the all-one vector in , and the matrices and are obtained from and respectively by taking element-wise absolute values. This linear system of equations has a unique solution as long as the expected value is integrable as a function of . If admits a representation as in (35), we can define the norm of .

###### Definition 2.

Let be a function that satisfies for a pair of (), then we define

 ∥f∥Dp(α,{ρt})=|α|TNp(1), (37)

to be the norm of with respect to the pair (, ), where is obtained from by taking element-wise absolute values. We define

 ∥f∥Dp=inff=fα,{ρt}|α|TNp(1). (38)

to be the norm of .

As an example, if is constant in , then the norm becomes

 ∥f∥Dp=inff=fα,ρ|α|Te(Eρ(|U||W|)p)1/pe. (39)

Given this definition, the compositional function spaces on are defined as the set of continuous functions that can be represented as in (35) with finite norm, where for any , is a probability distribution defined on , , is the Borel -algebra on . We use to denote these function spaces.

Beside , we introduce another class of function spaces , which independently controls . In these spaces we can provide good control of the Rademacher complexity.

###### Definition 3.

Let be a function that satisfies for a pair of (), then we define

 ∥f∥~Dp(α,{ρt})=|α|TNp(1)+∥Np(1)∥1−D, (40)

to be the norm of with respect to the pair (, ). We define

 ∥f∥~Dp=inff=fα,{ρt}|α|TNp(1)+∥Np(1)∥1−D. (41)

to be the norm of . The space is defined as all the continuous functions that admit the representation in (35) with finite norm.

Note that in the definitions above, the only condition on is the existence and uniqueness of defined by (35). Hence, can be discontinuous as a function . However, the compositional law of large numbers, which is the underlying reason behind the approximation theorem that we will discuss next (Theorem 8), requires to satisfy some continuity condition. To that end, we define the following “Lipschitz coefficient” and “Lipschitz norm” for

###### Definition 4.

Given a family of probability distribution , the “Lipschitz coefficient” of , which is denoted by , is defined as the infimum of all the number that satisfies

 ∣∣EρtUσ(Wz)−EρsUσ(Wz)∣∣≤Lip{ρt}|t−s||z|, (42)

and

 ∣∣∥∥Eρt|U||W|∥∥1,1−∥