The task of supervised learning is to approximate a function using a given set of data. This type of problem has been the subject of classical numerical analysis and approximation theory for a long time. The theory of splines and the theory of finite element methods are very successful examples of such classical results[10, 9]
, both are concerned with approximating functions using piecewise polynomials. In these theories, one starts from a function in a particular function space, say a Sobolev or Besov space, and proceeds to derive optimal error estimates for this function. The optimal error estimates depend on the regularity encoded in the function space as well as the approximation scheme. They are the most important pieces of information for understanding the underlying approximation scheme. It should be noted that when discussing a particular function space, we are not just concerned with the set of functions it contains, the norm associated with the space is also crucial.
Identifying the right function space that one should use is the most crucial step in this analysis. Sobolev/Besov type of spaces are good function spaces for these classical theories since:
One can prove direct and inverse approximation theorems for these spaces. Roughly speaking, a function can be approximated by piecewise polynomials with certain convergence rate if and only if the function is in certain Sobolev/Besov space.
The functions we are interested in, e.g. solutions of partial differential equations (PDEs), are in these spaces. This is at the heart of the regularity theory for PDEs.
However, these spaces are tied with the piecewise polynomial basis used in the approximation scheme. These approximation schemes suffer from the curse of dimensionality, i.e. the number of parameters needed to achieve certain level of accuracy grows exponentially with dimension. Consequently, Sobolev/Besov type of spaces are not the right function spaces for studying machine learning models that can potentially address the curse of dimensionality problem.
Another inspiration for this paper comes from kernel methods. It is well-known that the right function space associated with a kernel method is the corresponding reproducing kernel Hilbert space (RKHS) . RKHS and kernel methods provide one of the first examples for which dimension-independent error estimates can be established.
The main purpose of this paper is to construct and identify the analog of these spaces for two-layer and residual neural network models. For two-layer neural network models, we show that the right function space is the so-called “Barron space”. Roughly speaking, a function belongs to the Barron space if and only if it can be approximated by “well-behaved” two-layer neural networks. The analog of the Barron space for deep residual neural networks is the “compositional function space” that we construct in the second part of this paper. We will establish direct and inverse approximation theorems for these spaces as well as the optimal Rademacher complexity estimates.
One important difference between approximation theory in low and high dimensions is that in high dimensions, the best error rate (or order of convergence) that one can hope for is the Monte Carlo error rate. Therefore using the error rate as an indicator to distinguish the quality of different approximation schemes or machine learning models is not a good option. The function spaces or the associated norms seem to be a better alternative. We take the viewpoint that a function space is defined by its approximation property using a particular approximation scheme. In this sense, Sobolev/Besov spaces are the result when we consider approximation by piecewise polynomials or wavelets. Barron space is the analog when we consider approximation by two-layer neural networks and the compositional function space is the analog when we consider approximation by deep residual networks. The norms that are associated with these spaces may seem a bit unusual at a first sight, but they arise naturally in the approximation process, when we consider direct and inverse approximation theorems.
Although this work was motivated by the problem of understanding approximation theory for neural network models in machine learning, we believe that it should have an implication for high dimensional analysis in general. One natural follow-up question is whether one can show that solutions to high dimensional partial differential equations (PDE) belong to the function spaces introduced here. At least for linear parabolic PDEs, the work in  suggests that some close analog of the compositional spaces should serve the purpose.
In Section 2, we introduce the Barron space for two-layer neural networks. Although not all the results in this section are new (some have appeared in various forms in [13, 2]), they are useful for illustrating our angle of attack and they are also useful for the work in Section 3 where we introduce the compositional function space for residual networks.
Notations: Let . We define if otherwise . For simplicity, we fix the domain of interest to be . We denote by the input variable, and let . We sometimes abuse notation and use (or some other analogs) to denote the function in order to signify the independent variable under consideration. We use to denote the norm of function defined by
is a probability distribution on. We do not specify in this paper.
One important point for working in high dimension is the dependence of the constants on the dimension. We will use to denote constants that are independent of the dimension.
2 Barron spaces
2.1 Definition of the Barron spaces
We will consider functions that admit the following representation
where , is a probability distribution on (, ), with being a Borel -algebra on , and
where denotes all the parameters. It should be noted that in general, the ’s for which (2) holds are not unique.
To get some intuition about the representation (2), we write the Fourier representation of a function as:
For functions that admit the representation (2), we define its Barron norm:
Here the infimum is taken over all for which (2) holds for all . Barron spaces are defined as the set of continuous functions that can be represented by (2) with finite Barron norm. We name these spaces after Barron to honor his contribution to the mathematical analysis of two-layer neural networks, in particular the work in [4, 5, 13].
As a consequence of the Hölder’s inequality, we trivially have
However, the opposite is also true.
For any , we have and
As a consequence, we have that for any , and . Hence, we can use and to denote the Barron space and Barron norm.
A natural question is: What kind of functions are in the Barron space? The following is a restatement of an important result proved in . It is an extension of the Fourier analysis of two-layer sigmoidal neural networks in Barron’s seminal work  (see also  for another proof).
In addition, the Barron space is also closely related to a family of RKHS. Let . Due to the scaling invariance of , we can assume . Then (2) can be written as
where denotes the collection of the probability measures over , is the Borel -algebra on .
Given a fixed probability distribution , we can define a kernel:
Let denote the RKHS induced by . Then we have the following theorem.
According to , we have the following characterization of :
In addition, for any , . It is obvious that for any , , which implies that . Conversely, for any , there exists a probability distribution that satisfies
and . Hence we have , which implies . Therefore . Together with Proposition 1, we complete the proof. ∎
2.2 Direct and inverse approximation theorems
With (2), approximating by two-layer networks becomes a Monte Carlo integration problem.
For any , there exists a two-layer neural network such that
Furthermore, we have
We call the path norm of two-layer neural networks. This is the analog of the Barron norm of functions in . Hence, when studying approximation properties, it is natural to study two-layer neural networks with bounded path norm.
One can also prove an inverse approximation theorem. To state this result, we define:
Let be a continuous function on . Assume there exists a constant and a sequence of functions such that
for all . Then there exists a probability distribution on , such that
for all . Furthermore, we have with
2.3 Estimates of the Rademacher complexity
Next, we show that the Barron spaces we defined have low complexity. We show this by bounding the Rademacher complexity of bounded sets in the Barron spaces.
Definition 1 (Rademacher complexity).
The following theorem gives an estimate of the Rademacher complexity of the Barron space. Similar results can be found in . We include the proof for completeness.
Let . Then we have
From Theorem 8 in , we see that the above results implies that functions in the Barron spaces can be learned efficiently .
2.4.1 Proof of Theorem 1
Take . For any , there exists a probability measure that satisfies
Let , and consider two measures and on defined by
for any Borel set , where
Obviously , and
Next, we define extensions of and to by
for any Borel sets , and let . Then we have and
Therefore, we can normalize to be a probability measure, and
Taking the limit of , we have . Since from Holder’s inequality, we conclude that . ∎
2.4.2 Proof of Theorem 4
Let be a positive number such that . Let be a probability distribution such that and . Let with . Then we have . Let be i.i.d. random variables drawn from , and consider the following empirical average,
Let be the approximation error. Then we have
Define the event , and . By Markov’s inequality, we have
Therefore we have
Choose any in . The two-layer neural network model defined by this satisfies both requirements in the theorem. ∎
2.4.3 Proof of Theorem 5
Without loss of generality, we assume that , otherwise due to the scaling invariance of we can redefine the parameters as follows,
Let be the parameters in the two-layer neural network model and let and . Then we can define a probability measure:
It is obvious that for all . Since is compact, the sequence of probability measure is tight. By Prokhorov’s Theorem, there exists a subsequence and a probability measure such that converges weakly to .
The fact that implies . Therefore, we have
For any , is continuous with respect to and bounded from above by . Since is the weak limit of , we have
2.4.4 Proof of Theorem 6
Let and . For any and , let be a distribution satisfying and . Then,
Due to the symmetry, we have
where the last inequality follows from the contraction property of Rademacher complexity (see Lemma 26.9 in ) and that is Lipschitz continuous with Lipschitz constant . Applying Lemma 26.11 in  and plugging (2.4.4) into (2.4.4), we obtain
Taking , we complete the proof. ∎
3 Compositional function spaces
In this section, we carry out a similar program for residual neural networks. Due to the compositional and incremental form of the residual networks, the natural function spaces and norms associated with the residual neural networks are also compositional in nature. We will call them compositional spaces and compositional norms, respectively. Similar to what was done in the last section, we establish a natural connection between these function spaces and residual neural networks, by proving direct and inverse approximation theorems. We also prove an optimal complexity bound for the compositional space.
We postpone all the proofs to the end of this section.
3.1 The compositional law of large numbers
We consider residual neural networks defined by
where is the input, , and we use to denote all the parameters to be learned from data. Without loss of generality, we will fix to be
We will fix and throughout this paper, and when there is no danger for confusion we will omit in the notation and use to denote the residual network for simplicity.
For two layer neural networks, if the parameters are i.i.d sampled from a probability distribution , then we have
as a consequence of the law of large numbers. To get some intuition in the current situation, we will first study a similar setting for residual networks in whichand are i.i.d sampled from a probability distribution on . To this end, we will study the behavior of as . The sequence of mappings we obtained is the repeated composition of many i.i.d. random near-identity maps.
The following theorem can be viewed as a compositional version of the law of large numbers. The “compositional mean” is defined with the help of the following ordinary differential equation (ODE) system:
Assume that is Lipschitz continuous and
Then, the ODE (28) has a unique solution. For any , we have
in probability as . Moreover, we have
which means the convergence is uniform with respect to .
This result can be extended to situations when the distribution is time-dependent, which is the right setting in applications.
Let be a family of probability distributions on with the property that there exists constants and such that
for all . Assume is Lipschitz continuous. Let be the solution of the following ODE,
Then, for any fixed , we have
in probability as . Moreover, the convergence is uniform in .
3.2 The compositional function spaces
Motivated by the previous results, we consider the set of functions defined by:
where is given in (26), , , and . To define a norm for these functions, we consider the following linear ODEs ()
where is the all-one vector in , and the matrices and are obtained from and respectively by taking element-wise absolute values. This linear system of equations has a unique solution as long as the expected value is integrable as a function of . If admits a representation as in (35), we can define the norm of .
Let be a function that satisfies for a pair of (), then we define
to be the norm of with respect to the pair (, ), where is obtained from by taking element-wise absolute values. We define
to be the norm of .
As an example, if is constant in , then the norm becomes
Given this definition, the compositional function spaces on are defined as the set of continuous functions that can be represented as in (35) with finite norm, where for any , is a probability distribution defined on , , is the Borel -algebra on . We use to denote these function spaces.
Beside , we introduce another class of function spaces , which independently controls . In these spaces we can provide good control of the Rademacher complexity.
Let be a function that satisfies for a pair of (), then we define
to be the norm of with respect to the pair (, ). We define
to be the norm of . The space is defined as all the continuous functions that admit the representation in (35) with finite norm.
Note that in the definitions above, the only condition on is the existence and uniqueness of defined by (35). Hence, can be discontinuous as a function . However, the compositional law of large numbers, which is the underlying reason behind the approximation theorem that we will discuss next (Theorem 8), requires to satisfy some continuity condition. To that end, we define the following “Lipschitz coefficient” and “Lipschitz norm” for
Given a family of probability distribution , the “Lipschitz coefficient” of , which is denoted by , is defined as the infimum of all the number that satisfies