An important problem in machine learning is to approximate a target functiondefined on some compact subset of a Euclidean space by a model , e.g., a neural network, radial basis function network, or a kernel based model. A central problem in this theory is to estimate the complexity of approximation; i.e., (loosely speaking) to obtain a bound on the number of parameters in in order to ensure that can be approximated by within a prescribed accuracy . Typically, the number of parameters grows exponentially as a function of , a phenomenon known as the curse of dimensionality.
One approach to mitigate the curse of dimensionality is to assume that the data comes from an unknown manifold of a low dimension embedded in
. The subject of manifold learning deals with questions of approximation on this manifold, typically based on the eigenfunctions of some differential operator on the manifold or some kernel based methods (e.g.,[29, 30, 25]). One major problem in this domain of ideas is that the models used for the approximation are based on the manifold alone, which is determined by the existing data. Therefore, if a new datum arrives, it might require a change of the manifold, equivalently, starting the computation all over again. This is called the problem of out-of-sample extension. In kernel based methods, it is typically solved using the so called Nyström extension, but estimating the degree of approximation on the ambient space is an open problem as far as we are aware.
In [26, 23], we have argued that deep networks are able to overcome the curse of dimensionality using what we have called the “blessing of compositionality”. We have observed that many functions of practical interest have a compositional structure. Although shallow networks cannot take advantage of this fact, deep networks can be built to have the same compositional structure. For example, we consider a function of variables with the structure
We then construct shallow networks to approximate the bivariate functions respectively. Under appropriate assumptions on the smoothness classes of these functions, the number of parameters in a deep network required to ensure an accuracy of in the approximation of is , where is a parameter associated with the smoothness of the functions involved. In contrast, a shallow network is unable to simulate the compositional structure, and hence, must treat as a function of variables. The resulting estimate on the number of parameters is then .
We note that compositionality is a property of the expression of a function, not an intrinsic property of the function itself. A simple example in the univariate case is the constant function , , that can also be expressed as a compositional function
It is therefore natural to ask for which functions shallow networks can already avoid the curse of dimensionality.
The main purpose of this paper is to address the following two problems: (1) dimension-independent bounds in approximation by shallow networks, (2) approximation bounds for an out-of-sample extension in manifold learning. There is a by-product of our results that is of interest in information based complexity. An important problem in that theory is to obtain bounds on the discretization error for integrals in a high dimensional setting. Much of the work in this direction (e.g., 
) is focused on integration on a cube (or the whole Euclidean space) with a weight function having a tensor product structure. Our result proves dimension independent bounds in a very general setting that does not require a tensor product structure, neither in the domain of integration nor for the measure with respect to which the integral is taken.
In Section 2, we give a more technical introduction, including in precise terms the notion of curse of dimensionality, and a review of some ideas involved in function approximation. We explain our set-up and discuss the main theorem in Section 3. The main theorem is illustrated in a number of examples in Section 4 : approximation of functions on the sphere (and hence, the Euclidean space) by networks using ReLU activation functions (Corollary 4.1), approximation of functions on the sphere using a class of zonal function networks using a positive definite activation function (Corollary 4.2), approximation of functions on a cube using certain radial basis function networks (Corollary 4.3) and approximation of functions on a manifold and their out-of-sample (Nyström) extensions to the ambient space (Corollary 4.4). The proof of the main theorem is given in Section 5.
2 Technical introduction
We consider the problem of approximating a function defined on a compact subset of some Euclidean space by mappings of the form , where is a kernel (not necessarily symmetric), , and ’s are real numbers. We will refer to such a mapping as a (shallow) -network with neurons. For example, the action of a neuron, , can be expressed as , where , and , that of a radial basis function by , etc. A deep -network is obtained by composition of such networks according to some directed acyclic graph.
In theoretical analysis, one assumes some prior on , encoded by the assumption that for some compact subset of a Banach space . The set is known in approximation theory parlance as the smoothness class. A central problem in the theory is then to estimate the worst case error
From a practical point of view, one seeks a constructive procedure to realize at least sub-optimally the infimum expression above. This is described in abstract terms as follows. Let the parameter selector be given by . If is a continuous map, we say that it is a robust parameter selector. We define the error in approximation to using this mapping by
Thus, in the definition of the worst case error, the parameters , are allowed to depend upon the individual function in question in an unspecified manner, while in the definition of the error in approximation, all functions involved are approximated using the same parameter selection procedure. Instead of one seeks to estimate
where the infimum is taken over all robust parameter selectors .
Many classes used in this theory suffer from the so-called curse of dimensionality (cf. ) :
where is a “smoothness parameter” associated with .
We digress to make a note on terminology. The term degree of approximation of from is defined by
However, the quantity is also referred to as the degree of approximation to by networks prescribed by the summation expression in (2.2). The terms error in approximation (or approximation error) are also used to indicate degree of approximation. The terms rate (or accuracy) of approximation refers to an upper estimate on the degree of approximation.
The curse of dimensionality is avoided either by assuming a different prior on the target function or by dropping the requirement that the parameter selector be robust. The purpose of this paper is to explore the second option. We will show that even the smoothness classes typically studied in the literature that give rise to the curse of dimensionality do not suffer from the same if we do not require the parameter selector to be robust. This is observed in , where there was no restriction on the parameter selector and the recovery algorithm, so that a space-filling curve could be used in theory. In our setting, the parameter selector has a specific meaning and the recovery algorithm consists of constructing a -network using these parameters.
In order to motivate our work, we first review some of the ideas in the existing work on the estimation of degree of approximation by shallow networks.
First, it is clear that if the parameter selector is robust, then , where is a constant independent of . It is sometimes assumed (or even proved under suitable conditions) in the literature that can be chosen independent of as well (e.g., [32, 33]). Then it is easy to see that in order for the sequence of networks to converge to , it is necessary that must admit a representation of the form
for some (signed) measure having a bounded total variation on . This total variation has been referred to as the -variation of [14, 15]. Using probabilistic estimates, it is then possible to obtain that leads to the bound . Many other results of this type have been obtained in the literature (e.g.,[2, 3, 12, 18, 13]). All of these either assume explicitly or deduce from the assumptions in these papers that a representation of the form (2.5) holds. Also, the error bounds neither require nor depend upon the smoothness of .
A representation of the form (2.5) holds also for many classes for which the curse of dimensionality applies. For example, let be the unit sphere of the Euclidean space , be the (negative) Laplace-Beltrami operator on , be an integer, and we consider to be the class of all continuous functions on for which is continuous. The so-called non-linear width for this class is (cf. ). However, if is the Green function for the operator , then every has a representation of the form
where is the volume element of . Indeed, this fact is used critically in our work  on approximation by zonal function networks using as the activation function, where we gave explicit constructions based entirely on the data with no stipulations on the locations where the sampling nodes are. Similar representations play a critical role in similar estimates in many other papers of ours, e.g., [20, 22]. In some sense, this is the other extreme of the kind of results on the degree of approximation, where the smoothness of is the only determining factor. Clearly, probabilistic ideas can be used to obtain dimension independent bounds instead, if only we give up the requirement of a robust parameter selection. However, the challenge here is not to loose the advantage offered by the smoothness of .
To summarize, in both of these approaches, one has an integral representation of the form (2.5), but get different bounds depending upon the norm and the method used.
In this paper, we will consider a very general set-up where can be an arbitrary compact metric measure space, and consider functions that admit a representation of the form (2.5). Giving up the requirement of robust parameter selector, we will use an idea in the paper  of Bourgain and Lindenstrauss to obtain dimension independent bounds (cf. Definition 3.1) in the uniform norm provided some very mild conditions hold. This technique involves aspects from both the approaches mentioned above. Thus, we will use concentration inequalities as in the first approach. Our conditions on will be in terms of approximation of using a fixed basis as in the second approach.
We will elaborate more about the highlights below in the paper at appropriate places, but they can be summarized as follows.
Our bounds are in the uniform norm. We have argued in  that the usual measurement of generalization error using the expected value of the least square loss is not applicable for approximation theory for deep networks; one has to use the uniform approximation to take full advantage of the compositional structure. Moreover, results about shallow networks can then be “lifted” to those about deep networks using a property called good propagation of errors.
Our results combine the advantages of the probabilistic approach to obtain dimension independent bounds and the classical approximation theory approach where the higher the smoothness of the activation function, the better the bounds on the degree of approximation.
We allow the measure to be, for example, supported on a sub-manifold of a manifold . Under certain conditions, the constants involved in the bounds on the degree of approximation depend upon the sub-manifold alone.
At the same time, taking to be a kernel well defined on the ambient space , our bounds provide estimates on the degree of approximation for the out-of-sample extension of the target function to the entire space . The asymptotic behavior of these bounds is also independent of the dimension of the ambient space, but the constants may depend upon the dimension of the ambient space.
3 The set-up and main theorem
Let be a compact metric space, be the metric on , and
be a probability measure on. If , , the ball of radius centered at is denoted by ; i.e.,
In the sequel, we assume that for each , and that there exist such that
As the examples below show, serves as a dimension parameter for the ambient space . In Definition 3.1, we will define the notion of dimension more formally, without requiring a measure.
If , the symbol denotes the class of bounded, real valued, uniformly continuous functions on , equipped with the supremum norm . We will omit the subscript if , and write .
Let be a nested sequence of finite dimensional subspaces of : , with the dimension of being , .
In the sequel, the symbols will denote generic positive constants depending only on the fixed quantities under discussion, such as the smoothness parameters, , , , the dimensions, etc. Their values may be different at different occurrences, even within a single formula. The notation means .
We let be the geodesic distance on , be the volume measure on , normalized to be a probability measure. The space is the space of all spherical polynomials of degree ; i.e., the restriction to of algebraic polynomials in variables of total degree . The dimension of is .
Let , being the Lebesgue measure, normalized to be a probability measure, . Let be the class of all polynomials of total degree . The dimension of is .
It is possible to convert any measure space that admits a non-atomic measure into a compact metric measure space as described above. Let be any measure space with a non-atomic probability measure ; i.e., for any measurable with , there exists a measurable subset with . Then using ideas described in [11, Chapter VIII, Section 40], we obtain a nested sequence of partitions of such that , , and for , each , where is the integer part of . For each , we can then associate with an interval of the form . Thus, every point in corresponds to a unique number of the form , where each , and an infinite tail of ’s is prohibited in the sequence . We define an equivalence class on by writing if , and replace by its corresponding quotient space, so that the correspondence is one-to-one. We note that for any measurable subset , is preserved under this operation. We define a metric on by
where denotes the exclusive or between the digits. Then is a compact metric space with this metric. It is not difficult to verify that if (equivalently, ), , and then as well. Conversely, if then . So, by the construction of the partition , (3.2) is satisfied with . A nested sequence of subspaces of can then be constructed using the ideas in , where the question of degree of approximation is also considered in detail. However, since the “dimension parameter” in this case, we will not pursue this example further in this paper.
Next, we define some abstract ideas, starting with the notion of a dimension for a subset of . Specific examples will be given in Section 4.
For a finite subset with , and a compact subset , we write
We will omit the mention of if . Let . A finite subset is called -distinguishable if . It is easy to check that if is a maximal -distinguishable set then
In particular, . Moreover, using a volume argument and (3.2), we see that if , then .
If , , we denote by the number of points in a maximal -distinguishable set for the closure of .
Let . A family of subsets of is called (at most) -dimensional if there exists a constant such that
A subset is -dimensional if is -dimensional.
Next, we introduce the conditions on our kernels and measures.
We need to define a local smoothness for the kernel
. In classical wavelet analysis and theory of partial differential equations, it is customary to define the local smoothness of a function at a pointin terms of the degree of approximation of the function by polynomials of a fixed degree over the neighborhoods of , measured in terms of the diameters of these neighborhoods. In our analysis, we will use the spaces for this purpose. However, our definition is a bit more complicated in the absence of any structure on and detailed spline-like approximation theory for the spaces .
If , , let
Let , . A function is called -smooth on if there exists such that,
Let , . A function will be called a kernel of class if
and for every , there is a compact subset such that is -smooth on and -smooth on , with
Since we do not assume to be symmetric, it is possible to define it on instead of , where is another compact metric measure space with a measure satisfying a condition analogous to (3.2). The conditions on can be formulated for instead of . This will only complicate the presentation of the paper without adding any new insights. Therefore, we will use the above definition, but in fact, will be applying it with the restriction of to , where is the support of a measure on .
Before stating our measure theoretic notions, we recall some preliminaries. The term measure will mean a signed or positive Borel measure on . The total variation measure of a signed measure on is defined by
where the sum is over all countable partitions of into Borel measurable sets . We will denote . The support is the set of all for which for every . It is easy to see that is a compact subset of .
Let . A measure on will be called -admissible if has a bounded total variation , is -dimensional subset of , and
Our main theorem can now be stated as follows.
Let , and be a -admissible measure on . Let , , , , and for each , be an -dimensional family of subsets of . Let be defined by
Then for , there exists an integer , points with , and numbers such that
Here, all the constants involved depend upon , in addition to the other fixed parameters in the definition of , in particular, on .
In the case when is -smooth on , we may choose , and obtain the upper bound in (3.12) to be . In particular, if the function is -smooth on for every , then there is no saturation in the degree of approximation. For any , we may take , , , and obtain the bound .
(Tractability of integration) Theorem 3.1 has another interesting consequence, perhaps, not relevant directly to the theme of the present paper. In information based complexity, one is interested in approximating integrals of the form for high dimensional spaces so as to obtain the error in the approximation independent of the dimension, except possibly for the constant factors involved. A great deal of research is devoted to this subject, e.g. , where is considered to be a cube and is the Lebesgue measure. A typical assumption on the class of functions for which these results are applicable also involve a representation of the form
for some measure supported on , and having a bounded total variation. A simple application of Fubini’s theorem in (3.12) leads to the following corollary.
In contrast to much of the literature on this subject, we note that there is no tensor product structure required here. Moreover, as explained above, the estimates improve without saturation as the smoothness of increases.
4 Examples and applications
In this section, we illustrate the implications of Theorem 3.1 using a number of examples.
4.1 Approximation by ReLU networks
We assume the set-up described in Example 3.1. It has been observed in [1, 23] that approximation on a compact subset of a Euclidean space by ReLU networks is equivalent to the approximation of an even function on by networks of the form . (We note that for all , , .) An estimate on the degree of approximation in this context is obtained in  for functions that admit an integral representation as required in Theorem 3.1. Our methods are constructive using a robust parameter selector, and yield a bound of the form . (cf.  for the optimality of this bound.) We have considered in  a slightly more general class of activation functions , so that the case corresponds to the approximation using ReLU networks. On the -dimensional family of sets , is -smooth, and is infinitely differentiable on . Clearly, if is any measure and is a -dimensional set, the family is either -dimensional or -dimensional. Therefore, Theorem 3.1 yields the following corollary.
4.2 Approximation by certain zonal function networks
We assume the set-up described in Example 3.1, and let . For the class of functions satisfying (3.11), with , the error in approximation with a robust parameter selector and completely constructive procedure given in  is . In order to apply Theorem 3.1, we note that , , and we may take , . Then we obtain the following corollary.
Let . We use Theorem 3.1 with . Then
4.3 Approximation on a cube by radial basis function networks
We assume the set-up described in Example 3.2. Let where is at least times continuously differentiable on except at finitely many points in whose neighborhoods is Lipschitz continuous. Apart from continuous piecewise linear functions , a typical example is . Then , , , and . Theorem 3.1 yields the following corollary.
4.4 Manifold learning and out-of-sample extension
We discuss the scenario used commonly in manifold learning. Let be a compact, -dimensional subset of , for example, a -dimensional compact Riemannian manifold embedded in (and hence, without loss of generality, in ). Let be a measure supported on that satisfies, in place of (3.10), the stronger condition (analogous to (3.2)):
Then we may use Theorem 3.1 with in place of , , , and use the restrictions of the spaces to . Then the estimate (3.12) holds with the norm taken over in place of with constants depending only on quantities related to without any reference to the ambient space .
On the other hand, the original estimate (3.12) is an estimate on the degree of approximation to an out-of-sample (Nyström) extension of using the formula (3.11), albeit now with constants depending upon as well.
In kernel based learning on manifolds, it is customary to choose a kernel defined on that is infinitely smooth (e.g., the Gaussian). In this case, as remarked in Remark 3.2, our estimate (3.12) not only gives bounds on the degree of approximation without saturation on the manifold itself, but as just remarked, also for the degree of approximation on the ambient space, without using an explicit Nyström extension.
We summarize this in the following corollary.
where the constant is independent of , and the points . If is infinitely smooth, then we have for every ,
Our proof of Theorem 3.1 is an application of the ideas in  in a more general context. The first step in this direction is to obtain a partition of . This will be described in detail in Section 5.1
. The next step is to construct a set of random variables to which a concentration inequality can be applied. The basic tools for this are developed in Section5.2. The proof of Theorem 3.1 is then completed in Section 5.3.
5.1 Partition of the space
Our main objective in this section is to prove the following theorem.
Let be a positive measure on , , be a maximal -distinguishable subset of , and . Then there exists a subset and a partition of with each of the following properties.
(volume property) For , , , and
(density property) , .
(intersection property) Let be a compact, -dimensional subset. Then there exists constant independent of , , and such that
In particular, if is a -dimensional set, then .
Our proof of this theorem requires some preparation, which we organize in two lemmas. The first is the observation that among a finite collection of balls, the number of balls that can intersect each other is bounded independently of the number of balls one starts with (cf. [10, Lemma 7.1]).
Let be a finite subset of , , . Then
Thus, for any , the number of balls that can have a non-empty intersection with , does not exceed a fixed number, the number being independent of .
This proves (5.1).
The proof of the following lemma is almost verbatim the same as that of [10, Lemma 7.2], which in turn, is based on some ideas in the book [6, Appendix 1], but we reproduce a somewhat modified proof, both for the sake of completeness, and because the lemma was not stated in  in the form needed here.
Let , be a positive measure on , . Let be a finite set, and be a partition of such that for every . Then there exists a subset and a partition of such that for each , , and
Proof. In this proof, we write . In view of Lemma 5.1, at most of the balls can intersect each other. In this proof, let . If , then the lemma is proved with with no further effort. So, let , and . Now, we define a function as follows. If , we write . Otherwise, let . Since is a partition of , we have
Since each , it follows that at most of the ’s have a nonempty intersection with . So, there must exist for which
Clearly, each such . We imagine an enumeration of , and among the ’s for which is maximum, pick the one with the lowest index. We then define to be this . Necessarily, , and is nonempty. So,
Now, we define
For each , . Since is a partition of , . If , for , then with and with . Since is a partition of , it follows that , and hence . Thus, is a partition of , , and
With this preparation, we are now ready to prove Theorem 5.1.
Proof of Theorem 5.1.
Let . We set , and for , . Then is a partition of satisfying the conditions of Lemma 5.2 with .