In this era of big data, data-sets of massive size and with various features are routinely acquired, creating a crucial challenge to machine learning in the design of learning strategies for data management, particularly in realization of certain data features. Deep learning is a state-of-the-art approach for the purpose of realizing such features, including localized position information [3, 5], geometric structures of data-sets [4, 29], and data sparsity [18, 17]. For this and other reasons, deep learning has recently received much attention, and has been successful in various application domains 
, such as computer vision, speech recognition, image classification, fingerprint recognition and earthquake forecasting.
Affine transformation-invariance, and particularly rotation-invariance, is an important data feature, prevalent in such areas as statistical physics , early warning of earthquakes , 3-D point-cloud segmentation , and image rendering . Theoretically, neural networks with one hidden layer (to be called shallow nets) are incapable of embodying rotation-invariance features in the sense that its performance in handling these features is analogous to the failure of algebraic polynomials  in handling this task . The primary goal of this paper is to construct neural networks with at least two hidden layers (called deep nets) to realize rotation-invariant features by deriving “fast” approximation and learning rates of radial functions as target functions.
Recall that a function defined on the dimensional ball, with radius where , is called a radial function, if there exists a univariate real-valued function defined on the interval such that , for all . For convenience, we allow to include the Euclidian space with . Hence, all radial-basis functions (RBF’s) are special cases of radial functions. In this regard, it is worthwhile to mention that the most commonly used RBF’s are the multiquadric and Gaussian , where
. For these and some other RBF’s, existence and uniqueness of scattered data interpolation from the linear span of, for arbitrary distinct centers and for any , are assured. The reason for the popularity of the multiquadric RBF is fast convergence rates of the interpolants to the target function 
, and that of the Gaussian RBF is that it is commonly used as the activation function for constructing radial networks that possess the universal approximation property and other useful features (see, , , , , ) and references therein). The departure of our paper from constructing radial networks is that since RBF’s are radial functions, they qualify to be target functions for our general-purpose deep nets with general activation functions. Hence, if the centers of the desired RBF have been chosen and the coefficients have been pre-computed, then the target function
can be realized by using one extra hidden layer for the standard arithmetic operations of additions and multiplications and an additional outer layer for the input of RBF centers and coefficients to the deep net constructed in this paper.
The main results of this paper are three-fold. We will first derive a lower bound estimate for approximating radial functions by deep nets. We will then construct a deep net with four hidden layers to achieve this lower bound (up to a logarithmic multiplicative factor) to illustrate the power of depth in realizing rotation-invariance. Finally, based on the prominent approximation ability of deep nets, we will show that implementation of the empirical risk minimization (ERM) algorithm in deep nets facilitates fast learning rates and is independent of dimensions. The presentation of this paper is organized as follows. Main results will be stated in Section2, where near-optimal approximation order and learning rate of deep nets are established. In Section 3, we will establish our main tools for constructing deep nets with two hidden layers for approximation of univariate smooth functions. Proofs of the main results will be provided in Section 4. Finally, derivations of the auxiliary lemmas that are needed for our proof of the main results are presented in Section 5.
2 Main Results
Let denote the unit ball in with center at the origin. Then any radial function defined on is represented by for some function . Here and throughout the paper, the standard notation of the Euclidean norm is used for . In this section, we present the main results on approximation and learning of radial functions .
2.1 Deep nets with tree structure
Consider the collection
of shallow nets with activation function , where . The deep nets considered in this paper are defined recursively in terms of shallow nets according to the tree structure, as follows:
Let , , and , , be univariate activation functions. Set
Then a deep net with the tree structure of layers can be formulated recursively by
where for each , , and . Let denote the set of output functions for at the -th layer.
, showing sparse and tree-based connections among neurons. Due to the concise mathematical formulation, this definition of deep nets has been widely used to illustrate its advantages over shallow nets. In particular, it was shown in  that deep nets with the tree structure can be constructed to overcome the saturation phenomenon of shallow nets; in  that deep nets, with two hidden layers, tree structure, and finitely many neurons, can be constructed to possess the universal approximation property; and in [26, 12] that deep nets with the tree structure are capable of embodying tree structures for data management. In addition, a deep net with the tree structure was constructed in  to realize manifold data.
As a result of the sparse connections of deep nets with the tree structure, it follows from Definition 1 and Figure 1 that there are a total of
free parameters for . For , we introduce the notation
For functions in this class, the parameters of deep nets are bounded. This is indeed a necessity condition, since it follows from the results in [19, 20] there exists some with finitely many neurons but infinite capacity (pseudo-dimension). The objective of this paper is to construct deep nets of the form (3) for some and , for the purpose of approximating and learning radial functions.
2.2 Lower bounds for approximation by deep nets
In this subsection, we show the power of depth in approximating radial functions, by showing some lower bound results for approximation by deep nets under certain smoothness assumption on the radial functions.
For , and , with and , let denote the collection of univariate -times differentiable functions , whose -th derivatives satisfy the Lipschitz condition
In particular, for , let denote the set of radial functions with .
We point out that the above Lipschitz continuous assumption is standard for radial basis functions (RBF’s) in Approximation Theory, and was adopted in [13, 14] to quantify the approximation abilities of polynomials and ridge functions. For and , we denote by
the deviations of from in . The following main result shows that shallow nets are incapable of embodying the rotation-invariance property.
The proof of Theorem 1 is postponed to Section 4. Observe that Theorem 1 exhibits an interesting phenomenon in approximation of radial functions by deep nets, in that the depth plays a crucial role, by comparing (5) with (6). For instance, the lower bound for deep nets is a big improvement of the lower bound for shallow nets, for dimensions .
2.3 Near-optimal approximation rates for deep nets
In this subsection, we show that the lower bound (6) is achievable up to a logarithmic factor by some deep net with layers for certain commonly used activation functions that satisfy the following smoothness condition.
The activation function is assumed to be infinitely differentiable, with both and bounded by , such that for some and all and that
It is easy to see that all of the logistic function: , the hyperbolic tangent function: , the arctan function: , and the Gompertz function: , satisfy Assumption 1, in which we essentially impose three conditions on the activation function , namely: infinite differentiability, non-vanishing of all derivatives at the same point, and the sigmoidal property (7). On the other hand, we should point out that such strong assumptions are stated only for the sake of brevity, but can be relaxed to Assumption 2 in Section 3 below. In particular, the infinite differentiability condition on can be replaced by some much weaker smoothness property as that of the target function . The following is our second main result, which shows that deep nets can be constructed to realize the rotation-invariance property of by exhibiting a dimension-independent approximation error bound, which is much smaller than that for shallow nets.
Note that the deep net in Theorem 2 has the number of free parameters satisfying
We would like to mention an earlier work 
on approximating radial functions by deep ReLU networks, where it was shown that for each, there exists a fully connected deep net with ReLU activation function, , and at least parameters and at least layers, such that
for some absolute constant and constant independent of . The novelties of our results in the present paper, as compared with those in , can be summarized as follows. Firstly, noting that for , we may conclude that only an upper bound (without approximation order estimation) was provided in , while both near-optimal approximation error estimates and achievable lower bounds are derived in our present paper on the approximation of functions in . In addition, while fully connected deep nets were considered in , we construct a deep net with sparse connectivity in our paper. Finally, to achieve upper bounds for any (as opposed to merely ), non-trivial techniques, such as “product-gate” and approximation of smoothness functions by products of deep nets and Taylor polynomials are introduced in Section 3. It would be of interest to obtain similar results as Theorem 2 for deep ReLU nets, but this is not considered in the present paper.
2.4 Learning rate analysis for empirical risk minimization on deep nets
Based on near-optimal approximation error estimates in Theorem 2, we shall deduce a near-optimal learning rate for the algorithm of empirical risk minimization (ERM) over . Our analysis will be carried out in the standard regression framework , with samples
drawn independently according to an unknown Borel probability measureon , with and for some .
The primary objective is to learn the regression function that minimizes the generalization error where denotes the conditional distribution at induced by . To do so, we consider the learning rate for the ERM algorithm
Here, is the parameter appearing in the definition of . Since , it is natural to project the final output to the interval by the truncation operator The following theorem is our third main result on a near-optimal dimension-independent learning rate for .
We emphasize that the learning rate in (10) is independent of the dimension , and is much better than the optimal learning rate for learning -smooth (but not necessarily radial) functions on [10, 15, 16]. For shallow nets, it follows from (5) that to achieve a learning rate similar to (11), we need at least neurons to guarantee the bias. For , since
, the capacity of neural networks is large. Consequently, it is difficult to derive a satisfactory variance, so that derivation of a similar almost optimal learning rates as (11) for ERM on shallow nets is also difficult. Thus, Theorem 3 demonstrates that ERM on deep nets can embody the rotation-invariance property by deducing the learning rate of order .
3 Approximation by Deep Nets without Saturation
Construction of neural networks to approximate smooth functions is a classical and long-standing topic in approximation theory. Generally speaking, there are two approaches, one by constructing neural networks to approximate algebraic polynomials, and the other by constructing neural networks with localized approximation properties. The former usually requires extremely large norms of weights [24, 32] and the latter frequently suffers from the well-known saturation phenomenon [2, 3], in the sense that the approximation rate cannot be improved any further, when the regularity of the target function goes beyond a specific level. The novelty of our method is to adopt the ideas from both of the above two approaches to construct a deep net with two hidden layers with controllable norms of weights and without saturation, by considering the “exchange-invariance” between polynomials and shallow nets, the localized approximation of neural networks, a recently developed “product-gate” technique , and a novel Taylor formula. For this purpose, we need to impose differentiability and the sigmoid property on activation functions, as follows.
Let , with and . Assume that
is a sigmoidal function with
is a sigmoidal function with, such that for all , for some .
It is obvious that Assumption 2 is much weaker than the smoothness property of in Assumption 1. Furthermore, it removes the restriction (7) on the use of sigmoid functions as activation function, by considering only the general sigmoidal property:
In view of this property, we introduce the notation
where , and observe that .
3.1 Exchange-invariance of univariate polynomials and shallow nets
In this subsection, a shallow net with one neuron is constructed to replace a univariate homogeneous polynomial together with a polynomial of lower degree. It is shown in the following proposition that such a replacement does not degrade the polynomial approximation property.
Under Assumption 2 with , and , let and with . Then for an arbitrary ,
The proof of Proposition 1 requires the following Taylor representation which is an easy consequence of the classical Taylor formula
with remainder in integral form, and using the formula To obtain the Taylor polynomial of degree , this formula does not require to be -times differentiable. This observation is important throughout our analysis.
Let and be -times differentiable on .Then for ,
We are now ready to prove Proposition 1.
for . It follows that
with defined by (15). What is left is to estimate the remainder . To this end, we observe, for the case , from the definition of , that for any ,
For , we may apply the estimate
to compute, for any
Finally, for , we may apply the Lipschitz property of to obtain
so that for any , we have
This completes the proof of Proposition 1.
3.2 Approximation of univariate polynomials by neural networks and the product gate
Our second tool, to be presented in the following proposition, shows that the approximation capability of shallow nets is not worse than that of polynomials of the same order (degree ) as the cardinality of weights of the shallow nets.
Under Assumption 2 with and , let and . Then for an arbitrary , there exists a shallow net
for , such that
where is a constant depending only on , , and , to be specified explicitly in the proof of the derivation.
We remark, however, that to arrive at a fair comparison with polynomial approximation, the polynomial degree should be sufficiently large, so that the norm of weights of the shallow nets could also be extremely large. In the following discussion, we require to be independent of in order to reduce the norm of the weights. Based on Proposition 2, we are able to derive the following proposition, which yields a “product-gate” property of deep nets.
Under Assumption 2 with and , for , there exists a shallow net
for , such that for any ,
where is a constant depending only on , , and .
Proof. For , we apply Proposition 2 to the polynomial to derive a shallow net
for , such that
and implies , we have
This completes the proof of Proposition 3 by scaling to .
To end this subsection, we present the proof of Proposition 2.