In physical or biological systems, engineering applications, financial studies, and many other fields, only can finite number of samples be obtained. Supervised learning aims at synthesizing a function (or mapping) to represent or approximate an unknown but definite relation between the input and output, based on the input-output samples. The learning process is accomplished with the selection of a hypothesis space and a learning algorithm. The hypothesis space is a family of functions endowed with certain structures, very often, a space spanned by a set of parameterized functions like
where is a parameter for specifying the -th function and
is a class of parameters. A typical example is the three-layer feed-forward neural networks (FNNs) in whichis the response of the
-th neuron in the hidden layer withbeing all the connection weights connected to the neurons . A learning algorithm is then defined by some optimization scheme to derive an estimator in based on the given samples. To distinguish the type of parameters, we call each a hidden predictor, a hidden parameter, and a bright parameter. Then, for a nonlinear function , it follows from (1) that hidden parameters determine the attributions of hidden predictors of the estimator ( nonlinear mechanism), while bright parameters characterize how hidden predictors are linearly combined (linear mechanism). In this sense, supervised learning boils down to determining hidden and bright parameters in a parameterized hypothesis space.
In traditional learning paradigm, hidden and bright parameters are not distinguished and trained simultaneously. Such a scheme is featured as the one-stage learning
(OSL). The well known support vector machine (SVM)
, kernel ridge regression and FNNs  are typical examples of the OSL scheme. OSL has a benefit of theoretical attractiveness in the sense that this scheme enables to realize the optimal generalization error bounds [32, 44, 26]. However, it inevitably requires to solve some nonlinear optimization problem, which usually suffers from the time-consuming difficulty, especially for problems with large-sized samples.
To circumvent this difficulty, a two-stage learning (TSL) scheme, featured as learning through random assignment of hidden parameters (LtRaHP), was developed and widely used [4, 17, 18, 29, 35] in the last two decades. LtRaHP assigns randomly the hidden parameters in the first stage and determines the bright parameters by solving a linear least-square problem in the second stage. Typical examples of LtRaHP include, the random vector functional-link networks (RVFLs) , the echo-state neural networks (ESNs) , the random weight neural networks (RWNNs)  and the extreme learning machine (ELM) . LtRaHP significantly reduces the computational burden of OSL without sacrificing the prediction accuracy very much, as partially justified in our recent theoretical studies [24, 28]. However, due to the randomness of the hidden parameters, a satisfactory generalization capability of LtRaHP is achieved only in the sense of expectation. This then leads to an uncertainty problem: it is uncertain whether a single trail of the scheme succeeds or not. Consequently, to yield a convincing result, multiple times of trails are required in the training process of LtRaHP.
From these studies, we draw a simple conclusion on the pros and cons of existing learning schemes. OSL possesses promising generalization capabilities but it is built on the high computational burden, while LtRaHP has charming computational advantages but it suffers from an uncertainty problem. Thus, it is still open to find an efficient and feasible learning scheme, especially when the size of data is huge. Our aim in the present paper is to develop a new TSL scheme. Our core idea is to apply a deterministic mechanism for the assignment of hidden parameters in place of the random assignment in LtRaHP. Accordingly, the new TSL scheme will be featured as learning through deterministic assignment of hidden parameters (LtDaHP). We will show that LtDaHP outperforms LtRaHP in the sense that LtDaHP avoids the uncertainty problem of LtRaHP without increasing the computational complexity.
As the popularity of neural networks in recent years [10, 36, 7, 48], we equip the LtDaHP scheme with an FNN-instance to show its outperformance. Taking inner weights as minimal Riesz energy points on a sphere and thresholds as equally spaced points (ESPs) in an interval, we can define an FNN-realization of LtDaHP. We theoretically justify that so defined LtDaHP outperforms both LtRaHP and OSL in many ways. Firstly, LtDaHP can achieve the almost optimal generalization error bounds of the OSL schemes; Secondly, LtDaHP significantly reduces the computational burden of OSL; Finally, unlike LtRaHP, LtDaHP can find a satisfactory estimator in a single time of trial. Thus, LtDaHP provides an effective way of overcoming both the high computational burden difficulty of OSL and the uncertainty problem of LtRaHP. We also provide a series of simulations and application examples to support the outperformance of LtDaHP.
The rest of paper is organized as follows. Section II aims at introducing the new FNN-realization of LtDaHP as well as a brief introduction of the minimal Riesz energy configuration problem on the sphere. In Section III, we verify the almost optimality of LtDaHP in the framework of statistical learning theory. In Section IV, we provide the simulations and application examples to support the outperformance of LtDaHP and the correctness of the theoretical assertions we have made. We conclude the paper in Section V with some remarks.
Ii FNN-Realization of LtDaHP
In this section, after providing the motivation of the LtDaHP scheme and briefly introducing minimal Riesz energy points on the sphere, we formalize an FNN-realization of LtDaHP.
FNNs, taking three-layer FNNs with one output neuron for example, look for the estimators of the form where is the inner weight which connects the input layer to the -th hidden neuron, is the threshold of the -th hidden neuron,
is the nonlinear activation function, andis the outer weights that connects the -th hidden layer to the output layer. In FNNs, the hidden parameters are and the bright parameters are FNNs generate their estimators conventionally through solving the optimization problem
It is obvious that (2) does not distinguish hidden parameters and bright parameters and is actually an OSL.
Our idea to design a TSL learning system based on FNNs mainly stems from two interesting observations. On the one hand, we observe in theoretical literature [37, 30] that to realize the optimal approximation capability, the inner weights of an FNN can be restricted on the unit sphere embedded into the input space. This theoretical finding provides an intuition to design efficient learning schemes based on FNNs with shrinking the class of parameters. On the other hand, the existing LtRaHP schemes [17, 18, 28, 35]
shows that the uniform distribution forhidden parameters is usually effective. This prompts us to assign the hidden parameters as uniform as possible. An extreme assignment is to deterministically select the hidden parameters as the equally spaced points (ESPs), rather than the random sketching. Combining these two observations, it is reasonable to generate inner weights as ESPs on the unit sphere and thresholds as ESPs on some interval.
The problem is, of course, can ESPs on the sphere be practically constructed? This problem, known as the Tamme’s problem or the hard sphere problem , is a well known and long-standing open question. This perhaps explains why only LtRaHP has been widely utilized in TSL up to now, even though several authors have already conjectured that LtDaHP may outperform its random counterpart . However, due to the non-boundary property of the sphere, the Tamme’s problem is the limiting case of another famous problem: The minimal Riesz energy configuration problem . The latter problem, listed as the -th problem of Smale’s “problems for this century” , can be approximately solved by using several methods, such as the equal-area partition  and recursive zonal sphere partitioning . Thus, the hidden parameters of FNNs can be selected by appropriately combining the minimal Riesz energy points on the sphere with the equally spaced points in an interval.
Ii-B Minimal Riesz energy points on the sphere
Let denote the unit sphere in the -dimensional Euclidean space , and be a collection of distinct points (a configuration) on . The Riesz -energy associated with , denoted by , is defined by 
Here is the Euclidean norm. We use to denote the -point minimal -energy over , that is,
where the minimization is taken over all -point configurations of . If is a minimzer of (3), i.e.,
then is called a minimal -energy configuration of , and the points in are called the minimal -energy points.
The elegant work in  showed that the minimal -energy points of are an effective approximation of the equally spaced points (ESPs) on the sphere whenever . Thus, one can use the minimal -energy points to substitute ESPs in applications. As formulated as the Smale’s problem , generating minimal -energy configurations and minimal -energy points on has triggered enormous research activities [16, 22, 41] in the past thirty years.
Up till now, there have been several well established approaches to approximately solve the minimal -energy configuration problems [19, 22, 41], among which two widely used procedures are Saff et al.’s equal-area partitioning  and Leopardi’s recursive zonal sphere partitioning procedure . Both of them have been justified to be able to approximately generate the minimal -energy points of for a certain with a “cheap” computational cost, more precisely, with an asymptotic time complexity .
Ii-C The LtDaHP Scheme
Let be the set of samples with and . Without loss of generality, we assume the input space and the output space , where is the unit ball in and . Our idea is to solve an FNN-learning problem by the TSL approach which deterministically assigns the hidden parameters at the first stage, and solves a linear least-square problem at the second stage. In particular, we propose to deterministically assign inner weights to be minimal Riesz -energy points of and thresholds to be ESPs in the interval Consequently, our suggested FNN-realization of LtDaHP can be formalized as follows:
LtDaHP Scheme: Given the training samples , the nonlinear function and a splitting we generate the LtDaHP estimator via the following two stages:
Stage 1: Take the inner weights to be minimal Riesz -energy points of with , and to be ESPs in the interval that is,
We then obtain a parameterized hypothesis space
Stage 2: The LtDaHP estimator is defined by
Classical neural network approximation literature [6, 14, 3] shows that neural networks with fixed inner weights are sufficient to approximate univariate functions. We adopt this approach in our construction (4) by using to approximate univariate functions. Then, we use an approach from  to extend univariate approximation to multivariate approximation (see Section C in Appendix for detailed construction), which requires different inner weights on the sphere and obtained an FNN with good approximation property formed as (4). It should be mentioned that in the splitting is the main parameter to control the approximation accuracy and depends on is required for a dimensional extension. Based on the splitting, each inner weight shares same thresholds in constructing FNNs, which is different from the classical FNNs in (2). The structures of functions in is shown in the following Figure 1.
Based on the deterministic assignment of hidden parameters, LtDaHP then transforms a nonlinear optimization problem (2) into a linear one (5), which reduces heavily the computational burden. It should be mentioned that for ESNs, there is another approach to deterministically construct hidden parameters 
. However, ESNs focus on training recurrent neural networks rather than the standard FNN studied in this paper.
Iii Theoretical assessment
In this section we study theoretical behaviors of LtDaHP. After reviewing some basic notations of learning theory , we prove that the FNN-realization of LtDaHP provides an almost optimal generalization error bound as long as the regression function is smooth.
Iii-a Statistical Learning Theory
Suppose that are drawn independently and identically from
according to an unknown probability distributionwhich admits the decomposition
Assume that is a function that characterizes the correspondence between the input and output, as induced by . A natural measurement of the error incurred by using of this purpose is the generalization error, defined by
which is minimized by the regression function [9, Chap.1]
We do not know this ideal minimizer since is unknown, but we have access to random examples from sampled according to .
Let be the Hilbert space of -square-integrable functions on , with norm It is known that, for every , there holds [9, Chap.1]
So, the goal of learning is to find a best approximation of the regression function .
If we have a specific estimator of in hand, the error clearly depends on and therefore has a stochastic nature. As a result, it is impossible to say anything about in general for a fixed . Instead, we can look at its behavior in probability as measured by the following expected error
where the expectation is taken over all realizations obtained for a fixed , and is the
fold tensor product of.
Iii-B An Almost Optimal Generalization Bound
In general, it is impossible to get a nontrivial generalization error bound of a learning algorithm without knowing any information on [12, Thm.3.1]. So, some types of a-priori information of the regression function have to be imposed. Let be the set of positive integers and with each The -th order derivative of a function is defined by
where and . The classical Sobolev class is then defined for any by
Let be the identity mapping
and is called the distortion of , which measures how much distorts the Lebesgue measure. We assume that the distribution satisfies and , which is standard and utilized in vast literature [12, 9, 32, 43, 23, 26].
Since the generalization capability of LtDaHP depends also on the activation function certain restrictions on should be imposed. We say that
is a sigmoid function, ifsatisfies
By definition, for any sigmoid function there exists a positive constant such that
where is the number of different thresholds in the LtDaHP scheme. We further suppose that for arbitrary closed set in is square integrable, which is denoted by
The proof of Theorem 1 will be presented in Appendix. Some immediate remarks, to explain this result, are as follows.
Iii-C1 On optimality of generalization error
We see that modulo the logarithmic factor , the established learning rate (9) is optimal in a minmax sense. That is, up to a logarithmic factor, the upper and lower bounds of the learning rate are asymptotically identical. We further show that this learning rate is also almost optimal among all learning schemes. Let be the class of all Borel measures satisfying and . We enter into a competition over all estimators and define
Then, quantitatively measures the quality of and it was shown in [12, Chap. 3] that
where is a constant depending only on , and (10) shows that if and learning rates of all learning strategies based on samples cannot be faster than Consequently, the learning rate established in (9) is almost optimal among all learning schemes.
In this sense, Theorem 1 says that even when the hidden parameters are not trained and just preassigned deterministically, LtDaHP does not degrade the generalization capability of FNNs which train hidden and bright parameters together by some OSL scheme. It is noted that a similar almost optimal learning rate has also been proved in  for a typical scheme of LtRaHP (ELM):
in which the expectation is taken over all possible random assignments of hidden parameters. We refer the readers to  for detailed definitions of and . There is an additional expectation term that brings the uncertainty problem of LtRaHP. Comparing (11) with (9), we can see that the LtDaHP dismisses the -expectation term. Furthermore, we notice that LtRaHP, as shown in , may break the almost optimal generalization error for certain specific activation functions even in the -expectation sense. Theorem 1 thus implies that the LtDaHP improves on LtRaHP not only in circumventing the uncertainty problem, but also guaranteeing the generalization capability further.
Iii-C2 On how to specify the activation function
In Theorem 1, three conditions have been imposed on the activation function (i) with , where is defined by (7), (ii) is a bounded sigmoid function, and The conditions (ii) and (iii) are clearly mild, say, both the widely applied heaviside function and logistic function satisfy the assumptions, where
The most crucial assumption is (i), i.e., should be carefully chosen. It is observed that in (8) is with respective to and , while depends merely on and So can be specified when is given. For example, if is the logistic function, then can be selected as any positive numbers satisfying .
The problem is that there are infinite many choices of such . How to specify the best thus becomes a practical issue. According to [31, Thm. 2.4], the complexity of monotonously increases with respect to
. Due to the well known bias and variance trade-off principle, we then recommend to choosein practice. For example, when the logistic function is utilized, we may take if and if .
Iii-C3 On almost optimality of the number of hidden neurons
Theorem 1 has presented the number of hidden neurons to be We observe from (4) that is an -dimensional linear space. Hence, according to the well known linear width theory , for , must be not smaller than if one wants to achieve an approximation error of This means that the number of hidden neurons required in Theorem 1 cannot be reduced.
At the first glance, there are two parameters, and , that need to be specified, which is more complicated than that in OSL  and LtRaHP . In fact, there is only an essential parameter since and have a relation . If the smoothness information of is known, one can directly take as that in Theorem 1. However, it is usually infeasible since we do not know the concrete value of , when faced with real world applications. Instead, we turn to some model-selection strategy such as the “cross-validation” approach [12, Chap.8] to determine and .
Iii-C4 On why LtDaHP works
It is known [9, Chap.1] that a satisfactory generalization capability of a learning scheme can only be resulted from an appropriate trade-off between the approximation capability and capacity of the hypothesis space. We use this principle to explain the success of LtDaHP.
Given a hypothesis space and a function the approximation capability of can be measured by its best approximation error: while the capacity of can be measured by its pseudo-dimension , denoted by . We compare the approximation capabilities and capacities of the hypothesis spaces of FNNs and LtDaHP. The hypothesis space of FNNs is the family of functions of the form
if is the logistic function. Furthermore, in , it was verified the approximation capability of satisfies
In comparison, we find in  that
and, similarly, we can prove in Lemma 6 that
as long as satisfies (8). Comparing (12) and (13) with (14) and (15), we thus conclude that for appropriately tuned , both the approximation capabilities and capacities of and are almost the same. This shows the reason why the LtDaHP scheme performs at least not worse than the conventional FNNs.
Iv Experimental Studies
In this section, we present both toy simulations and real world data experiments to assess the performance of LtDaHP as compared with support vector regression (SVR), Gaussian process regression (GPR), and a typical LtRaHP scheme (ELM), where the learning algorithm is the same as LtDaHP, except that the inner weights and thresholds are randomly sampled according to the uniform distribution. In our experiments, the minimal Riesz energy points were approximately generated by the recursive zonal sphere partitioning  using the EQSP tool box111http://www.mathworks.com/matlabcentral/fileexchange/13356-eqsp-recursive-zonal-sphere-partitioning-toolbox. SVR and GPR were realized by the Matlab functions fitrsvm and fitrgp(with subsampling 1000 atoms), respectively. For fair of comparisons, we applied the 10-fold cross-validation method [12, Chap.8] to select all parameters (more specifically, to select three parameters in SVR, the width of Gaussian kernel, the regularization parameter and the epsilon-insensitive band, and two parameters in LtRaHP and LtDaHP, the number of hidden neurons for both methods, for LtDaHP, and for LtRaHP), and the time for parameter tuning was included in recording the training time.
All the simulations and experiments were conducted in Matlab R2017b on a workstation with 64Gb RAM and E5-2667 v2 3.30GHz CPU.
Iv-a Toy Simulations
This series of simulations were designed to support the correctness of Theorem 1 and compare the learning performance among LtDaHP, LtRaHP, GPR, and SVR. For this purpose, the regression function is supposed to be known and given by
where Direct computation shows and . We generated the training sample set with variable data size through independently and randomly sampling from according to the uniform distribution, and with
being the white noise. The learning performance of the algorithms were tested by applying the resultant estimators to the test setwhich was generated similarly to with a difference that In simulations, we took and implemented the LtDaHP and LtRaHP with where
for LtDaHP as suggested in the subsection III.C. The for LtDaHP and for LtRaHP were tuned by 10-fold cross validation. In addition, to avoid the risk of singularity, we implement the least square (2) for LtDaHP and LtRaHP with a very small and fixed regularization pamameter .
In the first simulation, to illustrate the difference between LtDaHP and LtRaHP, we conducted a phase diagram study. To this end, the number of samples varied from to and the number of neurons ranged from to . For each pair , we implemented 100 independent simulations with LtDaHP and LtRaHP. The average rooted mean square errors (RMSE) were then recorded. We plotted all the simulation results in a figure, called the phase diagram, with -axis and -axis being respectively the number of neurons and the number of samples, and the colors from blue to red corresponding to the RMSE values from small to large. The simulation results are reported in Fig. 2.
Fig. 2(a) shows that for suitable choice of , the LtDaHP estimator maintains always very low RMSE, which coincides with Theorem 1. Furthermore, comparing (a) with (b) in Fig. 2 demonstrates several obvious differences between LtDaHP and LtRaHP: (i) the test errors of LtDaHP are much smaller than those of LtRaHP. This can be observed not only for the best choice of the number of neurons, but also for every fixed number of neurons as well. This difference reveals that as far as the generalization capability is concerned, LtDaHP outperforms LtRaHP in this example. (ii) LtDaHP exhibits a somewhat tidy phase change phenomenon: there is a clear range of and such that the LtDaHP performs well. Similar phase change phenomenon does not appear in LtRaHP, as exhibited in Fig. 2(b), due to the uncertainty. This difference implies that LtDaHP is more robust to the specification of neuron number than LtRaHP and selecting an appropriate neuron number for LtDaHP is much easier than that for LtRaHP. All these differences show the advantages of LtDaHP over LtRaHP.
In the second simulation, we studied the pros and cons of LtDaHP, LtRaHP, GPR, and SVR. We implemented these four algorithms independently 50 times and calculated the average RMSEs. The obtained RMSEs, as well as the corresponding computational time and model sparsity, were plotted as a function of the number of training samples in Fig. 3.
From Fig. 3(a), we can see that GPR performs with the best generalization capability, then LtDaHP, LtRaHP and finally SVR. More specifically, we observe that the test errors of LtDaHP are very close to GPR when the sample size is over . But LtRaHP requires more samples to reach comparable performances. From Fig. 3(b) we can see that LtDaHP and LtRaHP have constantly low training time. In this simulation, GPR has lower training time when samples are smaller than , but is less predictable due to more complex algorithm. In addition, SVR always takes more time for training, since there are 3 hyper-parameters to be tuned. Finally from Fig. 3(c) we can see that LtDaHP and LtRaHP has much smaller model sparsity than SVR, which suggests less time in prediction.
All these simulations support the outperformance of LtDaHP and the theoretical assertions made in the previous sections.
Iv-B Real World Benchmark Data Experiments
|Data sets||Training samples||Testing samplesr||Attributes|
We further apply LtDaHP, LtRaHP, GPR, and SVR to a family of real world benchmark data sets. We include 10 problems covering different fields222http://archive.ics.uci.edu/ml and https://www.dcc.fc.up.pt/ltorgo/. With the training and testing samples drawn as in Table 1, we used 10-fold cross-validation to select all the parameters involved in each algorithm. Then we implemented each algorithm independently 50 times and calculated the rooted mean square error (TestRMSE) of the estimator. It was also recorded the corresponding average training time (TrainMT) for each algorithm. For comparison of testing complexity, we recorded the average number of hidden neurons (MSparsity) involved in LtDaHP, LtRaHP, and SVR. The simulation results are listed in Table II.
We can see from Table II that LtDaHP works well for most of the data sets, exhibiting an almost similar or comparable generalization performance to GPR. Both LtRaHP and SVR failed in certain data sets.
As far as the training time and testing complexity are concerned, LtDaHP and LtRaHP significantly outperform SVR, and are better than GPR when sample size is higher than 10000. Furthermore, we can observe that LtDaHP and LtRaHP always keep a similar training time and testing complexity.
Iv-C Real World Massive Data Experiments
In this section we assess the performance of LtDaHP and LtRaHP through applying the algorithms to a real world massive data.
The problem we have applied is the household electric power consumption data set. The task is to predict the global active power from primary features. The dataset contains samples, and so a real large scale problem. We applied LtDaHP and LtRaHP to this problem by dividing the sample dataset into a training set containing samples and a test set containing samples. 10 random partitions of the data were implemented and the results were recorded in Table III. Under such an experimental setting, the RMSE predicted by GPR is , while the deep kernel machine(DKL) can achieve , as reported in .
We only compare the performance of LtDaHP and LtRaHP because of the extremely high computational burden of SVR for such a large scale problem. Both algorithms were applied with neuron number varying from to . We plot the obtained RMSEs of test error as a function of to demonstrate the performances of LtDaHP and LtRaHP in Fig. 4. Fig. 4 shows that for most choices of , LtDaHP performs much better than LtRaHP.
To compare the performance of LtDaHP and LtRaHP further, we implemented the algorithms in which the parameters were chosen by using the 5-fold cross-validation method. The resultant RMSEs, TrainMT, and Msparsity are shown in Table III.
From Table III we see that LtDaHP performs much better than LtRaHP with respect to the TestRMSE, and they achieve a similar training time and model sparsity. In addition, LtDaHP achieves comparable result to DKL, and slightly better than GPR. But it should be noted that both DKL and GPR are lack of theoretical guarantees. However, both LtDaHP and LtRaHP cost more computation time than DKL(which is 3600s as reported in ) due to an additional procedure of parameter tuning. We believe that with some other additional precision-promoting skills used, like “divide and conquer” in [25, 5], the performance of LtDaHP can be further improved.
All these simulations and experiments support that, as a new TSL scheme, LtDaHP outperforms LtRaHP in generalization capability and SVR in computational complexity. It is also comparable with GPR in numerical ability but possesses almost optimal theoretical guarantees.
In this paper, we proposed a new TSL scheme: learning through deterministic assignment of hidden parameters (LtDaHP). The main contributions can be concluded as follows:
Borrowing an approximate solution to the classical Tamme’s problem, we suggested to set inner weights of FNNs as minimal Riesz -energy points on the sphere and thresholds as equally spaced points in an interval;
We proved that with the suggested deterministic assignment mechanism of the hidden parameters, LtDaHP achieves an almost optimal generalization bound (learning rate). In particular, it does not degrade the generalization capability of the classical one-stage learning schemes very much.
A series of simulations and application examples were provided to support the correctness of the theoretical assertions and the effectiveness of the LtDaHP scheme.
Additionally, we found that the outperformance of LtDaHP over LtRaHP demonstrated in the house electric prediction problem in Section V.C shows that LtDaHP may be more effectively applied to practical problems, especially for large scaled problems. We finish this section with two additional remarks.
It should be remarked that LtDaHP involves an adjustable parameter that may have a crucial impact for its performance. We have suggested a criterion of specification of for the logistic function, and the simulations in Section 4 have substantiated the validity and effectiveness of such a criterion. This criterion is, however, by no means universal. In other words, it is perhaps inadequate for other activation functions. Thus, how to generally set an appropriate K in implementation of LtDaHP is still open. We leave it for our future study.
The minimal Riesz energy points were approximated by the
recursive zonal sphere partitioning. However, the EQSP algorithm is not robust in high-dimensional cases, which makes the corresponding LtDaHP scheme instably training high dimensional data. We will work on data-driven methods by taking the sample distribution to determine the hidden parameters in a future work.
The minimal Riesz energy points were approximated by the recursive zonal sphere partitioning. However, the EQSP algorithm is not robust in high-dimensional cases, which makes the corresponding LtDaHP scheme instably training high dimensional data. We will work on data-driven methods by taking the sample distribution to determine the hidden parameters in a future work.
Appendix:Proof of Theorem 1
We divide the proof of Theorem 1 into five parts. The first part concerns the orthonormal system in . The second part focuses on the ridge representations for polynomials. The third one aims at constructing an FNN in , while the fourth part pursues its approximation ability. In the last part, we analyze the learning rate of (5).
V-a Orthonormal basis for multivariate polynomials on the unit ball
Let be the Gegenbauer polynomial  with index . It is known that the family of polynomials is a complete orthogonal system in the weighted space with and , that is,
Then, is a complete orthonormal system for the weighted space with . With this, we introduce the univariate Sobolev spaces
where It is easy to see that
where denotes the algebraic polynomials defined on of degrees at most .
Denote by and the class of all spherical harmonics of degree and the class of all spherical polynomials with total degrees , respectively. It can be found in  that . Since the dimension of is given by
the dimension of is Let be an arbitrary orthonormal system of . The well known addition formula is given by 
where , and denotes the aero element of
For , define
where . Then it follows from  that
is an orthonormal basis for with Based on the orthonormal system, we define the Sobolev space on , denoted by , as the space
where and .
V-B Ridge function representation for multivariate polynomials on the ball
For and , let a discretization quadrature rule
holds exact for . By the sequence of works in [20, 21], we find that all minimal -energy configurations with is a discretization quadrature rule . The following positive cubature formula can be found in .
If , then there exists a set of numbers such that
We then present the main tool for our analysis in the following proposition.
Let . If is a discretization quadrature rule for spherical polynomials of degree up to , then for arbitrary , there holds
and is a constant depending only on .
We postpone the proof of Proposition 1 to the end of this subsection and introduce the following lemma concerning important properties of at first.
Let be defined as above. Then for each we have