Researches on artificial feed-forward neural networks (FNNs) have become increasingly active and popular, for it is one of the most powerful tools in artificial intelligence field. What’s more, it has been widely studied and applied in a great number of engineering fields, such as the genetic research, the prediction of rainfall, the analysis of biological emotion and the humanoid robots control . Because FNN has formidable approximation capability to approximate complex and nonlinear functions directly through training on samples, a computational model containing system characteristics could be built up more conveniently without knowing the exact information of the objective system by training on input samples.
. Leshno has proved that multi-layer FNNs with activation functions are able to approximate any function, including continuous functions and non-sequence functions. For the past few years, the single-hidden layer feed-forward neural networks (SLFNs) model has been proposed by Huang et al. [14, 13, 11], which can approximate functions with zero errors by learning diverse samples. SLFNs have at most
hidden neurons and the hidden layer biases.
For traditional SLFNs, the parameters need to be tuned iteratively to ensure the optimality of the network and frequent parameters tuning will lead to a complicated relationship between the parameters of different layers. As a solution, the famous Back-propagation (BP) algorithm based on gradient-descend has been proved to be the most effective tool for iterative learning algorithm in SLFNs . However, the BP-type training algorithm has some inherent weaknesses resulting in inevitable slow operations and poor performances, such as slow convergence , local minima .
To address this problem, Huang et al. propose and investigate a simple but effective learning algorithm for the SLFNs, called extreme learning machine (ELM). It has been observed that the ELM has unparalleled advantages such as extremely training speed, and excellent generalization performance [14, 13, 11]. Unlike the traditional BP algorithm, ELM did not pay much attention to the input weights and the neuron biases, and the input weights and hidden layer biases can be arbitrarily assigned [12, 9].
Although the ELM enjoys its promising merits in application prospects, it still has several shortcomings that ELM could not easily overcome. The first one is that because of due to the mechanisms of randomly assigning weights, the output of ELM fluctuates and the performances could be unstable. The second problem ELM involving is that the hidden layer matrix could be not full column rank or even ill-conditioned when the input weights and biases are generated randomly, which may incur serious computational robustness problems in obtaining optimal output weights .
Therefore, in order to overcome the aforementioned weaknesses of BP-type FNNs and ELM, in this paper, we focus on utilizing Gegenbauer orthogonal polynomials as activation functions, and constructing orthogonal polynomials FNN with the simplicity, such as fixed input weights and biases. Such settings have been proved effective in [41, 43]. Further, to avoid lengthy training and determine optimal output weights directly, a method called weights direct determination (WDD) based on pseudo-inverse in previous works [41, 43] is firstly reviewed. However, the WDD only focuses on empirical risk minimization, which might easily lead to a too complicated model and higher risks of over-fitting [32, 3]. Also, in the previous works, the proposed networks are single output networks, which may be not consistent with the facts that FNNs with multiple output are the most effective way to tackle multi-classification problems.
In this paper, a novel FNN termed as Gegenbauer neural network (GNN) is constructed and investigated based on Gegenbauer polynomials series and polynomials function approximation. Then, based on equality-constrained optimization, a regularized WDD (R-WDD) is proposed with a regularized term to prevent the model from being too complicated. Also, based on the number of network output nodes, a unified learning mechanism of R-WDD for binary and multi classification problems is given. Besides, with regard to computational scalability and efficiency, two different solutions to R-WDD are given, depending on the scale of the problems and the dimensionality of hypothesis feature space. Specifically, the main contributions of this paper lie in the following facts.
The main merit ELM algorithm is its extremely fast training speed. In this paper, the proposed GNN has similar calculation complexity to ELM, mainly involving an inversion of a square matrix. Thus, it has the comparable excellent training speed as ELM.
Compared with ELM, the proposed algorithm is stable and has lower computational risks, since the input weights are fixed and the tensor products of Gegenbauer orthogonal polynomials are utilized as activation functions.
Experiments on several real-world data sets have illustrated that the GNN has comparable or even better generalization performance in binary classification problems and multi-classification problems, compared with least square support vector machine (LS-SVM) with Gaussian kernel and ELM with Gaussian kernel.
Experiments have shown that GNN has great computational scalability and efficiency, still performs very well when there only exists very few neurons in the network.
Experiments on data sets contaminated with random noises prove the better classification robustness of GNN.
The remainder of the article is organized as follows. Section II presents and analyzes the theoretical basis of the GNN, and the GNN with single output nodes (SOGNN) and the GNN with multiple output nodes (MOGNN) are constructed and discussed respectively. In Section III, the proposed R-WDD is discussed and analyzed based on equality constrained-optimization. Section IV presents several comparison experiments on LS-SVM, ELM and GNN with R-WDD. And the final the remarks are given in Section V.
Ii Theoretical Basis for GNN
In this section, the fundamentals of Gegenbauer Polynomials are first presented. Then, the theoretical basis of Gegenbauer polynomials series (GPS) for multivariate function approximation is briefly analyzed. Furthermore, based on the analysis, the GNN model is constructed and investigated.
Ii-a Fundamentals of Gegenbauer Polynomials
where is a parameter of the polynomials that satisfies , is the degree of the Gegenbauer polynomials and and is a Gamma function. Equivalently, Gegenbauer polynomials can also be defined by the recurrence relations below :
where and .
Ii-B GPS for Multivariate Function Approximation
First, a function approximation theorem, i.e., Weierstrass approximation theorem, is presented for the further discuss.
(Weierstrass Approximation Theorem)
Let with .Then there exists a sequence of polynomials converging uniformly to on :
with . And it can also be generalized to a multivariate case. Set . Let . Then there is a sequence of polynomials converging uniformly to on :
Based on the Gegenbauer polynomials and theories of function approximation, i.e., Theorem 1
, and polynomial interpolation[16, 43, 40], a univariate continuous function defined on can be approximated by the -order (or to say the total number of polynomials used) GPS:
where is the coefficient corresponding to , denotes -order GPS and is the error function related to . According to approximation theories  and Theorem 1, the error function could converge to zero when is large enough, which could be expressed as:
Let and is a continuous real-valued function with variables on (i.e., ). Then, , the generalized multivariate Bernstein polynomials of , converges uniformly to and could be structured as follows:
where with , and is binomial coefficient (i.e., ).
According to Lemma 1, for a target function , we have the following result:
Then, on the basis of polynomial interpolation theory , we give the following theorem.
For a real-valued function with variables continuous on , it could be approximated by GPS uniformly.
where denotes the coefficient corresponding to .
where is the total amount of Gegenbauer polynomials utilized to approximate , and is the residual error functions related to function , is the coefficient for , , termed as the Gegenbauer Elementary function (GEF) in this paper, is the -order tensor product of the one-dimensional Gegenbauer polynomials, which could be expressed as:
is the weights for , and is the total amount of Gegenbauer polynomials tensor product (i.e., ) utilized to approximate .
Ii-C Network Architecture
Based on the analysis in Section II-B, the SOGNN and MOGNN are hereby constructed and investigated. The output patterns of both forms are then illustrated.
Ii-C1 SOGNN Model
According to the theoretical analysis above, the architecture of SOGNN is constructed. As is shown in Fig. 1 , the network architecture of SOGNN consists of three layer, including input layer, hidden layer and output layer. Let denotes the -dimensional input vector, and all of the input data should be normalized to . In this model, the neurons in input layer and the single neuron in output layer are both chosen linear in our model. Besides, the activation functions in hidden layer are a series of GEFs , and denotes the total amount of hidden neurons. Thus, the output of SOGNN in this case can be expressed as follow:
where denotes the Gegenbauer activation vector with respect to , actually maps the -dimensional input space to the -dimensional hidden layer feature space, and denotes the connecting weights vector.
If the number of samples is , their outputs can be written in matrix form, i.e.,
Based on Theorem 2 (specifically, Eq. (11) and (12), with large enough and optimal weights, can best approximate the target function . Also, in this model, all the input weights are fixed to be , and the threshold values for all neurons are fixed to be .
Notably, the aforementioned parametric settings can effectively simplify the network architecture, lower the computational complexity, make it more convenient for future implementation on hardware, and have been proved effective in previous works [41, 43].
Even more, such parametric simplification, on one hand, maintains the approximation capability and the optimality of the network since the network (shown in Fig. 1) with the aforementioned parametric settings well matches the theoretical analysis in Eq. (14) and (12). On the other hand, such fixed parametric settings could also improve the robustness of the model when faced with disturbance in input. According to [21, 22], when the input weights and hidden layer biases are randomly assigned, the robustness could be poor, since the changes of the output weights matrix sometimes could be very large when input disturbance is encountered, which will result in increasing risk of network.
Further, with activation functions being orthogonal elementary functions, our model prevents the hidden activation matrix from being not full column rank and ill-conditioned, according to orthogonal polynomials theory . Also, with being invertible, the direct computing of can be easily obtained. Notably, this sets the basis of the computational robustness of the proposed method in Section III-B .
Ii-C2 MOGNN Model
As is shown in Fig. 2, the MOGNN model is constructed based on SOGNN. To be more specific, the MOGNN with outputs consists of SOGNN. The output of MOGNN is -dimensional vector, where and the th output of MOGNN is the output of the th SOGNN. The parametric settings is the same as the settings of SOGNN. The output of th output node is given below:
Iii Proposed Regularized Weights Direct Determination
In this section, the detailed explanation of our proposed R-WDD for GNN is presented. First, the original WDD method is reviewed, after which the main drawbacks of it are discussed. Then, based on equality-constrained optimization, a method minimizing the empirical risks and structural risks with better generalization ability is provided, and the analysis of it is carried out in cases of two forms of GNN, respectively. Finally, the unified R-WDD for GNN weights determination under different cases are given.
Iii-a Review of the Original Weights Direct Determination
, for a given GNN model, the objective loss function of WDD is:
where is the label of samples.
With WDD, the solution of the optimal output weights can be calculated by the following equation:
where is the Moore-Penrose generalized inverse of activation matrix , also termed as .
According to 
, there are several methods to calculate the Moore-Penrose generalized inverse of a matrix: orthogonal projection method, iterative method and singular values decomposition (SVD).
Since the WDD directly calculates the least-square solution involving an inversion of a matrix, the speed of calculation can be guaranteed. Further, according to the previous works [43, 41], the networks equipped with WDD have demonstrated outstanding performances, such as excellent approximation ability, easy implementation.
However, although WDD method has aforementioned merits, it still has a main drawback. As is shown Eq. (16), the WDD only focuses on empirical risk minimization and no measure is taken to prevent the model hypothesis from being too complicated, which might easily lead to an over-fitting model [32, 3].
Iii-B Equality-constrained Optimization Problems for GNN
Due to the aforementioned drawbacks of the original WDD method, in this paper, our model does not only tend to seek the smallest training error but also seek the smallest norm of output weights. According to the famous Barlett’s theory , a good model with generalization ability should be the one could reach a best trade-off between the empirical risk and the structural risk. Generally, the empirical risk is the training error of the neural network, which could be represented by -norm, i.e., ; the structural risk is the -norm of output weights, i.e., , which denotes the complexity of the model. Thus, the optimization object and the mathematical model could be formulated as below:
According to the ridge regression theory, it would make solution much stabler and tending to have better generalization ability since a positive value added to the diagonal of or
. Also, according to the statistical learning theory[22, 28], the real risk of a model should be the sum of the empirical one and the structural one, and these two kinds of risk should be considered together in discussions. Since the proposed GNN has two forms (i.e., SOGNN and MOGNN), in the following, the solution for each form would be discussed respectively. After that, it can be seen the solution for SOGNN is actually one specific case of MOGNN. Thus, these two solutions are finally unified, which leads to the final solution of R-WDD.
Iii-B1 SOGNN for Classification
Based on the analysis in Section II, the SOGNN can approximate any continuous target function in the variables space. For classification problems, the expected output of SOGNN can be close enough to the target class labels in the corresponding regions. Thus, the objective function of proposed R-WDD for SOGNN can be described as below:
where the weight factor is the regularizer. By tuning the values of , the proportion of the empirical risk and the structural risk can be adjusted for reaching the best trade-off. Based on KKT theorem , to solve the R-WDD optimization problem for SOGNN is equivalent to solve the following Lagrangian equation:
where denotes the Lagrangian multiplier vector with equality constraints of Eq. (19) and . By setting the gradients of the Lagrangian equation with respect to , and equal to zero respectively, we can obtain the the following KKT optimality conditions of Eq. (20):
Iii-B2 MOGNN for Classification
According to Section II-C, the MOGNN is actually a generalized form of the SOGNN. For -classification problems, the number of the output nodes of MOGNN should be , and when the original class label of sample is , the expected output of MOGNN should be a dimensional vector and the th component of it should be close to , while others should be close to . Thus, the objective function of proposed R-WDD for MOGNN can be described as below:
where is the target vector corresponding to sample , is the error vector, is a matrix of , where is the vector of the output weights corresponding to th output node and is Frobenius norm.
Similar to SOGNN, the Lagrangian equation for R-WDD optimization for MOGNN is:
where denotes the Lagrangian multiplier matrix, and . By setting the partial derivatives of Eq. (23) equal to zero respectively, we can also obtain the KKT optimality conditions of Eq. (23) as follows:
It can be seen from the Eq. (21) and Eq. (24) that to solve the R-WDD optimization for SOGNN is actually to solve one particular case of the R-WDD optimization for MOGNN when the number of output nodes is fixed to . To be more specific, some of the matrices relevant to in MOGNN reduce to vectors in SOGNN. Thus, the solutions of weights for SOGNN and MOGNN could be analyzed together. It is worth mentioning that the size of the activation matrix remains the same in the both cases, and it only depends on the number of training samples and the neuron number .
Iii-C Proposed Regularized Weights Direct Determination
Based on the aforementioned KKT optimality conditions for SOGNN and MOGNN, the unified solution of output weights for SOGNN and MOGNN can be obtained. However, due to the efficiency concerns, different solutions to the above KKT optimality conditions can be obtained based on the size of the training data sets.
Iii-C1 Small Training Data Sets
is an identity matrix, whose size is, and the same as the size of .
when the number of training samples is not large, the calculation of R-WDD mainly involves an inversion of a matrix. It is worth mentioning: for SOGNN, the and is column vectors; for MOGNN, the and evolve to matrices.
Iii-C2 Huge Training Data Sets
where is an identity matrix, the same as the size of . Thus, the calculation mainly involves an inversion of an matrix, and the speed only hinges on when grows large.
Based on the aforementioned discussion under two circumstances and Eq. (14), the final network output equation is:
As is shown in the Eq. (29), the calculation of R-WDD always only involves an inversion of a matrix of relatively small size, thus the the rapidness of R-WDD calculation is guaranteed.
Iv Experiments and Analysis
In this section, to verify the performance of the GNN with proposed R-WDD, we first compare the generalization performance of GNN with different algorithms (i.e., LS-SVM with Gaussian kernel and ELM with Gaussian kernel, another two famous algorithm based on equality-constrained optimization) on real-world benchmark binary, multi-class classification cases. To make it more intuitive, the decision boundaries of different algorithms are presented to demonstrate the classification capability of the proposed method. Then, we also compare the computational scalability of three algorithms, and the computational efficiency of GNN and ELM. Since model’s sensitivity to hyper parameter selection is also one key feature of algorithms, experiments between GNN and ELM are conducted to test model’s tolerance for the hyper parametric changes. At last, we conduct an experiment in which different amplitude noises are added to test the robustness to random noises of the different algorithms.
The simulations in Subsection IV-B and IV-F are carried out in MATLAB R2018b environment running in Intel(R) Core(TM) i5-8520U 1.60 GHz CPU with 8-GB RAM, and the simulations in Subsection IV-D and IV-E are carried out in MATLAB R2018a environment running in Intel(R) Core(TM) i7-8700T 2.40GHz CPU with 16-GB RAM.The codes for LS-SVM and ELM are downloaded from  and  respectively.
Iv-a Benchmark Data Sets and Hyper Parameters Selections
To verify the classification performance of different algorithms, 5 binary classification cases and 5 multi-classification cases have been tested in out experiments. All data sets are downloaded from the UCI Machine Learning Repository.
According to , the selection of hyper parameters greatly influences the performance of the algorithms. For LS-SVM with the popular Gaussian kernel (i.e., ), it mainly has two hyper parameters which are the cost parameter (preventing the model from being too complex) and kernel parameter . ELM using Gaussian kernel, has three hyper parameters, the cost parameter , kernel parameter and number of nodes . For the proposed GNN with R-WDD, it also has three hyper parameters which are the regularizer , the parameter for GPS and the number of neurons . In this paper, the value of is fixed to . Also, the number of nodes for ELM and the proposed GNN with R-WDD for binary and multi-classification problems are fixed to 1000.
Specifically, for a binary problem, only one LS-SVM, one ELM with single output node and one SOGNN are trained; for a -class multi-classification problem, it need to train binary LS-SVM, while for ELM and GNN, only one network with output nodes is required. Thus, ELM and GNN are more flexible and applicable mechanisms in multi-classification.
In experiments, for LS-SVM and ELM, the hyper parameters (i.e., and ) are searched from 31 different values in the fashion of 4-fold cross-validation gird search, which will result in 961 different combinations of (,). And the ranges of values of and are . The value of for GNN is also searched from 31 different values in . At last, the specifications of each experiment (including the divisions of data sets, the number of features and classes of data set, and the hyper parameters selections for each algorithm) are reported in TABLE I and II.
Note that for each experiment, fifty trials are conducted to ensure the results are not accidental values, and the average result is reported. Besides, all the data are reshuffled at each trial and all the data are normalized before experiments.
Iv-B Performance Comparison on Real-World Benchmark Data Sets
present the performance comparisons of LS-SVM with Gaussian kernel, ELM with Gaussian kernel and GNN with R-WDD. Note that the simulation results include the mean testing classification accuracies (Acc), corresponding mean standard deviations (Std) and the average training time, and the best results are highlighted in boldface.
For binary classification problems, GNN could achieve comparable or even better generalization performance as LS-SVM and ELM with comparably fast learning speed. Take the results of Banana data set as instance.
For Banana data set, the best Acc is achieved by LS-SVM, while the Acc by GNN is only 0.11% less. Also, GNN demonstrates comparably extremely fast training speed as ELM, with difference within 0.01s.
For multi-classification problems, the solutions achieved by GNN are better than LS-SVM and ELM, with higher Acc, and stabler with lower Std. To be illustrative, take the results of Glass and Ecoli data sets as examples.
For Glass data set, Acc of GNN is much higher, with 5.46% and 4.47% higher than the results achieved by LS-SVM and ELM.
For Ecoli data set, Acc of the GNN is at least 2% higher than LS-SVM and ELM, while Std is much lower. Although the training time is a little longer, the training speed of GNN with R-WDD is very fast and accessible.
|Acc (%)||Std (%)||Time (s)||Acc (%)||Std (%)||Time (s)||Acc (%)||Std (%)||Time (s)|
|Acc (%)||Std (%)||Time (s)||Acc (%)||Std (%)||Time (s)||Acc (%)||Std (%)||Time (s)|
Iv-C Comparison of Decision Boundaries
The decision boundaries of each algorithm. It can be observed that three algorithms can classify the nonlinear separable data set well.
As a more intuitive proof of GNN s classification capability,we also conduct a simple experiment aiming at showing the decision boundaries of different algorithms on a nonlinear separable data classes. For this experiment, it is conducted on a synthetic two-dimensional, two class data set called double moon, with a pattern of two moons intertwining. Since it is a binary classification problem, the pattern vectors of Class 1 is tagged “+1” labels and ones of Class 2 received “-1” labels.
For this experiment, all the classifiers are trained with a data set with 3000 samples. The decision boundary of each algorithm is recorded in Fig. 3 (a)-(c), respectively. Note that this simple experiment is set to demonstrate the classification ability of each classifier intuitively, thus all the samples were used to train and the testing accuracy rate were not given.
Iv-D Computational Scalability and Efficiency
One of the biggest merits of ELM originates from its excellent computational scalability and efficiency, which means the training speed of ELM maintains when the scale of problem grows and the number of hidden neurons increases. In this subsection, experiments on Liver data sets are conducted to verify the computational scalability and efficiency of GNN. To be specific, since the experiments here is set to evaluate the computational scalability and efficiency of algorithms, we simply amplify size of the Liver data set by randomly copying the samples. To avoid accidental results, each data point plotted is the average values of 100 trials.
Iv-D1 Computational Scalability
In the simple example shown in Fig. (4), we can see the GNN has comparable (or even better) computational scalability as ELM. Specifically, according to the analysis in  , for LS-SVM the main computational cost comes from its calculating Lagrange multipliers , which is highly associated with the size of data sets . However, since ELM and GNN with R-WDD have different equations calculating the output weights when is great (refer Eq. (29)), the computational cost of ELM and GNN reduces dramatically. Thus, ELM and GNN has better computational scalability, compared to LS-SVM.
Iv-D2 Computational Efficiency
The results of comparison on the computational efficiency, has been shown in Fig. (5), we can see the training times consumed by both ELM and GNN do not increase remarkably, as increases. Even more, the training time consumed by GNN grows a bit slower than ELM with a smoother slope.
Based on the above two experiments and the analysis above, it can be seen that the GNN still has extremely fast training speed when the and becomes very large. Thus, it can be concluded that the GNN has great computational scalability and efficiency, with regard to the size of training data sets and the number of neurons .
Iv-E Sensitivity to Hyper Parameters
Axiomatically, the selections of the hyper parameters are vital to the performance of the classification algorithms. However, if the model is not very sensitive to changes of hyper parameters, i.e., the best generalization performance could be achieved in a wide range of combinations of hyper parameters, the model would have better utility value, since it require less user intervention in real-world implementation. Thus, in this subsection, the ensuing experiments are conducted to test the sensitivity of ELM and GNN to hyper parameters, Even more, the generalization performances of ELM and GNN when is very small are also explored. To avoid accidental results, each data point is the average values of 100 trials and the ELM’s Gaussian parameters and GNN’s GPS parameter follow the optimal settings listed in Subsection IV-A.
Iv-E1 Sensitivity to Hyper Parameters
The experiments are conducted on a binary problem (Liver data set) and a multi-class problem (Vehicle data set) to show GNN s sensitivity to the combinations of two hyper parameters (i.e., and ). Similarly, as a comparison, the same experiments on ELM are conducted to test the sensitivity to the combinations of two similar hyper parameters (i.e., and ).
According to the results shown in Fig (6) and Fig. (7), the best performance of GNN could be achieved in a wider range of combinations of (,). Especially, the performance of GNN is not sensitive to , and can still perform well even when there only exists a very few neurons in the network. Thus, only need to be specified to achieve best performance. However, the ELM is relatively more sensitive to the combinations of (, ), and good performance could be achieved only when the is large enough.
Iv-E2 Performance when is small
From the Fig (6) and Fig. (7), we can also see that when neurons are few, the generalization performance of GNN are still good, while the generalization performance of ELM grows slowly when grows. Thus, experiments on the same two data sets are conducted to compare the generalization performance of these two algorithms when is small. The basic experimental settings also follow the ones in Subsection IV-A, except (,) for ELM and (,) for GNN.
From the results shown in Fig. (8) and Fig. (9), it is clear that the GNN works still very well with high Accs, when the hypothesis feature space is not complicated (i.e., ), while for ELM, we can performance deteriorates greatly when lacking enough neurons. Thus, with appropriate values of , the approximation capability of GNN is much better than ELM, which needs more neurons to ensure its approximation capability. As a conclusion, GNN has much better generalization performance than ELM when the network hypothesis feature space is simple (i.e., is small).
Thus, through the two aforementioned experiments, we can conclude the that the GNN is less sensitive to the changes of hyper parameters than ELM. What is more, it still performs excellently when the dimensionality of the hypothesis feature space is small. Thus, the model proposed has great utility values and is easy to implement.
|Acc (%)||(%)||Acc (%)||(%)||Acc (%)||(%)|
|Acc (%)||(%)||Acc (%)||(%)||Acc (%)||(%)|
Iv-F Comparison of Classification Robustness
To investigate the classification robustness of the proposed GNN with R-WDD, we conduct the following experiments on the same real-world benchmark multi-classification data sets as Subsection IV-B. Then, we contaminate the data sets by adding random noises with amplitude of ,and to the training sets. And the test set should be noise free since this experiment is to evaluate the classification robustness of the algorithms when noises exist. All experiment settings follow the settings in Subsection IV-A.
TABLE V and VI show the Accs of three algorithms on different data sets when the noises amplitude is and respectively. Also, we compare the results in TABLE IV with TABLE V-VI, and calculate the changes in Accs before and after the noises being added, to show the influences on the performance of algorithms brought by random noises.
All the performances of algorithms deteriorate as the random noises are added. However, the decreases in Accs of GNN are much slower than that of the LS-SVM and ELM. To be illustrative, we take the Wine data sets as example.
For Wine data set, when the amplitude of noises is , the decrease of GNN in Acc is only 0.77%, while that of LS-SVM and ELM are 2.82% and 1.91% respectively; when the amplitude grows to the decrease of LS-SVM and ELM in Acc are 4.90% and 2.54%, while that of the GNN is only 0.33%, 1/15 and 1/8 of the decrease of LS-SVM and ELM. Besides, the Acc of GNN is at least 2% higher than LS-SVM and ELM when amplitude of noises is 10%.
Based on the above analysis, it could be concluded that the classification robustness of GNN with R-WDD is higher than LS-SVM and ELM.
In this paper, to tackle the slow operations of BP-type algorithms and computational robustness problems, a novel type GNN model with structural simplicity has been proposed and investigated, based on GPS and polynomials function approximation. Then, in order to overcome the disadvantages of easy over-fitting and improve model’s generalization performance, a R-WDD method, based on equality-constrained optimization, has been proposed to minimize networks’ empirical risk and structural risk. For better computational scalability and efficiency, two solutions to the R-WDD optimization are given, depending on the scale of the problems and the complexity of the model.
As verifications, the comparison experiments on various real-world benchmark data sets have proved GNN has the comparable (or better) generalization performance, especially in multi classification problems, compared with LS-SVM, ELM. Besides, the excellent computational scalability and efficiency of GNN have been confirmed in the comparison experiments. As for the hyper-parametric experiments, less sensitivity of GNN to the combinations of hyper parameters has been observed. Especially, the generalization ability of GNN hardly relies on the values of and good performances could be achieved even though the dimensionality of the hypothesis feature space is very small. Finally, the experiments on contaminated data sets also verified the classification robustness of the proposed method.
-  (1992) Improving convergence of back-propagation by handling flat-spots in the output layer. Artificial Neural Networks, pp. 1003–1009. Cited by: §I.
-  (1971) Generalized inverse of matrices and its applications. Cited by: §III-A.
-  (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. Information Theory IEEE Transactions on 44 (2), pp. 525–536. Cited by: §I, §III-A, §III-B.
-  (2006) Extension of bernstein polynomials to infinite dimensional case. Journal of Approximation Theory 140 (2), pp. 191–202. Cited by: §II-B.
-  (2004) Orthogonal polynomials: computation and approximation. In Numerical Mathematics & Scientific Computation, Cited by: §II-C1.
-  (2005) Proof of weierstrass approximation theorem using band-limited functions. Proceedings of the IEEE 61 (4), pp. 512–512. Cited by: §II-B.
Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §III-B.
-  (2013) Robust extreme learning machine. Neurocomputing 102 (2), pp. 31–44. Cited by: §II-C1.
-  (2012) Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems Man & Cybernetics Part B 42 (2), pp. 513–529. Cited by: §I, §IV-A, §IV-D1.
-  Note: School of Electric. and Electro. Engi., Nanyang Tech. Univ., Singapore [Online].http://www.ntu.edu.sg/home/egbhuang/elm_codes.html Cited by: §IV.
-  (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. Neural Networks, IEEE Transactions on 9 (1), pp. 224–229. Cited by: §I, §I.
-  (2003) Universal approximation using incremental feedforward networks with arbitrary input weights. Revised and resubmitted to IEEE Transactions on Neural Networks. Cited by: §I.
-  (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 2, pp. 985–990. Cited by: §I, §I.
-  (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489–501. Cited by: §I, §I.
-  (2008) Direct and converse results for multivariate generalized bernstein polynomials. Cited by: §II-B, §II-B.
-  (2015) Dynamic load identification for stochastic structures based on gegenbauer polynomial approximation and regularization method. Mechanical Systems & Signal Processing 56-57, pp. 35–54. Cited by: §II-B.
-  (2010) . In Intelligent Computation Technology and Automation (ICICTA), 2010 International Conference on, Vol. 1, pp. 655–658. Cited by: §I.
-  (2014) ERNN: a biologically inspired feedforward neural network to discriminate emotion from eeg signal. Neural Networks and Learning Systems, IEEE Transactions on 25 (3), pp. 609–620. Cited by: §I.
-  (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6 (6), pp. 861–867. Cited by: §I.
-  (2014) Construction of linear dynamic gene regulatory network based on feedforward neural network. In Natural Computation (ICNC), 2014 10th International Conference on, pp. 99–107. Cited by: §I.
-  (2011) A new robust training algorithm for a class of single-hidden layer feedforward neural networks. Neurocomputing 74 (16), pp. 2491–2501. Cited by: §II-C1.
-  (1999) Neural network learning :theoretical foundations. Ai Magazine 22 (2), pp. 99–100. Cited by: §II-C1, §III-B.
Rainfall prediction in kemayoran jakarta using hybrid genetic algorithm (ga) and partially connected feedforward neural network (pcfnn). In Information and Communication Technology (ICoICT), 2014 2nd International Conference on, pp. 166–171. Cited by: §I.
-  LS-svmlab toolbox. Note: Dept. Elect. Eng., ESAT-SCD-SISTA, Leuven, Belgium, 2002 [Online]http://www.esat.kuleuven.be/sista/lssvmlab/ Cited by: §IV.
-  (1986) Learning representations by back-propagating errors. Nature 323 (3), pp. 533–536. Cited by: §I.
-  (2014) Orthogonal polynomials of several variables. Journal of Approximation Theory 112 (3), pp. 318–319. Cited by: §II-B.
-  (2012) Fast evolutionary programming-based hybrid multi-layer feedforward neural network for predicting grid-connected photovoltaic system output. In Signal Processing and its Applications (CSPA), 2012 IEEE 8th International Colloquium on, pp. 44–47. Cited by: §I.
-  (2002) Weighted least squares support vector machines : robustness and sparse approximation. Neurocomputing 48 (1), pp. 85–105. Cited by: §III-B.
-  (1999) Least squares support vector machine classifiers. Neural Processing Letters 9 (3), pp. 293–300. Cited by: §III-B.
-  (1980) Practical methods of optimization, volume 2: constrained optimization (r. fletcher). Cited by: §III-B1.
-  UCI repository of machine learning databases. Note: Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, 1998 [Online].http://www.ics.uci.edu/~mlearn/MLRepository.html Cited by: §IV-A.
-  (1995) The nature of statistical learning theory. Cited by: §I, §III-A.
-  (2011) A study on effectiveness of extreme learning machine. Neurocomputing 74 (16), pp. 2483–2490. Cited by: §I.
-  (2013) Interval uncertain method for multibody mechanical systems using chebyshev inclusion functions. International Journal for Numerical Methods in Engineering 95 (7), pp. 608–630. Cited by: §II-A.
-  (2014) Uncertainty propagation in sea for structural cacoustic coupled systems with non-deterministic parameters. Journal of Sound & Vibration 333 (17), pp. 3949–3965. Cited by: §II-A.
A unified method for the response analysis of interval/random variable models of acoustic fields with uncertain-but-bounded parameters: a unified polynomial method for interval/random variable models. International Journal for Numerical Methods in Engineering 111. Cited by: §II-A.
-  (2006) A new fast learning algorithm for multi-layer feedforward neural networks. In Machine Learning and Cybernetics, 2006 International Conference on, pp. 2928–2934. Cited by: §I.
-  (2016) A survey of randomized algorithms for training neural networks. Information Sciences 364, pp. 146–155. Cited by: §I.
-  (2009) Gegenbauer orthogonal basis neural network and its weights-directdetermination method. Electronics Letters 45 (23), pp. 1184. Cited by: §II-A.
-  (2008) A weights-directly-determined simple neural network for nonlinear system identification. In IEEE IEEE International Conference on Fuzzy Systems, Cited by: §II-B.
-  (2014) Weights and structure determination of multiple-input feed-forward neural network activated by chebyshev polynomials of class 2 via cross-validation. Neural Computing & Applications 25 (7-8), pp. 1761–1770. Cited by: §I, §II-B, §II-B, §II-C1, §III-A, §III-A.
-  (2015) Neural-dynamic-method-based dual-arm cmg scheme with time-varying constraints applied to humanoid robots. IEEE Transactions on Neural Networks & Learning Systems 26 (12), pp. 3251–3262. Cited by: §I.
Two-input power-activation neural network weights-direct-determination and structure optimized by particle swarm optimization. In the Ninth International Conference on Intelligent Control and Information Processing (ICICIP), Vol. , pp. 191–198. Cited by: §I, §II-B, §II-C1, §III-A, §III-A.