A Gegenbauer Neural Network with Regularized Weights Direct Determination for Classification

10/25/2019 ∙ by Jie He, et al. ∙ 0

Single-hidden layer feed forward neural networks (SLFNs) are widely used in pattern classification problems, but a huge bottleneck encountered is the slow speed and poor performance of the traditional iterative gradient-based learning algorithms. Although the famous extreme learning machine (ELM) has successfully addressed the problems of slow convergence, it still has computational robustness problems brought by input weights and biases randomly assigned. Thus, in order to overcome the aforementioned problems, in this paper, a novel type neural network based on Gegenbauer orthogonal polynomials, termed as GNN, is constructed and investigated. This model could overcome the computational robustness problems of ELM, while still has comparable structural simplicity and approximation capability. Based on this, we propose a regularized weights direct determination (R-WDD) based on equality-constrained optimization to determine the optimal output weights. The R-WDD tends to minimize the empirical risks and structural risks of the network, thus to lower the risk of over fitting and improve the generalization ability. This leads us to a the final GNN with R-WDD, which is a unified learning mechanism for binary and multi-class classification problems. Finally, as is verified in the various comparison experiments, GNN with R-WDD tends to have comparable (or even better) generalization performances, computational scalability and efficiency, and classification robustness, compared to least square support vector machine (LS-SVM), ELM with Gaussian kernel.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Researches on artificial feed-forward neural networks (FNNs) have become increasingly active and popular, for it is one of the most powerful tools in artificial intelligence field. What’s more, it has been widely studied and applied in a great number of engineering fields, such as the genetic research

[20], the prediction of rainfall[23], the analysis of biological emotion[18] and the humanoid robots control [42]. Because FNN has formidable approximation capability to approximate complex and nonlinear functions directly through training on samples, a computational model containing system characteristics could be built up more conveniently without knowing the exact information of the objective system by training on input samples.

For the development of this algorithm, many researchers have explored the abilities of multi-layer FNNs [37, 27, 17]

. Leshno has proved that multi-layer FNNs with activation functions are able to approximate any function, including continuous functions and non-sequence functions

[19]. For the past few years, the single-hidden layer feed-forward neural networks (SLFNs) model has been proposed by Huang et al. [14, 13, 11], which can approximate functions with zero errors by learning diverse samples. SLFNs have at most

hidden neurons and the hidden layer biases.

For traditional SLFNs, the parameters need to be tuned iteratively to ensure the optimality of the network and frequent parameters tuning will lead to a complicated relationship between the parameters of different layers. As a solution, the famous Back-propagation (BP) algorithm based on gradient-descend has been proved to be the most effective tool for iterative learning algorithm in SLFNs [25]. However, the BP-type training algorithm has some inherent weaknesses resulting in inevitable slow operations and poor performances, such as slow convergence [1], local minima [38].

To address this problem, Huang et al. propose and investigate a simple but effective learning algorithm for the SLFNs, called extreme learning machine (ELM). It has been observed that the ELM has unparalleled advantages such as extremely training speed, and excellent generalization performance [14, 13, 11]. Unlike the traditional BP algorithm, ELM did not pay much attention to the input weights and the neuron biases, and the input weights and hidden layer biases can be arbitrarily assigned [12, 9].

Although the ELM enjoys its promising merits in application prospects, it still has several shortcomings that ELM could not easily overcome. The first one is that because of due to the mechanisms of randomly assigning weights, the output of ELM fluctuates and the performances could be unstable. The second problem ELM involving is that the hidden layer matrix could be not full column rank or even ill-conditioned when the input weights and biases are generated randomly, which may incur serious computational robustness problems in obtaining optimal output weights [33].

Therefore, in order to overcome the aforementioned weaknesses of BP-type FNNs and ELM, in this paper, we focus on utilizing Gegenbauer orthogonal polynomials as activation functions, and constructing orthogonal polynomials FNN with the simplicity, such as fixed input weights and biases. Such settings have been proved effective in [41, 43]. Further, to avoid lengthy training and determine optimal output weights directly, a method called weights direct determination (WDD) based on pseudo-inverse in previous works [41, 43] is firstly reviewed. However, the WDD only focuses on empirical risk minimization, which might easily lead to a too complicated model and higher risks of over-fitting [32, 3]. Also, in the previous works, the proposed networks are single output networks, which may be not consistent with the facts that FNNs with multiple output are the most effective way to tackle multi-classification problems.

In this paper, a novel FNN termed as Gegenbauer neural network (GNN) is constructed and investigated based on Gegenbauer polynomials series and polynomials function approximation. Then, based on equality-constrained optimization, a regularized WDD (R-WDD) is proposed with a regularized term to prevent the model from being too complicated. Also, based on the number of network output nodes, a unified learning mechanism of R-WDD for binary and multi classification problems is given. Besides, with regard to computational scalability and efficiency, two different solutions to R-WDD are given, depending on the scale of the problems and the dimensionality of hypothesis feature space. Specifically, the main contributions of this paper lie in the following facts.

  • The main merit ELM algorithm is its extremely fast training speed. In this paper, the proposed GNN has similar calculation complexity to ELM, mainly involving an inversion of a square matrix. Thus, it has the comparable excellent training speed as ELM.

  • Compared with ELM, the proposed algorithm is stable and has lower computational risks, since the input weights are fixed and the tensor products of Gegenbauer orthogonal polynomials are utilized as activation functions.

  • Experiments on several real-world data sets have illustrated that the GNN has comparable or even better generalization performance in binary classification problems and multi-classification problems, compared with least square support vector machine (LS-SVM) with Gaussian kernel and ELM with Gaussian kernel.

  • Experiments have shown that GNN has great computational scalability and efficiency, still performs very well when there only exists very few neurons in the network.

  • Experiments on data sets contaminated with random noises prove the better classification robustness of GNN.

The remainder of the article is organized as follows. Section II presents and analyzes the theoretical basis of the GNN, and the GNN with single output nodes (SOGNN) and the GNN with multiple output nodes (MOGNN) are constructed and discussed respectively. In Section III, the proposed R-WDD is discussed and analyzed based on equality constrained-optimization. Section IV presents several comparison experiments on LS-SVM, ELM and GNN with R-WDD. And the final the remarks are given in Section V.

Ii Theoretical Basis for GNN

In this section, the fundamentals of Gegenbauer Polynomials are first presented. Then, the theoretical basis of Gegenbauer polynomials series (GPS) for multivariate function approximation is briefly analyzed. Furthermore, based on the analysis, the GNN model is constructed and investigated.

Ii-a Fundamentals of Gegenbauer Polynomials

Definition 1. The general expression of Gegengenbauer polynomials can be defined as below [39, 36]:

(1)

where is a parameter of the polynomials that satisfies , is the degree of the Gegenbauer polynomials and and is a Gamma function. Equivalently, Gegenbauer polynomials can also be defined by the recurrence relations below [36]:

(2)

where and .

According to orthogonal polynomials theories [34, 35], Gegenbauer polynomials are orthogonal with respect to weighted function within the interval , and have the orthogonal properties as below:

(3)

where satisfies that and there exists .

Ii-B GPS for Multivariate Function Approximation

First, a function approximation theorem, i.e., Weierstrass approximation theorem, is presented for the further discuss.

Theorem 1.

(Weierstrass Approximation Theorem)
Let with .Then there exists a sequence of polynomials converging uniformly to on :

(4)

with . And it can also be generalized to a multivariate case. Set . Let . Then there is a sequence of polynomials converging uniformly to on :

(5)

with .

Proof:

The proof of famous Weierstrass approximation theorem has been completed in [6]. Thus, the proof of Theorem 1 is completed. ∎

Based on the Gegenbauer polynomials and theories of function approximation, i.e., Theorem 1

, and polynomial interpolation

[16, 43, 40], a univariate continuous function defined on can be approximated by the -order (or to say the total number of polynomials used) GPS:

(6)

where is the coefficient corresponding to , denotes -order GPS and is the error function related to . According to approximation theories [41] and Theorem 1, the error function could converge to zero when is large enough, which could be expressed as:

(7)

For the function approximation of GPS in -dimension, we have the following lemma [15, 4].

Lemma 1.

Let and is a continuous real-valued function with variables on (i.e., ). Then, , the generalized multivariate Bernstein polynomials of , converges uniformly to and could be structured as follows:

(8)

where with , and is binomial coefficient (i.e., ).

According to Lemma 1, for a target function , we have the following result:

(9)

Then, on the basis of polynomial interpolation theory [15], we give the following theorem.

Theorem 2.

For a real-valued function with variables continuous on , it could be approximated by GPS uniformly.

Proof:

Without loss of generality, in light of Eq. (6) and (7), can be approximated by GPS as follows:

(10)

where denotes the coefficient corresponding to .

Then, in light of the lemma given above, we could substitute the Eq. (10) into Eq. (8):

(11)

where is the total amount of Gegenbauer polynomials utilized to approximate , and is the residual error functions related to function , is the coefficient for , , termed as the Gegenbauer Elementary function (GEF) in this paper, is the -order tensor product of the one-dimensional Gegenbauer polynomials, which could be expressed as:

is the weights for , and is the total amount of Gegenbauer polynomials tensor product (i.e., ) utilized to approximate .

Besides, according to Eq. (10) and Eq. (7), the approximation residual error function of using GPS would converge to zero:

(12)

Thus, the proof of GPS for multivariate function approximation is thus completed. ∎

Note that the should be large enough to guarantee the approximation capability of the GPS. Therefore, according to Theorem 2, a series of GEFs , sorted by graded lexicographer [26, 41], with optimal weights could minimize the residual error function and best approximate the function .

Ii-C Network Architecture

Based on the analysis in Section II-B, the SOGNN and MOGNN are hereby constructed and investigated. The output patterns of both forms are then illustrated.

Ii-C1 SOGNN Model

Fig. 1: The network architecture of GNN.

According to the theoretical analysis above, the architecture of SOGNN is constructed. As is shown in Fig. 1 , the network architecture of SOGNN consists of three layer, including input layer, hidden layer and output layer. Let denotes the -dimensional input vector, and all of the input data should be normalized to . In this model, the neurons in input layer and the single neuron in output layer are both chosen linear in our model. Besides, the activation functions in hidden layer are a series of GEFs , and denotes the total amount of hidden neurons. Thus, the output of SOGNN in this case can be expressed as follow:

(13)

where denotes the Gegenbauer activation vector with respect to , actually maps the -dimensional input space to the -dimensional hidden layer feature space, and denotes the connecting weights vector.

If the number of samples is , their outputs can be written in matrix form, i.e.,

(14)

where and

Based on Theorem 2 (specifically, Eq. (11) and (12), with large enough and optimal weights, can best approximate the target function . Also, in this model, all the input weights are fixed to be , and the threshold values for all neurons are fixed to be .

Notably, the aforementioned parametric settings can effectively simplify the network architecture, lower the computational complexity, make it more convenient for future implementation on hardware, and have been proved effective in previous works [41, 43].

Even more, such parametric simplification, on one hand, maintains the approximation capability and the optimality of the network since the network (shown in Fig. 1) with the aforementioned parametric settings well matches the theoretical analysis in Eq. (14) and (12). On the other hand, such fixed parametric settings could also improve the robustness of the model when faced with disturbance in input. According to [21, 22], when the input weights and hidden layer biases are randomly assigned, the robustness could be poor, since the changes of the output weights matrix sometimes could be very large when input disturbance is encountered, which will result in increasing risk of network.

Further, with activation functions being orthogonal elementary functions, our model prevents the hidden activation matrix from being not full column rank and ill-conditioned, according to orthogonal polynomials theory [5]. Also, with being invertible, the direct computing of can be easily obtained. Notably, this sets the basis of the computational robustness of the proposed method in Section III-B [8].

Ii-C2 MOGNN Model

Fig. 2: The network architecture of GNN.

As is shown in Fig. 2, the MOGNN model is constructed based on SOGNN. To be more specific, the MOGNN with outputs consists of SOGNN. The output of MOGNN is -dimensional vector, where and the th output of MOGNN is the output of the th SOGNN. The parametric settings is the same as the settings of SOGNN. The output of th output node is given below:

(15)

Iii Proposed Regularized Weights Direct Determination

In this section, the detailed explanation of our proposed R-WDD for GNN is presented. First, the original WDD method is reviewed, after which the main drawbacks of it are discussed. Then, based on equality-constrained optimization, a method minimizing the empirical risks and structural risks with better generalization ability is provided, and the analysis of it is carried out in cases of two forms of GNN, respectively. Finally, the unified R-WDD for GNN weights determination under different cases are given.

Iii-a Review of the Original Weights Direct Determination

According to the previous works in [43, 41]

, for a given GNN model, the objective loss function of WDD is:

(16)

where is the label of samples.

With WDD, the solution of the optimal output weights can be calculated by the following equation:

(17)

where is the Moore-Penrose generalized inverse of activation matrix , also termed as .

According to [2]

, there are several methods to calculate the Moore-Penrose generalized inverse of a matrix: orthogonal projection method, iterative method and singular values decomposition (SVD).

Since the WDD directly calculates the least-square solution involving an inversion of a matrix, the speed of calculation can be guaranteed. Further, according to the previous works [43, 41], the networks equipped with WDD have demonstrated outstanding performances, such as excellent approximation ability, easy implementation.

However, although WDD method has aforementioned merits, it still has a main drawback. As is shown Eq. (16), the WDD only focuses on empirical risk minimization and no measure is taken to prevent the model hypothesis from being too complicated, which might easily lead to an over-fitting model [32, 3].

Iii-B Equality-constrained Optimization Problems for GNN

Due to the aforementioned drawbacks of the original WDD method, in this paper, our model does not only tend to seek the smallest training error but also seek the smallest norm of output weights. According to the famous Barlett’s theory [3], a good model with generalization ability should be the one could reach a best trade-off between the empirical risk and the structural risk. Generally, the empirical risk is the training error of the neural network, which could be represented by -norm, i.e., ; the structural risk is the -norm of output weights, i.e., , which denotes the complexity of the model[29]. Thus, the optimization object and the mathematical model could be formulated as below:

(18)

According to the ridge regression theory

[7], it would make solution much stabler and tending to have better generalization ability since a positive value added to the diagonal of or

. Also, according to the statistical learning theory

[22, 28], the real risk of a model should be the sum of the empirical one and the structural one, and these two kinds of risk should be considered together in discussions. Since the proposed GNN has two forms (i.e., SOGNN and MOGNN), in the following, the solution for each form would be discussed respectively. After that, it can be seen the solution for SOGNN is actually one specific case of MOGNN. Thus, these two solutions are finally unified, which leads to the final solution of R-WDD.

Iii-B1 SOGNN for Classification

Based on the analysis in Section II, the SOGNN can approximate any continuous target function in the variables space. For classification problems, the expected output of SOGNN can be close enough to the target class labels in the corresponding regions. Thus, the objective function of proposed R-WDD for SOGNN can be described as below:

(19)

where the weight factor is the regularizer. By tuning the values of , the proportion of the empirical risk and the structural risk can be adjusted for reaching the best trade-off. Based on KKT theorem [30], to solve the R-WDD optimization problem for SOGNN is equivalent to solve the following Lagrangian equation:

(20)

where denotes the Lagrangian multiplier vector with equality constraints of Eq. (19) and . By setting the gradients of the Lagrangian equation with respect to , and equal to zero respectively, we can obtain the the following KKT optimality conditions of Eq. (20):

(21a)
(21b)
(21c)

Iii-B2 MOGNN for Classification

According to Section II-C, the MOGNN is actually a generalized form of the SOGNN. For -classification problems, the number of the output nodes of MOGNN should be , and when the original class label of sample is , the expected output of MOGNN should be a dimensional vector and the th component of it should be close to , while others should be close to . Thus, the objective function of proposed R-WDD for MOGNN can be described as below:

(22)

where is the target vector corresponding to sample , is the error vector, is a matrix of , where is the vector of the output weights corresponding to th output node and is Frobenius norm.

Similar to SOGNN, the Lagrangian equation for R-WDD optimization for MOGNN is:

(23)

where denotes the Lagrangian multiplier matrix, and . By setting the partial derivatives of Eq. (23) equal to zero respectively, we can also obtain the KKT optimality conditions of Eq. (23) as follows:

(24a)
(24b)
(24c)

It can be seen from the Eq. (21) and Eq. (24) that to solve the R-WDD optimization for SOGNN is actually to solve one particular case of the R-WDD optimization for MOGNN when the number of output nodes is fixed to . To be more specific, some of the matrices relevant to in MOGNN reduce to vectors in SOGNN. Thus, the solutions of weights for SOGNN and MOGNN could be analyzed together. It is worth mentioning that the size of the activation matrix remains the same in the both cases, and it only depends on the number of training samples and the neuron number .

Iii-C Proposed Regularized Weights Direct Determination

Based on the aforementioned KKT optimality conditions for SOGNN and MOGNN, the unified solution of output weights for SOGNN and MOGNN can be obtained. However, due to the efficiency concerns, different solutions to the above KKT optimality conditions can be obtained based on the size of the training data sets.

Iii-C1 Small Training Data Sets

When is not huge, or more specifically, less than , by substituting Eq. (21a) and (24a), and Eq. (21b) and (24b) into Eq. (21c) and (24c) respectively, we can get:

(25)

where

is an identity matrix, whose size is

, and the same as the size of .

By substituting Eq.(25) into Eq. (21a) and (24a), and getting rid of , the following equation can be obtained:

(26)

when the number of training samples is not large, the calculation of R-WDD mainly involves an inversion of a matrix. It is worth mentioning: for SOGNN, the and is column vectors; for MOGNN, the and evolve to matrices.

Iii-C2 Huge Training Data Sets

When is huge, i.e., greater than , by substituting Eq. (21c) and (24c) into Eq. (21b) and (24b) respectively, the explicit expression for can be obtained as follows:

(27)

Based on this, we can obtain the following solution for weights vector , by substituting Eq. (27) into Eq. (24a):

(28)

where is an identity matrix, the same as the size of . Thus, the calculation mainly involves an inversion of an matrix, and the speed only hinges on when grows large.

Based on the aforementioned discussion under two circumstances and Eq. (14), the final network output equation is:

(29)

As is shown in the Eq. (29), the calculation of R-WDD always only involves an inversion of a matrix of relatively small size, thus the the rapidness of R-WDD calculation is guaranteed.

Data Sets Training Testing Classes Features LS-SVM ELM GNN
Australian 484 206 2 6 1500 1500 0.05
Banana 1591 3709 2 2 1500 1500 0.05
Diabetes 537 231 2 8 1500 1500 0.05
Liver 241 104 2 6 1500 1500 0.05
Ionosphere 245 106 2 33 1500 1500 0.05
TABLE I: Specifications of Experiments of Binary Classification Problems
Data Sets Training Testing Classes Features LS-SVM ELM GNN
Iris 105 45 3 3 1500 1000 0.05
Glass 158 56 6 9 1500 1000 0.05
Wine 126 52 3 13 1500 1000 0.05
Ecoli 243 93 8 7 1500 1000 0.05
Vehicle 594 252 4 18 1500 1000 0.05
TABLE II: Specifications of Experiments of Multi-Classification Problems

Iv Experiments and Analysis

In this section, to verify the performance of the GNN with proposed R-WDD, we first compare the generalization performance of GNN with different algorithms (i.e., LS-SVM with Gaussian kernel and ELM with Gaussian kernel, another two famous algorithm based on equality-constrained optimization) on real-world benchmark binary, multi-class classification cases. To make it more intuitive, the decision boundaries of different algorithms are presented to demonstrate the classification capability of the proposed method. Then, we also compare the computational scalability of three algorithms, and the computational efficiency of GNN and ELM. Since model’s sensitivity to hyper parameter selection is also one key feature of algorithms, experiments between GNN and ELM are conducted to test model’s tolerance for the hyper parametric changes. At last, we conduct an experiment in which different amplitude noises are added to test the robustness to random noises of the different algorithms.

The simulations in Subsection IV-B and IV-F are carried out in MATLAB R2018b environment running in Intel(R) Core(TM) i5-8520U 1.60 GHz CPU with 8-GB RAM, and the simulations in Subsection IV-D and IV-E are carried out in MATLAB R2018a environment running in Intel(R) Core(TM) i7-8700T 2.40GHz CPU with 16-GB RAM.The codes for LS-SVM and ELM are downloaded from [24] and [10] respectively.

Iv-a Benchmark Data Sets and Hyper Parameters Selections

To verify the classification performance of different algorithms, 5 binary classification cases and 5 multi-classification cases have been tested in out experiments. All data sets are downloaded from the UCI Machine Learning Repository

[31].

According to [9], the selection of hyper parameters greatly influences the performance of the algorithms. For LS-SVM with the popular Gaussian kernel (i.e., ), it mainly has two hyper parameters which are the cost parameter (preventing the model from being too complex) and kernel parameter . ELM using Gaussian kernel, has three hyper parameters, the cost parameter , kernel parameter and number of nodes . For the proposed GNN with R-WDD, it also has three hyper parameters which are the regularizer , the parameter for GPS and the number of neurons . In this paper, the value of is fixed to . Also, the number of nodes for ELM and the proposed GNN with R-WDD for binary and multi-classification problems are fixed to 1000.

Specifically, for a binary problem, only one LS-SVM, one ELM with single output node and one SOGNN are trained; for a -class multi-classification problem, it need to train binary LS-SVM, while for ELM and GNN, only one network with output nodes is required. Thus, ELM and GNN are more flexible and applicable mechanisms in multi-classification.

In experiments, for LS-SVM and ELM, the hyper parameters (i.e., and ) are searched from 31 different values in the fashion of 4-fold cross-validation gird search, which will result in 961 different combinations of (,). And the ranges of values of and are . The value of for GNN is also searched from 31 different values in . At last, the specifications of each experiment (including the divisions of data sets, the number of features and classes of data set, and the hyper parameters selections for each algorithm) are reported in TABLE I and II.

Note that for each experiment, fifty trials are conducted to ensure the results are not accidental values, and the average result is reported. Besides, all the data are reshuffled at each trial and all the data are normalized before experiments.

Iv-B Performance Comparison on Real-World Benchmark Data Sets

TABLE III -IV

present the performance comparisons of LS-SVM with Gaussian kernel, ELM with Gaussian kernel and GNN with R-WDD. Note that the simulation results include the mean testing classification accuracies (Acc), corresponding mean standard deviations (Std) and the average training time, and the best results are highlighted in boldface.

For binary classification problems, GNN could achieve comparable or even better generalization performance as LS-SVM and ELM with comparably fast learning speed. Take the results of Banana data set as instance.

  • For Banana data set, the best Acc is achieved by LS-SVM, while the Acc by GNN is only 0.11% less. Also, GNN demonstrates comparably extremely fast training speed as ELM, with difference within 0.01s.

For multi-classification problems, the solutions achieved by GNN are better than LS-SVM and ELM, with higher Acc, and stabler with lower Std. To be illustrative, take the results of Glass and Ecoli data sets as examples.

  • For Glass data set, Acc of GNN is much higher, with 5.46% and 4.47% higher than the results achieved by LS-SVM and ELM.

  • For Ecoli data set, Acc of the GNN is at least 2% higher than LS-SVM and ELM, while Std is much lower. Although the training time is a little longer, the training speed of GNN with R-WDD is very fast and accessible.

Data Sets LS-SVM ELM GNN
Testing Testing Training Testing Testing Training Testing Testing Training
Acc (%) Std (%) Time (s) Acc (%) Std (%) Time (s) Acc (%) Std (%) Time (s)
Australian 72.08 3.44 0.2462 74.69 1.42 0.2267 75.91 3.11 0.2639
Banana 89.63 0.33 0.1241 89.04 0.46 0.1124 89.52 0.61 0.1206
Diabetes 77.01 2.41 0.1550 77.52 2.46 0.1328 77.29 2.05 0.1421
Liver 71.17 3.92 0.2541 74.87 1.44 0.2817 72.68 4.04 0.2370
Ionosphere 91.58 2.74 0.2151 92.12 4.73 0.1761 91.03 2.66 0.2240
TABLE III: Performance Comparison of LS-SVM, ELM and GNN: Binary Class Data Sets
Data Sets LS-SVM ELM GNN
Testing Testing Training Testing Testing Training Testing Testing Training
Acc (%) Std (%) Time (s) Acc (%) Std (%) Time (s) Acc (%) Std (%) Time (s)
Iris 96.22 2.36 0.0231 96.04 2.37 0.0222 96.58 2.34 0.0257
Glass 67.32 5.04 0.0417 68.41 4.27 0.0326 72.68 4.94 0.0421
Wine 97.63 1.82 0.0343 98.48 4.46 0.0326 98.54 1.63 0.0457
Ecoli 85.93 3.52 0.0344 87.48 2.91 0.0382 89.55 2.23 0.0481
Vehicle 83.19 1.93 0.1029 83.16 1.89 0.0931 84.05 1.75 0.1266
TABLE IV: Performance Comparison of LS-SVM, ELM and GNN: Multi-Class Data Sets

Iv-C Comparison of Decision Boundaries

(a) LS-SVM with ,
(b) ELM with , ,
(c) GNN with ,
Fig. 3:

The decision boundaries of each algorithm. It can be observed that three algorithms can classify the nonlinear separable data set well.

As a more intuitive proof of GNN s classification capability,we also conduct a simple experiment aiming at showing the decision boundaries of different algorithms on a nonlinear separable data classes. For this experiment, it is conducted on a synthetic two-dimensional, two class data set called double moon, with a pattern of two moons intertwining. Since it is a binary classification problem, the pattern vectors of Class 1 is tagged “+1” labels and ones of Class 2 received “-1” labels.

For this experiment, all the classifiers are trained with a data set with 3000 samples. The decision boundary of each algorithm is recorded in Fig. 3 (a)-(c), respectively. Note that this simple experiment is set to demonstrate the classification ability of each classifier intuitively, thus all the samples were used to train and the testing accuracy rate were not given.

Iv-D Computational Scalability and Efficiency

One of the biggest merits of ELM originates from its excellent computational scalability and efficiency, which means the training speed of ELM maintains when the scale of problem grows and the number of hidden neurons increases. In this subsection, experiments on Liver data sets are conducted to verify the computational scalability and efficiency of GNN. To be specific, since the experiments here is set to evaluate the computational scalability and efficiency of algorithms, we simply amplify size of the Liver data set by randomly copying the samples. To avoid accidental results, each data point plotted is the average values of 100 trials.

S[c][c][0.85]

Fig. 4: Computational scalability of different algorithms: this experiment runs on the amplified Liver data set (20000 samples), and the number of neurons of GNN and ELM is 100. We can see that as grows, the training time consumed by LS-SVM exponentially increases, while the training time consumed by ELM (Gaussian kernel) and GNN increases very slowly.

Iv-D1 Computational Scalability

In the simple example shown in Fig. (4), we can see the GNN has comparable (or even better) computational scalability as ELM. Specifically, according to the analysis in [9] , for LS-SVM the main computational cost comes from its calculating Lagrange multipliers , which is highly associated with the size of data sets . However, since ELM and GNN with R-WDD have different equations calculating the output weights when is great (refer Eq. (29)), the computational cost of ELM and GNN reduces dramatically. Thus, ELM and GNN has better computational scalability, compared to LS-SVM.

Iv-D2 Computational Efficiency

The results of comparison on the computational efficiency, has been shown in Fig. (5), we can see the training times consumed by both ELM and GNN do not increase remarkably, as increases. Even more, the training time consumed by GNN grows a bit slower than ELM with a smoother slope.

Based on the above two experiments and the analysis above, it can be seen that the GNN still has extremely fast training speed when the and becomes very large. Thus, it can be concluded that the GNN has great computational scalability and efficiency, with regard to the size of training data sets and the number of neurons .

Iv-E Sensitivity to Hyper Parameters

Axiomatically, the selections of the hyper parameters are vital to the performance of the classification algorithms. However, if the model is not very sensitive to changes of hyper parameters, i.e., the best generalization performance could be achieved in a wide range of combinations of hyper parameters, the model would have better utility value, since it require less user intervention in real-world implementation. Thus, in this subsection, the ensuing experiments are conducted to test the sensitivity of ELM and GNN to hyper parameters, Even more, the generalization performances of ELM and GNN when is very small are also explored. To avoid accidental results, each data point is the average values of 100 trials and the ELM’s Gaussian parameters and GNN’s GPS parameter follow the optimal settings listed in Subsection IV-A.

Iv-E1 Sensitivity to Hyper Parameters

The experiments are conducted on a binary problem (Liver data set) and a multi-class problem (Vehicle data set) to show GNN s sensitivity to the combinations of two hyper parameters (i.e., and ). Similarly, as a comparison, the same experiments on ELM are conducted to test the sensitivity to the combinations of two similar hyper parameters (i.e., and ).

L[c][c][0.85]

Fig. 5: Computational efficiency of ELM and GNN: this experiment runs on the amplified Liver data set (1000 samples). We can see that as the hypothesis feature space becomes more complex (i.e., grows), the training time consumed by both ELM (Gaussian kernel) and GNN increases slowly.

According to the results shown in Fig (6) and Fig. (7), the best performance of GNN could be achieved in a wider range of combinations of (,). Especially, the performance of GNN is not sensitive to , and can still perform well even when there only exists a very few neurons in the network. Thus, only need to be specified to achieve best performance. However, the ELM is relatively more sensitive to the combinations of (, ), and good performance could be achieved only when the is large enough.

(a) ELM on Liver Data Set
(b) GNN on Liver Data Set
Fig. 6: The hyper parametric sensitivity test runs on liver data set (binary). Both ELM and GNN can achieve good generalization performance in a wide range of combinations of (,L) or (,L). However, GNN is less sensitive to and performs well even when is small.
(a) ELM on Vehicle Data Set
(b) GNN on Vehicle Data Set
Fig. 7: The hyper parametric sensitivity test runs on vehicle data set (mult-class). GNN can achieve good generalization performance in a wider range of combinations of (,) than ELM. GNN s performance is only sensitive to the values of , while ELM could only achieve best performance in a narrow range of combinations of (,).

Iv-E2 Performance when is small

From the Fig (6) and Fig. (7), we can also see that when neurons are few, the generalization performance of GNN are still good, while the generalization performance of ELM grows slowly when grows. Thus, experiments on the same two data sets are conducted to compare the generalization performance of these two algorithms when is small. The basic experimental settings also follow the ones in Subsection IV-A, except (,) for ELM and (,) for GNN.

From the results shown in Fig. (8) and Fig. (9), it is clear that the GNN works still very well with high Accs, when the hypothesis feature space is not complicated (i.e., ), while for ELM, we can performance deteriorates greatly when lacking enough neurons. Thus, with appropriate values of , the approximation capability of GNN is much better than ELM, which needs more neurons to ensure its approximation capability. As a conclusion, GNN has much better generalization performance than ELM when the network hypothesis feature space is simple (i.e., is small).

Thus, through the two aforementioned experiments, we can conclude the that the GNN is less sensitive to the changes of hyper parameters than ELM. What is more, it still performs excellently when the dimensionality of the hypothesis feature space is small. Thus, the model proposed has great utility values and is easy to implement.

(a) ELM on Liver Data Set with Very Few Neurons
(b) GNN on Liver Data Set with Very Few Neurons
Fig. 8: The experiment runs on Liver data set (binary). The ELM starts with a low generalization accuracy when the neurons are very few, and the accuracy grows slowly, while the GNN has relatively high generalization accuracy when is larger than 10.
(a) ELM on Vehicle Data Set with very few neurons
(b) GNN on Vehicle Data Set with very few neurons
Fig. 9: The experiment runs on Vehicle data set (multi-class). The ELM starts with a very low generalization accuracy when the neurons are very few, and the accuracy grows slowly, while the GNN has relatively high generalization accuracy when is larger than 10. Also, both ELM and GNN are sensitive to the values of or
Data Sets LS-SVM ELM GNN
Testing Changes Testing Changes Testing Changes
Acc (%) (%) Acc (%) (%) Acc (%) (%)
Iris 94.31 -1.91 95.21 -0.83 96.02 -0.56
Glass 63.21 -5.11 63.76 -4.65 68.55 -4.13
Wine 93.81 -2.82 96.57 -1.91 97.77 -0.77
Ecoli 80.42 -5.51 82.75 -4.73 85.41 -4.14
Vehicle 79.21 -3.98 79.51 -3.65 81.89 -2.16
TABLE V: Performance Comparison of LS-SVM, ELM and GNN: Multi-Class Data Sets with Random Noises
Data Sets LS-SVM ELM GNN
Testing Changes Testing Changes Testing Changes
Acc (%) (%) Acc (%) (%) Acc (%) (%)
Iris 91.87 -4.35 92.46 -3.58 94.59 -1.99
Glass 60.19 -7.13 61.72 -6.69 66.77 -6.23
Wine 92.73 -4.90 95.94 -2.54 98.21 -0.33
Ecoli 78.80 -7.13 80.13 -7.35 84.51 -5.04
Vehicle 75.27 -7.92 77.37 -5.79 79.25 -4.80
TABLE VI: Performance Comparison of LS-SVM, ELM and GNN: Multi-Class Data Sets with Random Noises

Iv-F Comparison of Classification Robustness

To investigate the classification robustness of the proposed GNN with R-WDD, we conduct the following experiments on the same real-world benchmark multi-classification data sets as Subsection IV-B. Then, we contaminate the data sets by adding random noises with amplitude of ,and to the training sets. And the test set should be noise free since this experiment is to evaluate the classification robustness of the algorithms when noises exist. All experiment settings follow the settings in Subsection IV-A.

TABLE V and VI show the Accs of three algorithms on different data sets when the noises amplitude is and respectively. Also, we compare the results in TABLE IV with TABLE V-VI, and calculate the changes in Accs before and after the noises being added, to show the influences on the performance of algorithms brought by random noises.

All the performances of algorithms deteriorate as the random noises are added. However, the decreases in Accs of GNN are much slower than that of the LS-SVM and ELM. To be illustrative, we take the Wine data sets as example.

  • For Wine data set, when the amplitude of noises is , the decrease of GNN in Acc is only 0.77%, while that of LS-SVM and ELM are 2.82% and 1.91% respectively; when the amplitude grows to the decrease of LS-SVM and ELM in Acc are 4.90% and 2.54%, while that of the GNN is only 0.33%, 1/15 and 1/8 of the decrease of LS-SVM and ELM. Besides, the Acc of GNN is at least 2% higher than LS-SVM and ELM when amplitude of noises is 10%.

Based on the above analysis, it could be concluded that the classification robustness of GNN with R-WDD is higher than LS-SVM and ELM.

V Conclusion

In this paper, to tackle the slow operations of BP-type algorithms and computational robustness problems, a novel type GNN model with structural simplicity has been proposed and investigated, based on GPS and polynomials function approximation. Then, in order to overcome the disadvantages of easy over-fitting and improve model’s generalization performance, a R-WDD method, based on equality-constrained optimization, has been proposed to minimize networks’ empirical risk and structural risk. For better computational scalability and efficiency, two solutions to the R-WDD optimization are given, depending on the scale of the problems and the complexity of the model.

As verifications, the comparison experiments on various real-world benchmark data sets have proved GNN has the comparable (or better) generalization performance, especially in multi classification problems, compared with LS-SVM, ELM. Besides, the excellent computational scalability and efficiency of GNN have been confirmed in the comparison experiments. As for the hyper-parametric experiments, less sensitivity of GNN to the combinations of hyper parameters has been observed. Especially, the generalization ability of GNN hardly relies on the values of and good performances could be achieved even though the dimensionality of the hypothesis feature space is very small. Finally, the experiments on contaminated data sets also verified the classification robustness of the proposed method.

References

  • [1] K. Balakrishnan and V. Honavar (1992) Improving convergence of back-propagation by handling flat-spots in the output layer. Artificial Neural Networks, pp. 1003–1009. Cited by: §I.
  • [2] K. S. Banerjee (1971) Generalized inverse of matrices and its applications. Cited by: §III-A.
  • [3] P. L. Bartlett (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. Information Theory IEEE Transactions on 44 (2), pp. 525–536. Cited by: §I, §III-A, §III-B.
  • [4] L. D Ambrosio (2006) Extension of bernstein polynomials to infinite dimensional case. Journal of Approximation Theory 140 (2), pp. 191–202. Cited by: §II-B.
  • [5] W. Gautschi (2004) Orthogonal polynomials: computation and approximation. In Numerical Mathematics & Scientific Computation, Cited by: §II-C1.
  • [6] C. R. Giardina and P. M. Chirlian (2005) Proof of weierstrass approximation theorem using band-limited functions. Proceedings of the IEEE 61 (4), pp. 512–512. Cited by: §II-B.
  • [7] A. E. Hoerl and R. W. Kennard (2000)

    Ridge regression: biased estimation for nonorthogonal problems

    .
    Technometrics 12 (1), pp. 55–67. Cited by: §III-B.
  • [8] P. Horata, S. Chiewchanwattana, and K. Sunat (2013) Robust extreme learning machine. Neurocomputing 102 (2), pp. 31–44. Cited by: §II-C1.
  • [9] G. B. Huang, H. Zhou, X. Ding, and R. Zhang (2012) Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems Man & Cybernetics Part B 42 (2), pp. 513–529. Cited by: §I, §IV-A, §IV-D1.
  • [10] G. B. Huang, Q. Y. Zhu, and C. K. Siew Note: School of Electric. and Electro. Engi., Nanyang Tech. Univ., Singapore [Online].http://www.ntu.edu.sg/home/egbhuang/elm_codes.html Cited by: §IV.
  • [11] G. Huang and H. A. Babri (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. Neural Networks, IEEE Transactions on 9 (1), pp. 224–229. Cited by: §I, §I.
  • [12] G. Huang, L. Chen, and C. Siew (2003) Universal approximation using incremental feedforward networks with arbitrary input weights. Revised and resubmitted to IEEE Transactions on Neural Networks. Cited by: §I.
  • [13] G. Huang, Q. Zhu, and C. Siew (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 2, pp. 985–990. Cited by: §I, §I.
  • [14] G. Huang, Q. Zhu, and C. Siew (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489–501. Cited by: §I, §I.
  • [15] E. Ibikli (2008) Direct and converse results for multivariate generalized bernstein polynomials. Cited by: §II-B, §II-B.
  • [16] L. Jie, X. Sun, H. Xu, J. Chao, and D. Yu (2015) Dynamic load identification for stochastic structures based on gegenbauer polynomial approximation and regularization method. Mechanical Systems & Signal Processing 56-57, pp. 35–54. Cited by: §II-B.
  • [17] C. Junhong, W. Zhuobin, H. Tao, and X. Xianwei (2010)

    Analysis of feature extraction criterion function maximum in nonlinear multi-layer feedforward neural networks for pattern recognition

    .
    In Intelligent Computation Technology and Automation (ICICTA), 2010 International Conference on, Vol. 1, pp. 655–658. Cited by: §I.
  • [18] R. Khosrowabadi, C. Quek, K. K. Ang, and A. Wahab (2014) ERNN: a biologically inspired feedforward neural network to discriminate emotion from eeg signal. Neural Networks and Learning Systems, IEEE Transactions on 25 (3), pp. 609–620. Cited by: §I.
  • [19] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6 (6), pp. 861–867. Cited by: §I.
  • [20] L. Liu, M. Liu, and M. Ma (2014) Construction of linear dynamic gene regulatory network based on feedforward neural network. In Natural Computation (ICNC), 2014 10th International Conference on, pp. 99–107. Cited by: §I.
  • [21] Z. Man, K. Lee, D. Wang, Z. Cao, and C. Miao (2011) A new robust training algorithm for a class of single-hidden layer feedforward neural networks. Neurocomputing 74 (16), pp. 2491–2501. Cited by: §II-C1.
  • [22] Martin and Anthony (1999) Neural network learning :theoretical foundations. Ai Magazine 22 (2), pp. 99–100. Cited by: §II-C1, §III-B.
  • [23] S. Nurcahyo, F. Nhita, et al. (2014)

    Rainfall prediction in kemayoran jakarta using hybrid genetic algorithm (ga) and partially connected feedforward neural network (pcfnn)

    .
    In Information and Communication Technology (ICoICT), 2014 2nd International Conference on, pp. 166–171. Cited by: §I.
  • [24] K. Pelckmans LS-svmlab toolbox. Note: Dept. Elect. Eng., ESAT-SCD-SISTA, Leuven, Belgium, 2002 [Online]http://www.esat.kuleuven.be/sista/lssvmlab/ Cited by: §IV.
  • [25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. Nature 323 (3), pp. 533–536. Cited by: §I.
  • [26] J. V. Stokman, C. F. Dunkl, and Y. Xu (2014) Orthogonal polynomials of several variables. Journal of Approximation Theory 112 (3), pp. 318–319. Cited by: §II-B.
  • [27] S. I. Sulaiman, T. K. A. Rahman, I. Musirin, and S. Shaari (2012) Fast evolutionary programming-based hybrid multi-layer feedforward neural network for predicting grid-connected photovoltaic system output. In Signal Processing and its Applications (CSPA), 2012 IEEE 8th International Colloquium on, pp. 44–47. Cited by: §I.
  • [28] J. A. K. Suykens, J. D. Brabanter, L. Lukas, and J. Vandewalle (2002) Weighted least squares support vector machines : robustness and sparse approximation. Neurocomputing 48 (1), pp. 85–105. Cited by: §III-B.
  • [29] J. A. K. Suykens and J. Vandewalle (1999) Least squares support vector machine classifiers. Neural Processing Letters 9 (3), pp. 293–300. Cited by: §III-B.
  • [30] R. Tapia (1980) Practical methods of optimization, volume 2: constrained optimization (r. fletcher). Cited by: §III-B1.
  • [31] UCI repository of machine learning databases. Note: Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, 1998 [Online].http://www.ics.uci.edu/~mlearn/MLRepository.html Cited by: §IV-A.
  • [32] V. N. Vapnik (1995) The nature of statistical learning theory. Cited by: §I, §III-A.
  • [33] Y. Wang, F. Cao, and Y. Yuan (2011) A study on effectiveness of extreme learning machine. Neurocomputing 74 (16), pp. 2483–2490. Cited by: §I.
  • [34] J. Wu, Z. Luo, Y. Zhang, N. Zhang, and L. Chen (2013) Interval uncertain method for multibody mechanical systems using chebyshev inclusion functions. International Journal for Numerical Methods in Engineering 95 (7), pp. 608–630. Cited by: §II-A.
  • [35] M. Xu, Z. Qiu, and X. Wang (2014) Uncertainty propagation in sea for structural cacoustic coupled systems with non-deterministic parameters. Journal of Sound & Vibration 333 (17), pp. 3949–3965. Cited by: §II-A.
  • [36] S. Yin, D. Yu, Y. Hui, and B. Xia (2017)

    A unified method for the response analysis of interval/random variable models of acoustic fields with uncertain-but-bounded parameters: a unified polynomial method for interval/random variable models

    .
    International Journal for Numerical Methods in Engineering 111. Cited by: §II-A.
  • [37] D. Zhang, C. Liu, Z. Wang, and N. Liu (2006) A new fast learning algorithm for multi-layer feedforward neural networks. In Machine Learning and Cybernetics, 2006 International Conference on, pp. 2928–2934. Cited by: §I.
  • [38] L. Zhang and P. N. Suganthan (2016) A survey of randomized algorithms for training neural networks. Information Sciences 364, pp. 146–155. Cited by: §I.
  • [39] Y. Zhang and W. Li (2009) Gegenbauer orthogonal basis neural network and its weights-directdetermination method. Electronics Letters 45 (23), pp. 1184. Cited by: §II-A.
  • [40] Y. Zhang, L. Wei, C. Yi, and C. Ke (2008) A weights-directly-determined simple neural network for nonlinear system identification. In IEEE IEEE International Conference on Fuzzy Systems, Cited by: §II-B.
  • [41] Y. Zhang, X. Yu, D. Guo, Y. Yin, and Z. Zhang (2014) Weights and structure determination of multiple-input feed-forward neural network activated by chebyshev polynomials of class 2 via cross-validation. Neural Computing & Applications 25 (7-8), pp. 1761–1770. Cited by: §I, §II-B, §II-B, §II-C1, §III-A, §III-A.
  • [42] Z. Zhang, Z. Li, Y. Zhang, Y. Luo, and Y. Li (2015) Neural-dynamic-method-based dual-arm cmg scheme with time-varying constraints applied to humanoid robots. IEEE Transactions on Neural Networks & Learning Systems 26 (12), pp. 3251–3262. Cited by: §I.
  • [43] Z. Zhang, J. He, Z. Chen, and L. Kong (2018)

    Two-input power-activation neural network weights-direct-determination and structure optimized by particle swarm optimization

    .
    In the Ninth International Conference on Intelligent Control and Information Processing (ICICIP), Vol. , pp. 191–198. Cited by: §I, §II-B, §II-C1, §III-A, §III-A.