Pulling back error to the hidden-node parameter technology: Single-hidden-layer feedforward network without output weight

05/06/2014 ∙ by Yimin Yang, et al. ∙ 0

According to conventional neural network theories, the feature of single-hidden-layer feedforward neural networks(SLFNs) resorts to parameters of the weighted connections and hidden nodes. SLFNs are universal approximators when at least the parameters of the networks including hidden-node parameter and output weight are exist. Unlike above neural network theories, this paper indicates that in order to let SLFNs work as universal approximators, one may simply calculate the hidden node parameter only and the output weight is not needed at all. In other words, this proposed neural network architecture can be considered as a standard SLFNs with fixing output weight equal to an unit vector. Further more, this paper presents experiments which show that the proposed learning method tends to extremely reduce network output error to a very small number with only 1 hidden node. Simulation results demonstrate that the proposed method can provide several to thousands of times faster than other learning algorithm including BP, SVM/SVR and other ELM methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The widespread popularity of neural networks in many fields is mainly due to their ability to approximate complex nonlinear mappings directly from the input samples. In the past two decades, due to their universal approximation capability, feedforward neural networks (FNNs) have been extensively used in classification and regression problem[1]

. According to Jaeger’s estimation

[2], 95% literatures are mainly on FNNs. As a specific type of FNNs, the single-hidden-layer feedforward network (SLFNs) plays an important role in practical applications[3]. For arbitrary distinct samples , where and , an SLFNs with

hidden nodes and activation function

are mathematically modeled as

(1)

where denotes the output of the th hidden node with the hidden-node parameters and is the output weight between the th hidden node and the output node. denotes the inner product of vector and x in .

An active topic on the universal approximation capability of SLFNs is then how to determine the parameters , and such that the network output can approximate a given target . The feature of SLFNs resorts to parameters of the output weight and hidden nodes parameters. According to conventional neural network theories, SLFNs are universal approximators when all the parameters of the networks including the hidden-node parameters and output weight are allowed adjustable[4][5].

Unlike above neural network theories that all the parameters in networks are allowed adjustable, other researches proposed some semi-random network theories[6][7][8]. For example, Lowe [8] focus on a specific RBF network: The centers in [8] can be randomly selected from the training data instead of tuning, but the impact factor of RBF hidden node is not randomly selected and usually determined by users.

Unlike above semi-random network theories, in 2006, Huang et al[9] illustrated that iterative techniques are not required in adjusting all the parameters of SLFNs at all. Based on this idea, Huang et al proposed simple and efficient learning steps referred to as extreme learning machine(ELM). In [10][11][12], Huang et al have proved that SLFNs with randomly generated hidden node parameter can work as universal approximators by only calculating the output weights linking the hidden layer to the output nodes. Recently ELM development [13] shows that ELM unifies FNNs and SVM/LS-SVM. Compared to ELM, LS-SVM and PSVM achieve suboptimal solutions and have a higher computational cost.

Above neural network theories indicate that SLFNs can work as universal approximation at least hidden-node parameters111hidden-node parameters can be generated randomly and output weight should be exist, however, in this paper we indicate that output weight do not need exist in SLFNs at all.

In [14] we proposed a learning algorithm, called bidirectional extreme learning machine (B-ELM) in which half of hidden-node parameters are not randomly selected and are calculated by pulling back the network residual error to input weight. The experimental results in [14] indicated that B-ELM tends to reduce network output error to a very small value at an extremely early learning stage. Further more, our recent experimental results indicate that in B-ELM[14]

, output weight play a very minion role in the network learning effectiveness. Inspired by these experimental results, in this paper, we show that SLFNs without output weight can approximate any target continuous function and classify any disjoint regions if one using pulling back error to hidden-node parameters. In particular, the following contributions have been made in this paper.

1) The learning speed of proposed learning method can be several to thousands of times faster than other learning methods including SVM, BP and other ELMs. Further more, it can provide good generalization performance and can be applied in regression and classification applications directly.

2) Different from conventional SLFNs in which the hidden node parameter and output weight should be needed, in the proposed method, we proved that SLFNs without output weight can still approximate any target continuous function and classify any disjoint regions. Thus the architecture of this single parameter neural network is extremely simpler than traditional SLFNs.

3) Different from other neural networks requiring large number of hidden nodes222In [13], Huang et al indicate ”The generalization performance of ELM is not sensitive to the dimensionality of the feature space (the number of hidden nodes) as long as is set large enough (e.g., for all the real-world cases tested in our simulations).”, experimental study shows that the proposed learning method with only one hidden node can give significant improvements on accuracy instead of maintaining a large hidden-node-numbers hidden layer.

Ii Preliminaries and Notation

Ii-a Notations and Definitions

The sets of real, integer, positive real and positive integer numbers are denoted by and , respectively. Similar to [1], let be a space of functions on a compact subset in the -dimensional Euclidean space such that are integrable, that is, . Let be denoted by . For , the inner product is defined by

(2)

The norm in space will be denoted as . denotes the number of hidden nodes. For training samples, denotes the input matrix of network, denotes the desire output matrix of network. is called the hidden layer output matrix of the SLFNs; the th column of H () is the th hidden node output with respect to inputs. The hidden layer output matrix is said to be randomly generated function sequence if the corresponding hidden-node parameters () are randomly generated. . denotes the residual error function for the current network with hidden nodes. I is unit matrix and .

Iii Bidirectional ELM for regression problem

Theorem 1

[14] Given training samples come from the same continuous function, given the sigmoid or sine activation function ; Given a error feedback function sequence by

(3)

If activation function is sin/cos, given a normalized function ; If activation function is sigmoid, given a normalized function . Then for any continuous target function , randomly generated function sequence ,

hold with probability one if

(4)
(5)
(6)
(7)

where and represent its reverse function, respectively. if is sine activation function, ; if is sigmoid activation function, .

Remark 1

Compared with B-ELM, In the proposed method, we only make two changes. The first one is we set . The second one is the pseudoinverse of input data has been changed as

based on the ridge regression theory. Although very small changes are made, the experimental results show that by using this proposed learning method, one hidden-node SLFNs without output weight (output weight

equal to unit matrix) can achieve similar generalization performance as other standard SLFNs with hundreds of hidden nodes. Further more, different from B-ELM [14] which only work for regression problem, the proposed method can be applied in regression and multi-classification applications.

Iv SLFNs without output weight

Basic idea 1: our recent experimental results indicate that in B-ELM[14], output weight play a very minion role in the network learning effectiveness. Inspired by these experimental results, in this proposed method, we directly set output weight equal to unit matrix.

Theorem 2

Given training samples come from the same continuous function, given an SLFNs with any bounded nonconstant piecewise continuous function for additive nodes or sine nodes, for any continues target function , obtained error feedback function sequence , holds with probability one if

(8)
(9)

The validity of this theorem is obvious because and , equal to . And we can get .

Remark 2

When , it is easy to notice that the proposed method can reduce the network output error to 0. Thus the learning problem has been converted into finding optimal hidden node parameter which lead to .

Basic idea 2: For fixed output weight equal to unit matrix or vector (), seen from equation (8)-(9), to train an SLFN is simply equivalent to finding a least-square solution of the linear system . If activation function can be invertible, to train an SLFN is simply equivalent to pulling back residual error to input weight. For example, for arbitrary distinct samples , , If activation function is sine function, to train an SLFN is simply equivalent to finding a least-square solution of the linear system :

(10)

According to [16], the smallest norm least-squares solution of the above linear system is . Based on this idea, we give the following theorem.

Lemma 1

[1] Given a bounded nonconstant piecewise continuous function , we have

(11)
Theorem 3

Given arbitrary distinct samples , given the sigmoid or sine activation function , for any continuous desire output T, there exist hold with probability one if

(12)
(13)

where if activation function is sin/cos, given a normalized function ; If activation function is sigmoid, given a normalized function . and represent its reverse function, respectively. If is sine activation function, ; if is sigmoid activation function, , .

For an activation function , is given by

(14)

In order to let , here we give a normalized function : if activation function is sin/cos; if activation function is sigmoid. Then for sine hidden node

(15)

For sigmoid hidden node

(16)

let , for sine activation function, we have

(17)

For sigmoid activation function, we have

(18)

where is the Moore-Penrose generalized inverse of the given set of training examples[15]. Similar to [16], we have 1: is one of the least-squares solutions of a general linear system , meaning that the smallest error can be reached by this solution:

(19)

2: the special solution has the smallest norm among all the least-squares solutions of , which is guarantee that . Although the smallest error can be reached by equation (17)-(18), we still can reduce its error by adding bias . For sine activation function:

(20)

For sigmoid activation function

(21)

According to 19 and Lemma 1, we have

(22)

We consider the residual error as

(23)

Let

(24)

Because , we have equation (23) is still valid for

(25)

Now based on equation 25, we have , so the sequence is decreasing and bounded below by zero and the sequence converges.

Remark 3

According to Theorem 2-3, for arbitrary distinct samples where and , the proposed network with hidden nodes and activation function are mathematically modeled as

(26)

where is a normalized function, . Here, the proposed the proposed method for SLFN can be summarized in Algorithm 1.

  Initialization: Given a training set ,the hidden-node output function , continuous target function , set number of hidden nodes , .
  Learning step:
  while  do
     Increase by one the number of hidden nodes ;
     Step 1) set ;
     Step 2) calculate the input weight , bias based on equation 12;
     Step 3) calculate e after adding the new hidden node :
  end while
Algorithm 1 the proposed method algorithm
Remark 4

Different from other neural network learning methods in which output weight parameter should be adjusted, in the proposed method, the output weight of SLFNs can be equal to unit matrix and thus the proposed neural network does not need output weight at all. Thus the architecture and computational cost of this proposed method are much smaller than other traditional SLFNs.

Remark 5

Subsection V.C presents experiments which show that the proposed method with only one hidden node can give better generalization performance than the proposed network with hidden node. Based on this experimental results, for arbitrary distinct samples where and , the proposed network is mathematically modeled as

(27)

where is a normalized function, . Thus algorithm 1 can be modified as algorithm 2.

  Initialization: Given a training set ,the hidden-node output function , continues target function , set number of hidden nodes .
  Learning step:
  Step 1) set ;
  Step 2) calculate the input weight , bias based on equation 12;
Algorithm 2 the proposed method algorithm

V Experimental Verification

Datasets #Attri #Train # Test
Auto MPG 8 200 192
Machine CPU 6 100 109
Fried 11 20768 20000
Wine Quality 12 2898 2000
Puma 9 4500 3692
California Housing 8 16000 4000
House 8L 9 16000 6784
Parkinsons motor 26 4000 1875
Parkinsons total 26 4000 1875
Puma 9 6000 2192
Delta elevators 6 6000 3000
Abalone 9 3000 1477
TABLE I: Specification of regression problems
Datasets #Feature #Train # Test
A9a 123 32561 16281
colon-cancer 2000 40 22
USPS 256 7291 2007
Sonar 60 150 58
Hill Valley 101 606 606
Protein 357 17766 6621
TABLE II: specification of Small/Medium-sized classification problems
Datasets #Feature #Train # Test
Covtype.binary 54 300000 280000
Mushrooms 112 4000 4122
Gisette 5000 6000 1000
Leukemia 7129 38 34
Duke 7129 29 15
Connect-4 126 50000 17557
Mnist 780 40000 30000
DNA 180 1046 1186
w3a 300 4912 44837
TABLE III: specification of large-sized classification problems

To examine the performance of our proposed algorithm (B-ELM), in this section, we test them on some benchmark regression and classification problems. Neural networks are tested in SVR, SVM, BP, EM-ELM,I-ELM, EI-ELM, B-ELM, ELM and proposed the proposed method.

V-a Benchmark Data Sets

In order to extensively verify the performance of different algorithms, wide type of data sets have been tested in our simulations, which are of small size, medium dimensions, large size, and/or high dimensions

. These data sets include 12 regression problems and 15 classification problems. Most of the data sets are taken from UCI Machine Learning Repository

333http://archive.ics.uci.edu/ml/datasets.html and LIBSVM DATA SETS 444http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.

Regression Data Sets: The 12 regression data sets(cf.Table I)can be classified into two groups of data:

1) data sets with relatively small size and low dimensions, e.g., Auto MPG, Machine CPU, Puma, Wine, Abalone;

2) data sets with relatively medium size and low dimensions, e.g., Delta, Fried, California Housing, Parkinsons;

Classification Data Sets: The 15 classification data sets(cf.Table II and Table III) can be classified into three groups of data:

1) data sets with relative medium size and medium dimensions, e.g., Sonar, Hill Valley, Wa3, DNA, Mushrooms, A9a, USPS;

2) data sets with relative small size and high dimensions, e.g., Colon-cancer, Leukemia, Duke;

3) data sets with relative large size and high dimensions, e.g., Protein, Covtype.binary, Gisette, Mnist, Connect-4;

In these data sets, the input data are normalized into while the output data for regression are normalized into the range . All data sets have been preprocessed in the same way (held-out method). Ten different random permutations of the whole data set are taken without replacement, and some(see in tables) are used to create the training set and the remaining is used for the test set. The average results are obtained over 50 trials for all problems.

V-B Simulation Environment Settings

The simulations of different algorithms on the data sets which are shown in Table I and Table II are carried out in Matlab 2009a environment running on the same Windows 7 machine with at 2 GB of memory and an i5-430 (2.33G) processor. The codes used for SVM and SVR are downloaded from LIBSVM555http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, The codes used for B-ELM, ELM and I-ELM are downloaded from ELM666http://www.ntu.edu.sg/home/egbhuang/elmcodes.html.

For SVM and SVR, in order to achieve good generalization performance, the cost parameter and kernel parameter of SVM and SVR need to be chosen appropriately. We have tried a wide range of and . For each data set, similar to [17], we have used 30 different value of and , resulting in a total of 900 pairs of . The 30 different value of and are . Average results of 50 trials of simulations with each combination of are obtained and the best performance obtained by SVM/SVR are shown in this paper.

For BP, the number of hidden nodes are gradually increased by an interval of 5 and the nearly optimal number of nodes for BP are then selected based on cross-validation method. Average results of 50 trails of simulations for each fixed size of SLFN are obtained and finally the best performance obtained by BP are shown in this paper as well.

Simulations on large data sets(cf.Table III) are carried out in a high-performance computer with Intel Xeon E3-1230 v2 processor (3.2G) and 16-GB memory.

V-C Generalization performance comparison of ELM methods with different hidden nodes

The aim of this subsection is to show that the proposed method with only one hidden node generally achieves better generalization performance than other learning methods. And it is also to show that the proposed method with one hidden node achieves the best performance than the proposed method with one hidden node. In this subsection, I-ELM, ELM, EI-ELM and the proposed method are compared in one regression problem and three classification problems: Fried, DNA, USPS and Mushroom. In these cases, all the algorithms increase the hidden nodes one by one. More importantly, we find that the testing accuracy obtained by proposed method is reduced to a very high value when only one hidden node is used. And the testing accuracy obtained by proposed method is not increased but is reduced when hidden node added one by one. This means the proposed method only need to calculates one-hidden-node parameter() once and then SLFNs without output weight can achieve similar generalization performance as other learning method with hundreds of hidden nodes. Thus in the following experiments, the number of hidden node equal to one in the proposed method.

Datasets I-ELM (200 nodes) B-ELM (200 nodes) EI-ELM (200 nodes, ) the proposed method (1 nodes)
Mean time(s) Mean time(s) Mean time(s) Mean time(s)
House 8L 0.0946 1.1872 0.0818 3.8821 0.0850 10.7691 0.0819 0.0020
Auto MPG 0.1000 0.2025 0.0920 0.3732 0.0918 1.3004 0.0996 ¡0.0001
Machine CPU 0.0591 0.1909 0.0554 0.3469 0.0551 1.2633 0.0489 ¡0.0001
Fried 0.1135 0.8327 0.0857 5.5063 0.0856 7.4016 0.0834 0.0051
Delta ailerons 0.0538 0.4680 0.0431 1.3946 0.0417 3.5478 0.0453 ¡0.0001
PD motor 0.2318 0.4639 0.2241 4.7680 0.2251 3.9016 0.2210 0.0037
PD total 0.2178 0.4678 0.2137 4.9278 0.2124 3.7854 0.2136 0.0023
Puma 0.1860 0.5070 0.1832 2.1846 0.1830 4.2161 0.1808 0.0012
Delta ele 0.1223 0.5313 0.1156 1.6206 0.1155 4.0240 0.1174 ¡0.0001
Abalone 0.0938 0.3398 0.0808 1.2549 0.0848 2.6676 0.0828 0.0017
Wine 0.1360 0.3516 0.1264 1.7098 0.1266 2.7126 0.1250 0.0031
California house 0.1801 1.1482 0.1450 7.2625 0.1505 12.0832 0.1420 0.0078
TABLE IV: Performance comparison (mean-mean testing RMSE; time-training time)
Datasets EM-ELM (200 nodes) ELM (200 nodes) the proposed method (1 nodes)
Mean time(s) Mean time(s) Mean time(s)
House 8L 0.0663 7.0388 0.0718 0.8369 0.0819 0.0020
Auto MPG 0.0968 0.0075 0.0976 0.0156 0.0996 ¡0.0001
Machine CPU 0.0521 0.1385 0.0513 0.0069 0.0489 ¡0.0001
Fried 0.0618 18.0290 0.0619 1.3135 0.0834 0.0051
Delta ailerons 0.0421 0.1342 0.0431 0.0616 0.0453 ¡0.0001
PD motor 0.2196 0.7394 0.2190 0.2730 0.2203 0.0037
PD total 0.2094 0.5944 0.2076 0.2838 0.2136 0.0023
Puma 0.1478 4.8392 0.1602 0.3728 0.1808 0.0012
Abalone 0.0817 0.1638 0.0824 0.0761 0.0828 0.0017
Wine 0.1216 0.3806 0.1229 0.1950 0.1250 0.0031
California house 0.1302 3.5574 0.1354 0.9753 0.1420 0.0078
TABLE V: Performance comparison (mean-mean testing RMSE; time-training time)
Datasets ELM (1 nodes) the proposed method (1 nodes)
Mean time(s) Mean time(s)
House 8L 0.1083 0.0009 0.0819 0.0020
Auto MPG 0.2126 ¡ 0.0001 0.0996 ¡0.0001
Machine CPU 0.1331 ¡0.0001 0.0489 ¡0.0001
Fried 0.2207 0.0031 0.0834 0.0051
Delta ailerons 0.0864 ¡0.0001 0.0453 ¡0.0001
PD motor 0.2620 0.0020 0.2210 0.0037
PD total 0.2548 0.0007 0.2136 0.0023
Puma 0.2856 0.0012 0.1808 0.0012
Delta ele 0.1454 ¡0.0001 0.1174 ¡0.0001
Abalone 0.1363 0.0007 0.0828 0.0017
Wine 0.1750 0.0006 0.1250 0.0031
California house 0.2496 0.0027 0.1420 0.0078
TABLE VI: Performance comparison (mean-mean testing RMSE; time-training time)
Datasets Eplison-SVR BP the proposed method (1 nodes)
Mean time(s) Mean time(s) Mean time(s)
House 8L 0.0799 53.6531 0.0790 27.8462 0.0819 0.0020
Auto MPG 0.0985 0.0234 0.0953 1.6034 0.0996 ¡0.0001
Machine CPU 0.0727 0.0187 0.0843 0.7129 0.0489 ¡0.0001
Fried 0.0829 197.9534 0.0591 81.8774 0.0834 0.0051
Delta ailerons 0.0402 6.8718 0.0415 12.6735 0.0453 ¡0.0001
California house 0.1529 35.2250 0.1435 54.3081 0.1420 0.0078
PD total 0.2082 7.2540 0.2120 12.6438 0.2136 0.0023
TABLE VII: Performance comparison (mean-mean testing RMSE; time-training time)
Datasets SVM ELM the proposed method (1 nodes)
Mean time(s) Mean time(s) #node Mean time(s)
Covtype.binary 74.84% 413.5275 77.27% 36.5947 500 76.55% 1.2043
Mushrooms 86.90% 38.6247 46.97% 0.9126 500 88.84% 0.0047
Gisette 77.68% 309.3968 88.69% 6.4093 500 94.10% 48.2027
Leukemia 82.58% 2.3914 76.47% 9.0340 5000 85.29% 20.9915
W3a 97.18% 4.5552 97.25% 0.9095 500 98.17% 0.1872
Duke 86.36% 0.0156 79.27% 7.8437 5000 92.67% 20.0352
Connect-4 66.01% 569.6221 76.55% 7.3757 500 75.40% 0.7597
Mnist 70.85% 478.4707 91.60% 8.1651 500 84.20% 8.8858
DNA 93.70% 0.4680 84.94% 0.2122 500 92.41% 0.0187
TABLE VIII: Performance comparison (mean-mean testing RMSE; time-training time)
Datasets SVM ELM the proposed method (1 nodes)
Mean time(s) Mean time(s) #node Mean time(s)
A9a 77.39% 295.0603 85.10% 4.5871 500 85.57% 0.5714
Colon 76.67% 10.0156 80.67% 11.6283 5000 85.06% 0.9719
USPS 94.65% 146.4942 93.54% 2.0639 500 88.86% 0.4898
Sonar 86.29% 0.0172 80.86% 0.0686 500 75.69% ¡0.0001
Hill Valley 58.67% 0.1295 64.31 0.1647 500 67.61% 0.0047
Protein 51.18% 253.5796 67.09% 5.0919 500 68.76% 1.9953
TABLE IX: Performance comparison (mean-mean testing RMSE; time-training time)

V-D Real-world regression problems

The experimental results between proposed the proposed method and some other incremental ELMs (B-ELM, I-ELM, and EI-ELM) are given in Table IV-Table V. In these tables, the close results obtained by different algorithms are underlined and the apparent better results are shown in boldface. All the incremental ELMs (I-ELM, B-ELM, EI-ELM) increase the hidden nodes one by one till nodes-numbers equal to 200, while for fixed ELMs (ELM, EM-ELM), 200-hidden-nodes are used. It can be seen that the proposed method can always achieve similar performance as other ELMs with much higher learning speed. In Table IV, for Machine CPU problem, the the proposed method runs 1900 times, 3400 times, 12000 times faster than the I-ELM, B-ELM and EI-ELM, respectively. For Abalone problem, the proposed method runs 200 times, 700 times, 1600 times faster than I-ELM, B-ELM and EI-ELM, respectively. In Table V, for Wine problem, the the proposed method runs 120 times and 60 times faster than EM-ELM and ELM, respectively. and the testing RMSE of EI-ELM is 2 times larger than the testing RMSE of B-ELM. The B-ELM runs 1.5 times faster than the I-ELM and the testing RMSE for the obained I-ELM is 5 times larger than the testing RMSE for B-ELM.

If only 1-hidden-node being used, those ELM methods such as I-ELM, ELM, EM-ELM and B-ELM can be considered as the same learning method (ELM[13]). Thus in Table VI, we carried out performance comparisons between the proposed method and 1-hidden-node ELM. As observed from Table VI, the average testing RMSE obtained by the proposed method are much better than the ELM. For California house and Delta ailerons problem, the testing RMSE obtained by ELM runs 2 times larger than that of the proposed method. In real applications, SLFNs with only 1 hidden nodes is extremely small network structure, meaning that after trained this small size network may response to new external unknown stimuli much faster and much more accurate than other ELM algorithms in real deployment.

V-E Real-world classification problems

In order to indicate the advantage of the B-ELM on classification performance, the testing accuracy between the proposed the proposed method and other algorithms has also been conducted. Table VIII and IX display the performance comparison of SVM, ELM and the proposed method. In these tables, the close results obtained by different algorithms are underlined and the apparent better results are shown in boldface. As seen from those simulation results given in these tables, the proposed method can always achieve comparable performance as SVM and ELM with much faster learning speed. Take Covtype.binary (large number of training samples with medium input dimensions) and Gisette (medium number of training samples with high input dimensions).

1) For Covtype.binary data set, the proposed method runs 1403 times and 35 times faster than ELM and SVM, respectively.

2) For Gisette data set, the proposed method runs 341 times and 1.7 times faster than ELM and SVM, respectively.

Huang et al.[1][18][16][13] have systematically investigated the performance of ELM, SVM/SVR and BP for most data sets tested in this work. It is found that ELM obtain similar generalization performance as SVM/SVR but in much simpler and faster way. Similar to those above works, our testing results(cf. Table VII-IX) shows that the proposed the proposed method always provide comparable performance as SVM/SVR and BP with much faster learning speed.

On the other hand, the proposed method requires none human intervention than SVM, BP and other ELM methods. Different from SVM which is sensitive to the combinations of parameters (), or from other ELM methods in which parameter needs to be specified by users, the proposed method have none specified parameter and is ease of use in the respective implementations.

Vi Conclusion

Unlike other SLFN learning methods, in our new approach, one may simply calculate the hidden node parameter once and the output weight is not need at all. And it has been rigorously proved that the proposed method can greatly enhance the learning effectiveness, reduce the computation cost, and eventually further increase the learning speed. The simulation results on sigmoid type of hidden nodes show that compared to other learning methods including SVM/SVR, BP and ELMs, the new approach can significantly reduce the NN training time several to thousands of times and can applied in regression and classification problems. Thus this method can be used efficiently in many applications.

However, we find an interesting phenomenon which we are not able to prove in this method, which should be worth pointing out. Experimental results show that this proposed learning method with one hidden node can achieve better generalization performance than the same method with hidden nodes. This phenomenon of this proposed method bring about many advantages, but if researchers can find the nature of this phenomenon, it can have far reaching consequences on the generalization ability of neural network.

References

  • [1] Guang-Bin Huang, Lei Chen, and Chee-Kheong Siew. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4):879–892, 2006.
  • [2] Herbert Jaeger.

    A tutorial on training recurrent neural networks , covering BPPT , RTRL , EKF and the ” echo state network ” approach.

    2002:1–46, 2005.
  • [3] G. B. Huang, P. Saratchandran, and N. Sundararajan. An efficient sequential learning algorithm for growing and pruning rbf (gap-rbf) networks. IEEE Transactions on Systems Man and Cybernetics Part B-cybernetics, 34(6):2284–2292, December 2004.
  • [4] Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.
  • [5] Rui Zhang, Yuan Lan, Guang-Bin Huang, and Zong-Ben Xu. Universal approximation of extreme learning machine with adaptive growth of hidden nodes. IEEE Transactions on Neural Networks and Learning Systems, 23(2):365–371, February 2012.
  • [6] B. Igelnik and Y. H. Pao. Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 6(6):1320–9, 1995.
  • [7] Y. H. PAO, G. H. PARK, and D. J. SOBAJIC. Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2):163–180, April 1994.
  • [8] D.S.Broomhead and D.Lowe.

    Multivariable functional interpolation and adaptive networks.

    Complex Systems, 2:321–355, 1988.
  • [9] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine. In Technical Report ICIS/03/2004 (also in http://www.ntu.edu.sg/eee/icis/cv/egbhuang.htm), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, January 2004.
  • [10] Ming-Bin Li, Guang-Bin Huang, P. Saratchandran, and Narasimhan Sundararajan. Fully complex extreme learning machine. Neurocomputing, 68:306–314, 2005.
  • [11] Guang-Bin Huang and Chee-Kheong Siew. Extreme learning machine with randomly assigned RBF kernels. International Journal of Information Technology, 11(1):16–24, 2005.
  • [12] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Real-time learning capability of neural networks. In Technical Report ICIS/45/2003, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, April 2003.
  • [13] G. B. Huang, H. M. Zhou, X. J. Ding, and R. Zhang. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems Man and Cybernetics Part B-cybernetics, 42(2):513–529, April 2012.
  • [14] Y. M. Yang, Y. N. Wang, and X. F. Yuan. Parallel chaos search based incremental extreme learning machine. Neural Processing Letters, 37(3):277–301, June 2013.
  • [15] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1):80–86, February 2000.
  • [16] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70:489–501, 2006.
  • [17] C. W. Hsu and C. J. Lin.

    A comparison of methods for multiclass support vector machines.

    IEEE Transactions on Neural Networks, 13(2):415–425, March 2002.
  • [18] Guorui Feng, Guang-Bin Huang, Qingping Lin, and Robert Gay. Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Transactions on Neural Networks, 20(8):1352–1357, 2009.