The widespread popularity of neural networks in many fields is mainly due to their ability to approximate complex nonlinear mappings directly from the input samples. In the past two decades, due to their universal approximation capability, feedforward neural networks (FNNs) have been extensively used in classification and regression problem
. According to Jaeger’s estimation, 95% literatures are mainly on FNNs. As a specific type of FNNs, the single-hidden-layer feedforward network (SLFNs) plays an important role in practical applications. For arbitrary distinct samples , where and , an SLFNs with
hidden nodes and activation functionare mathematically modeled as
where denotes the output of the th hidden node with the hidden-node parameters and is the output weight between the th hidden node and the output node. denotes the inner product of vector and x in .
An active topic on the universal approximation capability of SLFNs is then how to determine the parameters , and such that the network output can approximate a given target . The feature of SLFNs resorts to parameters of the output weight and hidden nodes parameters. According to conventional neural network theories, SLFNs are universal approximators when all the parameters of the networks including the hidden-node parameters and output weight are allowed adjustable.
Unlike above neural network theories that all the parameters in networks are allowed adjustable, other researches proposed some semi-random network theories. For example, Lowe  focus on a specific RBF network: The centers in  can be randomly selected from the training data instead of tuning, but the impact factor of RBF hidden node is not randomly selected and usually determined by users.
Unlike above semi-random network theories, in 2006, Huang et al illustrated that iterative techniques are not required in adjusting all the parameters of SLFNs at all. Based on this idea, Huang et al proposed simple and efficient learning steps referred to as extreme learning machine(ELM). In , Huang et al have proved that SLFNs with randomly generated hidden node parameter can work as universal approximators by only calculating the output weights linking the hidden layer to the output nodes. Recently ELM development  shows that ELM unifies FNNs and SVM/LS-SVM. Compared to ELM, LS-SVM and PSVM achieve suboptimal solutions and have a higher computational cost.
Above neural network theories indicate that SLFNs can work as universal approximation at least hidden-node parameters111hidden-node parameters can be generated randomly and output weight should be exist, however, in this paper we indicate that output weight do not need exist in SLFNs at all.
In  we proposed a learning algorithm, called bidirectional extreme learning machine (B-ELM) in which half of hidden-node parameters are not randomly selected and are calculated by pulling back the network residual error to input weight. The experimental results in  indicated that B-ELM tends to reduce network output error to a very small value at an extremely early learning stage. Further more, our recent experimental results indicate that in B-ELM
, output weight play a very minion role in the network learning effectiveness. Inspired by these experimental results, in this paper, we show that SLFNs without output weight can approximate any target continuous function and classify any disjoint regions if one using pulling back error to hidden-node parameters. In particular, the following contributions have been made in this paper.
1) The learning speed of proposed learning method can be several to thousands of times faster than other learning methods including SVM, BP and other ELMs. Further more, it can provide good generalization performance and can be applied in regression and classification applications directly.
2) Different from conventional SLFNs in which the hidden node parameter and output weight should be needed, in the proposed method, we proved that SLFNs without output weight can still approximate any target continuous function and classify any disjoint regions. Thus the architecture of this single parameter neural network is extremely simpler than traditional SLFNs.
3) Different from other neural networks requiring large number of hidden nodes222In , Huang et al indicate ”The generalization performance of ELM is not sensitive to the dimensionality of the feature space (the number of hidden nodes) as long as is set large enough (e.g., for all the real-world cases tested in our simulations).”, experimental study shows that the proposed learning method with only one hidden node can give significant improvements on accuracy instead of maintaining a large hidden-node-numbers hidden layer.
Ii Preliminaries and Notation
Ii-a Notations and Definitions
The sets of real, integer, positive real and positive integer numbers are denoted by and , respectively. Similar to , let be a space of functions on a compact subset in the -dimensional Euclidean space such that are integrable, that is, . Let be denoted by . For , the inner product is defined by
The norm in space will be denoted as . denotes the number of hidden nodes. For training samples, denotes the input matrix of network, denotes the desire output matrix of network. is called the hidden layer output matrix of the SLFNs; the th column of H () is the th hidden node output with respect to inputs. The hidden layer output matrix is said to be randomly generated function sequence if the corresponding hidden-node parameters () are randomly generated. . denotes the residual error function for the current network with hidden nodes. I is unit matrix and .
Iii Bidirectional ELM for regression problem
 Given training samples come from the same continuous function, given the sigmoid or sine activation function ; Given a error feedback function sequence by
If activation function is sin/cos, given a normalized function ; If activation function is sigmoid, given a normalized function . Then for any continuous target function , randomly generated function sequence , hold with probability one if
hold with probability one if
where and represent its reverse function, respectively. if is sine activation function, ; if is sigmoid activation function, .
Compared with B-ELM, In the proposed method, we only make two changes. The first one is we set . The second one is the pseudoinverse of input data has been changed as based on the ridge regression theory. Although very small changes are made, the experimental results show that by using this proposed learning method, one hidden-node SLFNs without output weight (output weight
based on the ridge regression theory. Although very small changes are made, the experimental results show that by using this proposed learning method, one hidden-node SLFNs without output weight (output weightequal to unit matrix) can achieve similar generalization performance as other standard SLFNs with hundreds of hidden nodes. Further more, different from B-ELM  which only work for regression problem, the proposed method can be applied in regression and multi-classification applications.
Iv SLFNs without output weight
Basic idea 1: our recent experimental results indicate that in B-ELM, output weight play a very minion role in the network learning effectiveness. Inspired by these experimental results, in this proposed method, we directly set output weight equal to unit matrix.
Given training samples come from the same continuous function, given an SLFNs with any bounded nonconstant piecewise continuous function for additive nodes or sine nodes, for any continues target function , obtained error feedback function sequence , holds with probability one if
The validity of this theorem is obvious because and , equal to . And we can get .
When , it is easy to notice that the proposed method can reduce the network output error to 0. Thus the learning problem has been converted into finding optimal hidden node parameter which lead to .
Basic idea 2: For fixed output weight equal to unit matrix or vector (), seen from equation (8)-(9), to train an SLFN is simply equivalent to finding a least-square solution of the linear system . If activation function can be invertible, to train an SLFN is simply equivalent to pulling back residual error to input weight. For example, for arbitrary distinct samples , , If activation function is sine function, to train an SLFN is simply equivalent to finding a least-square solution of the linear system :
According to , the smallest norm least-squares solution of the above linear system is . Based on this idea, we give the following theorem.
 Given a bounded nonconstant piecewise continuous function , we have
Given arbitrary distinct samples , given the sigmoid or sine activation function , for any continuous desire output T, there exist hold with probability one if
where if activation function is sin/cos, given a normalized function ; If activation function is sigmoid, given a normalized function . and represent its reverse function, respectively. If is sine activation function, ; if is sigmoid activation function, , .
For an activation function , is given by
In order to let , here we give a normalized function : if activation function is sin/cos; if activation function is sigmoid. Then for sine hidden node
For sigmoid hidden node
let , for sine activation function, we have
For sigmoid activation function, we have
where is the Moore-Penrose generalized inverse of the given set of training examples. Similar to , we have 1: is one of the least-squares solutions of a general linear system , meaning that the smallest error can be reached by this solution:
2: the special solution has the smallest norm among all the least-squares solutions of , which is guarantee that . Although the smallest error can be reached by equation (17)-(18), we still can reduce its error by adding bias . For sine activation function:
For sigmoid activation function
According to 19 and Lemma 1, we have
We consider the residual error as
Because , we have equation (23) is still valid for
Now based on equation 25, we have , so the sequence is decreasing and bounded below by zero and the sequence converges.
According to Theorem 2-3, for arbitrary distinct samples where and , the proposed network with hidden nodes and activation function are mathematically modeled as
where is a normalized function, . Here, the proposed the proposed method for SLFN can be summarized in Algorithm 1.
Different from other neural network learning methods in which output weight parameter should be adjusted, in the proposed method, the output weight of SLFNs can be equal to unit matrix and thus the proposed neural network does not need output weight at all. Thus the architecture and computational cost of this proposed method are much smaller than other traditional SLFNs.
Subsection V.C presents experiments which show that the proposed method with only one hidden node can give better generalization performance than the proposed network with hidden node. Based on this experimental results, for arbitrary distinct samples where and , the proposed network is mathematically modeled as
where is a normalized function, . Thus algorithm 1 can be modified as algorithm 2.
V Experimental Verification
To examine the performance of our proposed algorithm (B-ELM), in this section, we test them on some benchmark regression and classification problems. Neural networks are tested in SVR, SVM, BP, EM-ELM,I-ELM, EI-ELM, B-ELM, ELM and proposed the proposed method.
V-a Benchmark Data Sets
In order to extensively verify the performance of different algorithms, wide type of data sets have been tested in our simulations, which are of small size, medium dimensions, large size, and/or high dimensions
. These data sets include 12 regression problems and 15 classification problems. Most of the data sets are taken from UCI Machine Learning Repository333http://archive.ics.uci.edu/ml/datasets.html and LIBSVM DATA SETS 444http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.
Regression Data Sets: The 12 regression data sets(cf.Table I)can be classified into two groups of data:
1) data sets with relatively small size and low dimensions, e.g., Auto MPG, Machine CPU, Puma, Wine, Abalone;
2) data sets with relatively medium size and low dimensions, e.g., Delta, Fried, California Housing, Parkinsons;
Classification Data Sets: The 15 classification data sets(cf.Table II and Table III) can be classified into three groups of data:
1) data sets with relative medium size and medium dimensions, e.g., Sonar, Hill Valley, Wa3, DNA, Mushrooms, A9a, USPS;
2) data sets with relative small size and high dimensions, e.g., Colon-cancer, Leukemia, Duke;
3) data sets with relative large size and high dimensions, e.g., Protein, Covtype.binary, Gisette, Mnist, Connect-4;
In these data sets, the input data are normalized into while the output data for regression are normalized into the range . All data sets have been preprocessed in the same way (held-out method). Ten different random permutations of the whole data set are taken without replacement, and some(see in tables) are used to create the training set and the remaining is used for the test set. The average results are obtained over 50 trials for all problems.
V-B Simulation Environment Settings
The simulations of different algorithms on the data sets which are shown in Table I and Table II are carried out in Matlab 2009a environment running on the same Windows 7 machine with at 2 GB of memory and an i5-430 (2.33G) processor. The codes used for SVM and SVR are downloaded from LIBSVM555http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, The codes used for B-ELM, ELM and I-ELM are downloaded from ELM666http://www.ntu.edu.sg/home/egbhuang/elmcodes.html.
For SVM and SVR, in order to achieve good generalization performance, the cost parameter and kernel parameter of SVM and SVR need to be chosen appropriately. We have tried a wide range of and . For each data set, similar to , we have used 30 different value of and , resulting in a total of 900 pairs of . The 30 different value of and are . Average results of 50 trials of simulations with each combination of are obtained and the best performance obtained by SVM/SVR are shown in this paper.
For BP, the number of hidden nodes are gradually increased by an interval of 5 and the nearly optimal number of nodes for BP are then selected based on cross-validation method. Average results of 50 trails of simulations for each fixed size of SLFN are obtained and finally the best performance obtained by BP are shown in this paper as well.
Simulations on large data sets(cf.Table III) are carried out in a high-performance computer with Intel Xeon E3-1230 v2 processor (3.2G) and 16-GB memory.
V-C Generalization performance comparison of ELM methods with different hidden nodes
The aim of this subsection is to show that the proposed method with only one hidden node generally achieves better generalization performance than other learning methods. And it is also to show that the proposed method with one hidden node achieves the best performance than the proposed method with one hidden node. In this subsection, I-ELM, ELM, EI-ELM and the proposed method are compared in one regression problem and three classification problems: Fried, DNA, USPS and Mushroom. In these cases, all the algorithms increase the hidden nodes one by one. More importantly, we find that the testing accuracy obtained by proposed method is reduced to a very high value when only one hidden node is used. And the testing accuracy obtained by proposed method is not increased but is reduced when hidden node added one by one. This means the proposed method only need to calculates one-hidden-node parameter() once and then SLFNs without output weight can achieve similar generalization performance as other learning method with hundreds of hidden nodes. Thus in the following experiments, the number of hidden node equal to one in the proposed method.
|Datasets||I-ELM (200 nodes)||B-ELM (200 nodes)||EI-ELM (200 nodes, )||the proposed method (1 nodes)|
|Datasets||EM-ELM (200 nodes)||ELM (200 nodes)||the proposed method (1 nodes)|
|Datasets||ELM (1 nodes)||the proposed method (1 nodes)|
|Auto MPG||0.2126||¡ 0.0001||0.0996||¡0.0001|
|Datasets||Eplison-SVR||BP||the proposed method (1 nodes)|
|Datasets||SVM||ELM||the proposed method (1 nodes)|
|Datasets||SVM||ELM||the proposed method (1 nodes)|
V-D Real-world regression problems
The experimental results between proposed the proposed method and some other incremental ELMs (B-ELM, I-ELM, and EI-ELM) are given in Table IV-Table V. In these tables, the close results obtained by different algorithms are underlined and the apparent better results are shown in boldface. All the incremental ELMs (I-ELM, B-ELM, EI-ELM) increase the hidden nodes one by one till nodes-numbers equal to 200, while for fixed ELMs (ELM, EM-ELM), 200-hidden-nodes are used. It can be seen that the proposed method can always achieve similar performance as other ELMs with much higher learning speed. In Table IV, for Machine CPU problem, the the proposed method runs 1900 times, 3400 times, 12000 times faster than the I-ELM, B-ELM and EI-ELM, respectively. For Abalone problem, the proposed method runs 200 times, 700 times, 1600 times faster than I-ELM, B-ELM and EI-ELM, respectively. In Table V, for Wine problem, the the proposed method runs 120 times and 60 times faster than EM-ELM and ELM, respectively. and the testing RMSE of EI-ELM is 2 times larger than the testing RMSE of B-ELM. The B-ELM runs 1.5 times faster than the I-ELM and the testing RMSE for the obained I-ELM is 5 times larger than the testing RMSE for B-ELM.
If only 1-hidden-node being used, those ELM methods such as I-ELM, ELM, EM-ELM and B-ELM can be considered as the same learning method (ELM). Thus in Table VI, we carried out performance comparisons between the proposed method and 1-hidden-node ELM. As observed from Table VI, the average testing RMSE obtained by the proposed method are much better than the ELM. For California house and Delta ailerons problem, the testing RMSE obtained by ELM runs 2 times larger than that of the proposed method. In real applications, SLFNs with only 1 hidden nodes is extremely small network structure, meaning that after trained this small size network may response to new external unknown stimuli much faster and much more accurate than other ELM algorithms in real deployment.
V-E Real-world classification problems
In order to indicate the advantage of the B-ELM on classification performance, the testing accuracy between the proposed the proposed method and other algorithms has also been conducted. Table VIII and IX display the performance comparison of SVM, ELM and the proposed method. In these tables, the close results obtained by different algorithms are underlined and the apparent better results are shown in boldface. As seen from those simulation results given in these tables, the proposed method can always achieve comparable performance as SVM and ELM with much faster learning speed. Take Covtype.binary (large number of training samples with medium input dimensions) and Gisette (medium number of training samples with high input dimensions).
1) For Covtype.binary data set, the proposed method runs 1403 times and 35 times faster than ELM and SVM, respectively.
2) For Gisette data set, the proposed method runs 341 times and 1.7 times faster than ELM and SVM, respectively.
Huang et al. have systematically investigated the performance of ELM, SVM/SVR and BP for most data sets tested in this work. It is found that ELM obtain similar generalization performance as SVM/SVR but in much simpler and faster way. Similar to those above works, our testing results(cf. Table VII-IX) shows that the proposed the proposed method always provide comparable performance as SVM/SVR and BP with much faster learning speed.
On the other hand, the proposed method requires none human intervention than SVM, BP and other ELM methods. Different from SVM which is sensitive to the combinations of parameters (), or from other ELM methods in which parameter needs to be specified by users, the proposed method have none specified parameter and is ease of use in the respective implementations.
Unlike other SLFN learning methods, in our new approach, one may simply calculate the hidden node parameter once and the output weight is not need at all. And it has been rigorously proved that the proposed method can greatly enhance the learning effectiveness, reduce the computation cost, and eventually further increase the learning speed. The simulation results on sigmoid type of hidden nodes show that compared to other learning methods including SVM/SVR, BP and ELMs, the new approach can significantly reduce the NN training time several to thousands of times and can applied in regression and classification problems. Thus this method can be used efficiently in many applications.
However, we find an interesting phenomenon which we are not able to prove in this method, which should be worth pointing out. Experimental results show that this proposed learning method with one hidden node can achieve better generalization performance than the same method with hidden nodes. This phenomenon of this proposed method bring about many advantages, but if researchers can find the nature of this phenomenon, it can have far reaching consequences on the generalization ability of neural network.
-  Guang-Bin Huang, Lei Chen, and Chee-Kheong Siew. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4):879–892, 2006.
A tutorial on training recurrent neural networks , covering BPPT , RTRL , EKF and the ” echo state network ” approach.2002:1–46, 2005.
-  G. B. Huang, P. Saratchandran, and N. Sundararajan. An efficient sequential learning algorithm for growing and pruning rbf (gap-rbf) networks. IEEE Transactions on Systems Man and Cybernetics Part B-cybernetics, 34(6):2284–2292, December 2004.
-  Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.
-  Rui Zhang, Yuan Lan, Guang-Bin Huang, and Zong-Ben Xu. Universal approximation of extreme learning machine with adaptive growth of hidden nodes. IEEE Transactions on Neural Networks and Learning Systems, 23(2):365–371, February 2012.
-  B. Igelnik and Y. H. Pao. Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 6(6):1320–9, 1995.
-  Y. H. PAO, G. H. PARK, and D. J. SOBAJIC. Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2):163–180, April 1994.
D.S.Broomhead and D.Lowe.
Multivariable functional interpolation and adaptive networks.Complex Systems, 2:321–355, 1988.
-  Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine. In Technical Report ICIS/03/2004 (also in http://www.ntu.edu.sg/eee/icis/cv/egbhuang.htm), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, January 2004.
-  Ming-Bin Li, Guang-Bin Huang, P. Saratchandran, and Narasimhan Sundararajan. Fully complex extreme learning machine. Neurocomputing, 68:306–314, 2005.
-  Guang-Bin Huang and Chee-Kheong Siew. Extreme learning machine with randomly assigned RBF kernels. International Journal of Information Technology, 11(1):16–24, 2005.
-  Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Real-time learning capability of neural networks. In Technical Report ICIS/45/2003, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, April 2003.
-  G. B. Huang, H. M. Zhou, X. J. Ding, and R. Zhang. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems Man and Cybernetics Part B-cybernetics, 42(2):513–529, April 2012.
-  Y. M. Yang, Y. N. Wang, and X. F. Yuan. Parallel chaos search based incremental extreme learning machine. Neural Processing Letters, 37(3):277–301, June 2013.
-  A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1):80–86, February 2000.
-  Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70:489–501, 2006.
C. W. Hsu and C. J. Lin.
A comparison of methods for multiclass support vector machines.IEEE Transactions on Neural Networks, 13(2):415–425, March 2002.
-  Guorui Feng, Guang-Bin Huang, Qingping Lin, and Robert Gay. Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Transactions on Neural Networks, 20(8):1352–1357, 2009.