A Vest of the Pseudoinverse Learning Algorithm

05/20/2018 ∙ by Ping Guo, et al. ∙ 0

In this letter, we briefly review the basic scheme of the pseudoinverse learning (PIL) algorithm and present some discussions on the PIL, as well as its variants. The PIL algorithm, first presented in 1995, is a non-gradient descent algorithm for multi-layer neural networks and has several advantages compared with gradient descent based algorithms. We also show that the so-called extreme learning machine (ELM) is a vest (another name) of the PIL algorithm for single hidden layer feedforward neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-layer perceptron (MLP) is a kind of feedforward neural networks, which is most studied in mid-eighties of the last century. With more than three hidden layers, MLP is called deep neural network (DNN), and training DNN is a deep learning procedure. Now MLP has already been found to be successful for various supervised learning tasks. Both theoretical and empirical studies have shown that the MLP is of powerful capabilities for pattern classification and universal approximation

[1]. When there are few hidden layers, weight parameters of the network can be learned by the gradient descent learning algorithm, namely the well-known error back propagation (BP) algorithm [12][10]. As we have known, the BP algorithm has several disadvantages. It usually has a poor convergence rate and sometimes falls into local minima [14]

. The selection of hyperparameters in the BP algorithm, such as learning rate and momentum constant, is often crucial for the success of the algorithm.

In order to solve these problems arisen in the BP algorithm, Guo et al [4] proposed a non-gradient descent algorithm, and later named this algorithm as the pseudoinverse learning (PIL) algorithm [5]. Unlike the BP algorithm, the PIL algorithm could exactly calculate the network weights, rather than to find weights with iterative optimization. The PIL algorithm adopts only generalized linear algebraic methods, e.g., pseudoinverse operations and matrix inner products. Moreover, PIL does not need to explicitly set any control parameters, which were usually specified by users empirically.

Ii Basic Scheme of the Pseudoinverse Learning Algorithm

Fig. 1: A schematic diagram of single hidden layer neural networks.

BP algorithm is a kind of gradient decent algorithm. Evidently, it is a great discovery in neural network learning, but it has a few of disadvantages, such as slow convergence or local minima, which most researchers knew about. Non-gradient descent based algorithms have been considered as alternative approaches, especially in the late 90 or early 2000, among which the PIL algorithm is a success one [4][5], especially for multi-layer perceptron (or multi-layer neural networks). Following we will take a single hidden layer neural network (SHLN) as a example to review the learning algorithms.

Figure (1) is a schematic diagram of single hidden layer neural networks.

The SHLN shown in Fig. (1) has three layers, including one input layer, one output layer and one hidden layer. There are neurons in input layer, neurons in hidden layer, and neurons in output layer. We use to express

dimensional input vector, and

stands for dimensional output vector. in input layer is called bias neuron, and is hidden layer bias neuron. is the weight matrix connecting input and hidden neurons, and is the weight matrix connecting the hidden and output neurons. The network mapping function is expressed as,

(1)

Where stands for the network parameter group, including connecting weights , and bias neurons . While

is an activation function, most used function type including sigmoid, hyperbolic, step, radial basis function, and so on.

From above equation, we can see that is the hidden layer output, and is the last layer output. If we let

and

now matrix become augmented matrix. In the literatures, bias neuron is used to prevent zero input vector to destroy weight updating in sequentially training, the value of the bias usual is set to be , but some researchers also took it as a variable. For mathematical expression concise, hidden layer bias neuron often is omitted. Please note that in Ref. [9] is called the threshold of the th hidden node, but it has no role for threshold operation.

When data set is given, training the network is to find the weigh parameters

with minimizing cost function. The cost function, also known as the loss function, or error function, is used to measure the difference between the actual outputs and the expected outputs of the network. There are many forms of the error function, which is determined by the probability distribution of the error. When the error distribution is the Gaussian distribution, the cost function is the sum of the squared error (SSE).

(2)

For simplifying, we can write this system error function in matrix form,

(3)

Where subscript stands for Frobenius norm.

The purpose of neural network training is to find the weight parameters to minimize the cost function. Traditional learning algorithm is BP algorithm. As mentioned earlier, learning algorithm related hyperparameters, such as learning rate and momentum constant, need to be selected by users in BP algorithms. The choices of hyperparameters are more difficult for most beginners of neural network research.

In order to overcome the drawbacks of the BP algorithm, Guo et al [4] proposed the PIL algorithm to train a SHLN in 1995. In that work, the activation function is taken as the hyperbolic function . Minimizing following error function to find weight parameter matrix,

(4)

where is the output matrix of the hidden layer, is the input matrix consisting of input vectors as its rows and columns as input vector dimension plus 1, , and is the target label matrix which consists of label vectors as its rows and columns as target vector dimension.

Eq. (4) is formally a problem of least squares in linear algebra. However, only the matrix is known at present. Weight matrices V and W are not yet known. The task of the network learning is to find these matrices. According the theorems in linear algebra, a formal solution to W in Eq. (4) is , where is the pseudoinverse of . To bring this formal solution into Eq. (4), we can also get the following mathematical form:

(5)

If Eq. (5) holds, an intuitive explanation is that will satisfy the requirement. So if is the final solution, should be a full rank matrix. With this new objective function, we can adopt some methods to set input matrix so as to make hidden layer output matrix to approach the full rank. A simple way to set to be a random value matrix could reach this goal with nonlinear transformation, or we can set as the pseudoinverse of the input matrix .

As we known, in SHLN, the number of hidden neurons is a hyperparameter of the neural network architecture. When a given problem is formulated, the number of neurons in the input layer is determined by the dimensionality of the input data, and the number of neurons in the output layer depends on the specific problem. The number of hidden layer neurons is only one hyperparameter for SHLN and perhaps it the hardest problem for most beginners (see [3] for more details). In the work of Guo et al [4], the is set be for the purpose of exact learning.

Hence, we have a fast learning algorithm which computes the weight matrix instead of iterative approach. For most problem, it only needs one step to reach the perfect learning. The algorithm can be summarized as follows:

Algorithm PIL: Given a date set, we draw pair samples as the training set, activation function , and set hidden neuron number be ,

  • Step 1: Compute and hidden layer output matrix ,

  • Step 2: Compute and

    ,

  • Step 3: Compute the output weight matrix

    .

And the network output .

As we known, the common practice is random initialization of weight parameters in BP algorithm, then with delta learning rule to update weight matrix. While in PIL algorithm, weight parameters are computed with pseudoinverse solution, and do not need to be adjusted further. In the work of Guo et al [4], randomly set input weight also has been investigated. Following sentence is copied from Ref. [4]:

A simple method is to set V as a random n by N matrix. In practice, this is not a proper method. As we mentioned above, we use as the activate function. If the matrix Z =XV contains elements of large values, it will result in complex numbers which was not desirable. So it is better to choose a proper matrix V so that Z has no elements of large value. One way is to set the values of the elements as small as possible.

When writing above description in following form, we can regard it as a variant of the PIL algorithm:

Algorithm PIL0: Given a date set, we draw pair samples as the training set, activation function , and set hidden neuron number ,

  • Step 1: Randomly assign input weight matrix V (the element values in V should in a small interval,
    say [-1, +1]),

  • Step 2: Compute the hidden layer output matrix
    ,

  • Step 3: Compute the output weight matrix

    .

    where B=ArcTanh(T)

And the network output .

Remarks

  1. Set hidden neuron number is for exact learning, if the training error is allowed, we can set hidden neuron number .

  2. Activation function can be taken any nonlinear transformation function, such as sigmoid, Gaussian kernel, and so on.

  3. When last layer activation function is taken as a linear function, [5].

Recently, we find that Huang et al [8][9] created a name called extreme learning machine (ELM), compared ELM algorithm described in [8][9], we can easily find that it is exactly the same with our PIL0 algorithm for SHLN in learning scheme. So we think that ELM algorithm is a variant created by simple name alternation (VEST) of the PIL algorithm.

Iii PIL Variants

For a number of data set, the PIL algorithm for SHLN can reach accurate learning. But for some other data, the learning accuracy of SHLN cannot meet the high precision requirements if we only simply assign input weight matrix V as the pseudoinverse of X, or a random value matrix without any constraints. In 2001, Guo et al [5] proposed a new solution that extended the neural network architecture from a single hidden layer to multiple hidden layers. Later in 2003, Guo et al extended their work and published in the journal of Neurocomputing (copyright 2003) [6].

The PIL algorithm for multilayer neural network is summarized as follows:

Algorithm ePIL: Given a date set, we draw pair samples as the training set, activation function , and set hidden neuron number be ,

  • Step 1: Compute ,

  • Step 2: Compute . If it is less than the given error , go to step 5. If not, go on to the next step.

  • Step 3: Let . Feed forward the result to next layer, and compute .

  • Step 4: Compute , set , and go to step 2.

  • Step 5: Let final layer output weight matrix
    .

And the network output

(6)

There are some new viewpoints to PIL is as follow:

Remarks

  1. Eq.(6) showed a deep neural network architecture.

  2. The depth of this DNN is dynamical growth and is data dependent.

  3. If stoped at , we get an identity orthogonal projector

  4. When we let

    , the PIL algorithm is an unsupervised learning algorithm, it realizes vector normalization in high dimensional space.

In the work of Guo et al [6], another variant of PIL algorithm is discussed. It stated in the discussion section of Ref. [6]:

“But if we intend to reduce the network complexity, we can add a same-dimension Gaussian noise matrix to perturb the transformed matrix in step 4 of the PIL algorithm. The inverse function of the perturbed matrix will exist with probability one because the noise is an identical and independent distribution. In such a strategy, we can constrain the hidden layers to at most two to reach the perfect learning”.

Writing above description with mathematical algorithm form:

Algorithm PIL1: Given a date set, we draw pair samples as the training set, activation function , set hidden neuron number be , and set Gaussian noise perturbation matrix be .

  • Step 1: Compute ,

  • Step 2: Compute . If it is less than the given error , go to step 5. If not, go on to the next step.

  • Step 3: Let . Feed forward the result to next layer, and compute .

  • Step 4: Compute , set , and go to step 2.

  • Step 5: Let final layer output weight matrix
    .

Remark

  1. Weight matrix adding noise is equivalent to input matrix adding noise, while training with noise is a kind of regularization [2].

Iii-a VEST Analysis

Iii-A1 Neural Network Architecture

For a MLP, the number of hidden layers and the number of hidden layer neurons belong to the architecture hyperparameters, there is no theory to guide how to select these hyperparameters. For a SHLN, only one architecture hyperparameter is the number of hidden layer neurons, and perhaps it is also a difficult problem for most beginners. In Ref.[4], Guo et al detailed the reason for choosing the number of hidden neurons as follows:

From the above definitions (B is an matrix, Y is an matrix), we know is an matrix. Based on linear algebra we know that if , it is impossible that Y has a right inverse [6]111[6] in Ref.[4] is “Ben Noble and James W. Daniel, Applied Linear Algebra, Prentice-Hall, Englewood Cliffs, NJ, 1988. ”. This requires that at least if we hope to get inverse. (of course, we can choose , but this will increase computation time)”. It is explicit that a suggestion on the selection of the number of hidden neurons is given in the PIL. In contrast, a specific value of the number of hidden neurons was not given in the ELM papers, and it did not give any effective suggestion for this hyperparameter selection beyond saying “random nodes”. In fact, in Ref.[8][9], they just follow those discussions about the number of hidden neurons should be set to , no other specific value was suggested.

As for activation function, the most commonly used functions in MLP are sigmoidal function and hyperbolic tangent function. Here are some discussions on activation functions: In

[6], it stated, From the learning procedure, it is obvious that no differentiable activate function is needed. We only require that the activate function can perform nonlinear transform to raise the rank of the weight matrix. In [8], it repeated our no differentiable activate function statement and gave no explanations, “Unlike the traditional classic gradient-based learning algorithms which only work for differentiable activation functions, the ELM learning algorithm can be used to train SLFNs with non-differentiable activation functions”. From this point, it is doubt that authors of the paper [8] have read our papers previously. While in [9], similar statement appeared in the discussion section, “Unlike the traditional classic gradient-based learning algorithms which only work for differentiable activation functions, as easily observed the ELM learning algorithm could be used to train SLFNs with many nondifferentiable activation functions”.

However, still in [9], in order to prove theorem 2.1, the authors required the activation function is infinitely differentiable. It also stated that This paper rigorously proves that for any infinitely differentiable activation function SLFNs with N hidden nodes can learn N distinct samples exactly and SLFNs may require less than N hidden nodes if learning error is allowed. The question is that why activation function from nondifferentiable in 2004[8] becomes infinitely differentiable in 2006[9], in [9], authors did not give any explanations.

Iii-A2 Weight Parameters

Weight parameters are a set of important parameters in neural network. The initialization of weight parameters directly determines generalization performance and convergence rate of neural networks when the BP algorithm is used in network learning. After a neural network architecture is designed, it is common to adopt the BP algorithm to find these weight parameters. When applying the BP algorithm, the traditional way to initialize the weight parameters is to randomly set these values. In [4][5][6], the weight matrix W, which connect hidden and output neurons, is computed with pseudoinverse . While in [8][9], the same method is described with that in PIL. As for weight matrix, V or , which connects input and hidden neurons, three methods have been investigated by Guo et al:

  1. It can be set as pseudoinverse of input matrix as presented in [4][5][6];

  2. It can be initialized randomly without further tuning, as stated in [4];

  3. It can be set as pseudoinverse of input matrix with additive Gaussian noise, as stated in [6].

From these facts, it is clear that random weight generation in [8][9] is just one of the choices within PIL.

Iii-B Other Variants

Wang and Wan pointed out in their Comments [13]: “The output weights can be adjusted in one of the following ways: 1) using pseudoinverse (also known as Moore–Penrose generalized inverse); 2) incrementally (at each iteration, a new random hidden neuron is added); or 3) online sequentially (as new data arrive in real-time applications)”. We note the following:

1) is the exact same as that in PIL papers.

2) is a simple extension of Griville’s theorem (add neuron) and bordering algorithm (delete neuron) which is discussed in [5][6].

3) is also a simple extension of Griville’s theorem which is discussed in [5][6].

Wang and Wan also pointed out, “In conclusion, feedforward networks (both RBF and MLP) with randomly fixed hidden neurons (RHN) have previously been proposed and discussed by other authors in papers and textbooks. These RHN networks have been shown, both theoretically and experimentally, to be fast and accurate. Hence, it is not necessary to introduce a new name ‘ELM’. ” Here we notice that in [8][9], only the MLP feedforward network with sigmoidal activation function is referred, and they only discussed that with hidden neurons the training error can reach zero, as that discussed in PIL papers. Hence our discussions at here are restricted to only these two papers, without referring to Huang’s other papers222More discussions can be found at https://elmorigin.weebly.com, or other PIL’s variants after year 2004. Under this restriction, we can see that from the learning scheme to concrete methods, the authors of papers [8][9] followed our PIL work, created a new name called ELM, with nothing new except leaving the hardest work on the selection of the number of hidden nodes to users. From these discussion, it is easy to know that the ELM is a VEST of the PIL algorithm.

Iv Some Statements

In this section, we will point out some incorrect statements and false claims in [8][9].

Iv-a Data Interval

The theorem 2.1 in [9] claimed that for the activation function which is infinitely differentiable in any interval, with training samples, randomly chosen input weights and hidden layer bias from any intervals, the hidden layer output matrix H is invertible. However, this theorem is incorrect. It is a common pitfall that randomly draws data in any interval, for example, if we take Tanh(x) as the activation function, when we randomly chose the input weights and bias from any intervals, the hidden output matrix H is often not invertible as it is investigated in [4]. The reason is that those very big or very small values in input weight matrix will make Tanh(x) function saturation, resulting in these elements assuming the same values in matrix H (rank defective) and H being non-invertible in numerical simulations.

In fact, strictly speaking, we have,

Theorem 1. For any bounded activation function , if no constraint to its input intervals, the hidden layer output matrix H is not always invertible.

Proof.

Given a standard SHLN, define interval , with . When we randomly chosen data from interval, or interval, the elements in hidden layer output matrix H will be its boundary value when . For example, elements in H=sigmoid(WX) will be all 1 or 0 for finite value training data set X. The rank of the H matrix is 1 and it is not full when , then it is NOT invertible with probability one. ∎

In practical implement an algorithm in computing, it is a pitfall if no constraint is considered. The IEEE floating-point standard specifies the positive and negative infinity values. The single precision effective floating point range is about , the numbers great than positive or less than negative will be regards as infinity.

Another example in [11] is when we choose randomly the input weights and biases in [-1,1], it will be found that the rank of hidden output matrix H is NOT full:

Let be defined over [0,1],

If this function is approximated with SHLN, it will be found that the theorem 2.1 in [9] is incorrect.

Iv-B Training Error

The theorem 2.2 in [9] claimed that if any small positive value is given, there exists , for any random input weight values, . And the proof was given only the case of according to Theorem 2.1. Here we do not consider whether theorem 2.1 is correct or not, but discuss only the case of first. From the linear algebra textbook it is known that if H is a square full rank matrix, the inverse of H exists and exact learning can be reached as we discussed in [4]. So for the case of , it does not need to prove training error can be any small value again because can be zero. Even if it needs to be proof, just simple cite a linear algebra textbook as most researchers did. The key issue with which we are concerned is the case of , which means that the number of hidden neurons is smaller than that of training samples. According the theorem in linear algebra textbook, for example, Ref. [6] listed in [4], as stated in section III-A1, it is impossible we can obtain right inverse of H. In other words, when we set an infinite small positive value , the learning error cannot be smaller than this when the number of hidden neurons is less than N.

For a finite N, we know that the pseudoinverse solution is the best approach for output weight matrix W, . If we substitute into SSE function, we can find that learning problem becomes minimizing , where is an orthogonal projection operator. If there exists null space for output matrix T, for most training data set, the norm of those vectors which lay in null space for T cannot be less than . Furthermore, in linear algebra the least squares solution for over determined problem has been studied by many researchers, no theory can guarantee that the error can be arbitrary small for most data set. Therefore, theorem 2.2 is incorrect either.

More theoretical analysis about incorrectness of these theorems, please refer to [11].

Iv-C SSE function and generalization

As we known, the SSE function is the most used function in neural network research. In mathematical expression, the norm is often adopted for the SSE function. When a finite size training data is given, there are mainly two categories of approaches to avoid underfitting and overfitting, and hence getting good generalization: One is called model selection, and the other is regularization. In [9], it stated in the introduction section, “Different from traditional learning algorithms the proposed learning algorithm not only tends to reach the smallest training error but also the smallest norm of weights. Therefore, the proposed learning algorithm tends to have good generalization performance for feedforward neural networks”. It also claimed in the discussion section of [9], “(2) The proposed ELM has better generalization performance than the gradient-based learning such as back propagation in most cases”. We know that when minimizing SSE function (Eq. 2) corresponding to weight W, pseudoinverse solution is the best approach. This solution has the properties such as the minimum training error, the smallest norm of weights, and uniqueness of the minimum norm least-squares solution. The ELM authors have misunderstood the meaning of the smallest norm of weights for the pseudoinverse solution. Here the smallest norm of weights is only compared with other least-squares solutions, it cannot guarantee network has good generalization performance if overfitting occurs. To avoid overfitting and obtain good generalization, one of the techniques is weight decay regularization which adds a penalty term to the SSE function. This constraint term usually is the norm of weights times a regularization constant, as most researchers, including Guo et al [7], studied in the literature. In the case SSE function is adopted, other techniques to reach good generalization performance of the neural network include early stop, stacked generalization as discussed in [6]. But the studies with different cost functions or regularization techniques to reach good generalization is beyond the scope of this letter. Here we simply wish to point out that if we only minimize the SSE function without any other constraints, it is impossible to get good generalization unless network’s architecture is well designed for a given data set.

Iv-D Hidden Nodes

In the abstract of Ref. [9], there is a such statement: “This paper proposes a new learning algorithm called extreme learning machine (ELM) for single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs”. We think here the phrase “randomly chooses hidden nodes” means randomly chooses the number of hidden nodes, and with this randomly chosen value, to construct a SLFN. This selection method for the number of hidden nodes is simple indeed, but it does NOT work at all in practice. As we know, in order to obtain the good generalization performance for a SLFN for a given data set, one of the techniques is model selection, that is, selecting an optimal network architecture, a structure neither too simple to underfitting, nor too complex to overfitting. For a SHLN, it only has one hyperparamenter of the network structure, which is the number of hidden nodes when we assigned the numbers of input and output neurons. In the past decades, many research papers address this model selection problem, and we get to know that this hyperparameter depends on many factors [3]. It is a pitfall to realize good generalization through randomly choosing a number of hidden nodes. Let us do a thought experiment to illustrate the random nodes method is useless. Suppose that a given data set has N training samples, there are N users using this method to do experiments. Every user chooses randomly a number of hidden nodes in the range of [1, N], which leads to many different network structures being generated. For those users with , they will suffer from underfitting problem. For those users with , they may suffer from overfitting problem. Both underfitting and overfitting will have poor generalization. Only a small number of lucky users can get good generalization performance for their experiments. This thought experiment shows that the number of hidden nodes should be sophisticatedly chosen, instead of merely randomly chosen. Furthermore, in the experiments of [9], it can be found that the cross-validation method was adopted to choose the optimal number of hidden nodes. This confirms that not only their statements have contradicted themselves, but also learning speed of ELM is NOT so fast as they claimed when selecting optimal number of hidden nodes time is counted.

V Summary

In order to stress the originality of the so called ELM, the ELM authors made such a statement in [9], “It should be noted that the input weights (linking the input layer to the first hidden layer) and hidden layer biases333The concept is wrong that regards as hidden layer bias, in fact it is input layer bias. need to be adjusted in all these previous theoretical research works as well as in almost all practical learning algorithms of feedforward neural networks.” However, the fact is that in the PIL, all three methods to set input weights have shown that weight parameters do not need to be adjusted further. Also, the number of hidden neurons (including bias neuron) is set to and does not need to be adjusted either. Here it is clearly shown that ELM author s statement is a false claim. Furthermore, During the period of 2005 international conference on intelligent computing (ICIC 2005), the first author of the PIL papers had introduced PIL work to the first author of the ELM papers. ELM authors already got known the PIL work, they still wrote such a statement in their paper. They not only excluded PIL papers in reference list of the article [9], but also denied the originality of the PIL.

From discussions in this manuscript, we can see that the ELM is the same with PIL0 in learning scheme, and ELM not only has nothing new in learning scheme, but is also riddled with a lot of incorrect statements and false claims in theory. To avoid misleading junior researchers around world, our suggestions to the ELM authors are as follows:

  1. Acknowledge the originality of the PIL fast learning scheme and clarify that the ELM is simply a VEST of the PIL algorithm for SHLN.

  2. Remedy those incorrect statements and false claims in ELM papers.

Acknowledgment

We greatly appreciate those researchers who provide many useful suggestions and comments to this letter. Without their help, we still did not know PIL has been renamed to ELM. And we especially thank Prof. Philip C. L. Chen, our co-author of the PIL, who gives a lot of suggestions, including some key issues, to this work.

References

  • [1]

    Bishop, C.M.: Neural Networks for Pattern Recognition.

    Oxford University Press (1995)
  • [2] Bishop, C.M.: Training with noise is equivalent to tikhonov regularization. Neural Computation 7(1), 108–116 (1995)
  • [3] comp.ai.neural-nets FAQ: part3: “how many hidden units should i use?”. ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu (1997, 1998, 1999, 2000, 2001, 2002). Copyright
  • [4] Guo, P., Chen, P., Sun, Y.: An exact supervised learning for a three-layer supervised neural network. In: Proceedings of the International Conference on neural Information Processing (ICONIP’95), pp. 1041–1044. Beijing, China (1995)
  • [5] Guo, P., Lyu, M.: Pseudoinverse learning algorithm for feedforward neural networks. In: N.E. Mastorakis (ed.) Advances in Neural Networks and Applications, pp. 21–326. Puerto De La Cruz, Tenerife, Canary Islands, Spain (2001)
  • [6] Guo, P., Lyu, M.: A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56(1), 101–121 (2004). (Online in 2003)
  • [7]

    Guo, P., Lyu, M., Chen, P.: Regularization parameter estimation for feedforward neural networks.

    IEEE trans System, Man and Cybernetics (B) 33(1), 35–44 (2003)
  • [8] Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: A new learning scheme of feedforward neural networks. In: Proc. of IEEE Int. Joint Conf. on Neural Networks, vol. 2, pp. 985–990 (2004)
  • [9] Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and applications. Neurocomputing 70, 489–501 (2006)
  • [10] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • [11] Li, M., Wang, D.: Insights into randomized algorithms for neural networks: Practical issues and common pitfalls. Information Sciences pp. 170–178 (2017)
  • [12] Rumelhart, D., McClelland, J.: Learning internal representations by error propagation. In: Parallel Distributed Processing:Explorations in the Microstructure of Cognition: Foundations, pp. 318–362. MIT Press (1986)
  • [13] Wang, L.P., Wan, C.R.: Comments on ’the extreme learning machine’. IEEE Transactions on Neural Networks 19(8), 1494– 1495 (2008)
  • [14] Wessels., L., Barnard, E.: Avoiding false local minima by proper initialization of connections. IEEE Transactions on Neural Networks pp. 899–905 (1992)