1 Introduction
In the past years, a fast algorithm called Extreme Learning Machine (ELM) was proposed by Huang et al. (2004, 2006b). It is used to train Single Layer Feedforward Networks (SLFN), as shown in Fig. 1, where part of the network parameters ( and ) are randomly generated, and the remaining () are found using labeled data and a closedform solution.
Due to its simplicity and speed, ELM gained popularity and has been applied in a wide range of applications such as computer vision and time series analysis
(Huang et al., 2015), achieving good performances, generally better than classifiers such as Support Vector Machines (SVMs)
(Huang et al., 2015).Variants of ELM were also proposed, turning possible to deal with sequential data (where new samples arrive over time) using Online Sequential ELM (OSELM) (Liang et al., 2006), or increasing the SLFN hidden node number, using Incremental ELM (IELM) (Huang et al., 2006a).
In ELM and OSELM, the number of nodes in the hidden layer needs to be wellchosen to avoid overfitting and underfitting (Deng et al., 2009)
. To overcome this problem, an ELMbased algorithm using ridge regression was proposed by
Deng et al. (2009). Although it achieves good results, the resulting network is dense and might suffer from memory and computing limitations.MartínezMartínez et al. (2011) extended Deng et al. (2009) work, proposing an algorithm named Regularized ELM (RELM), which can select an architecture based on the problem. By using the Elastic Net theory, RELM can prune some hidden nodes when dealing with a onedimensional output.
The aforementioned methods were defined considering only one output node, with optimization problems minimizing vector norms, but they can be adapted to multidimensional tasks by considering each output separately. To deal with these tasks, Generalized Regularized ELM (GRELM) was proposed by Inaba et al. (2018), which generalizes RELM, using matrix norms in its objective function, replacing and vector norms by Frobenius and matrix norms, respectively.
One common characteristic of these methods is the use of or Frobenius norm of the prediction error in each objective function. As pointed by Zhang & Luo (2015), these approaches have some drawbacks when the application suffers from outliers, which is common in realworld tasks, since the model can be biased towards them.
To deal with outliers, Outlier Robust ELM (ORELM) was proposed by Zhang & Luo (2015), which considers the norm of the prediction error in its objective function, achieving better results in the presence of outliers in regression tasks, when compared with other algorithms. However, ORELM was also defined only to onedimensional outputs.
In this paper, we generalize the ORELM to multitarget regression problems. We use the same model considered in GRELM, replacing Frobenius norm by norm of the prediction error, which can be interpreted as an extension of the vector norm. When considering outputs with only one dimension and using only the ridge penalty, our method is the same as ORELM. We use a threeblock Alternating Direction Method of Multipliers (ADMM) (Chen et al., 2016) to solve our optimization problem, which is a simple but powerful algorithm. We also propose an incremental version of this method, which can increase the number of nodes efficiently if the desired error was not obtained.
Our methods were tested in 15 public realworld datasets, which were contaminated with outliers to verify the robustness of them, using the average relative root mean square error (aRRMSE) as metric. We compared our methods with similar algorithms based on ELM.
2 Related Work
In this section, we review Extreme Learning Machine and some of its variants, in specific regularized and incremental ones.
2.1 Elm
An extremely fast algorithm to train an SLFN with hidden nodes was proposed in Huang et al. (2004). This algorithm was called ELM and considers a dataset with distinct and labeled samples , where and
. The SLFN estimation of
is modeled as(1) 
where is the weight vector connecting the th hidden node and the input nodes. is the weight vector connecting the hidden nodes and the output node, is the bias of the th hidden node, and
is the activation function of the SLFN.
Assuming that an SLFN with nodes can approximate all samples with zero error, i.e, , we can rewrite Eq. 1 as:
(2) 
where and
(3) 
where , and .
2.2 Regularized ELM
Although ELM has shown good results in several applications, the right choice of must be made in order to obtain a good performance, avoiding overfitting and underfitting. Bartlett (1998) showed that models whose parameters have smaller norms are capable of achieving better generalization. On this note, Deng et al. (2009) introduced RELM for SLFN with sigmoid additive nodes, where an norm of is added in the optimization problem of Eq. 4. Huang et al. (2012) extends this work to different types of hidden nodes and activation functions, as well as kernels.
Other types of regularization in the ELM optimization problem were considered by MartínezMartínez et al. (2011). Then, RELM can be described in a generalized way by
(6) 
where and are regularization parameters.
The optimization problem of Eq. 6 uses the Elastic Net penalty, which is a tradeoff between the ridge regression (), where only the norm is considered, and the lasso penalty (), where only the norm is considered. Since it uses the norm, the Elastic Net method is also capable of reducing the network size, pruning output weights while maintaining generalization.
2.3 Outlier Robust ELM
Recently, the norm has been applied to improve the performance of methods in the presence of outliers (Zhang & Luo, 2015)
. This can be achieved in ELM by using it in the loss function, i.e.,
(8) 
instead of using the usual norm.
Supported on this observation, Zhang & Luo (2015) proposed the ORELM algorithm, which finds the solution of the following optimization problem:
(9) 
This optimization problem can be solved by using the Augmented Lagrange Multiplier (ALM) method, as suggested by Zhang & Luo (2015).
2.4 Generalized Regularized ELM
A limitation of ELM, RELM and ORELM is that they are defined only for onedimensional outputs. Although we can consider each output separately in multidimensional tasks, their solutions are not capable of capturing relations between the multiple outputs. This means that we can only prune a hidden node if and only if its weights are pruned in all outputs simultaneously, which can be difficult.
To capture those relations, matrix norms can be used in the ELM optimization problem. Considering this, the GRELM method was proposed by Inaba et al. (2018). This method considers a optimization problem similar to Eq. 6, using matrices and replacing and norms by Frobenius and norms, respectively.
Considering a dataset with multidimensional outputs, with distinct samples , where and , we can extend Eq. 2:
(10) 
where and is the weight matrix connecting the SLFN hidden and output layers, i.e., . Then, the optimization problem of GRELM is given by
(11)  
subject to 
where , and are regularization parameters, is the Frobenius norm of a matrix with rows and columns, where is the th row of , defined as
(12) 
where is the trace of and is the norm of , defined as
(13) 
2.5 Incremental ELM
Due to its simplicity and low computational cost, the ELM algorithm was very popular after it was proposed. However, it has two main challenges (Feng et al., 2009): reducing its complexity to deal with large datasets and models with many nodes; and choosing the optimal hidden node number (usually a trial and error method is used).
To deal with those problems, Huang et al. (2006b) proposed an algorithm named IELM to train an SLFN adding one node at a time, reducing the memory cost (since in each iteration, it only works with one node) of training an SLFN. Considering a dataset with onedimensional targets, the algorithm starts with and considers that the residual error is equal to the targets: .
When the th node is added, its input weights and bias are randomly generated, and then its output of every training sample is calculated as . Finally, the output weight of the new node is calculated using
(14) 
and the residual error is updated using
(15) 
Following this algorithm, it is possible to achieve good approximations of functions, with good generalization performance (Huang et al., 2006b). Furthermore, by using an expected learning accuracy as a stopping criterion, it is possible to find a good number of hidden nodes without using the trial and error approach.
One disadvantage of the IELM algorithm is that only one hidden node can be added at a time. An incremental method that was capable of adding a group of hidden nodes at the same time in an SLFN was proposed by Feng et al. (2009). This method is called Error Minimization ELM (EMELM) and takes advantage of the Schur complement (Petersen & Pedersen, 2007) to update the generalized inverse of the matrix , which is updated in each th iteration.
Assuming we already have an SLFN, with one output and hidden nodes, trained using the ELM algorithm, we used a matrix to calculate the output weights . When new nodes are added to the network, its new hidden output matrix can be written as:
(16) 
where
(17) 
According to Feng et al. (2009), it is possible to update the output weights :
(18) 
where and are given by
(19) 
where
is an identity matrix and
(20) 
respectively.
Note that if is stored, we can save some computations when adding the new nodes in the iteration, reducing the algorithm computational time.
2.6 Incremental Regularized ELM
The IELM and EMELM algorithms are capable of increasing the hidden node number of an SLFN trained using the original ELM algorithm. However, these methods have a problem in some applications where the initial hidden layer output matrix is rank deficient, and the accuracy of its computations may be compromised (Xu et al., 2016). Also, they inherit ELM problems and sometimes cannot achieve the expected testing accuracy.
The overfitting problem of ELM can be solved by using the structural risk minimization principle, which was used in the RELM method (Deng et al., 2009; MartínezMartínez et al., 2011; Huang et al., 2012)
. However, the hidden node number of an SLFN trained with the original RELM method is a hyperparameter and cannot be increased efficiently.
To solve this problem, the Incremental Regularized ELM (IRELM) was proposed (Xu et al., 2016), which is an algorithm capable of increasing the hidden node number of an SLFN trained using the RELM algorithm (where the considered structural risk is the norm, with and ). IRELM considers that by adding one node, its new hidden output matrix is given by , where and the new output weights of the SLFN can be calculated as
(21) 
We can rewrite as
(23) 
and
(24) 
respectively.
By using Eq. 22, we can find the new network output weight () efficiently, using known information, which is usually faster than training a new and larger SLFN. According to Zhang & Luo (2015), methods that considers the error norm can suffer in the presence of outliers, which is the case of IRELM.
3 Proposed Method
In this section, we present our proposed methods. We first generalize ORELM to multitarget regression problems, presenting its ADMM update rules. This method is named Generalized Outlier Robust ELM (GORELM). We also present an incremental version of this generalization.
3.1 GorElm
Multitarget regression (MTR) refers to the prediction problems of multiple realvalued outputs. Such problems occur in different fields, including ecological modeling, economy, energy, data mining, and medical image (Ghosn & Bengio, 1997; Džeroski et al., 2000; Kocev et al., 2009; Wang et al., 2015; Zhen et al., 2016; SpyromitrosXioufis et al., 2016).
A natural way to treat MTR problems is to obtain a model for each output, separately. However, one of the main MTR challenges is contemplating the output relationship, besides the nonlinear inputoutput mapping, on modeling (Zhang & Yeung, 2012; Zhen et al., 2017).
In the age of big data, MTR problems are becoming more and more common in the massive amount of accumulated data. However, inherent to this immense volume of data, other problems arise. Outliers are becoming more frequent due to different causes, such as instrumentation or human error (Zhang & Luo, 2015; Hodge & Austin, 2004). Although modeling the complex relationship between outputs and inputoutput on MTR problems are well studied, little has been done regarding outliers on MTR problems.
The use of norm to achieve robustness to outliers in regression problems is a common practice in the literature (Zhang & Luo, 2015; Xu et al., 2013; Chen et al., 2017). The idea of GORELM is to extend the ORELM to MTR problems. Nevertheless, instead of treating each output separately, possible output relationships are considered on GORELM through its regularization. Therefore, GORELM is especially suitable for problems where output outliers may occur in a structured^{1}^{1}1Structured in the sense that all outputs of an outlier are simultaneously affected. way.
Thus, our approach is an extension of ORELM where matrix norms are considered, instead of vector ones. We replace and norms by and Frobenius norms, respectively. We can also view our approach as an extension of GRELM (Eq. 11), replacing the Frobenius norm of the error by its norm. The following optimization problem is proposed
(25) 
The optimization problem of GORELM (Eq. 25) is equivalent to the following constrained problem
(26)  
subject to  
where the objective function is separable on , , and . Similar to GRELM, we use ADMM to solve the optimization problem. On each iteration of ADMM, the algorithm performs an alternated minimization of the augmented Lagrangian over , , and .
Note that we can rewrite Eq. 26 as
(27)  
subject to 
where , , , and . According to Chen et al. (2016), the optimization problem established in Eq. 27 can be solved using the 3block ADMM algorithm.
The augmented Lagrangian of the Eq. 27 is
(28) 
where is the Lagrange multiplier, and is a regularization parameter.
Using the scaled dual variable , we can rewrite the augmented Lagrangian as
(29) 
Since the Frobenius norm is separable, and by the equivalence of Eq. 26 and Eq. 27, the augmented Lagrangian (Eq. 29) of GORELM can be written as
(30) 
where , with and .
At iteration , ADMM consists of the following update rules for

, we have the following subproblem
(34) The optimization problem of Eq. 34 is identical to the one established on the update rule of Inaba et al. (2018) method, showed in Algorithm 1. Therefore, the solution of Eq. 34 is given by
(35) where is an operator that applies the block softthresholding operator (Boyd, 2010) in each row of , which is defined as
(36) with and . Note that shrinks its argument to if and moves it by units to the origin otherwise.

, we have the following subproblem
(37) which has the same structure as Eq. 34. Thus, we have the following solution
(38) 
, the Lagrange multiplier is updated by
(39) (40)
Algorithm 2 resumes que GORELM method.
3.2 IgorElm
In some applications, we can choose an initial SLFN that achieves an undesired error when trained using any ELMbased algorithm. This can mean that there is still room for improvement in the model generalization. As discussed in Section 2.5, one of the approaches that can improve the performance of an SLFN is to increase its number of hidden nodes.
To train this larger model, we can simply use the same algorithm using a new, untrained, SLFN with more nodes. However, this procedure is not efficient if we already have a trained model, since we would discard all learned information. Thus, if the initial model efficiency is not sufficient, an approach to increase the number of hidden nodes taking advantage of the previous knowledge will be desirable.
In the literature, there are some ELMbased algorithms, such as IELM (Huang et al., 2006b), EMELM (Feng et al., 2009) and IRELM (Xu et al., 2016), that are capable of adding more nodes to an existing SLFN that was trained using an ELM algorithm, taking advantage of known information to find the new model in less time. However, these methods usually suffer in the presence of outliers.
In this paper, we also propose the Incremental Generalized Outlier Robust ELM (IGORELM). As the name suggests, it is an incremental version of GORELM, where it is possible to add a new group of hidden nodes at each iteration to an existing SLFN and is robust to outliers. This algorithm solves almost the same optimization problem of GORELM (Eq. 25), taking advantage of a known model to train a larger one.
Since it uses the elastic net theory, GORELM can prune hidden nodes. However, its optimization problem considers a tradeoff between pruning nodes and the impact of this action in the model error. This implies that if no nodes were pruned, there is still room for improvement, and we can use IGORELM to increase the model performance.
IGORELM increases the network size until some stopping criterion is achieved. Every time a larger model is trained, previous knowledge is used as a starting point to the ADMM algorithm, used to solve the GORELM method (Algorithm 2).
The starting point of each IGORELM iteration is composed of increasing the dimensions of matrices with a direct relationship to the number of nodes (e.g., , , and ), using new values (e.g., zerovalued matrices) in these new elements. Then, random weights and biases of the new nodes are generated. The dimension of also increases, which implies that a larger matrix inversion must be done, which may increase the training time of each iteration.
When adding the th batch of nodes in an SLFN trained with IGORELM, we can update the inverse of using its Schur complement (Petersen & Pedersen, 2007) efficiently, if is stored:
(41)  
(42) 
where
(43)  
(44)  
(45) 
and .
Then, by adjusting variables whose dimension depends on the number of nodes and updating , the same algorithm of GORELM is used and its solution is found by using its update rules.
As the stopping criterion of the IGORELM algorithm, we can chose hyperparameters such as the maximum number of total hidden nodes, an expected value for the model efficiency (e.g. a metric), or the ratio of pruned nodes. Algorithm 3 resumes the IGORELM method.
4 Experiments
To evaluate the proposed GORELM and IGORELM methods, we selected 15 public realword datasets (Table 1) for multitarget regression (MTR) from the Mulan Java library (Tsoumakas et al., 2010), excluding the ones with missing data. The datasets are randomly divided into two parts: a training subset with 2/3 of the samples and a testing one with the remaining.
Dataset  #Training Data  #Test Data  #Attributes  #Targets 

andro  33  16  30  6 
atp1d  225  112  411  6 
atp7d  197  99  411  6 
edm  103  51  16  2 
enb  512  256  8  2 
jura  239  120  15  3 
oes10  269  134  298  16 
oes97  223  111  263  16 
rf1  6003  3002  64  8 
rf2  5119  2560  576  8 
scm1d  6535  3268  280  16 
scm20d  5977  2989  61  16 
scpf  95  48  23  3 
slump  69  34  7  3 
wq  707  353  16  4 
We consider that a sample is an outlier based on boxplot analysis, as defined by Tukey (1977). Considering and
as the first and third quartile, and
as the interquartile range of the boxplot of an attribute, a sample is considered an outlier if it is located in one of the following two intervals: and .To evaluate the outlier robustness of our methods, we contaminated the MTR datasets using the following procedure: a subset of samples of the training set is randomly selected, which size is defined according to the outlier ratio; then each attribute of the corresponding targets of such subset are replaced by random values from or . We tested our algorithms using outlier ratios of 20% and 40%, along with the uncontaminated datasets. This procedure was only used in the training subset.
The inputs and targets were normalized to have values in the interval using each respective minimum and maximum values on the training dataset. This normalization was also used in the testing dataset, using the same factors.
Our tests were conducted in an Ubuntu 16.04 Linux computer with Intel® Core™ i52500 CPU with 8 GB of RAM, using MATLAB 9.2.
4.1 Regression with outliers
To compare GORELM with RELM, ORELM and GRELM in MTR tasks, we set for each method an initial number of hidden nodes () equal to 1000 for all datasets. When considering GRELM and GORELM, we fixed and , and we obtained (or ) and using fold crossvalidation in the training subset. We tested 21 values for (the same values were used for ) and 5 values for . We considered the same values of in RELM (with and ), and the same values of in ORELM.
In this paper, we consider experiments with the objective of reducing the impact of initialization parameters (random weights), since our methods are nondeterministic. An experiment is defined as the following procedure: The weights between input and hidden layers () and the random biases (
) are randomly generated according to a uniform distribution in the interval
; The training subset is randomly permuted; A sigmoidal activation function in the hidden layer is used: ; SLFNs are trained using the algorithms and parameters obtained in fold, and we obtain the average relative root mean square error (aRRMSE) metric^{2}^{2}2Since aRRMSE measures error, a smaller value means better prediction. and time for the training and testing subsets.Considering , with , as the th of the targets of the problem, aRRMSE is defined as the mean of each target relative root mean square error (RRMSE):
(46) 
where is the output of a SLFN with respect to , and .
The experiments were run 100 times, using 20 iterations as the stopping criterion in ORELM, GRELM and GORELM algorithms.. The average (and standard deviation) values of aRRMSE obtained in these experiments are shown in Table
2, 3 and 4, for the uncontaminated datasets and with outlier ratios of 20% and 40%, respectively. The best results were highlighted in boldface.In all datasets, similar results were obtained with all training methods, when considering the uncontaminated dataset. In this case, as shown in Table 2, GORELM achieved better metrics than the other algorithms in some datasets. Since the GORELM main objective is to turn an SLFN robust to outliers, these results might imply that the datasets have some noise in its samples.
When considering that 20% of each training subset is contaminated with outliers, RELM and GRELM achieves worse results than the robust methods. This is expected, since they use the Frobenius (or ) norm, which are not suitable to deal with outliers. The better results, in this case, were obtained by robust techniques, with each method winning in approximately half of the datasets. When considering 40% of outliers, similar results were obtained, with GORELM achieving better aRRMSE in 10 of 15 datasets.
It can be noted that in the aRRMSE values of GORELM in the test sets remained similar in almost all datasets in the presence of outliers, when compared to the aRRMSE values obtained by the method when trained using the uncontaminated datasets.
In the presence of outliers, ORELM and GORELM achieved better results, which was expected. Since its model considers relations between different targets (it uses matrix norms), GORELM achieved better results than ORELM in most cases, showing that it can be a proper and robust technique in MTR tasks with the outliers inserted, according to the described methodology.
Table 5 show the time, in seconds, spent to train SLFNs with the respective algorithms. Our simulations show that GORELM training stage was slightly slower than ORELM and GRELM in most cases, which was expected, since more optimization steps are needed in each iteration. Since all networks have similar architecture size, the time spent to test a set of data are almost the same.
Table 6 resumes the obtained parameters in the fold cross validation for the tested methods. It should be noted that, in almost all datasets, the value for is zero, implying that the algorithms were not capable of pruning hidden nodes with no impact on the model performance, maintaining the initial nodes, as shown in Table 7
. This also suggest that better results may be obtained by increasing the number of neurons.
Dataset  RELM  ORELM  GRELM  GORELM  

Train  Test  Train  Test  Train  Test  Train  Test  
andro  
atp1d 
Comments
There are no comments yet.