1 Introduction
Practical applications of machine learning can be problematic in the sense that developers and practitioneers often do not fully trust in their own predictions. A fundamental reason for this mistrust can be found in the fact that Mean Squared Error (MSE) and other error measures averaged over a dataset are commonly used to evaluate performance of a method or compare different methods. Averaged error measures are unfit for business processes where each particular sample is important, as it represents a customer or other existing entity [2]. On the other hand, applied Machine Learning models might skip some data samples, because they are only a part of a bigger process structure, and uncertain data might be given to human experts to be handled [22].
The trust problem can be solved by computing a samplespecific confidence value [32]. Then predictions with high confidence (and enough trust in them) are used, while data samples with uncertain predictions are passed to the next analytical stage. The Machine Learning model works as a filter, solving “easy cases” automatically with confident predictions, and reducing the amount of data remaining to be analyzed [3].
Let be a dataset where outputs
are independently drawn from a normal distribution conditioned on inputs
:(1) 
This dataset has heteroscedastic noise because the variance is not constant. A common homoscedasticity assumption simplifies formula (1) to but removes the ability to separate confident predictions from uncertain ones.
The heteroscedastisity of outputs is a reasonable assumption because applied Machine Learning problems often have stochastic outputs. Such outputs do not have a single correct value for the given input. The variance of random noise in outputs may be assumed equal because the noise is independent of the inputs, but the same assumption cannot be made about the variance of the stochastic outputs because they certainly depend on the inputs.
This work focuses on prediction intervals specifically for Extreme Learning Machines (ELM) [21, 25]. ELM is a fast nonlinear model with universal approximation ability [18]
. It has a feedforward neural network structure but with randomly fixed hidden layer weights, so only the linear output layer needs to be trained. With a large hidden layer and L2regularization
[41] ELM exhibit stable predictions [29], that are not affected by a particular initialization of the random hidden layer weights. It is an excelled Machine Learning tool to solve applied problems [4, 40] with simple formulation, little to no hyperparameters, performance at the stateoftheart level [17, 38, 47] and scalable to Big Data [1, 39].The idea of the method is to use an ELM to predict an output , and a second ELM to estimate its conditional variance . Furthermore, a variance analysis is done on the predictions of the second ELM. It provides upper and lower boundaries for the predicted variance. These boundaries describe the model uncertainty for samples with little similar training data available, and make the methodology uniformly applicable to different problems.
The rest of the paper is organized as following. The following section describes stateoftheart in prediction intervals estimation, and how the proposed solution differs from the rest. Section 2 describes Extreme Learning Machines and the proposed methodology. Section 3 analyses the method performance on small artificial and real world datasets. Section 4 presents the results on huge real world dataset, and describes the runtime requirements compared to the original ELM. Section 5 summarizes the findings.
1.1 StateoftheArt
Prediction with uncertainty in a wellknown task. Probabilistic methods can obviously formulate a solution. Prediction intervals are available in Bayesian formulation of ELM [12, 8], including persample PI [36] though the applicability is limited due to the quadratic computational cost in the number of data samples.
Fuzzy nonlinear regression [15] approach exists for problems having fuzzy inputs or outputs. It applies random weights neural networks with noniterative training similar to ELM, but formulates the solution in terms of fuzzy sets theory [5]. Such a native fuzzy approach allows for a detailed investigation of the effects of uncertainty on learning of a method [43, 44], and has important practical applications [6] for fuzzy data problems.
Without runtime limitation, good results are achieved with model independent methods [33] based on clustering of input data and resampling. Clustering of inputs and repetitive model retraining during the resampling both scale poorly with data size, and would limit the performance of ELM otherwise capable of processing billions of data samples [1].
A specific case [42] of modelindependent approach limited to linear models (with arbitrary solution algorithm and hyperparameters) provides good results for heteroscedastic datasets ([42], supplementary materials), and suits for ELM output layer solution as well. The method applies to any amount of training data, and will benefit from huge datasets by producing more independent models in its ensemble part. Unfortunately, it does not output prediction intervals directly.
The scope of this paper is constrained to fast ways of computing prediction intervals of outputs, tailored specifically for Extreme Learning Machine. The proposed solution works especially well in conjunction with ELM, reusing some heavy computational parts as shown in the next section. A fast runtime is one of the the key features of ELM, making it valuable for practical applications and Big Data processing. Another key feature of ELM is approximation of complex unknown functions, and the proposed method approximates prediction intervals of model outputs in similar fashion without probabilistic or fuzzy set notations.
2 Methodology
This section starts by introducing the Extreme Learning Machine. It continues with the prediction intervals idea, and its implementation suitable for ELM. The section concludes with a formal description of an algorithm.
2.1 Extreme Learning Machine
The Extreme Learning Machine [20] model is formulated as a feedforward neural network with a single hidden layer. It has input and
hidden neurons. Solution is given for one output neuron; in case of many output neurons each one has an independent solution. The hidden layer weights
are initialized with random noise and fixed. Often an extra input neuron with the constant value is added to function as bias.Hidden layer neurons apply a nonlinear transformation function
to their output. Typical functions are sigmoid or hyperbolic tangent, but this function may be omitted to add linear neurons. For input data samples gathered in a matrix , the hidden layer output matrix is:(2) 
where the function is applied elementwise. In matrix notation, the formula simplifies to .
The output layer of ELM is a linear regression problem
, that is overdetermined in real cases with more data samples than hidden neurons (). The output weightsare given by an ordinary least squares solution
computed with the MoorePenrose pseudoinverse [35] of matrix .Random initialization may decrease the performance of a naive ELM. This problem is completely solved by including L2 regularization in the output layer solution. The linear regression problem becomes:
(3) 
where is L2regularization parameter optimized by validation. With L2 regularization and a large number of hidden neurons, ELM performance becomes stable and unaffected by a particular random initialization of [19].
2.2 Prediction Intervals
Assume a stochastic output with i.i.d. distribution conditioned on the inputs as in equation (1). Model prediction estimates only the mean value of an output, and ignores its stochastic nature.
Prediction intervals (PI) offer a simple way of describing the uncertainty of the output by estimating the boundaries on its value, such that the true output
lies between those boundaries with the given probability
. For normally distributed outputs (1) the prediction intervals at the confidence level can be modelled by(4) 
where
is an inverse cumulative distribution function, i.e.
.The maximum likelihood estimator for the variance of a homoscedastic output is given by Mean Squared Error [7]. However, it provides uniform prediction intervals that fit poorly to practical applications of Machine Learning.
An estimation of variance in linear regression is a wellresearched topic, with plethora of theoretical [37] and experimental [33] results available. Variance of heteroscedastic model predictions can be computed with the Bienaymé formula [26, 23] from the variance of model weights
. However, variance of the predicted outputs corresponds to confidence intervals and does not describe the range of possible true outputs
.The relation between the heteroscedastic prediction intervals and other methods is illustrated on Figure 1.
2.3 Prediction Intervals for Extreme Learning Machines
The idea of this paper is to estimate the variance of heteroscedastic outputs using a second ELM model. The model predictions are computed by the first ELM, then the squared residuals are used as training outputs for the second ELM that learns to predict the conditional variance of outputs.
However, ELM predictions can be inaccurate, and their quality must be taken into account. For that reason, variances of the predictions for the first ELM and the second ELM are added to the predicted squared residuals to bound the true variance of the outputs :
(5) 
In addition to directly estimating the inputdependent variance , this expression has the desired properties of giving larger variance for models with insufficient amount of training data. With an excessive amount of training data , variances of the predicted residuals and the predicted outputs decrease to zero and the variance of true outputs is given by its ELM estimation: . A similar approach to the prediction intervals exist in feedforward neural networks [31], however it is valid only for the case .
The output layer of ELM is a linear regression. Bienaymé formula [26, 23] provides the variance of outputs in linear regression, and in ELM:
(6) 
where is the hidden layer output of an ELM for an input sample .
There is plethora of methods for estimating covariance of normally distributed linear system weights . The method of choice is weighted Jackknife estimator [45]. It is unbiased, robust against heteroscedastic noise [37, 16, 13, 10], as fast as an ELM, and scales well with the data size. Another good method for variance estimation is Wild Bootstrap [10] with nice theoretical properties, but it is slower as the bootstrap part requires several repetitions to converge.
2.4 Weighted Jackknife for Big Data
A summary of the Weighted Jackknife methods is presented below. Its inputs are an ELM hidden layer outputs and residuals .
(7)  
(8)  
(9)  
(10)  
(11) 
The method uses three auxiliary matrices: , and . Equation (9) creates a weighted data matrix by scaling every row of the original data
, its denominator includes a dot product between two vectors
.Weighted Jackknife works well together with ELM and Big Data. First, an auxiliary matrix in (7) is an inverse of the already computed matrix in an ELM solution (3).
Second, Big Data applications with huge number of samples are often limited by memory size, especially if the matrix computations are run on GPUs with very limited memory pool. Weighted Jackknife avoids such limitation by batch computations. Let the data matrix split in equal parts with samples each:
Then auxiliary matrix can be computed in the corresponding parts , and an auxiliary matrix becomes a summation over all the parts . Size of matrices and does not depend on the number of samples , and the weighting (9) may be done inplace without consuming additional memory.
Having only one data part in memory at a time reduces the total memory requirements by a factor of . Large enough allows a single workstation to process billions of samples with Weighted Jackknife, the same way as presented for ELM in [1]. The practical value of is limited by the minimum size of a single batch, that cannot fully utilize CPU/GPU computational potential for small data batches of [1].
2.5 ELM Prediction Intervals Algorithm
Prediction intervals are computed in two stages. The first stage uses training data to learn the two necessary ELM models , and estimate the covariances of output weights in these models:

Train an ELM model on the training data

Predict outputs for the training data

Use weighed Jackknife to estimate covariance of the output weights

Compute residuals for the training data

Train another ELM model to predict the residuals

Use weighed Jackknife to estimate covariance of the output weights
The training data and auxiliary vectors can be discarded at this point.
The second stage uses the previously trained models to predicts test outputs, their squared residuals and all variances. Then the prediction intervals are estimated with an equation (4).

Compute the hidden layer outputs for test inputs
using the two ELM models 
Predict test outputs

Compute variance of the predicted outputs

Predict squared residuals

Compute variance of the predicted square residuals

Compute prediction intervals for a desired confidence level :
(12)
Models can have different optimal number of neurons, that should be validated. Using L2 regularization prevents numerical instabilities. Note that the predicted squared residuals might have negative values, that are replaced by zero.
3 Experimental Results
3.1 Artificial Dataset
An artificial dataset with heteroscedastic noise is shown on Figure 2. Additional tests are done on homoscedastic versions of the same dataset with the same projection function with an inputindependent normally distributed noise. All experiments used ELM with one linear and 10 hyperbolic tangent hidden neurons, in both and .
Figure 3
shows the computed PI on the heteroscedastic artificial dataset at 95% confidence level. The figure also presents the standard deviation of the predicted residuals
at 95% confidence, to show how it is affected by the amount of training data. As the amount of training data increases, PI are given more precisely by and depend less on (Figure 3, right).Similar results obtained for the datasets with homoscedastic noise, presented on Figure 4. Larger variance of outputs makes the prediction task harder, leading to larger errors in (Figure 4, upper left). At the same time the variance of increases (Figure 4, shaded area), and the true PI rarely go beyond their estimated boundaries. Smaller variance of noise leads to more more precise PI, that still cover the true PI most of the time.
In the extreme case of a training set with only 30 samples (which is not enough for learning the correct shape of the true projection function), the predicted squared residuals become unreliable. However, including their variance in the predictions compensates for the model uncertainty (see Figure 5). It sometimes leads to overestimation of the true PI, but this is a desired property that prevents an uncertain model from predicting false highly confident outputs .
3.2 Comparison on Real World Datasets
ELM Prediction Intervals are compared on four real datasets with four other methods presented in [24]. Details of the datasets are given in Table 1. The paper uses two common metrics: Prediction Intervals Coverage Probability (PICP) that is a percentage of test samples whose outputs lie between the PI, and the Normalized Mean Predicted Interval Width that is an average width of PI on a test set divided by the range of the test targets. PICP shows what percentage of targets actually lie within PI, and it should correspond to the target coverage. NMPIW presents how optimal are the PI for the given task, compared to a naive approach of simply taking the full range of targets as an interval. Ideal PI have a small NMPIW with PICP equals to target coverage.
Dataset  Samples  Features  Reference 

Concrete compressive strength  1030  8  [46] 
Plasma betacarotene  315  12  [30] 
Powerplant  Steam pressure  200  5  [14] 
Powerplant  Main steam temperature  200  5  [14] 
The two measures PICP and NMPIW are interdependent as increasing PI width also increases the coverage. The comparison work [24] proposed a combined measure to replace PICP and NMPIW, but it is subjective due to two arbitrary hyperparameters. This paper rather presents PICP and NMPIW on the same plot.
ELM PI method proposed in the paper is compared to four other methods of computing PI for neural networks. The Delta method [9] linearizes a neural networks model around a set of parameters, then applies an asymptotic theory to construct the PI. An extension of the Delta method to heteroscedastic noise is available [11], although still limited due to linearization. Bayesian learning of neural network weights allows for direct derivation of variance for particular predicted values [27], but at a very high computational cost. Bootstrap method is directly applicable to any machine learning method including neural networks, although caution should be taken in selecting bootstrap parameters to make the method resilient to heteroscedastic noise [10]. Finally, the Lower Upper Bound Estimation (LUBE) method proposed by [24] uses two additional outputs in a neural network to predict lower and upper PI, training the network with a custom cost function that includes both PICP and NMPIW.
Experimental setup uses L1 regularized ELM model [28] for automatic model structure selection on relatively small datasets, implemented in HPELM toolbox [1]. The datasets are randomly split in 70% training and 30% test samples, median results over 30 initializations are reported. Numerical experimental results are given in Table 2; comparison numbers for other methods are available in the corresponding paper [24]. Runtime is reported for a 1.4GHz dualcore laptop.
Dataset  PICP(%)  NMPIW(%)  Runtime(ms) 

Concrete compressive strength  91.59  34.01  92 
Plasma betacarotene  92.63  40.66  36 
Powerplant  Steam pressure  93.33  39.29  27 
Powerplant  Main steam temperature  88.33  18.38  35 
Performance of the methods is shown as points in NMPIW/PICP coordinates, presented on Figure 6. An ideal method would be at the left edge of the dashed line (low NMPIW with precise PICP). As shown on the figure, ELM PI method performs better on Steam pressure dataset, a little worse on Plasma betacarotene datasets, and about average on the other two.
A further analysis shows possible reasons for good performance on Steam pressure, and bad one on Plasma betacarotene. The analysis compares against uniform PI using the same ELM predictions for a dataset. Such PI estimate homoscedastic noise correctly, but cannot learn heteroscedastic noise. Let a uniform PI grow starting from zero, then as they grow both coverage and the interval width will increase, generating many pairs of {NMPIW, PICP} points. These points are then connected by a line that represents homoscedastic PI performance boundary. Homoscedastic PI performance boundary and ELM PI for the two datasets in question are shown on Figure 7.
Obviously, useful heteroscedastic PI must be above this boundary – but in practice they may end up below due to poorer parameter estimation. Indeed, heteroscedastic PI need interval width per sample while homoscedastic PI only have interval width per dataset, that is easier to estimate precisely. As seen from Figure 7, this is the situation for ELM PI on the Plasma betacarotene dataset where uniform PI perform better. On Steam pressure however, heteroscedastic PI perform better than uniform ones as they have higher coverage with the same average width. Another possible reason for the difference in performance is that Plasma betacarotene dataset has homoscedastic noise, while Steam pressure dataset has actually a heteroscedastic noise (or heteroscedastic stochastic outputs), so heteroscedastic PI provide the most benefit when computed on the latter dataset.
4 Minimizing False Positives on a Large Real Dataset
This experiment uses PI to minimize the amount of false positive predictions on a large classification task. Note that the proposed PI methodology applies equally well to regression, and monotonic classification tasks are handled even better using purposely developed [48] implementations of ELM as .
A 4,000,000sample dataset of pixel colors for skin/nonskin classification is created from the FaceSkin Images dataset [34]. The inputs are colors of the target pixel and its neighbors with input features total, and the outputs are +1 for skin pixels and 1 for nonskin ones. The dataset uses photos of various people under different lighting conditions, without any preprocessing. True skin masks are created manually and are highly accurate. Half of the dataset is used for training, and the other half for test.
The applied ELM model uses 147 linear + 200 sigmoid neurons. Predictions of ELM are real values, that are turned into classes by taking their sign. Due to a simple model and input features (that are not tailored for image processing) the performance is average at about 87% accuracy. The goal of the experiment is to check whether the persample PI can be used to significantly improve the accuracy at a cost of coverage, compared to perdatasets PI computed by MSE.
To trade coverage for precision, a threshold is introduced. ELM predictions with an absolute value less than are ignored. A value of corresponding to the desired coverage percentage is found by scalar optimization methods. For persample PI, threshold is multiplied by the value of the corresponding for a prediction .
The results are shown on Figure 8. Here, an ELM models with a total of 347 hidden neurons is trained on a dataset with two million samples. The persample PI improves the true positive rate slightly. However, they reach almost zero false positives with 3% coverage, and exactly zero at 1%. Contrary to the proposed method, uniform PI computed with MSE cannot achieve zero false positives. Although one percent of coverage seems very little, it represents 20,000 test samples for that dataset, and it is a surprising achievement for a simple ELM model that is not optimized for False Positives reduction like in custom applications [2]. A specifically designed model, or an ensemble of multiple models could achieve zero False Positives with a larger coverage – a significant result for practical use of ELM, and Machine Learning algorithms in general.
4.1 Runtime Analysis
The runtime of persample PI is examined on the pixel classification dataset explained above. The experiments are run on a desktop machine with 4core Intel Skylake CPU, using an efficient ELM toolbox from [1]. With 2,000,000 training samples and 347 hidden neurons, training an ELM takes 12 seconds (for both or ). Computing covariance matrices and with weighted Jackknife method takes 25 seconds each, or only twice longer that training an ELM itself. Test predictions take 8 seconds to compute, and test persample PI take 32 seconds. In total, prediction intervals increase the ELM runtime by a constant factor of about 5.
Runtime on the realworld datasets is not directly comparable with the other methods as they are run on different machines, but it is the same order of magnitude as Bootstrap, an order of magnitude faster than Delta or Bayesian methods, but also an order of magnitude slower than the LUBE method. Replacing L1 regularized ELM with standard ELM reduces the runtime to the level of LUBE method, however it degrades the results on small datasets with a few hundreds samples. Extremely large datasets that do not need regularization benefit from the faster run speed.
5 Conclusion
The paper proposed a method of computing persample prediction intervals for Extreme Learning Machines. It successfully evaluates variance of heteroscedastic stochastic outputs, using only ELM models and the weighted Jackknife method. The proposed framework works well for homoscedastic outputs, making the proposed method applicable on a general level. ELM PI is comparable to other methods of computing PI in neural networks on small datasets, while keeping it possible to have very fast runtimes and scalability for Big Data.
On a real dataset, the method has shown to allow for a better precision and lower False Positives rate. Heteroscedastic PI performs in a similar way as uniform PI from Mean Squared Error on 50%70% of dataset samples, but they make a huge difference on the most confidently predicted 1%10% of samples. For these samples, the proposed PI allowed to achieve zero False Positives rate even with a basic ELM model, which is an extremely useful feature in many practical applications. The runtime is comparable to the runtime of an ELM itself that makes it feasible for large datasets of Big Data problems.
ELM PI can be easily extended to nonsymmetric PI by using two ELM models in the second stage for predicting upper and lower boundaries separately. An ensemble of ELMs may increase the coverage for zero False Positives data predictions. These extensions will be examined and evaluated in future works on this topic.
References
 [1] (201507) HighPerformance Extreme Learning Machines: A Complete Toolbox for Big Data Applications. IEEE Access 3, pp. 1011–1025. External Links: ISSN 21693536, Document Cited by: §1.1, §1, §2.4, §3.2, §4.1.
 [2] (201403) A TwoStage Methodology Using KNN and FalsePositive Minimizing ELM for Nominal Data Classification. Cognitive Computation 6 (3), pp. 432–445. External Links: ISSN 18669956, Document Cited by: §1, §4.
 [3] (201505) Arbitrary Category Classification of Websites Based on Image Content. IEEE Computational Intelligence Magazine 10 (2), pp. 30–41. External Links: ISSN 1556603X, Document Cited by: §1.
 [4] (201507) MDELM: Originally Mislabeled Samples Detection using OPELM Model. Neurocomputing 159, pp. 242–250. External Links: ISSN 09252312, Document Cited by: §1.

[5]
(1982)
Linear regression analysis with fuzzy model
. IEEE Transaction Systems Man and Cybermatics 12 (6), pp. 903–07. Cited by: §1.1. 
[6]
(201702)
Fuzziness based semisupervised learning approach for intrusion detection system
. Information Sciences 378, pp. 484–497. External Links: ISSN 00200255, Document Cited by: §1.1.  [7] (2006) Pattern Recognition and Machine Learning. Information science and statistics, Vol. 4, Springer Science+Business Media, Singapore. External Links: ISBN 9780387310732 Cited by: §2.2.
 [8] (2016) Variational Bayesian extreme learning machine. Neural Computing and Applications 27 (1), pp. 185–196. External Links: ISSN 14333058, Document Cited by: §1.1.
 [9] (199601) Confidence interval prediction for neural network models. IEEE Transactions on Neural Networks 7 (1), pp. 229–232. External Links: ISSN 10459227, Document Cited by: §3.2.
 [10] (200809) The wild bootstrap, tamed at last. Journal of Econometrics 146 (1), pp. 162–169. External Links: ISSN 03044076, Document Cited by: §2.3, §3.2.
 [11] (200303) Backpropagation of pseudoerrors: neural networks that are adaptive to heterogeneous noise. IEEE Transactions on Neural Networks 14 (2), pp. 253–262. External Links: ISSN 10459227, Document Cited by: §3.2.
 [12] (201103) BELM: Bayesian Extreme Learning Machine. IEEE Transactions on Neural Networks 22 (3), pp. 505–509. External Links: ISSN 10459227, Document Cited by: §1.1.
 [13] (200504) Bootstrapping heteroskedastic regression models: wild bootstrap vs. pairs bootstrap. 2nd CSDA Special Issue on Computational Econometrics 49 (2), pp. 361–376. External Links: ISSN 01679473, Document Cited by: §2.3.
 [14] (1974) Identification of a power plant from normal operating records. Automatic Control Theory and Applications 2 (3), pp. 63–67. Cited by: Table 1.
 [15] (201610) Fuzzy Nonlinear Regression Analysis Using a Random Weight Network. Inf. Sci. 364 (C), pp. 222–240. External Links: ISSN 00200255, Document Cited by: §1.1.
 [16] (199803) A robust approach to reference interval estimation and evaluation. Clinical Chemistry 44 (3), pp. 622–631. Cited by: §2.3.
 [17] (201505) Local Receptive Fields Based Extreme Learning Machine. IEEE Computational Intelligence Magazine 10 (2), pp. 18–29. External Links: ISSN 1556603X, Document Cited by: §1.
 [18] (200607) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks 17 (4), pp. 879–892. External Links: ISSN 10459227, Document Cited by: §1.
 [19] (201204) Extreme learning machine for regression and multiclass classification.. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 42 (2), pp. 513–529. External Links: ISSN 19410492, Document Cited by: §2.1.
 [20] (200612) Extreme learning machine: Theory and applications. Neural Networks Selected Papers from the 7th Brazilian Symposium on Neural Networks (SBRN ’04)7th Brazilian Symposium on Neural Networks 70 (1–3), pp. 489–501. External Links: ISSN 09252312, Document Cited by: §2.1.
 [21] (2529 July 2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference On, Vol. 2, pp. 985–990. External Links: ISBN 10987576, Document Cited by: §1.

[22]
(34 Dec. 2011)
Methodology for Behavioralbased Malware Analysis and Detection Using Random Projections and KNearest Neighbors Classifiers
. In 2011 Seventh International Conference on Computational Intelligence and Security, pp. 1016–1023. External Links: Document Cited by: §1.  [23] (2002) Applied multivariate statistical analysis. Vol. 5, Prentice hall Upper Saddle River, NJ. Cited by: §2.2, §2.3.
 [24] (201103) Lower Upper Bound Estimation Method for Construction of Neural NetworkBased Prediction Intervals. IEEE Transactions on Neural Networks 22 (3), pp. 337–346. External Links: ISSN 10459227, Document Cited by: Figure 6, §3.2, §3.2, §3.2, §3.2.
 [25] (2016) Advances in extreme learning machines (ELM2014). Neurocomputing 174, Part A, pp. 1 – 3. External Links: ISSN 09252312, Document Cited by: §1.
 [26] (1955) Probability Theory; Foundations, Random Sequences. D. Van Nostrand Company, New York. Cited by: §2.2, §2.3.
 [27] (199209) The Evidence Framework Applied to Classification Networks. Neural Computation 4 (5), pp. 720–736. External Links: ISSN 08997667, Document Cited by: §3.2.
 [28] (201001) OPELM: OptimallyPruned Extreme Learning Machine. IEEE Transactions on Neural Networks 21 (1), pp. 158–162. External Links: Document Cited by: §3.2.
 [29] (201109) TROPELM: A doubleregularized ELM using LARS and Tikhonov regularization. Advances in Extreme Learning Machine: Theory and Applications Biological Inspired Systems. Computational and Ambient Intelligence Selected papers of the 10th International WorkConference on Artificial Neural Networks (IWANN2009) 74 (16), pp. 2413–2421. External Links: ISSN 09252312, Document Cited by: §1.
 [30] (198909) Determinants of Plasma Levels of BetaCarotene and Retinol. American Journal of Epidemiology 130 (3), pp. 511–521. Cited by: Table 1.
 [31] (1995) Learning Local Error Bars for Nonlinear Regression. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), pp. 489–496. Cited by: §2.3.
 [32] (201410) Input dependent prediction intervals for supervised regression.. Intelligent Data Analysis 18 (5), pp. 873–887. External Links: ISSN 1088467X Cited by: §1.
 [33] (2015) Prediction intervals in supervised learning for model evaluation and discrimination. Applied Intelligence 42 (4), pp. 790–804. External Links: ISSN 15737497, Document Cited by: §1.1, §2.2.
 [34] (200501) Skin segmentation using color pixel classification: analysis and comparison. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (1), pp. 148–154. External Links: ISSN 01628828, Document Cited by: §4.
 [35] (1972) Generalized inverse of a matrix and its applications. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, Berkeley, CA, pp. 601–620. Cited by: §2.1.
 [36] (201501) Confidenceweighted extreme learning machine for regression problems. Neurocomputing 148, pp. 544–550. External Links: ISSN 09252312, Document Cited by: §1.1.
 [37] (1987) HeteroscedasticityRobustness of Jackknife Variance Estimators in Linear Models. The Annals of Statistics 15 (4), pp. 1563–1579. External Links: ISSN 00905364 Cited by: §2.2, §2.3.

[38]
(201601)
Extreme learning machine for missing data using multiple imputations
. Neurocomputing 174, Part A, pp. 220 – 231. External Links: ISSN 09252312, Document Cited by: §1.  [39] (201501) Efficient Skin Segmentation via Neural Networks: HPELM and BDSOM. INNS Conference on Big Data 2015 Program San Francisco, CA, USA 810 August 2015 53, pp. 400–409. External Links: ISSN 18770509, Document Cited by: §1.
 [40] (2016) Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification. Neurocomputing 174, Part A, pp. 344 – 351. External Links: ISSN 09252312, Document Cited by: §1.
 [41] (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 5, pp. 1035–1038. Cited by: §1.
 [42] (20170901) Stable prediction in highdimensional linear models. Statistics and Computing 27 (5), pp. 1401–1412. External Links: ISSN 15731375 Cited by: §1.1.
 [43] (201510) A Study on Relationship Between Generalization Abilities and Fuzziness of Base Classifiers in Ensemble Learning. IEEE Transactions on Fuzzy Systems 23 (5), pp. 1638–1654. External Links: ISSN 10636706, Document Cited by: §1.1.

[44]
(2017)
Noniterative Deep Learning: Incorporating Restricted Boltzmann Machine Into Multilayer Random Weight Neural Networks
. IEEE Transactions on Systems, Man, and Cybernetics: Systems PP (99), pp. 1–10. External Links: ISSN 21682216, Document Cited by: §1.1.  [45] (198612) Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis. Ann. Statist. (4), pp. 1261–1295 (en). External Links: ISSN 00905364, Document Cited by: §2.3.
 [46] (1998) Modeling of strength of highperformance concrete using artificial neural networks. Cement and Concrete Research 28 (12), pp. 1797–1808. Cited by: Table 1.
 [47] (201704) An Efficient Method for Traffic Sign Recognition Based on Extreme Learning Machine. IEEE Transactions on Cybernetics 47 (4), pp. 920–933. External Links: ISSN 21682267, Document Cited by: §1.
 [48] (2017) Monotonic classification extreme learning machine. Neurocomputing 225 (Supplement C), pp. 205 – 213. External Links: ISSN 09252312 Cited by: §4.