Ensembles of Randomized NNs for Pattern-based Time Series Forecasting

by   Grzegorz Dudek, et al.

In this work, we propose an ensemble forecasting approach based on randomized neural networks. Improved randomized learning streamlines the fitting abilities of individual learners by generating network parameters in accordance with the data and target function features. A pattern-based representation of time series makes the proposed approach suitable for forecasting time series with multiple seasonality. We propose six strategies for controlling the diversity of ensemble members. Case studies conducted on four real-world forecasting problems verified the effectiveness and superior performance of the proposed ensemble forecasting approach. It outperformed statistical models as well as state-of-the-art machine learning models in terms of forecasting accuracy. The proposed approach has several advantages: fast and easy training, simple architecture, ease of implementation, high accuracy and the ability to deal with nonstationarity and multiple seasonality in time series.



page 10

page 12


Randomized Neural Networks for Forecasting Time Series with Multiple Seasonality

This work contributes to the development of neural forecasting models wi...

Boosted Ensemble Learning based on Randomized NNs for Time Series Forecasting

Time series forecasting is a challenging problem particularly when a tim...

Ensemble Forecasting of Monthly Electricity Demand using Pattern Similarity-based Methods

This work presents ensemble forecasting of monthly electricity demand us...

Dynamic Time Scan Forecasting

The dynamic time scan forecasting method relies on the premise that the ...

Generalisation in fully-connected neural networks for time series forecasting

In this paper we study the generalisation capabilities of fully-connecte...

Optimizing Convergence for Iterative Learning of ARIMA for Stationary Time Series

Forecasting of time series in continuous systems becomes an increasingly...

Reinforcement Learning based dynamic weighing of Ensemble Models for Time Series Forecasting

Ensemble models are powerful model building tools that are developed wit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series (TS) forecasting plays an important role in many fields, including commerce, industry, public administration, politics, health, medicine, etc. [1]. TS expressing different phenomena and processes may include the nonlinear trend, multiple seasonal cycles of different lengths and random fluctuations. This makes the relationship between predictors and output variables very complex and places high demands on forecasting models. Over the years, many sophisticated forecasting models have been proposed including statistical, machine learning (ML) and hybrid solutions.

Among ML models, neural networks (NNs) are the most commonly used. There are a huge number of forecasting models based on different NN architectures [2]

. In addition to classic NNs such as multilayer perceptron (MLP), radial basis function NN, generalized regression NN (GRNN), and self-organizing maps


, many models based on deep learning (DL) have been developed recently. These are composed of combinations of basic structures, such as MLPs, convolutional NNs, and recurrent NNs (RNNs)

[4], [5]. Their success can be largely attributed to increased model complexity and the ability to perform representation learning and cross-learning.

An effective way to increase the performance of a predictive model is ensembling. Ensemble methods combine in some fashion multiple learning algorithms to produce a common response, hopefully improving accuracy and stability compared to a single learner. The challenge in ensemble learning, which to a large extent determines its success, is achieving a good tradeoff between the performance and the diversity of the ensemble members [6]

. The more accurate and the more diverse the individual learners are, the better the ensemble performance. Depending on the base model type, the diversity of learners can be achieved using different strategies. For example, the winning submission to the M4 Makridakis forecasting competition, was a hybrid model combining exponential smoothing (ETS) and long-short term memory

[7], which uses three sources of diversity. The first is a stochastic training process, the second is similar to bagging, and the third is training ensemble members using different initial parameters. In another state-of-the-art DL forecasting model, N-Beats [8], in order to achieve diversity, each of the ensemble members is trained using a different random initialization and a different random sequence of batches.

Randomization-based NNs are especially suitable for ensembling as they are highly unstable and extremely fast trained [9]. In randomized NNs, the hidden node parameters are selected randomly and the output weights are tuned using a simple and fast least-squares method. This avoids the difficulties associated with gradient-based learning such as slow convergence, local minima problems, and model uncertainties caused by the initial parameters. Ensemble methods based on randomized NNs were presented recently in several papers. In [10]

, an ensemble based on decorrelated random vector functional link (RVFL) networks was proposed using negative correlation learning to control the trade-off among the bias, variance and covariance in the ensemble learning. A selective ensemble of randomization-based NNs using successive projections algorithm (SPA) was put forward in

[11]. SPA improves diversity by selecting uncorrelated members. In [12], to enhance the generalization capacities of the ensemble, a novel framework for building an ensemble model by selecting appropriate representatives from a number of randomization-base models (RVFL or stochastic configuration networks) was proposed.

In the forecasting domain, ensembles of randomization-based NNs are widely used. Examples include: [13], where ensemble members, which are extreme learning machines, learn on TS decomposed by wavelet transform; [14], where a hybrid incremental learning approach is presented for ensemble forecasting, which is composed of discrete wavelet transform, empirical mode decomposition and RVFL networks as base learners; [15], where a new bagging ensemble approach based on NN with random weights for online data stream regression is proposed; [16], where a data-driven evolutionary ensemble forecasting model is proposed using two optimization objectives (error and diversity) and RVFL as a base learner.

Motivated by the good performance of the randomized NN based ensemble forecasting models mentioned above, and new improvements in randomized learning [17] as well as pattern-based TS representation suitable for TS with multiple seasonality [3], this work contributes to the development of forecasting models in the following ways:

  1. A new ensemble forecasting model based on randomized NNs is presented. Enhanced randomized learning improves the fitting abilities of individual learners. To deal with multiple seasonality and nonstationarity, the model applies pattern representation of TS.

  2. Six strategies for generating ensemble diversity are proposed. Each strategy governs diversity in a different way.

  3. An experimental study using four real-world datasets demonstrates the superior performance of the proposed ensemble forecasting approach when compared to statistical and machine learning forecasting models.

The remaining sections of this work are organized as follows. Section 2 presents the base forecasting model, . Section 3 describes strategies for generating ensemble diversity. The performance of the proposed ensemble forecasting model is evaluated in Section 4. Concluding remarks are given in Section 5.

2 Forecasting Model

Fig. 1 shows , a base forecasting model, which was designed for forecasting TS with multiple seasonality [18]. It is used as an ensemble member. is composed of an encoder, a decoder and a randomized feedforward NN (FNN).

Figure 1: Block diagram of the proposed forecasting model.

The encoder transforms the original TS into unified input and output patterns of its seasonal cycles. To create input patterns, the TS expressing multiple seasonality, , is divided into seasonal sequences of the shortest length. Let these sequences be expressed by vectors , where is the seasonal sequence length and is the sequence number. These sequences are encoded in input patterns as follows:


where is the mean value of sequence , and is a measure of sequence dispersion.

Note that the x-patterns are normalized versions of centered vectors . All x-patterns, representing successive seasonal sequences, have a zero mean, the same variance and the same unity length. However, they differ in shape. Thus, the original seasonal sequences, which have a different mean value and dispersion, are unified (see Fig. 2 in [18]).

The output patterns represent the forecasted sequences , where is a forecast horizon. The y-patterns are determined as follows:


where and are the same as in (1).

Note that in (2), for the -th output pattern, we use the same coding variables and as for the -th input pattern. This is because the coding variables for the forecasted sequence, and , are unknown for the future period.

The decoder transforms a forecasted output pattern into a TS seasonal cycle. The output pattern predicted by randomized FNN is decoded using the coding variables of the input query pattern, , using transformed equation (2):


where is the forecasted seasonal sequence, is the forecasted output pattern, and are the coding variables determined from the TS sequence encoded in query pattern .

The randomized FNN is composed of inputs, one hidden layer with nonlinear nodes, and

outputs. Logistic sigmoid activation functions are employed for hidden nodes. The training set is composed of the corresponding input and output patterns:

. The randomized learning algorithm consists of three steps (we use an improved version of the randomized learning algorithm proposed in [17]):

  1. Generate hidden node parameters. The weights are selected randomly: and biases are calculated from:


    where ; ; is one of the training x-patterns selected for the -th hidden node at random.

  2. Calculate hidden layer output matrix .

  3. Calculate the output weights:


    where is a matrix of output weights,

    is a matrix of target output patterns, and

    is the Moore-Penrose generalized inverse of matrix .

In the first step, the weights are selected randomly from symmetrical interval . The interval bounds, , decide about the steepness of the sigmoids. To make the bounds interpretable, let us express them by the sigmoid slope angle [17]: , where

is the upper bound for slope angles. Hyperparameter

as well as number of hidden nodes control the bias-variance tradeoff of the model. Both these hyperparameters should be adjusted to the target function complexity.

The biases calculated according to (4) ensure the introduction of the steepest fragments of the sigmoid into the input hypercube [17]. These fragments are most useful for modeling target function fluctuations. Thus, there are no saturated sigmoids and wasted nodes in the network.

3 Ensembling

An ensemble is composed of individual learners (). Each ensemble member learns from the training set using one of the strategies for generating diversity, described below. The ensemble prediction is an average of individual member predictions :


As an ensemble diversity measure, we define the average standard deviation of forecasts produced by the individual learners:


where is a test set, is a forecast of the -th element of the -th seasonal sequence produced by the -th learner, and is an average of forecasts produced by learners.

An ensemble diversity is generated using one of the following six strategies:

  • generates diversity by using different parameters of hidden nodes. For each learner, new weights are randomly selected taking as the upper bound for the sigmoid slope angles. Then, biases are calculated from (4) and output weights form (5). The diversity level is controlled by . For larger we get steeper sigmoids and higher diversity.

  • controls diversity by training individual learners on different subsets of the training set. For each ensemble member, a random sample from the training set is selected without replacement. The sample size is , where is a diversity parameter. Each learner has the same hidden node parameters. Its output weights are tuned to the training subset.

  • controls diversity by training individual learners on different subsets of features. For each ensemble member the features are randomly sampled without replacement. The sample size is , where is a diversity parameter. The ensemble members share the hidden node parameters. Their output weights are tuned to the training set.

  • is based on hidden node pruning. The learners are created from the initial architecture including hidden nodes. For each learner, nodes are randomly selected and the remaining are pruned. Output weights are determined anew for each learner. Parameter controls the diversity level.

  • is based on hidden weight pruning. The learners are created from the initial architecture including hidden nodes. For each learner, hidden node weights are randomly selected and set to zero. Output weights are determined anew for each learner. Parameter controls the diversity level.

  • generates diversity by noising training data. For each learner the training patterns are perturbed by Gaussian noise as follows:

    , where . Standard deviation of the noise, , is a diversity parameter. Each learner has the same hidden node parameters. Its output weights are tuned to the noised training data.

4 Experiments and Results

We evaluate the performance of the proposed ensembles of on four real-world forecasting problems. These concern forecasting electricity demand for four European countries: Poland (PL), Great Britain (GB), France (FR) and Germany (DE) (data was collected from www.entsoe.eu). The hourly electrical load TS express three seasonalities: yearly, weekly and daily (see Fig. 2 in [18]). The data period covers the 4 years from 2012 to 2015. Atypical days such as public holidays were excluded from the data (between 10 and 20 days a year). The forecast horizon is one day, i.e. 24 hours. We forecast the daily load profile for each day of 2015. For each forecasted day, a new training set is created which includes historical pairs of corresponding input and output patterns representing the same days of the week as the pair in the query pattern and forecasted pattern. The number of ensemble members was .

In the first experiment, to assess the impact of the hyperparameters on the forecasting accuracy of , we train the members with and . Fig. 2 shows the mean absolute percentage errors (MAPE) depending on the hyperparameters. The optimal values of were: for PL, for GB, for FR, and for DE. Based on these results, we select and for each ensemble variant and dataset, except , where we control diversity by changing , and , where we control diversity by changing the number of pruning nodes. In this latter case we use nodes.

From Fig. 3, we can assess the impact of the hyperparameters on the diversity (7) of . As expected, greater numbers of hidden nodes and steeper sigmoids (higher ) cause an increase in diversity.

Figure 2: MAPE depending on hyperparameters for .
Figure 3: Diversity of depending on hyperparameters.

Fig. 4 shows MAPE and ensemble diversity depending on the diversity parameters for all ensemble variants and datasets. From this figure, we can conclude that:

  • for diversity increases with . The optimal is around for all datasets.

  • for diversity decreases with . For lower (small training sets), the diversity as well as MAPE are very high. The MAPE curves have an irregular character, so it is difficult to select the optimal value. The error levels are much higher than for .

  • for diversity decreases with . MAPE reaches its minima for: for PL, for GB, for FR, and for DE. Thus, the best ensemble solutions use around of features.

  • for diversity changes slightly with . MAPE reaches its minima for: for PL, for GB, for FR, and for DE. Thus, the best choice for the number of selected nodes is around .

  • for diversity has its maximum for (10% of hidden weights are set to 0). MAPE increases with , having its lowest values for (no weight pruning). Thus, any weight pruning makes the error higher. Increasing the hidden node number to 80 did not improve the results.

  • for we observe low sensitivity of diversity as well as MAPE to diversity parameter . For smaller , the error only slightly decreases as increases. Then, for higher , the error starts to increase.

Figure 4: MAPE (solid lines) and ensemble diversity (dashed lines) depending on the diversity parameters.

Table 1 shows the quality metrics of the forecasts for the optimal values of the diversity parameters: MAPE, median of APE, root mean square error (RMSE), mean percentage error (MPE), and standard deviation of percentage error (Std(PE)) as a measure of the forecast dispersion. For comparison, the results for a single are shown. In this case, for each forecasting task (each day of 2015) the hyperparameters of , and , were optimized using grid search and 5-fold cross-validation [18]. The results for single shown in Table 1 are averaged over 100 independent training sessions. shown in Table 1 is an ensemble of these 100 runs (calculated from (6)). is similar to . The only difference is that for the hyperparameters were optimized for each forecasting task, and for we set and for all forecasting tasks. Distributions of APE are shown in Fig. 5.

PL MAPE 1.14 1.32 1.13 1.16 1.39 1.34 1.32 1.24
Median(APE) 0.79 0.94 0.78 0.81 0.98 0.94 0.93 0.89
RMSE 304 342 299 313 380 351 358 333
MPE 0.31 0.43 0.31 0.25 0.36 0.45 0.40 0.40
Std(PE) 1.67 1.88 1.64 1.72 2.07 1.94 1.94 1.80
GB MAPE 2.51 2.74 2.55 2.49 2.92 2.65 2.61 2.52
Median(APE) 1.76 2.01 1.82 1.77 2.09 1.95 1.88 1.80
RMSE 1151 1205 1160 1131 1325 1167 1187 1147
MPE -0.53 -0.83 -0.53 -0.53 -0.55 -0.65 -0.61 -0.61
Std(PE) 3.48 3.64 3.51 3.42 4.02 3.56 3.57 3.44
FR MAPE 1.57 1.80 1.59 1.61 1.90 1.74 1.67 1.60
Median(APE) 1.04 1.24 1.07 1.06 1.33 1.24 1.15 1.07
RMSE 1378 1505 1398 1402 1597 1459 1422 1385
MPE -0.28 -0.26 -0.25 -0.31 -0.33 -0.27 -0.42 -0.42
Std(PE) 2.53 2.80 2.58 2.57 2.92 2.72 2.60 2.53
DE MAPE 1.18 1.31 1.19 1.20 1.45 1.33 1.38 1.29
Median(APE) 0.81 0.91 0.80 0.82 0.98 0.96 0.96 0.91
RMSE 1077 1155 1111 1097 1423 1159 1281 1206
MPE 0.11 0.03 0.10 0.09 0.02 0.17 0.14 0.14
Std(PE) 1.89 2.01 1.96 1.92 2.44 2.04 2.22 2.11
Table 1: Forecasting results.
Figure 5: Distribution of APE.

To confirm that the differences in accuracy between ensemble variants are statistically significant, we performed a one-sided Giacomini-White test for conditional predictive ability [19]. This is a pairwise test that compares the forecasts produced by different models. We used a Python implementation of the Giacomini-White test in multivariate variant from [20, 21]. Fig. 6 shows results of the Giacomini-White test. The resulting plots are heat maps representing the obtained -values. The closer they are to zero the significantly more accurate the forecasts produced by the model on the -axis are than the forecasts produced by the model on the -axis. The black color indicates -values larger than 0.10.

Figure 6: Results of the Giacomini-White test for the proposed ensembles.

From the results presented in Table 1 and Figs. 5 and 6 we can conclude that the most accurate ensembles are , , and . For PL and DE they significantly outperformed all other models including single and , although, for GB and FR, can compete with them in terms of accuracy. Note that the APE distributions for , , and shown in a simplified form in Fig. 5 are very similar. The worst ensemble solution was based on weight pruning. It was beaten by all other models for all datasets. and were only slightly better in terms of accuracy than .

MPE shown in Table 1 allows us to assess the bias of the forecasts. A positive value of MPE indicates underprediction, while a negative value indicates overprediction. As can be seen from Table 1, for PL and DE all the models underpredicted, whilst for GB and FR they overpredicted. The forecasts produced by the three best ensemble models were less biased than the forecast produced by a single and .

In the next experiment, we compare performance (one of the best ensemble solutions) with that of other models based on classical statistical methods and ML methods. The baseline models that we use in our comparative studies are outlined below (see [22] for details). Their hyperparameters were selected on the training set in grid search procedures.

  • naive – naive model: ,

  • arima – autoregressive integrated moving average model,

  • ets – exponential smoothing model,

  • prophet – a modular additive regression model with nonlinear trend and seasonal components [23],

  • mlp

    perceptron with a single hidden layer and sigmoid nonlinearities,

  • svm

    – linear epsilon insensitive support vector machine (

    -SVM) [24],

  • anfis – adaptive neuro-fuzzy inference system,

  • lstm – long short-term memory,

  • fnm – fuzzy neighborhood model,

  • n-we

    – Nadaraya–Watson estimator,

  • grnn – general regression NN.

Table 2 shows MAPE for and the baseline models. Note that for each dataset, returned the lowest MAPE. To confirm the statistical significance of these findings, we performed a Giacomini-White test. The test results depicted in Fig. 7, clearly show that outperforms all the other models in terms of accuracy. Only in two cases out of 44 were the baseline models close to in terms of accuracy, i.e. N-WE for PL data and SVM for GB data.

naive arima ets prophet mlp svm anfis lstm fnm n-we grnn
PL 1.14 2.96 2.31 2.14 2.63 1.39 1.32 1.64 1.57 1.21 1.19 1.22
GB 2.51 4.80 3.50 3.19 4.00 2.84 2.54 2.80 2.92 3.02 3.12 3.01
FR 1.57 5.53 3.00 2.79 4.71 1.93 1.63 2.12 1.81 1.84 1.86 1.81
DE 1.18 3.13 2.31 2.10 3.23 1.58 1.38 2.48 1.57 1.30 1.29 1.30
Table 2: MAPE for and baseline models.
Figure 7: Results of the Giacomini-White test for the proposed and baseline models.

5 Conclusion

Challenging forecasting problems, such as forecasting nonstationary TS with multiple seasonality, need sophisticated models. In this work, to deal with multiple seasonal periods, we employ a pattern representation of the TS, in order to simplify the relationship between input and output data and make the problem easier to solve using simple regression models such as randomized NNs.

have three distinct advantages: a simple single-hidden layer architecture, extremely fast learning (no gradient-based learning, no problems with vanishing and exploding gradients), and ease of implementation (no complex optimization algorithms, no additional mechanisms such as dilation, attention, and residual connections).

produces a vector output, which means the model is able to forecast the time series sequence at once.

We propose an ensemble of with six different methods of managing diversity. Among these, the most promising ones turned out to be: learning the ensemble members using different parameters of hidden nodes (), training individual learners on different subsets of features (), and hidden node pruning (). These three methods brought comparable results in an experimental study involving four real-world forecasting problems (short-term load forecasting). has the additional advantage of having only two hyperparameters to tune, i.e. the number of hidden nodes and the interval bounds for random weights. In the comparative study, significantly outperformed both classical statistical methods and machine learning methods.

In our further work, we plan to introduce an attention mechanism into our randomization-based forecasting models to select training data and develop probabilistic forecasting models based on .


  • [1] Palit, A. K., Popovic, D.: Computational Intelligence in Time Series Forecasting: Theory and Engineering Applications. Springer-Verlag, London 2005.
  • [2] Benidis, K., Rangapuram, S.S., Flunkert, V., Wang, B., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke-Schneider, M., Salinas, D., Stella, L., Callot L. and Januschowski, T.: Neural forecasting: Introduction and literature overview. arXiv:2004.10240 (2020)
  • [3] Dudek, G.: Neural networks for pattern-based short-term load forecasting: A comparative study. Neurocomputing 205, 64–74 (2016)
  • [4] Torres, J.F., Hadjout, D. Sebaa, A., Martínez-Álvarez, F., Troncoso, A.: Deep Learning for Time Series Forecasting: A Survey. Big Data 9(1), 3–21 (2021)
  • [5]

    Hewamalage, H., Bergmeir, C., Bandara, K.: Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions. International Journal of Forecasting

    37(1), 388–427 (2021)
  • [6]

    Reeve H.W.J., Brown, G.: Diversity and degrees of freedom in regression ensembles. Neurocomputing

    298, 55–68 (2018)
  • [7] Smyl, S.: A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting 36(1), 75–85 (2020)
  • [8] Oreshkin, B.N., Carpov, D., Chapados, N. and Bengio, Y.: N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. 8th International Conference on Learning Representations, ICLR, (2020).
  • [9] Ren, Y., Zhang, L., Suganthan, P.N.: Ensemble Classification and Regression – Recent Developments, Applications and Future Directions. IEEE Computational Intelligence Magazine, 11(1), 41–53 (2016)
  • [10] Alhamdoosh, M., Wang, D.: Fast decorrelated neural network ensembles with random weights. Information Sciences 264, 104–117 (2014)
  • [11] Mesquita, D.P.P., Gomes, J.P.P, Rodrigues, L.R., Oliveira, S.A.F., Galvão, R.K.H.: Building selective ensembles of Randomization Based Neural Networks with the successive projections algorithm. Applied Soft Computing 70, 1135–1145 (2018)
  • [12] Huang, C., Li, M., Wang, D.: Stochastic configuration network ensembles with selective base models. Neural Networks, 264, 106–118 (2021)
  • [13] Li, S., Goel, L., Wang, P.: An ensemble approach for short-term load forecasting by extreme learning machine. Applied Energy 170, 22–29 (2016)
  • [14] Qiu, X., Suganthan, P.N., Amaratunga, G.A.J., Ensemble incremental learning Random Vector Functional Link network for short-term electric load forecasting. Knowledge-Based Systems 145, 182–196 (2018)
  • [15] de Almeida, R., Goh, Y.M., Monfared, R. et al. An ensemble based on neural networks with random weights for online data stream regression. Soft Computing 24, 9835–9855 (2020)
  • [16]

    Hu, Y., Qu, B., Wang, J., Liang, J., Wang, Y., Yu, K., Li, Y., Qiao, K.: Short-term load forecasting using multimodal evolutionary algorithm and random vector functional link network based ensemble learning. Applied Energy

    285, 116415 (2021)
  • [17] Dudek, G.: Generating random parameters in feedforward neural networks with random hidden nodes: Drawbacks of the standard method and how to improve it. Neural Information Processing, ICONIP 2020, Springer CCIS 1333, 598–606 (2020)
  • [18] Dudek G.: Randomized Neural Networks for Forecasting Time Series with Multiple Seasonality. 16th International Work-Conference on Artificial Neural Networks IWANN 2021 (in print).
  • [19] Giacomini, R., White, H.: Tests of conditional predictive ability. Econometrica 74(6), 1545–1578 (2006)
  • [20] Epftoolbox library - https://github.com/jeslago/epftoolbox, epftoolbox documentation - https://epftoolbox.readthedocs.io
  • [21] Lago, J. Marcjasz, G., De Schutter, B., Weron, R.: Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark. Applied Energy 293, 116983 (2021)
  • [22] Dudek, G., Pełka, P.: Pattern Similarity-based Machine Learning Methods for Mid-term Load Forecasting: A Comparative Study. Applied Soft Computing 104, 107223 (2021)
  • [23] Taylor, S.J., Letham, B. Forecasting at Scale. The American Statistician 72(1), 37–45 (2018)
  • [24] Pełka P.: Pattern-based Forecasting of Monthly Electricity Demand using Support Vector Machine. International Joint Conference on Neural Networks IJCNN 2021 (in print).