I Introduction
Electricity demand forecasting is an essential tool in all sectors in the electric power industry. Midterm load forecasting (MTLF), which involves forecasting the daily peak load for future months as well as monthly electricity demand, is necessary for power system operation and planning in such areas as maintenance scheduling, fuel reserve planning, hydrothermal coordination, planing of electrical energy import and export and also security assessment. In deregulated power systems, MTLF is a basis for the negotiation of forward contracts. Forecast accuracy translates directly into financial performance for energy companies and energy market participants. The financial impact can be measured in millions of dollars for every point of forecasting accuracy gained. All the above reasons justify interest in new forecasting tools for MTLF.
In this work, we focus on monthly electricity demand forecasting. Monthly electricity demand time series express a nonlinear trend, yearly cycles and a random component. The trend is dependent on the rate of economic development in a country, while seasonal cycles are related to the climate, weather factors and the variability of seasons [1]. Factors disrupting electricity demand in a midterm horizon include unpredictable economic events and political decisions [2].
MTLF methods can be divided into statistical/econometric models or machine learning/computational intelligence models [3]
. Typical examples of the former are ARIMA, EST, and linear regression. ARIMA and ETS can deal with seasonal time series but linear regression requires additional operations such as decomposition or the extension of the model with periodic components
[4]. Statistical models have undoubted advantages, being relatively simple, robust, efficient, and automatic, so that they can be used by nonexpert users.The flexibility of machine learning (ML) models has increased researchers’ interest in them as MTLF tools [5]
. Of these, neural networks (NNs) are the most explored because of their attractive features such as learning capability, universal approximation property, nonlinear modeling, massive parallelism and ease of specifying a loss function, to align it better with forecasting goals. Some examples of using different architectures of NNs for MLTF are:
[6] where NN learns on historical demand and weather factors, [7] where Kohonen NN was used, [8] where NNs were supported by fuzzy logic, [9] where generalized regression NN was used, [10] where NNs, linear regression and AdaBoost were used, [11] where weighted evolving fuzzy NNs were combined, and [12]where recurrent NNs were used. Among other machine learning MTLF models, the following can be mentioned: support vector machines
[13], and pattern similaritybased models [14].Recent trends in ML such as deep learning, especially deep recurrent NNs (RNNs), are very attractive for time series forecasting [15]. RNNs with connections between nodes forming a directed graph along a temporal sequence are able to exhibit temporal dynamic behavior using their internal state (memory) to process sequences of inputs. Recent works have reported that RNNs, such as the LSTM, provide high accuracy in forecasting and outperform most of the traditional statistical and ML methods such as ARIMA, support vector machine and shallow NNs [16]. There are many examples of an application of LSTMs to load forecasting: [17], [18], [19].
Many new ideas in the field of deep learning have been successfully applied to time series forecasting. For example in [20], bidirectional LSTM is proposed for shortterm scheduling in power markets. This solution has two benefits: longrange memory and bidirectional processing. It takes advantage of deep architectures, which are able to build up progressively higher level representations of data, by piling up RNN layers on top of each other. A residual recurrent highway network for learning deep sequence predictions was proposed in [21]. It contains highways within the temporal structure of the network for unimpeded information propagation, thus alleviating gradient vanishing problem. Hierarchical structure learning is posed as a residual learning framework to prevent performance degradation problems. Another example of using a new deep learning solution for time series forecasting is the NBEATS model proposed is [22]. Its architecture is based on backward and forward residual links and a deep stack of fullyconnected layers. NBEATS has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train.
In addition to point forecasting, deep learning also enables probabilistic forecasting. A model proposed in[23]
, DeepAR, produces accurate probabilistic forecasts, based on training an autoregressive RNN on a large number of related time series. This model makes probabilistic forecasts in the form of Monte Carlo samples that can be used to compute consistent quantile estimates for all subranges in the prediction horizon. Another solution for probabilistic time series forecasting was proposed in
[24]. It combines state space models with deep learning. By parametrizing a pertimeseries linear state space model with a jointlylearned RNN, the method retains the desired properties of state space models such as data efficiency and interpretability, while making use of the ability to learn complex patterns from raw data offered by deep learning approaches.To improve forecasting performance, RNN is also mixed with other methods such as ETS. Such a model won the M4 forecasting competition in 2018 [25]. This competition utilized 100,000 reallife time series, and incorporates all major forecasting methods, including those based on AI and ML, as well as traditional statistical ones [26]. The winning model, developed by Slawek Smyl [27]
, is a hybrid approach utilizing both statistical and ML features. It combined ETS with advanced LSTM, which is supported by such mechanisms as dilation, residual connections and attention
[28], [29], [30]. It produced the most accurate forecasts as well as the most precise prediction intervals. According to sMAPE, it was close to more accurate than the combination benchmark of the competition, which is a huge improvement. For monthly data (48,000 time series) it outperformed all other 60 submissions achieving the highest accuracy according to each of the three performance measures.In this work, we propose a forecasting model for MTLF based on the winning submission to the M4 competition for monthly data. It combines ETS, LSTM and ensembling. ETS enables the model to capture the main components of the individual time series, such as seasonality and level, while LSTM allows nonlinear trends and crosslearning. A common learning procedure for LSTM and ETS, with a penalized pinball loss, leads to simultaneous optimization of data representation and forecasting performance. Ensembling at three levels is a powerful regularization method which reduces the model variance.
The rest of the work is organized as follows. Section 2 describes the proposed forecasting model: its architecture, features, components and implementation details as well as data flow and processing. Section 3 describes the experimental framework used to evaluate the performance of the proposed model. Finally, Section 4 concludes the work.
Ii Forecasting Model
The proposed model is based on the winning submission to the M4 forecasting competition 2018 for monthly data and point forecasts [27]. It is a hybrid and hierarchical forecasting model that enables ETS and advanced LSTM to be mixed into a common framework. The model architecture, its specific features and components are describe below.
Iia Framework and Features
The proposed forecasting model is shown in Fig. 1. It is composed of:

ETS – which is a HoltWinters type multiplicative seasonal model. It is used for extracting two components from the time series: level and seasonality. ETS loads a set of time series (), calculates the level and seasonal components individually for each series and returns sets of levels () and seasonal components ().

Preprocessing – the level and seasonal components are used for deseasonalization and adaptive normalization of the time series. The inputs to the preprocessing module are: set of time series and sets of level and seasonal components, and , respectively. The preprocessed data are divided into input and output training data and returned in training set .

RDLSTM – which is residual dilated LSTM composed of four layers. Due to its recurrent nature, this model is capable of learning longterm dependencies in sequential data. RDLSTM learns in crosslearning mode on training set . The forecasts for all time series produced by RDLSTM are returned in set .

Postprocessing – the forecasts of the deseasonalized and normalized time series are ”reseasonalised” and renormalised. The inputs to the postprocessing module are: forecasts , and level and seasonality sets, and . The output is set , containing the forecasts for each time series.

Ensembling – the forecasts produced by individual models are averaged. This enhances the robustness of the method further, mitigating model and parameter uncertainty. The ensembling module receives the sets of forecasts produced by individual models, , aggregates them and returns a set of forecasts for all time series, .

Stochastic gradient descent (SGD) – the parameters of both ETS and RDLSTM are updated by the same overall optimization procedure, SGD, with the overarching goal of minimizing forecasting errors.
The proposed model has a hierarchical structure, i.e. the data are exploited in a hierarchical manner. Both local and global time series features are extracted. The global features are learned by RDLSTM across many time series. The specific features of each individual time series are extracted by ETS. Thus, each series has a partially unique and partially shared model.
Note the hybrid structure of the model, where statistical modeling is combined concurrently with ML algorithms. The model combines ETS, advanced LSTM and ensembling. ETS is focused on each individual series and enables the model to capture its main components such as seasonality and level. These components are used for time series preprocessing, normalization and deseasonalization.
An advanced LSTMbased RNN allows nonlinear trends and crosslearning. This is an extended, multilayer version of LSTM with residual dilated LSTM blocks. The dilated recurrent skip connections and spatial shortcut path from lower layers, applied in this solution, allow the model to better capture longterm seasonal relationships and ensure more efficient training. The RDLSTM model is trained on many time series (crosslearning). To train deep NNs, which have many parameters, crosslearning is necessary. Moreover, it enables the method to capture the shared features and components of the time series.
ETS and RDLSTM are optimized simultaneously, i.e. the ETS parameters and the RDLSTM weights are optimised by SGD at the same time. The same overall learning procedure optimizes the model, including data preprocessing. So, the learning process includes representation learning – searching for the most suitable representations of input and output data (individually for each time series), which ensures the most accurate forecasts. It is worth noting the dynamical character of the training set which is related to representation learning. The training set is updated in each epoch of RDLSTM learning. This is because SGD updates the ETS parameters in each epoch, and therefore the level and seasonal components, used for preprocessing, are updated as well.
Ensembling is seen as a much more powerful regularization technique than more popular alternatives, e.g. dropout or L2norm penalty [22]. In our case, ensembling combines individual forecasts at three levels: stage of training level, data subset level and model level. This reduces the variance related to the stochastic nature of SGD, and also related to data and parameter uncertainty.
IiB Exponential Smoothing
The complex nature of a time series, e.g. nonstationarity, nonlinear trend and seasonal variations, makes forecasting difficult and puts high demands on the models. A typical approach in this case is to simplify the forecasting problem by deseasonalization, detrending or decomposition. A time series is usually decomposed into seasonal, trend and stochastic components. The components expressing less complexity than the original time series can be modeled independently using simpler models. The most popular methods of decomposition are [31]: additive decomposition, multiplicative decomposition, X11, SEAT, and STL. Although very useful, this approach has a drawback. It separates the preprocessing from the forecasting, which results in the final solution not being optimal. Some classic statistical models, such as ETS, employ a better way: the forecasting model has a builtin mechanism to deal with seasonality. The final model uses the optimal decomposition of the time series.
In our approach, we use ETS as the preprocessing tool. ETS extracts two components from the time series: level (smoothed value) and seasonality. Then we use these components to normalize and deseasonalize the original time series. Preprocessed time series are forecasted by RDLSTM. ETS and RDLSTM are optimised simultaneously using SGD. So the resulting forecasting model, including data preprocessing, is optimized as a whole. This distinctive feature of the proposed approach needs to be emphasized.
The ETS model used in this study was inspired by the HoltWinters multiplicative seasonal model. However, it has been simplified by the removal of the linear trend component. This is because the trend forecasting is the task of RDLSTM, which is able to produce a nonlinear trend which is more valuable in our case. The updating formulas for the ETS model with a seasonal cycle length of twelve (useful for monthly data) are as follows [27]:
(1)  
where is the time series value at timepoint , and are the level and seasonal components, respectively, and , are smoothing coefficients.
The level equation shows a weighted average between the seasonally adjusted observation and the level for time . The seasonal equation expresses a seasonal component for time as a weighted average between a new estimate of the seasonality component and the past estimate . Fig. 2 depicts an example of the monthly electricity demand time series and its level and seasonal components obtained from (1).
The ETS model parameters, twelve initial seasonal components and two smoothing coefficients for each time series, were adjusted together with RDLSTM weights by SGD. Knowing these parameters allows the level and seasonal components to be calculated, which are then used for preprocessing: deseasonalization and normalization.
IiC Pre and Postprocessing
The level and seasonal components are calculated for all points of each series, which are then used for deseasonalization and adaptive normalization during the onthefly preprocessing. This is the most crucial element of the forecasting procedure as it determines its performance. The time series is preprocessed in each training epoch using the updated values of level and seasonal components. These updated values are calculated from (1), where the ETS parameters are increasingly fine tuned in each epoch by SGD.
The time series is preprocessed using rolling windows: input and output ones. Both windows have a length of twelve, which is equal to the length of both the seasonal cycle and the forecast horizon. The input window, , contains twelve consecutive elements of the time series which after preprocessing will be the RDLSTM inputs. The corresponding output window, , contains the next twelve consecutive elements, which after preprocessing will be the RDLSTM outputs. The time series fragments inside both windows are normalized by dividing them by the last value of the level in the input window,
, and then, divided further by the relevant seasonal component. As a result of this operation, we obtain positive input and output values close to one. Finally, to limit the destructive impact of outliers on the forecasts, a squashing function, log(.), is applied. The resulting preprocessing can be expressed as follows:
(2) 
where is the preprocessed th element of the time series, is the last value of the level in input window , and is the th seasonal component.
Note that normalization is adaptive and local, and the ”normalizer” follows the series values. This allows us to include the current features of the series ( and ) in the input and output variables.
The preprocessed elements of the time series contained in the successive input and output windows can be represented by vectors as follows:
– first pair of input and output windows:
, ,
– second pair of input and output windows:
, ,
– …,
– th pair of input and output windows:
, .
These vectors are included in the training subset for the th time series: . The training subsets for all time series are combined and form the training set which is used for RDLSTM crosslearning. Note the dynamic character of the training set. It is updated in each epoch because the level and seasonal components in (2) are updated.
Fig. 3 shows the input and output vectors for the time series which is shown in Fig. 2. The upper panel shows the corresponding input and output xvectors representing the first two seasonal cycles of the time series. The lower panel shows the input and output xvectors representing the last two seasonal cycles. It can be seen from this figure that the xvectors express patterns of the time series fragments after filtering out both level and seasonality. This pattern representation of the time series has been used successfully in earlier studies concerning ML forecasting models, especially similarity based models [32]. Different definitions of the time series patterns can be found in [33]. But these definitions are fixed, while in this work we use dynamic patterns which change during learning (compare the patterns in the first and last training epochs in Fig. 3).
RDLSTM operates on preprocessed time series values, . In the postprocessing step, the forecasts generated by RDLSTM, , need to be ”unwound” in the following way:
(3) 
IiD Residual Dilated LSTM
LSTM is a special kind of RNN capable of learning longterm dependencies in sequential data [34]. A common LSTM block is composed of a memory cell which can maintain its state over time, and three nonlinear ”regulators”, called gates, which control the flow of information inside the block. A typical LSTM block is shown in Fig. 4. In this diagram, and denote the hidden state and the cell state at time step , respectively. The cell state contains information learned from the previous time steps. Information can be added to or removed from the cell state using the gates: input gate (), forget gate () and output gate (). At each time step , the block uses the past state of the network, i.e. and , and the input to compute output and updated cell state . The hidden and cell states are recurrently connected back to the block input. All of the gates are controlled by the hidden state of the past cycle and the input xvector. Most modern studies incorporate many of the improvements that have been made to the LSTM architecture since its original formulation [35].
Detailed mathematical expressions describing LSTM block are given below. The cell state at time step is:
(4) 
where operator denotes the Hadamard product (elementwise product) and superscript, 1, refers to the first layer of RDLSTM network, where we use the standard LSTM block.
The hidden state at time step is given by:
Formulas related to the gates are as follows:
(6) 
(7) 
(8) 
(9) 
where W, V and b are input weights, recurrent weights and biases, respectively, and is a gate sigmoid activation function .
In our study, we also use a dilated residual version of LSTM (RDLSTM). Dilated RNN architecture was proposed in [28] as a solution to tackle three major challenges of RNN when learning on long sequences: complex dependencies, vanishing and exploding gradients, and efficient parallelization. It is characterized by multiresolution dilated recurrent skip connections. Moreover, it reduces the number of parameters needed and enhances training efficiency significantly in tasks involving very longterm dependencies. A dilated LSTM block receives as input states, not the last ones, and , but earlier states, and , where is a dilation. So, to compute the current states of the LSTM block, the last states are skipped. Usually multiple dilated recurrent layers are stacked with hierarchical dilations to construct a system, which learns the temporal dependencies of different scales at different layers. In [28], it was shown that this solution can reliably improve the ability of recurrent models to learn longterm dependency in problems from different domains. It seems that dilated LSTM can be particularly useful for seasonal time series, where the relationships between the series elements have a cyclical character. This character can be incorporated into the model by dilations related to seasonality.
A residual version of LSTM was proposed in [29]. A standard memory cell, which learns longterm dependencies of sequential data, provides a temporal shortcut path to avoid vanishing or exploding gradients in the temporal domain. The residual LSTM provides an additional spatial shortcut path from lower layers for efficient training of deep LSTM architectures. To avoid a conflict between spatial and temporaldomain gradient flows, residual LSTM separates the spatial shortcut path from the temporal one. This gives greater flexibility to deal with vanishing or exploding gradients.
Fig. 5 describes a residual dilated LSTM block which was used in this study. We denote this block by RDLSTM. In this figure is a shortcut path from th layer that is added to updated cell state processed by function
. Our implementation of the residual LSTM is a simplified version of the original one. The peephole connections are removed as well as linear transformations of the shortcut path
and transformed cell state . These transformations are not necessary because in our case the dimensions of and match that of .The mathematical expressions describing the RDLSTM block are as follows:
(10) 
(11) 
(12) 
(13) 
(14) 
(15) 
where superscript indicates the layer number (from 2 to 4 in our case, see below) and is a dilation (3, 6 or 12 in our case, see below).
The proposed RDLSTM architecture, which is a result of painstaking experimentation on M4 monthly data (48,000 time series), is depicted in Fig. 6. It is composed of four recurrent layers and a linear unit LU. The first layer consists of the standard LSTM block shown in Fig. 4. The subsequent three layers consist of RDLSTM blocks (Fig. 5) with increasing dilations and . The last element is a linear unit which transforms the output of the last layer, , into the forecast of the output xvector:
(16) 
The learnable parameters of RDLSTM are: input weights W, recurrent weights V, and biases b. They are tuned in the crosslearning mode simultaneously with the ETS parameters using SGD. The length of the cell and hidden states, , the same for all layers, was selected on the training set (see Section III for details).
Note that an input is not a scalar but a vector representing a sequence of the time series of length 12, i.e. a seasonal period. It allows the RDLSTM to be exposed to the immediate history of the series directly. An output is a vector representing the whole forecasted sequence of length 12.
IiE Ensembling
The forecasts generated by the RDLSTM model are ensembled at three levels:

Stage of training level. Averaging forecasts produced by most recent training epochs.

Data subset level. Averaging forecasts from a pool of forecasting models which learns on the subsets of the training set.

Model level. Averaging forecasts from independent runs of the pool of models produced on the data subset level.
The idea behind using the averaging forecasts produced in a few most recent training epochs (first level of ensembling) is that averaging has the effect of calming down the noisy SGD optimization process. SGD uses minibatches of training samples to estimate the actual gradients. The resulting searching process operates on the approximated gradients which make the trajectory noisy. Averaging the forecasts obtained in the most recent epochs, when the algorithm converges around the local minimum, may reduce the effect of stochastic searching and produce more accurate forecasts.
At the data subset level, the forecasts from models which learn on the subsets of the training set, , , …, , are averaged. The training set, , is composed of subsets containing the training samples for the th time series. To create the training subsets, first, a set of time series is split randomly into subsets of similar size: , , …, . The th subset contains the subsets for all time series excluding those in , i.e. . Each of models learns on its own training subset, , and generates forecasts for the time series included in . Then the forecasts for each time series produced by the pool of models are averaged.
The last level of ensembling simply averages the forecasts for each time series generated in independent runs of a pool of models. In each run, the training subsets are created anew.
Note that the diversity of learners, which is a key property which governs the ensemble performance [36], has various sources in our proposed approach. They include (i) data uncertainty: learning on minibatches, learning on different subset of the training set, and (ii) parameter uncertainty: learning using different initial values of the model parameters in each run.
The last two ensembling levels are depicted in Fig. 7. In this figure, , so four training subsets are created. The set of time series included in the th subsets in the th run is denoted by in this figure, and the set of forecasts generated by the model in this case is denoted by . For each time series forecasts are averaged. This is shown as a joint operation for levels 2 and 3 in the figure, and can be expressed as:
(17) 
where is the size of the pool of models, is the number of runs, and is the forecasted yvector generated as the th one in the th run.
In this work, we use a simple average for ensembling, but other functions, such as median, mode, or trimmed mean could be applied [36]. As shown in [37], a simple average of forecasts often outperforms more complicated weighting schemes.
Note that the models created at the data subset levels in runs can be trained simultaneously. Thus, the proposed forecasting system is suitable to be implemented in parallel.
IiF Loss Function
As a loss function we used pinball loss operating on normalized and deseasonalized RDLSTM outputs and actuals:
(18) 
where controls the function asymmetry.
When the loss function is symmetrical and penalizes positive and negative deviations equally. When the model tends to have a positive or negative bias, we can reduce the bias by introducing smaller or larger than , respectively. Thus, the asymmetric pinball function, penalizing positive and negative deviations differently, allows the method to deal with bias. It is worth noting that pinball loss is commonly employed in quantile regression and probabilistic forecasting [38].
The experience gained during the M4 competition shows that the smoothness of the level time series influences the forecasting accuracy substantially. It turned out that when the input to the RDLSTM was smooth, the RDLSTM focused on predicting the trend, instead of overfitting on some spurious, seasonalityrelated patterns. A smooth level curve also means that the seasonal components absorbed the seasonality properly. To deal with the wiggliness of a level curve, a penalized loss function was introduced as follows:

Calculate the logarithms of the quotients of the neighboring points of the level time series: ,

Calculate differences of the above: ,

Square and average them for each series.
The resulting penalized loss function related to a given time series takes the form:
(19) 
where is the number of forecasts and is a regularization parameter which determines how much to penalizes the wiggliness of a level curve.
The level wiggliness penalty affected the performance of the method significantly and contributed greatly to winning the M4 competition [27].
Iii Experimental Study
This section presents the results of applying the proposed forecasting model to the monthly electricity demand forecasting for 35 European countries. The data were obtained from the ENTSOE repository (www.entsoe.eu). The time series have different lengths: 24 years (11 countries), 17 years (6 countries), 12 years (4 countries), 8 years (2 countries), and 5 years (12 countries). The last year of data is 2014. The time series are presented in Fig. 8. As can be seen from this figure, monthly electricity demand time series exhibit different levels, nonlinear trends, strong annual cycles and variable variances. The shapes of yearly cycles change over time.
The forecasting model as well as comparative models were applied to forecast twelve monthly demands for the last year of data, 2014. The data from previous years were used for hyperparameter selection and learning.
Model hyperparameters were selected on the training data. A typical procedure was learning the model on the time series fragments up to 2012, and validating it on 2013 (during the M4 competition a very strong correlation between errors for the test period and last period of training data was observed). The selected hyperparameters were used to construct the model for 2014. The selected hyperparameters were as follows:

number of epochs: ,

learning rate: ,

length of the cell and hidden states: ,

asymmetry parameter in pinball loss: ,

regularization parameter: ,

ensembling parameters: , , .
The model was implemented in C++ relying on the DyNet library [39]. It was compiled in Visual Studio 2017 (Windows 10) and run in parallel on an 8core CPU (AMD Ryzen 7 1700, 3.0 GHz, 32 GB RAM).
Iiia Comparative Models
The proposed model was compared with stateoftheart models based on ML as well as classical statistical models such as ARIMA and ETS. All ML models except LSTM use a pattern representation of time series [14]. Patterns which express preprocessed repetitive sequences in a time series ensure input and output data unification through trend filtering and variance equalization. Consequently, the relationship between input and output data is simplified and the transformed forecasting problem can be solved using simple models.
The comparative models are described in detail in [14] and outlined below.

NNw – nearest neighbor weighted regression model. It estimates a vectorvalued regression function as an average of output patterns in a varying neighborhood of a query pattern. A weighting function allows the model to take into account the similarity between the query pattern and its nearest neighbors. The model hyperparameters are: input pattern length and number of nearest neighbors . Linear weighting function was used.

FNM – fuzzy neighborhood model. In this case, the regression model is similar to NNw except that it aggregates not only
nearest neighbors of the query pattern but all training patterns. The weighting function takes the form of a membership function (Gaussiantype function) which assigns to each training pattern a degree of membership to the query pattern neighborhood. The model hyperparameters are: input pattern length and membership function width.

NWE – Nadaraya–Watson estimator. It is a kernel regression model belonging to the same category of nonparametric models as NNw and FNM. NWE estimates the regression function as a locally weighted average, using a kernel function (Gaussian) as a weighting function. The model hyperparameters are: input pattern length and kernel bandwidth parameters.

GRNN – general regression NN model. This is a fourlayer NN with Gaussian nodes centered at training patterns. The node outputs express similarities between the query pattern and the training patterns. These outputs are treated as the weights of the training patterns in the regression model. The model hyperparameters are: input pattern length and bandwidth parameter for nodes.

MLP – multilayer perceptron
[40]with a single hidden layer and sigmoidal neurons. It used for learning the LevenbergMarquardt method with Bayesian regularization to prevent overfitting. The MLP hyperparameters are: input pattern length and number of hidden nodes. We use Matlab R2018a implementation of MLP (function
feedforwardnet from Neural Network Toolbox). 
ANFIS – adaptive neurofuzzy inference system [41]. The initial membership function parameters in the premise parts of rules are determined using fuzzy
means clustering. A hybrid learning method is applied for ANFIS training which uses a combination of the leastsquares for consequent parameters and backpropagation gradient descent method for premise parameters. The ANFIS hyperparameters are: input pattern length and number of rules. The Matlab R2018a implementation of ANFIS was used (function
anfis from Fuzzy Logic Toolbox). 
NNw+ETS, FNM+ETS, NWE+ETS, GRNN+ETS, MLP+ETS, ANFIS+ETS – variants of the above models with output patterns encoded with variables describing the current features of the time series. These features are the mean value of the series in the seasonal cycle and its dispersion. To postprocess the forecasted output pattern, the coding variables are predicted for the next period using ETS. We use R implementation of ETS (see below).

LSTM – long shortterm memory, where the responses are the training sequences with values shifted by one time step (a sequencetosequence regression LSTM network). For multiple time steps, after one step was predicted the LSTM state was updated. Previous prediction was used as input to LSTM, producing a forecast for the next time step. LSTM was optimized using Adam (adaptive moment estimation) optimizer. The length of the hidden state was the only hyperparameter to be tuned. Other hyperparameters remain at their default values. The experiments were carried out using Matlab R2018a implementation of LSTM (function
trainNetwork from Neural Network Toolbox). 
ARIMA – ARIMA model implemented in function auto.arima in R environment (package forecast). This function implements automatic ARIMA modeling which combines unit root tests, minimization of the Akaike information criterion (AICc) and maximum likelihood estimation to obtain the optimal ARIMA model [31].

ETS – exponential smoothing state space model [43] implemented in function ets (R package forecast). This implementation includes many types of ETS models depending on how the seasonal, trend and error components are taken into account. They can be expressed additively or multiplicatively, and the trend can be damped or not. As in the case of auto.arima, ets returns the optimal model estimating its parameters using AICc.
The models: NNw, FNM, NWE and GRNN are also known as pattern similaritybased forecasting models (PSFMs) because the forecast is constructed by aggregating the training output patterns using similarity between the query pattern and training input patterns [14]. All the model hyperparameters mentioned above were selected on the training set in grid search procedures. The NNbased models, i.e. MLP, ANFIS and LSTM, due to the stochastic nature of the learning processes return different results for the same data. In this study, 100 independent learning sessions for these models were performed and the final errors were calculated as averages over 100 trials.
IiiB Results
Table I shows the results of forecasting for the proposed and comparative models: median of absolute percentage error (APE), mean APE (MAPE), interquartile range of APE as a measure of the forecast dispersion, and root mean square error (RMSE). These are values averaged over 35 countries. The lowest errors are for ”+ETS” PSBMs, where the coding variables are predicted using ETS. The errors for ETS+RDLSTM are slightly higher but lower than for all other models.
Model  Median APE  MAPE  IQR  RMSE 

kNNw  2.89  4.99  3.85  368.79 
FNM  2.88  4.88  4.26  354.33 
NWE  2.84  5.00  3.97  352.01 
GRNN  2.87  5.01  4.02  350.61 
kNNw+ETS  2.71  4.47  3.52  327.94 
FNM+ETS  2.64  4.40  3.46  321.98 
NWE+ETS  2.68  4.37  3.36  320.51 
GRNN+ETS  2.64  4.38  3.51  324.91 
MLP  2.97  5.27  3.84  378.81 
MLP+ETS  3.11  4.80  4.12  358.07 
ANFIS  3.56  6.18  4.87  488.75 
ANFIS+ETS  3.54  6.32  4.26  464.29 
LSTM  3.73  6.11  4.50  431.83 
ARIMA  3.32  5.65  5.24  463.07 
ETS  3.50  5.05  4.80  374.52 
ETS+RDLSTM  2.74  4.48  3.55  347.24 
More detailed results are shown in Figs 9 and 10. Fig. 9 depicts MAPE for each country. As can be seen from this figure, ETS+RDLSTM is one of the most accurate models in most cases. Fig. 10 shows MAPE for each month of the forecasted period. Note lower errors for months 8–10 and higher for months 1–4 and 12. ETS+RDLSTM achieved better results than most of the comparative models for the months 6–12. For months 1 and 3 it achieved the highest errors compared to other models.
Rankings of the models are shown in Fig. 11. These are based on average ranks of the models in the rankings for individual countries. The left panel shows the ranking based on MAPE and the right panel shows the ranking based on RMSE. Note the high position of ETS+RDLSTM. It is in first position in MAPE ranking and third position in RMSE ranking. Note that in the latter case the difference between the first three positions is very small.
Examples of forecasts produced by the models for six countries are depicted in Fig. 12. For PL data, most models including ETS+RDLSTM do not exceed , which should be considered a very good result. Similar results were achieved for ES, IT and DE. In these cases MAPE for ETS+RDLSTM were respectively: (most accurate model), (third most accurate model) and . For GB the forecasts are underestimated. This results from the fact that demand went up unexpectedly in 2014 despite the downward trend observed in the previous period from 2010 to 2013. The reverse situation for FR caused a slight overestimation of forecasts. For GB data, ETS+RDLSTM with was the least accurate model, and for FR data, with , it was one the most accurate models.
Table II
shows basic descriptive statistics of the percentage errors (PEs) and Fig.
13depicts the empirical probability density functions of PEs for the selected models. The PE distributions are similar to normal but tests for the assessment of normality (Jarque–Bera test and Lilliefors test) do not confirm this. In all cases, the forecasts are overestimated, having a positive PE mean. Note that ETS+RDLSTM is one of the least biased models. Positive values of skewness indicate the rightskewed PE distributions, and high kurtosis values indicate leptokurtic distributions where the probability mass is concentrated around the mean.
Mean  Median  Std  Skewness  Kurtosis  

kNNw  1.87  1.08  10.43  5.14  44.56 
FNM  2.03  1.22  9.34  4.16  35.14 
NWE  1.91  1.18  10.82  5.41  48.94 
GRNN  1.87  1.16  10.60  5.34  48.11 
kNNw+ETS  1.25  0.20  9.00  4.40  37.30 
FNM+ETS  1.26  0.11  8.80  4.75  41.71 
NWE+ETS  1.26  0.17  8.68  4.63  40.75 
GRNN+ETS  1.26  0.11  8.61  4.42  38.38 
MLP  1.37  0.68  11.88  7.52  109.64 
MLP+ETS  1.71  1.03  7.32  1.55  11.83 
ANFIS  2.51  1.43  11.37  4.35  34.93 
ANFIS+ETS  1.30  0.40  12.65  0.96  39.37 
LSTM  3.12  1.81  9.49  2.86  22.21 
ARIMA  2.35  1.03  13.62  9.01  119.20 
ETS  1.04  0.31  7.97  1.89  13.52 
ETS+RDLSTM  1.11  0.27  10.07  6.37  63.61 
Iv Conclusion
In this work, we proposed and empirically validated a new architecture for midterm load forecasting. It was inspired by the winning submission to the M4 forecasting competition 2018, which combines ETS, advanced LSTM and ensembling. The model has a hierarchical structure composed of a global part learned across many time series (weights of the LSTM) with a time series specific part (ETS smoothing coefficients and initial seasonal components). Using stochastic gradient descent, it learns a mapping from input vectors to output vectors and time series representations at the same time. ETSinspired formulas, extract the main components of individual time series to be used for deseasonalization and normalization. Preprocessed time series are forecasted using residual dilated LSTM. Due to the introduction of dilated recurrent skip connections and a spatial shortcut path from lower layers, LSTM is able to capture better longterm seasonal relationships and ensure more efficient training. To deal with forecast bias we used a pinball loss function with a parameter controlling its asymmetry. In addition, we penalized the loss function to prevent overfitting. Threelevel ensembling ensures powerful regularization reducing the model variance, which has sources in the stochastic nature of SGD, and also in data and parameter uncertainty.
We applied the proposed model to the monthly electricity demand forecasting for 35 European countries. The results demonstrated its stateoftheart performance and competitiveness with classical models such as ARIMA and ETS as well as stateoftheart models based on ML.
References
 [1] F. Apadula, A. Bassini, A. Elli, and S. Scapin, “Relationships between meteorological variables and monthly electricity demand,“ Appl Energy, vol. 98, pp. 346–356, 2012.
 [2] E. Dogan, “Are shocks to electricity consumption transitory or permanent? Subnational evidence from Turkey,“ Utilities Policy, vol. 41 pp. 77–84, 2016.
 [3] L. Suganthi and A. A. Samuel, “Energy models for demand forecasting — A review,“ Renew Sust Energy Rev, vol. 16(2), pp. 1223–1240, 2002.
 [4] E. H. Barakat, “Modeling of nonstationary timeseries data. Part II. Dynamic periodic trends,“ Electr Power Energy Systems, vol. 23, pp. 63–68, 2001.
 [5] E. GonzálezRomera, M.A. JaramilloMorán, and D. CarmonaFernández, “Monthly electric energy demand forecasting with neural networks and Fourier series,“ Energy Conversion and Management, vol. 49, pp. 3135–3142, 2008.

[6]
J. F. Chen, S. K. Lo, and Q. H. Do, “Forecasting monthly electricity demands: An application of neural networks trained by heuristic algorithms,“
Information, vol. 8(1), 31, 2017.  [7] M. Gavrilas, I. Ciutea, and C. Tanasa, “Mediumterm load forecasting with artificial neural network models,“ Proc. Conf. International Conference and Exhibition on Electricity Distribution, vol. 6, 2001.
 [8] E. Doveh, P. Feigin, and L. Hyams, “Experience with FNN models for medium term power demand predictions,“ IEEE Trans. Power Syst., vol. 14(2), pp. 538–546, 1999.
 [9] P. Pełka and G. Dudek, “Mediumterm electric energy demand forecasting using generalized regression neural network,“ Proc. Conf. Information Systems Architecture and Technology ISAT 2018, Advances in Intelligent Systems and Computing, vol. 853, pp. 218–227, Springer, Cham, 2019.
 [10] T. Ahmad and H. Chen, “Potential of three variant machinelearning models for forecasting district level mediumterm and longterm energy demand in smart grid environment,“ Energy, vol. 160, pp. 1008–1020, 2018.
 [11] P. C. Chang, C. Y. Fan, and J. J. Lin, “Monthly electricity demand forecasting based on a weighted evolving fuzzy neural network approach,“ Electrical Power and Energy Systems, vol. 33, pp. 17–27, 2011.
 [12] S. Baek, “Midterm Load Pattern Forecasting With Recurrent Artificial Neural Network,“ in IEEE Access, vol. 7, pp. 172830–172838, 2019.
 [13] W. Zhao, F. Wang, and D. Niu, “The application of support vector machine in load forecasting,“ Journal of Computers, vol. 7(7), pp. 1615–1622, 2012.
 [14] G. Dudek and P. Pełka, “Pattern Similaritybased Machine Learning Methods for Midterm Load Forecasting: A Comparative Study,“ arXiv:2003.01475, 2020.
 [15] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural networks for time series forecasting: Current status and future directions,“ arXiv:1909.00590v3, 2019.
 [16] K. Yan, X. Wang ,Y. Du ,N. Jin , H. Huang, and H. Zhou, “MultiStep ShortTerm Power Consumption Forecasting with a Hybrid Deep Learning Strategy,“ Energies, vol. 11(11), 3089, 2018.
 [17] J. Bedi and D. Toshniwal, “Empirical mode decomposition based deep learning for electricity demand forecasting,“ IEEE Access, vol. 6, pp. 49144–49156, 2018.
 [18] H. Zheng, J. Yuan, and L. Chen, “ShortTerm Load Forecasting Using EMDLSTM Neural Networks with a XGBboost Algorithm for Feature Importance Evaluation,“ Energies 2017, vol. 10(8), 1168, 2017.
 [19] A. Narayan and K. W. Hipel, “Long short term memory networks for shortterm electric load forecasting,“ Proc. Conf. IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, pp. 2573–2578, 2017.
 [20] J. Toubeau, J. Bottieau, F. Vallée, and Z. De Grève, “Deep learningbased multivariate probabilistic forecasting for shortterm scheduling in power markets,“ IEEE Transactions on Power Systems, vol. 34(2), pp. 1203–1215, 2019.
 [21] T. Zia, and S. Razzaq, “Residual recurrent highway networks for learning deep sequence prediction models,“ Journal of Grid Computing, Jun 2018.
 [22] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “NBEATS: Neural basis expansion analysis for interpretable time series forecasting,“ arXiv:1905.10437v4, 2020.
 [23] D. Salinas, V. Flunkert, and J. Gasthaus, “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks,“ arXiv:1704.04110v3, 2019.
 [24] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski, “Deep state space models for time series forecasting“ In NeurIPS, vol. 31, pp. 7785–7794, 2018.
 [25] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Competition: Results, findings, conclusion and way forward,“ International Journal of Forecasting, vol. 34(4). pp. 802–808, 2018.
 [26] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Competition: 100,000 time series and 61 forecasting methods,“ International Journal of Forecasting, vol. 36(1): pp. 54–74, 2020.
 [27] S. Smyl, “A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting,“ International Journal of Forecasting, vol. 36(1), pp. 75–85, 2020.
 [28] S. Chang, Y. Zhang, W. Han., M. Yu, X. Guo, W. Tan, et al. “Dilated recurrent neural networks,“ arXiv: 1710.02224, 2017.
 [29] J. Kim, M. ElKhamy, and J. Lee, “Residual LSTM: Design of a deep recurrent architecture for distant speech recognition, “ arXiv:1701.03360, 2017.
 [30] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell. “A dualstage attentionbased recurrent neural network for time series prediction,“ In IJCAI17, pp. 2627–2633, 2017.
 [31] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2 (2018) Accessed on 7 March 2020.
 [32] G. Dudek, “Pattern Similaritybased Methods for Shortterm Load Forecasting – Part 1: Principles,“ Applied Soft Computing, vol. 37, pp. 277–287, 2015.
 [33] G. Dudek, “Pattern Similaritybased Methods for Shortterm Load Forecasting – Part 1: Principles,“ Applied Soft Computing, vol. 37, pp. 277–287, 2015.
 [34] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,“ Neural Computation, vol. 9(8), pp.1735–1780, 1997.
 [35] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink and J. Schmidhuber, “LSTM: A Search Space Odyssey,“ in IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, Oct. 2017.
 [36] F. Petropoulos, R. J. Hyndman, and C. Bergmeir, “Exploring the sources of uncertainty: Why does bagging for time series forecasting work?,“ European Journal of Operational Research, vol. 268(2), pp. 545–554, 2008.
 [37] F. Chan and L. L. Pauwels, “Some theoretical results on forecast combinations,“ International Journal of Forecasting, vol. 34(1), pp. 64–74, 2018.
 [38] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola, “Nonparametric quantile estimation“ Journal of Machine Learning Research, vol. 7, pp. 1231–1264, 2006.
 [39] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, et al., “DyNet: The dynamic neural network toolkit,“ arXiv:1701.03980, 2017

[40]
P. Pełka and G. Dudek, “Patternbased forecasting monthly electricity demand using multilayer perceptron,“ Proc. Conf. Artificial Intelligence and Soft Computing ICAISC 2019,
Lecture Notes in Artificial Intelligence, vol. 11508, pp. 663–672, Springer, Cham, 2019.  [41] P. Pełka and G. Dudek, “NeuroFuzzy System for Mediumterm Electric Energy Demand Forecasting,“ Proc. Conf. Information Systems Architecture and Technology ISAT 2017, Advances in Intelligent Systems and Computing, vol. 655, pp. 38–47, Springer, Cham, 2018.
 [42] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and machine learning forecasting methods: Concerns and ways forward,“ PLoS ONE, vol. 13(3), 2018.
 [43] R. J. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting with exponential smoothing: The state space approach, Springer, 2008.
Comments
There are no comments yet.