1 Introduction
Midterm electrical load forecasting (MTLF) is an essential tool for power system operation and planning. It concerns forecasting monthly electricity demand as well as daily peak load for the next months. The forecast horizon is usually from a week to a year. The midterm load forecasts are necessary for maintenance scheduling, fuel reserve planning, hydrothermal coordination, electrical energy import/export planing and security assessment. In the deregulated power systems MTLF is a basis for negotiation of forward contracts. The forecast accuracy translates into financial performance of energy companies and other energy market participants.
This work focuses on the monthly electricity demand forecasting. Time series of monthly electricity demand usually exhibits a trend and annual cycles. The trend is dependent on the dynamics of economic development of a country. Seasonal cycles are related to the climate, weather factors and variability of seasons. Factors which disturb the time series include political decisions, unpredictable economic events, structural breaks [13] and transitory effects from external variables [3].
Methods of MTLF can be roughly classified to the conditional modeling approach and autonomous modeling approach
[21]. The conditional modeling approach focuses on economic analysis and longterm planning and forecasting of energy policy. Socioeconomic conditions that affect energy demand in a given region and population migrations are taken into account. Economic growth is described by economic indicators, which are additional inputs to the forecasting model. Such indicators include [21], [20]: gross national product, consumer price index, exchange rates or average wage. In addition, variables describing the power system and network infrastructure are introduced as inputs such as the number and length of transmission lines and the number of high voltage stations. Economic variables have the greatest impact on the trend, while weather variables, due to their seasonal nature, on the periodic behavior of the monthly electricity demand [23]. Examples of conditional modeling approach can be found in [27], [8] and [28]. In [27]a knowledgebased expert system dedicated for fast developing utility is proposed. It identifies forecasting algorithms and the key variables, electrical and nonelectrical ones, that affect demand forecasts. A set of decision rules relating these variables are then obtained and stored in the knowledge base. Afterwards, the best model that will reflect accurately the typical system behavior over other models is suggested to produce the load forecast. Multiple linear regression and ARIMA models for the monthly peak load forecasing are proposed in
[8]. Inputs of the models are: historical series of electric peak load, weather variables and economic variables such as consumer price index, and industrial index. In [28]the heuristic model was proposed which approximates the relationship between the actual load and four sets of historical data: population, gross national product, consumer price index and temperature. Also the impact of the reserve margin and load diversity factor is considered to obtain the final forecast.
In autonomous modeling approach, the input variables include only historical loads and weather factors. This approach is more appropriate for stable economies, without rapid changes affecting electricity demand. The selection of weather variables, such as [21]: atmospheric temperature, humidity, insolation time, wind speed, etc., depends on the local climate and weather conditions [14]. Autonomous models described in [19] use ARIMA and neural networks (NNs) for forecasting monthly peak loads. Input variables include load profiles, weather factors (temperature and humidity) and time index. The model described in [14] uses historical demand and atmospheric temperatures as input variables. Variables expressing the seasons are also introduced. In many cases, autonomous models are simplified using only historical load data as input. Such an approach was used in [23], where the trend of a series of monthly loads was forecasted only on the basis of loads from the previous twelve months. The seasonal component was modeled by Fourier series. In [9] a fuzzy neural network model was used based only on weather variables (atmospheric pressure, temperature, wind speed, etc.), without taking into account historical loads as input variables.
Another categorization of MTLF methods is based on the forecasting models which can be classical statistical/econometrics models or computational intelligence/machine learning models [36]. Typical examples of the former are ARIMA, linear regression (LR) and exponential smoothing (EST). Implementation of seasonal cycles in the LR models requires additional operations, such as decomposition of the series into individual month series. For nonstationary time series with an irregular periodic trend in [5] a LR model extended with periodic components implemented by the sine functions of different frequencies has been proposed. Another example of using LR for forecasting power systems loads can be found in [2]. The model uses strong daily and yearly correlations to forecast daily load profiles over a period of several weeks to several years. Forecast results are corrected by annual load increases. In [8] the LR and ARIMA models were compared in the task of forecasting monthly peak loads up to 12 months ahead. Models use the same set of input variables including historical peak load data, weather and economic data. In experimental studies, ARIMA proved to be about twice as accurate as LR.
Limited adaptability of the classical MTLF methods and problems with modeling of nonlinear relationships have caused an increase in interest in artificial intelligence and machine learning methods
[23]. The most explored machine learning models in MTLF are neural networks. This is due to their attractive features such as nonlinear modeling, learning capabilities, universal approximation property and massive parallelism. In [22] for forecasting a trend of the monthly load time series and seasonal fluctuations two separate NNs are used. In [11] NN on the basis of historical loads and whether variables predicts the future monthly loads. To improve learning capability, NN is learned using gravitational search algorithm and cuckoo optimization algorithm. An example of using Kohonen NN for MTLF can be found in [20]. The authors built 12 forecasting networks for each month of the year. Input vectors contained historical loads and microeconomic indicators. NN proposed in [14] are supported by fuzzy logic. Seasonal variables are defined in the form of trapezoidal indicators of the season. The authors train a set of NNs using regularization techniques to prevent overfitting and aggregate their responses which results in more accurate forecasts. Other examples of machine learning models for MTLF are: [9], where weighted evolving fuzzy NN for monthly electricity demand forecasting was used, [1] where NNs, LR and AdaBoost were used for energy forecasting, [38]where support vector machine was used, and
[6] where a long shortterm memory network model was used.Many of the MTLF methods mentioned above need decomposition of the load time series to deal with a trend and seasonal variations. A typical approach is to decompose the time series into trend, seasonal, and stochastic components. The components expressing less complexity than the original time series can be modeled independently using simpler models. One of the popular tool for decomposition is STL (seasonal and trend decomposition using Loess) filtering procedure based on locally weighted polynomial smoother [37]. In [12] STL decomposition was used for monthly data of total electric energy consumption in different developed and developing countries. The times series were forecasted using ARIMA or ETS, and also bootstrap aggregating method was used to improve the model accuracy. Another method of decomposition is a wavelet transform which splits up the load time series into subseries in the wavelet domain. A low frequency component called an approximation expresses a trend while highfrequency components called details express cyclical variations [7]
. As an alternative of wavelet decomposition a Fourier transform can be used which decomposes function of time into its constituent frequencies. For example, in
[23] the load time series was split into two components: one describing the trend and the other the fluctuations. Then the fluctuation series was expressed by several Fourier series of different frequencies. Yet another method of load time series decomposition is empirical mode decomposition which breaks down the time series into socalled intrinsic mode functions. Such decomposition was used in [33], where to model each of the extracted intrinsic mode functions deep belief network was used.
As an alternative of the classical and stateoftheart methods for MTLF, in this work we describe pattern similaritybased forecasting methods (PSFMs). Similaritybased learning is a practical learning framework as a generalization of the minimal distance methods. It is a basis of many machine learning and pattern recognition methods used for classification, clustering and regression. Repeated, similarshaped cycles observed in seasonal time series encourage us to apply these methods also for forecasting. To do so, first we define patterns expressing preprocessed repetitive sequences in a time series. Pattern representation ensures the input and output data unification through trend filtering and variance equalization. Consequently, no decomposition of the time series is needed. Due to pattern representation the relationship between input and output data is simplified and the forecasting problem can be solved using simple models. We consider four such models: nearest neighbor model, fuzzy neighborhood model, kernel regression model and general regression neural network. A regression function is constructed by aggregation output patterns with weights dependent on the similarity between input patterns. So, the principle of operation is very clear which is a big advantage of these models compared to other forecasting models that are often black boxes. Other advantages are: small number of parameters to adjust which implies fast optimization procedure, good generalization ability which can be controlled by the model parameters, working on the newest data without retraining, robustness to missing input variables, and generating a vector as an output. Unlike the stateoftheart machine learning methods, such as neural networks and deep learning, PSFMs do not suffer with excessive tuning and training burden.
The PSFMs belong to the group of autonomous modeling approach, where only the historical load data are used as inputs. Not including other inputs such as whether and economic factors can raise objections but we have to remember that these exogenous variables are usually not available and have to be forecasted. This is a major challenge in all MTLF approaches with exogenous inputs. Many researchers use statistical measures such as correlation coefficients, personal experience and intuition to assess the validity, effectiveness and contribution of such exogenous variables to energy and load forecasting [21]. This can lead to low accuracy forecasts of whether and economic variables, and consequently, to large load forecast errors.
This paper is organized as follows. Section 2 describes monthly electricity demand time series and their representation using patterns. The idea and framework of PSFM as well as the four forecasting models are presented in Section 3. Section 4 describes the experimental framework used to evaluate the performance of the proposed and comparative models. Finally, Section 5 presents our conclusions.
2 Time Series and their Representation
A monthly electricity demand time series exhibits a trend, annual cycles and random component. Fig. 1 demonstrates an example of such a time series. As you can see in this figure we can observe nonlinear trend, jump at the end of the fifth cycle, and changing annual pattern over the years. Also the dispersion of the annual cycles changes significantly over time (from to MWh).
One of the main issues in building forecasting models is how the time series should be represented to get the highest performance of the model. Input and output variables should be defined on the basis of the original time series. These definitions affect significantly the results. In this work we use input and output variables as patterns of the fragments of monthly load time series. By a pattern we mean a vector with components that are calculated using some function of actual time series elements. For multiple seasonal time series this function can filter out a trend and seasonal fluctuations and so simplify the data and relationships between them [17]. Thus the forecasting model working on patterns can be less complex.
An input pattern of length is a vector of predictors representing a sequence of successive time series elements (monthly electricity demands) preceding the forecasted period. A function transforming time series elements into patterns depends on the time series character such as seasonalities, variance and trend. Some definitions of the function mapping original time series into patterns are:
(1) 
(2) 
(3) 
(4) 
where , is a mean of sequence , and is a measure of its dispersion.
Definition (1) just copy sequence into xpattern without transformation. The pattern components defined using (2) are the differences in demand of a given month and an average demand of the sequence . A quotient of this two quantities is expressed by (3). An xpattern defined by (4) is a normalized version of a vector composed of the elements of , i.e. . So, the original time series sequences having different mean and dispersion (see Fig. 1 (b)) are unified and after normalization they are represented by xpatterns which all have zero mean, the same variance and also unity length.
In Fig. 2 oneyear sequences are shown and their xpatterns defined using (2)(4). Note that all xpattern definitions boil down the mean of all patterns to the same value (0 or 1) and additionally definition (4) boils down the variance of all patterns to the same value. So, a trend was filtered out and patterns differs only in shape.
An output pattern represents a forecasted sequence of length : , where is a forecast horizon in months. The output patterns are defined similarly to the input patterns:
(5) 
(6) 
(7) 
(8) 
where , and and are coding variables.
The coding variables, and , are the mean and dispersion of the forecasted sequence , respectively. They both are known for the historical time series, so the training ypatterns can be prepared using them and the model can be trained. After learning, for a new query xpattern, the model produces the forecasted ypattern, . The forecasts of the monthly electricity demands in period are calculated from the forecasted ypattern using transformed equations (this is called decoding). For example, when ypatterns are defined using (8), a decoding equation is of the form:
(9) 
In this equation the coding variables are not known, because they are the mean and dispersion of the future sequence , which is just forecasted. In this case the coding variables are predicted from their historical values. In the experimental part of the work the coding variables are predicted using ARIMA and exponential smoothing (ETS).
To avoid forecasting the coding variables we use another approach. Instead of using the mean and dispersion of the forecasted sequence as coding variables, we introduce in (6)(8) as coding variables the mean and dispersion of sequence , i.e. , . Although this approach does not guarantee that all ypatterns have the same mean value, it unifies output data taking into account the current process variability, expressed by mean and dispersion . When the model returns the forecasted ypatterns, the forecast of the monthly demands are calculated from (9) using known coding variables for the historical sequence .
A training set for learning forecasting models is composed of pairs of corresponding x and ypatterns: . The model learns the mapping xpatterns ypatterns. It generates a forecast of a ypattern, , for a query pattern . The forecasts of the monthly electricity demands in period are calculated from the forecasted ypattern using transformed equations (5)(8).
Note that when using pattern approach, the forecasting model works on patterns expressing shapes of the time series sequences. In the first step of this approach the trend and dispersion (and also additional seasonal variation in the time series with multiple seasonal cycles) are filtered out. Then the model forecasts the unified data, i.e. ypatterns on the basis of xpatterns. Finally, in the decoding step, current trend and dispersion is introduced to the forecasted ypattern to get the forecasted monthly demand.
3 Pattern Similaritybased Forecasting Models
3.1 The Framework of the PSFM
Similaritybased methods are very popular in the field of machine learning and pattern recognition [15], [10]
. They estimate the class label or the function value for the query sample based on the similarities between this sample and a set of training samples using some similarity function defined for any pair of samples. The proposed forecastng methods can be classified as memorybased approximation methods using analogies between preprocessed sequences of the time series (patterns). It is assumed that a time series behavior in the future can be deduced from its behavior in similar conditions in the past. This assumption can be expressed in the pattern representation context as follows
[17]: If the query pattern is similar to pattern form the history, then the forecasted pattern will be similar to pattern (paired with ). This assumption allows us to predict the ypattern on the basis of known patterns , and . Usually we select several patterns and aggregate patterns paired with them to get the forecasted ypattern.The above assumption underlying PSFMs should be verified for a given time series. We analyse relationship between similarities of xpatterns and similarities of paired with them ypatterns. To do so, we define two random variables: similarity between
and , , and similarity between and , , where. Instead of similarity measure we can use a distance measure between patterns as random variables. All pairs of these random variables form the sample. To show the stochastic interdependence of the random variables the null hypothesis is formulated: The observed differences in numbers of occurrence of the sample elements in the specified categories of the random variables are caused by a random nature of the sample. This hypothesis is verified using the chisquared test based on a contingency table showing the joint empirical distribution of the random variables. High value of the
statistic, above the critical value, rejects the null hypothesis in favor of the alternative hypothesis, which justifies the use of PSFMs.In this study similarity between patterns, which are realvalued vectors, is measured using Euclidean distance. Other measures are also possible such as: Pearson’s correlation coefficient, Tanimoto coefficient, other Minkowski distances or dot product for normalized vectors.
The pattern similaritybased forecasting procedure is depicted in Fig. 3 and can be summarized in the following steps [17]:

Mapping the original time series sequences into x and ypatterns.

Selection of the training xpatterns similar to the query pattern .

Aggregation of the ypatterns paired with the similar xpatterns to get the forecasted pattern .

Decoding pattern to get the forecasted time series sequence .
During aggregation of the ypatterns (step 3) we use the weights for them which are dependent on the similarity between a query pattern and the training xpatterns. In this case, the regression model mapping xpatterns into ypatterns is of the form:
(10) 
where , is a weighting function.
Note that is a vectorvalued function returning the whole ypattern. It is nonlinear function if maps nonlinearly. Different definitions of are presented below where the PSFMs are specified.
3.2 Nearest Neighbor Models
The simplest PSFM is a nearest neighbor regression model. It estimates as the average of the ypatterns in a varying neighborhood of query pattern . The neighborhood is defined as a set of nearest neighbors of in the training set. The regression function is as follows:
(11) 
where is a set of indices of nearest neighbors of in and for each and .
Note that in this model the weights of are all equal to . Function (11) is a step function. The number of nearest neighbors controls the smoothness of the estimator. For the regression function is exactly fitted to the training points. Increasing causes an increase in bias and decrease in variance of the estimator. To get rid of jumps of the regression function and made it more smooth, we can introduce the weighting function that gives greater weights for closer neighbors and lower weights for distant neighbors. Weighting functions are usually dependent on the distance between a query pattern an its nearest neighbors. They are monotonically decreasing, reach the maximum value in zero, and the minimum value (nonnegative) for the th nearest neighbor. Some weighting function propositions can be found in [4]. In this work we use the weighting function of the form [16]:
(12) 
(13) 
where is the th nearest neighbor of in , is a Euclidean distance between and its th nearest neighbor in , is a parameter deciding about the differentiation of weights, and is a parameter deciding about the convexity of the weighting function.
The weighting function (13) is shown in Fig. 4. The interval of weights is . So, for the weights are the most diverse and for they are all equal. In the latter case we get . For the weighting function is linear. For we get convex function and for we get concave function. The three model parameters, and , allows us to control flexibly its features.
3.3 Fuzzy Neighborhood Model
In the NN model the regression surface is built using training patterns. Fuzzy neighborhood model (FNM) takes into account all training patterns when constructing the regression surface [32]. In this case not only patterns belong to the query pattern neighborhood but all training patterns belong to it, with a different membership degree. The membership function is dependent on the distance between the query pattern and the training pattern as follows:
(14) 
where and are parameters deciding about the membership function shape (see Fig. 5).
The weighting function in FNM is of the form:
(15) 
Membership function (15) is a Gaussiantype function. Other types are also possible, e.g a Cauchytype function with fatter tail, which gives greater weights for more distant patterns. The model parameters, and , shape the membership function and thus control the properties of the estimator. For wider membership functions the model tends to increase bias and decrease variance.
3.4 Kernel Regression Model
Kernel regression belongs to the same category of nonparametric methods as NN and FNM. It models a nonlinear relationship between a pair of random variables and . The Nadaraya–Watson estimator (NWE) is the most popular representative of this group. NWE estimates regression function as a locally weighted average, using in (10) a kernel as a weighting function [24]:
(16) 
A kernel function is centered at the data point and gives the highest value when the distance between this point and the query point is zero. The kernel function falls with the distance at a speed dependent on the smoothing parameter (or bandwidth) .
When the input variable is multidimensional, the kernel has a product form. In such a case, for a normal kernel, which is often used in practice, the weighting function is defined as [16], [18]:
(17) 
where is a bandwidth for the th dimension.
In NWE the bandwidth has a prominent effect on the estimator shape, whereas the kernel is clearly less important. Note that in (17) we define the bandwidths individually for each dimension. This gives more flexible estimator which allows us to control the influence of each input on the resulting fitted surface. The bandwidths decide about the biasvariance tradeoff of the estimator. For their too low values the estimator is undersmoothed, while for their too large values it is oversmoothed.
3.5 General Regression Neural Network Model
General Regression Neural Network (GRNN) was proposed in [35]
as a variant of radial basis function NN. It is a memorybased network where each neuron correspond to the one training xpattern. GRNN provides smooth approximation of a target function even with sparse data in a multidimensional space. Due to one pass learning it learns very fast when comparing to other NNs. Other advantages of GRNN are: easy tuning, highly parallel structure and smooth approximation of a target function even with sparse data in a multidimensional space.
As shown in Fig. 6
, GRNN is composed of four layers: input, pattern (radial basis layer), summation and output layer. The pattern layer transform inputs nonlinearly using Gaussian activation functions of the form:
(18) 
where is a Euclidean norm and is a bandwidth for the th pattern.
The Gaussian functions are centered at different training patterns . The neuron output expresses similarity between the query pattern and the th training pattern. This output is treated as the weight of the th ypattern. So the pattern layer maps the dimensional input space into dimensional space of similarity. The weighting function implemented in GRNN is defined as [29]:
(19) 
The performance of GRNN is related with bandwidths governing the smoothness of the regression function (10). Note that in the GRNN model each neuron has its own bandwidth . This allow us to control flexibly the weight of the th ypattern individually. As in the case of other PSFMs presented above, selection of the optimal values of bandwidths is a key problem in GRNN.
4 Simulation Studies
In this section we compare PSFMs on the midterm load forecasting problem using realworld data: monthly electricity demand time series for 35 European countries. The data are taken from the publicly available ENTSOE repository (www.entsoe.eu). The longest time series cover the time period from 1991 to 2014 (11 countries out of 35). Others are shorter: 17 years (6 countries), 12 years (4 countries), 8 years (2 countries), and 5 years (12 countries). A goal is to predict twelve monthly demands for 2014 using historical data. The 35 time series are shown in Fig. 7. As can be seen from this figure, the time series have different levels, trends, variations and yearly shapes.
4.1 Verification of the PSFM assumption
In Section 3.1 the assumption underlying the PSFMs was stated. To confirm this assumption the null hypothesis was formulated. It is verified for each country using the chisquared test based on a contingency table showing the joint empirical distribution of the random variables which are Euclidean distances between patterns: and . In this analysis we use patterns defined by (4) and (8), where , , and .
In the contingency tables both random variables are divided into five categories containing roughly the same number of observations. Fig. 8 shows the chisquared statistics for the analyzed countries. The critical value is shown by a dashed line. It is 26.30 for five categories adopted for each random variable and significance level . As you can see from this figure, all values are higher or very close to the critical value. This allows us to reject the null hypothesis and justifies the use of PSFMs. The strongest relationships between random variables are observed for Italy, Germany, Belgium and Luxembourg.
4.2 Comparative Models
As comparative models we use classical statistical forecasting models such as ARIMA and ETS, as well as neural and neurofuzzy models:

– ARIMA model implemented in function auto.arima in R environment (package forecast). This function implements automatic ARIMA modeling which combines unit root tests, minimization of the Akaike information criterion (AICc) and maximum likelihood estimation to obtain the optimal ARIMA model [25].

– exponential smoothing state space model [26] implemented in function ets (R package forecast). This implementation includes many types of ETS models depending on how the seasonal, trend and error components are taken into account. They can be expressed additively or multiplicatively, and the trend can be damped or not. As in the case of auto.arima, ets returns the optimal model estimating the model parameters using AICc [25].

– multilayer perceptron described in [31]. This model is designed for MTLF. It learns from patterns defined by (4) and (8). It predicts one component of ypattern on the basis of xpatterns. For all components, MLPs are trained. When all components of ypattern are predicted by the set of MLPs, the forecasts of demands are calculated using (9
). The network has one hidden layer with sigmoidal neurons and learns using LevenbergMarquardt method with Bayesian regularization to prevent overfitting. The MLP hyperparameters which were adjusted are the number of hidden nodes and length of the input patterns
. We use Matlab R2018a implementation of MLP (function feedforwardnet from Neural Network Toolbox). 
– adaptive neurofuzzy inference system proposed for MTLF in [30]. It works on patterns (4) and (8). This is an input, singleoutput model for prediction of one ypattern component. So, for all components, models are built and trained. ANFIS architecture is functionally equivalent to a Sugeno type fuzzy rule base. Initial membership function parameters in the premise parts of rules are determined using fuzzy
means clustering. A hybrid learning method is applied for ANFIS training which uses a combination of the leastsquares for consequent parameters and backpropagation gradient descent method for premise parameters. The ANFIS hyperparameters which were selected are the number of rules and the length of the input patterns
. The Matlab R2018a implementation of ANFIS was used (function anfis from Fuzzy Logic Toolbox). 
– long shortterm memory (LSTM) network, where the responses are the training sequences with values shifted by one time step (a sequencetosequence regression LSTM network). For multiple time steps, after one step was predicted the network state was updated. Previous prediction was used as input to the network producing the prediction for the next time step. LSTM was optimized using Adam (adaptive moment estimation) optimizer. The number of hidden nodes was the only hyperparameter to be tuned. Other hyperparameters remain at their default values. The experiments were carried out using Matlab R2018a implementation of LSTM (function
trainNetwork from Neural Network Toolbox).
4.3 Parameter settings and model variants
Taking into account our earlier experiences with PSFMs reported in [17], [16], [30], [31], [31], [32] and [18] we use pattern definitions (4) and (8). In most cases they provided the best accuracy of the models.
One of the main hyperparameters for all proposed PSFMs as well as for MLP and ANFIS was the length of the xpatterns. Although the natural choice for this hyperparameter is the seasonal cycle length, i.e. 12, we tested the models in the range for from 3 to 24, and finally selected the optimal value of for each model and each time series.
The NN model was used in two variants. The first one assigns the same weights for all neighbors: . The second one uses weighting function defined by (12) and (13), where and (linear weighting function). This variant is marked as NNw. The key hyperparameter in these both variants, besides the xpattern length, is the number of nearest neighbors . It was searched in the range from 1 to 50.
In FNM we set and control the membership function (14) width with . This hyperparameter was calculated from:
(20) 
where was the median of pairwise distances between xpatterns in the training set. It was searched for .
Determining on the basis of calibrates this parameter to data.
The bandwidth parameters in NWE were searched around the starting values determined using the Scott rule [34] proposed for the normal product density estimators:
The searched bandwidths are generated according to:
(22) 
where .
Note that the multidimensional optimization problem of searching bandwidths was replaced by a simple onedimensional problem of searching .
For GRNN we assume in this study the same bandwidths for all neurons. The bandwidth was searched according to (20). For MLP the number of hidden neurons was selected from the range from 1 to 10. The number of rules in the ANFIS model was selected from the range from 2 to 13. The number of hidden nodes in the LSTM model was selected from the set . For ARIMA and ETS models we use default parameter settings implemented in functions auto.arima and ets, respectively.
The optimal values of the hyperparameters for each model and for each of 35 time series were selected in the grid search procedure using crossvalidation.
Taking into account the ypattern encoding variants described in Section 2, three variants of each PSFM as well as MLP and ANFIS models are considered:

The basic variant, where the coding variables for ypatterns are the mean and dispersion of sequence , i.e. , . This variant enables us to calculate the forecast of the monthly loads from (9) without additional forecasting the coding variables.

The variant, were coding variables are the mean and dispersion of sequence . They are both forecasted independently using ARIMA model on the basis of their historical values. The symbols of models in this variant are extended by "+ARIMA", e.g. "NN+ARIMA", "ANFIS+ARIMA".

As in variant 2, the coding variables are the mean and dispersion of sequence . But in this case they are forecasted using ETS. The symbols of models in this variant are extended by "+ETS", e.g. "NN+ETS", "ANFIS+ETS".
4.4 Results and analysis
The PSFMs are deterministic models which return the same results for the same data. NNbased models, i.e. MLP, ANFIS and LSTM, due to the stochastic nature of the learning processes return different results for the same data. In this study these models were trained 100 times and the final errors were calculated as the averages over 100 independent trials.
Fig. 9 shows the forecasting errors on the test sets (mean absolute percentage error, MAPE) for each country. The rankings of the models are shown in Fig. 10. The ranking shown on the left is based on the median of APE and the ranking shown on the right is based on the average ranks of the models in the rankings for individual countries. Table 1 summarizes the accuracy of the models showing median of APE, MAPE, interquartile ranges of APE averaged over all countries and root mean square error (RMSE) also averaged over all countries.
Model  Median APE  MAPE  IQR  RMSE 

kNN  3.11  5.19  4.13  385.68 
kNNw  2.89  4.99  4.06  368.79 
FNM  2.88  4.88  4.43  354.33 
NWE  2.84  5.00  4.14  352.01 
GRNN  2.87  5.01  4.30  350.61 
kNN+ARIMA  2.88  4.71  4.21  352.42 
kNNw+ARIMA  2.89  4.65  4.02  346.58 
FNM+ARIMA  2.87  4.61  3.83  341.41 
NWE+ARIMA  2.85  4.59  3.74  340.26 
GRNN+ARIMA  2.81  4.60  3.77  345.46 
kNN+ETS  2.72  4.58  3.55  333.27 
kNNw+ETS  2.71  4.47  3.43  327.94 
FNM+ETS  2.64  4.40  3.34  321.98 
NWE+ETS  2.68  4.37  3.20  320.51 
GRNN+ETS  2.64  4.38  3.35  324.91 
MLP  2.97  5.27  3.89  378.81 
MLP+ARIMA  3.12  4.83  4.16  362.03 
MLP+ETS  3.11  4.80  4.12  358.07 
ANFIS  3.56  6.18  4.91  488.75 
ANFIS+ARIMA  3.66  6.05  5.40  473.80 
ANFIS+ETS  3.54  6.32  4.57  464.29 
LSTM  3.73  6.11  4.46  431.83 
ARIMA  3.32  5.65  5.27  463.07 
ETS  3.50  5.05  4.17  374.52 
As can be seen from Fig. 10 and Table 1, the most accurate models are PSFMs with ETS for forecasting the coding variables. There is no significant difference in errors between them. PSFMs in the basic variants and in +ARIMA variants were ranked lower in both rankings then PSFMs+ETS. The largest errors among PSFMs were observed for the simple NN model with equal weights. The comparative models were ranked as the least accurate. Among them the best one turned out to be MLP and the worse ones were ANFISbased models and LSTM ().
Fig. 9 allow us to compare the variants of PSFMs in more detail. The basic variants use the coding variables determined from history, i.e. the means and dispersions of sequence . In such a case, the stability of relationships between means and dispersions, respectively, of sequences and is very important. When the trend is falling the ypatterns are encoded with the mean value of the previous sequence which is higher then the mean of . For a forecasted ypattern the same higher value of the mean coding variable is expected. But when the time series instead of keeping falling starts to rise, the relationship between means of sequence and observed in the history is no longer valid. This results in a wrong forecast which continues a falling trend. The similar problem arises when the trend is rising and it starts to fall in a final part. This problem is observed for many countries, e.g. BA, BE, IT, DK, IE and others (see the increased errors for PSFMs in the basic variants for these countries in Fig. 9). To prevent this situation the coding variables can be forecasted for the sequence . We use ARIMA and ETS for this. In this case, the final accuracy is dependent on the accuracies of the three models: PSFM which predicts ypattern, ARIMA or ETS which predict the mean demand of sequence , and ARIMA or ETS which predict dispersion of sequence . When the coding variables are predicted with low accuracy, the final error can be higher than in the case of the basic PSFMs (see graphs for AT, BG and ES in Fig. 9). Note the enormous errors for ME, . They are caused by a very irregular character of the time series for ME and abnormal value of demand for March and April 2013, which is about twice the value typical for these months. No model has managed this time series sufficiently.
In Fig. 11
examples of forecasts generated by the models for four countries are depicted. For PL the PSFMs produce the most accurate forecasts. Among PSFMs the basic variant is slightly better than +ARIMA and +EST variants. Note that for PL the classical models, ARIMA nad ETS, give outlier forecasts. ARIMA produces overestimated forecasts while ETS produces underestimated forecasts. ARIMA also gives most inaccurate forecasts for DE and FR. For GB the forecasts generated by all models are underestimated. It results from that demand in 2014 went up unexpectedly despite the downward trend observed from 2010 to 2013. The opposite situation for FR caused a slight overestimation of forecasts. Note that for MLP and ANFIS the jumps in the forecasted curve are observed. They results from that these models predict only one component of the ypattern, and to forecast all
components we use independent models. So, the relationships between components are ignored. In the case of PSFMs which generate multioutput response, these relationships are kept because the forecasted annual cycle is formed by weighted averaging the shapes of historical annual cycles.5 Conclusion
In this work we explored pattern similaritybased models for midterm load forecasting. The key component of these models is the time series representation by patterns of the time series sequences. We defined input and output patterns which unify input and output data. Patterns carry information about the shapes of the annual cycles filtering out the trend and normalizing data. Pattern representation of the time series simplifies the relationships between data. As a result the forecasting model can be simpler and faster in learning.
The PSFMs belong to the class of lazy learning regression models where the regression function is built by aggregation of the output patterns with weights dependent on the similarity between input patterns. The big advantage of PSFMs are the small number of parameters to adjust. One of them is the input pattern length and another one is the bandwidth parameter deciding about the weighting function shape. The simplest PSFM is the NN model with equal weights for all nearest neighbors of the query pattern. More sophisticated models using weighting functions, such as NNw, FNM, NWE and GRNN, produce more accurate forecasts. In the variants used in this work, they all have only one hyperparameter controlling the bandwidth of the weighting function and hence, the biasvariance tradeoff of the model. Note that small number of parameters increases the generalization ability and facilitates the model optimization. Another advantage o PSFMs is no need to retrain the model when new data arrives. New data can be immediately added to the training set and are available for the model to produce the forecast. Note that the PSFM generate a vector as an output. Its size, as well as the size of the input pattern, does not affect the number of parameters and training process complexity such as in the case of MLP and ANFIS. The advantages of PSFMs include also robustness to incomplete input information. The models can work using xpatterns with lacking components.
The simulation study has shown high accuracy of PSFMs when comparing to the classical models such as ARIMA and ETS as well as MLP, ANFIS and LSTM models. The best performance was achieved by the combined models, PSFMs with ETS for forecasting the coding variables. But also basic PSFMs outperform the comparative models generating more accurate forecasts and being much more simpler and easier to optimize.
The directions of further research on PSFMs are: introduction additional input variables to the models, ensembles of the models, introduction confidence degrees of the training data to the models, and probabilistic forecasting.
References
 [1] T. Ahmad, H. Chen, Potential of three variant machinelearning models for forecasting district level mediumterm and longterm energy demand in smart grid environment, Energy 160 (2018) 10081020.
 [2] H.M. AlHamadi, S.A. Soliman, Longterm/midterm electric load forecasting based on shortterm correlation and annual growth, Electric Power Syst. Res. 74 (2005) 353–361.
 [3] F. Apadula, A. Bassini, A. Elli, S. Scapin, Relationships between meteorological variables and monthly electricity demand. Appl Energy 98 (2012) 346–356.
 [4] C.G. Atkenson, A.W. Moor, S. Schaal, Locally weighted learning. Artificial Intelligence Review 11 (1997) 75–113.
 [5] E.H. Barakat, Modeling of nonstationary timeseries data. Part II. Dynamic periodic trends. Electr Power Energy Systems 23 (2001) 63–68.
 [6] J. Bedi, D. Toshniwal, Empirical mode decomposition based deep learning for electricity demand forecasting, IEEE Access 6 (2018) 4914449156.
 [7] D. Benaouda, F. Murtagh, J.L. Starck, O. Renaud, Waveletbased nonlinear multiscale decomposition model for electricity load forecasting. Neurocomputing 70(1–3) (2006) 139–154.
 [8] P. Bunnoon, K. Chalermyanont, C. Limsakul, Mid term load forecasting of the country using statistical methodology: Case study in Thailand, International Conference on Signal Processing Systems (2009) 924928.
 [9] Chang PeiChann, Fan ChinYuan, Lin JyunJie, Monthly electricity demand forecasting based on a weighted evolving fuzzy neural network approach, Electrical Power and Energy Systems 33 (2011), 17–27.
 [10] Y. Chen, E.K. Garcia, M. R. Gupta, A. Rahimi, L. Cazzanti, Similaritybased classification: Concepts and algorithms, Journal of Machine Learning Research 10 (2009) 747776.
 [11] J.F. Chen, S.K. Lo, Q.H. Do, Forecasting monthly electricity demands: An application of neural networks trained by heuristic algorithms, Information 8(1) (2017) 31.
 [12] E.M de Oliveira, F.L.C. Oliveira, Forecasting midlong term electric energy consumption through bagging ARIMA and exponential smoothing methods, Energy 144 (2018) 776788
 [13] E. Dogan, Are shocks to electricity consumption transitory or permanent? Subnational evidence from Turkey, Utilities Policy 41 (2016) 77–84.
 [14] E. Doveh, P. Feigin., L. Hyams, Experience with FNN models for medium term power demand predictions, IEEE Trans. Power Syst. 14 (2) (1999) 538546.
 [15] W. Duch, Similaritybased methods: A general framework for classification, approximation and association, Control and Cybernetics 29 (4) (2000) 937968.
 [16] G. Dudek, Pattern similaritybased methods for shortterm load forecasting – Part 2: Models, Applied Soft Computing 36 (2015) 422441.
 [17] G. Dudek, Pattern similaritybased methods for shortterm load forecasting – Part 1: Principles, Applied Soft Computing 37 (2015) 277287.
 [18] G. Dudek, P. Pełka, Mediumterm electric energy demand forecasting using NadarayaWatson estimator. Proc. Conf. on Electric Power Engineering EPE’17 (2017) 16.
 [19] M.M. Elkateb, K. Solaiman, Y. AlTurki, A comparative study of mediumweatherdependent load forecasting using enhanced artificial/fuzzy neural network and statistical techniques, Neurocomputing 23 (1998) 3–13.
 [20] M. Gavrilas, I. Ciutea, C. Tanasa, Mediumterm load forecasting with artificial neural network models, IEEE Conf. Elec. Dist. Pub. 6 (2001).
 [21] M. Ghiassi, D.K. Zimbra, H. Saidane, Medium term system load forecasting with a dynamic artificial neural network model, Electric Power Systems Research 76 (2006) 302–316
 [22] E. GonzálezRomera, M.A. JaramilloMorán, D. CarmonaFernández, Monthly electric energy demand forecasting based on trend extraction. IEEE Trans Power System 21(4) (2006) 1935–1946.
 [23] E. GonzálezRomera, M.A. JaramilloMorán, D. CarmonaFernández, Monthly electric energy demand forecasting with neural networks and Fourier series, Energy Conversion and Management 49 (2008) 3135–3142.
 [24] W.K. Härdle, M. Müller, S. Sperlich, A. Werwatz, Nonparametric and Semiparametric Models. Springer, 2004.
 [25] R.J. Hyndman, G. Athanasopoulos, Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2 (2018) Accessed on 4 October 2019.
 [26] R.J. Hyndman, A.B. Koehler, J.K. Ord, R.D. Snyder, Forecasting with exponential smoothing: The state space approach, Springer, 2008.
 [27] M.S. Kandil, S.M. ElDebeiky, N.E. Hasanien,Longterm load forecasting for fast developing utility using a knowledgebased expert system, IEEE Trans. Power Syst. 17(2) (2002) 491–496.
 [28] N.A. Mohammed, Modelling of unsuppressed electrical demand forecasting in Iraq for long term, Energy 162 (2018) 354363.
 [29] P. Pełka, G. Dudek, Mediumterm electric energy demand forecasting using generalized regression neural network. Proc. Conf. Information Systems Architecture and Technology ISAT 2018, AISC 853, Springer, Cham (2018) 218227.
 [30] P. Pełka, G. Dudek, Neurofuzzy system for mediumterm electric energy demand forecasting, Proc. Conf. Information Systems Architecture and Technology ISAT 2017, AISC 655, Springer, Cham (2018) 3847.
 [31] P. Pełka, G. Dudek, Patternbased forecasting monthly electricity demand using multilayer perceptron, Proc. Conf. Artificial Intelligence and Soft Computing ICAISC 2019, LNAI 11508, Springer, Cham (2019) 663672.
 [32] P. Pełka, G. Dudek, Prediction of monthly electric energy consumption using patternbased fuzzy nearest neighbour regression. Proc. Conf. Computational Methods in Engineering Science CMES’17, ITM Web Conf.15 (2017) 15.
 [33] X. Qiu, Y. Ren, P.N. Suganthan, G.A.J. Amaratunga, Empirical mode decomposition based ensemble deep learning for load demand time series forecasting. Appl Soft Comp 54 (2017) 246–255.
 [34] D.W. Scott, Multivariate density estimation: Theory, practice, and visualization. Wiley, 1992.
 [35] D.F. Specht, A general regression neural network. IEEE Trans Neural Networks 2(6) (1991) 568–576.
 [36] L. Suganthi, A.A. Samuel, Energy models for demand forecasting — A review. Renew Sust Energy Rev 16(2) (2012) 1223–1240.
 [37] M. Theodosiou, Forecasting monthly and quarterly time series using STL decomposition. Int J Forecast 27(4) (2011) 1178–1195.
 [38] W. Zhao, F. Wang, D. Niu, The application of support vector machine in load forecasting, Journal of Computers 7(7) (2012) 16151622.
Comments
There are no comments yet.