I Introduction
The forecasting of power demand is of crucial importance for the development of modern power systems. The stable and efficient management, scheduling and dispatch in power systems rely heavily on precise forecasting of future loads on various time horizons. In particular, shortterm load forecasting (STLF) focuses on the forecasting of loads from several minutes up to one week into the future [1]. A reliable STLF helps utilities and energy providers deal with the challenges posed by the higher penetration of renewable energies and the development of electricity markets with increasingly complex pricing strategies in future smart grids.
Various STLF methods have been proposed by researchers over the years. Some of the models used for STLF include linear or nonparametric regression [2, 3]
, support vector regression (SVR)
[4, 1][5], fuzzylogic approach [6], etc. Reviews and evaluations of existing methods can be found in [7, 8, 9, 10]. Building STLF systems with artificial neural networks (ANN) has long been one of the mainstream solutions to this task. As early as 2001, a review paper by Hippert et al. surveyed and examined a collection of papers that had been published between 1991 and 1999, and arrived at the conclusions that most of the proposed models were overparameterized and the results they had to offer were not convincing enough [11]. In addition to the fact that the size of neural networks would grow rapidly with the increase in the numbers of input variables, hidden nodes or hidden layers, other criticisms mainly focus on the “overfitting” issue of neural networks [1]. Nevertheless, different types and variants of neural networks have been proposed and applied to STLF, such as radial basis function (RBF) neural networks
[12], wavelet neural networks [13, 14], extreme learning machines (ELM) [15], to name a few.Recent developments in neural networks, especially deep neural networks, have had great impacts in the fields including computer vision, natural language processing, and speech recognition
[16]. Instead of sticking with fixed shallow structures of neural networks with handdesigned features as inputs, researchers are now able to integrate their understandings of different tasks into the network structures. Different building blocks including convolutional neural networks (CNN)
[17], and long shortterm memory (LSTM)
[18]have allowed deep neural networks to be highly flexible and effective. Various techniques have also been proposed so that neural networks with many layers can be trained effectively without the vanishing of gradients or severe overfitting. Applying deep neural networks to shortterm load forecasting is a relatively new topic. Researchers have been using restricted Boltzmann machines (RBM) and feedforward neural networks with multiple layers in forecasting of demand side loads and natural gas loads
[19, 20]. However, these models are increasingly hard to train as the number of layers increases, thus the number of hidden layers are often considerably small (e.g., 2 to 5 layers), which limits the performance of the models.In this work, we aim at extending existing structures of ANN for STLF by adopting stateoftheart deep neural network structures and implementation techniques. Instead of stacking multiple hidden layers between the input and the output, we learn from the residual network structure proposed in [21] and propose a novel endtoend neural network model capable of forecasting loads of next 24 hours. An ensemble strategy to combine multiple individual networks is also proposed. Further, we extend the model to probabilistic load forecasting by adopting Monte Carlo (MC) dropout (For a comprehensive review of probabilistic electric load forecasting, the reader is referred to [22, 23]
). The contributions of this work are threefolds. First, a fully endtoend model based on deep residual networks for STLF is proposed. The proposed model does not require external feature extraction or feature selection algorithms, and only raw data of loads, temperature and information that is readily available are used as inputs. The results show that the forecasting performance can be greatly enhanced by improving the structure of the neural networks and adopting the ensemble strategy, and that the proposed model has good generalization capability across datasets. To the best of our knowledge, this is the first work that uses deep residual networks for the task of STLF. Second, the building blocks of the proposed model can easily be adapted to existing neuralnetworkbased models to improve forecasting accuracy (e.g., adding residual networks on top of 24hour forecasts). Third, a new formulation of probabilistic STLF for an ensemble of neural networks is proposed.
The remainder of the paper is organized as follows. In section II, we formulate the proposed model based on deep residual networks. The ensemble strategy, the MC dropout method, as well as the implementation details are also provided. In section III, the results of STLF by the proposed model are presented. We also discuss the performance of the proposed model and compare it with existing methods. Section IV concludes this paper and proposes future works. The source code for the STLF model proposed in this paper is available at https://github.com/yalickj/loadforecastingresnet.
Ii Shortterm Load Forecasting Based on Deep Residual Networks
In this paper, we propose a dayahead load forecasting model based on deep residual networks. We first formulate the lowlevel basic structure where the inputs of the model are processed by several fully connected layers to produce preliminary forecasts of 24 hours. The preliminary forecasts are then passed through a deep residual network. After presenting the structure of the deep residual network, some modifications are made to further enhance its learning capability. An ensemble strategy is designed to enhance the generalization capability of the proposed model. The formulation of MC dropout for probabilistic forecasting is also provided.
Iia Model Input and the Basic Structure for Load Forecasting of One Hour
We use the model with the basic structure to give preliminary forecasts of the 24 hours of the next day. Specifically, the inputs used to forecast the load for the th hour of the next day, , are listed in Table I. The values for loads and temperatures are normalized by dividing the maximum value of the training dataset. The selected inputs allow us to capture both shortterm closeness and longterm trends in the load and temperature time series [24]. More specifically, we expect that , , and can help the model identify longterm trends in the time series (the days of the same dayofweek index as the next day are selected as they are more likely to have similar load characteristics [13]), while and are able to provide shortterm closeness and characteristics. The input feeds the loads of the most recent 24 hours to the model. Forecast loads are used to replace the values in that are not available at the time of forecasting, which also helps associate the forecasts of the whole day. Note that the sizes of the abovementioned inputs can be adjusted flexibly. In addition, onehot codes for season^{1}^{1}1In this paper, the ranges for Spring, Summer, Autumn, and Winter are March 8th to June 7th, June 8th to September 7th, September 8th to December 7th, December 8th to March 7th, respectively., weekday/weekend distinction, and holiday/nonholiday^{2}^{2}2In this paper, we consider three major public holidays, namely, Christmas Eve, Thanksgiving Day, and Independence Day as the activities involved in these holidays have great impacts on the loads. The rest of the holidays are considered as nonholidays for simplicity. distinction are added to help the model capture the periodic and unordinary temporal characteristics of the load time series.
Input  Size  Description of the Inputs  

6 


4 


7 


24 


6 


4 


7 


1 


4 


2 


2 

The structure of the neural network model for load forecasting of one hour is illustrated in Fig. 1. For , , , , , and , we first concatenate the pairs , , and , and connect them with three separate fullyconnected layers. The three fullyconnected layers are then concatenated and connected with another fullyconnected layer denoted as . For , we forward pass it through two fullyconnected layers, the second layer of which is denoted as . and are concatenated to produce two fullyconnected layers, one used as part of the input of , the other used as part of the input of . is also connected to . In order to produce the output , we concatenate , , and , and connect them with a fullyconnected layer. This layer is then connected to
with another fully connected layer. All fullyconnected layers but the output layer use scaled exponential linear units (SELU) as the activation function.
The adoption of the ReLU has greatly improved the performance of deep neural networks
[25]. Specifically, ReLU has the form(1) 
where is the linear activation of the th node of a layer. A problem with ReLU is that if a unit can not be activated by any input in the dataset, the gradientbased optimization algorithm is unable to update the weights of the unit, so that the unit will never be activated again. In addition, the network will become very hard to train if a large proportion of the hidden units produce constant 0 gradients [26]. This problem can be solved by adding a slope to the negative half axis of ReLU. With a simple modification to the formulation of ReLU on the negative half axis, we get PReLU [27]. The activations of a layer with PReLU as the activation function is obtained by
(2) 
where is the coefficient controlling the slope of when . A further modification to ReLU that induces selfnormalizing properties is provided in [28], where the activation function of SELU is given by
(3) 
where and are two tunable parameters. It is shown in [28] that if we have and
, the outputs of the layers in a fullyconnected neural network would approach the standard normal distribution when the inputs follow the standard normal distribution. This helps the networks to prevent the problems of vanishing and exploding gradients.
As previously mentioned, in order to associate the forecasts of the 24 hours of the next day, the corresponding values within are replaced by for . Instead of simply copying the values, we maintain the neural network connections underneath them. Thus, the gradients of subsequent hours can be propagated backward through time. This would help the model adjust the forecast value of each hour given the inputs and forecast values of the rest of the hours.
We then concatenate as , which directly becomes the output of the model with the basic structure. Next, we proceed to formulate the deep residual network and add it on top of . The output of the deep residual network is denoted as and has the same size of .
IiB The Deep Residual Network Structure for Dayahead Load Forecasting
In [21], an innovative way of constructing deep neural networks for image recognition is proposed. In this paper, the residual block in Fig. 2 is used to build the deep neural network structure. In the residual block, instead of learning a mapping from to , a mapping from to is learned, where is a set of weights (and biases) associated with the residual block. Thus, the overall representation of the residual block becomes
(4) 
A deep residual network can be easily constructed by stacking a number of residual blocks. We illustrate in Fig. 3 the structure of the deep residual network (ResNet) used for the proposed model. More specifically, if residual blocks are stacked, the forward propagation of such a structure can be represented by
(5) 
where is the input of the residual network, the output of the residual network, and the set of weights associated with the th residual block, being the number of layers within the block. The back propagation of the overall loss of the neural network to can then be calculated as
(6) 
where is the overall loss of the neural network. The “1” in the equation indicates that the gradients at the output of the network can be directly backpropagated to the input of the network, so that the vanishing of gradients (which is often observed when the gradients at the output have to go through many layers before reaching the input) in the network is much less likely to occur [29]. As a matter of fact, this equation can also be applied to any pair (), where and are the output of the th residual block (or the input of the network when ), and the th residual block, respectively.
In addition to the stacked residual blocks, extra shortcut connections can be added into the deep residual network, as is introduced in [30]. Concretely, two levels of extra shortcut connections are added to the network. The lower level shortcut connection bypasses several adjacent residual blocks, while the higher level shortcut connection is made between the input and output. If more than one shortcut connection reaches a residual block or the output of the network, the values from the connections are averaged. Note that after adding the extra shortcut connections, the formulations of the forwardpropagation of responses and the backpropagation of gradients are slightly different, but the characteristics of the network that we care about remain unchanged.
We can further improve the learning ability of ResNet by modifying its structure. Inspired by the convolutional network structures proposed in [31, 32], we propose the modified deep residual network (ResNetPlus), whose structure is shown in Fig. 4. First, we add a series of side residual blocks to the model (the residual blocks on the right). Unlike the implementation in [32], the input of the side residual blocks is the output of the first residual block on the main path (except for the first side residual block, whose input is the input of the network). The output of each main residual block is averaged with the output of the side residual block in the same layer (indicated by the blue dots on the right). Similar to the densely connected network in [31], the outputs of those blue dots are connected to all main residual blocks in subsequent layers. Starting from the second layer, the input of each main residual block is obtained by averaging all connections from the blue dots on the right together with the connection from the input of the network (indicated by the blue dots on the main path). It is expected that the additional side residual blocks and the dense shortcut connections can improve the representation capability and the efficiency of error backpropagation of the network. Later in this paper, we will compare the performance of the basic structure, the basic structure connected with ResNet, and the basic structure connected with ResNetPlus.
IiC The Ensemble Strategy of Multiple Models
It is widely acknowledged in the field of machine learning that an ensemble of multiple models has higher generalization capability
[16] than individual models. In [33], analysis of neural network ensembles for STLF of office buildings is provided by the authors. Results show that an ensemble of neural networks reduces the variance of performances. A demonstration of the ensemble strategy used in this paper is shown in Fig.
5. More specifically, the ensemble strategy consists of two stages.The first stage of the strategy takes several snapshots during the training of a single model. In [34]
, the authors show that setting cyclic learning rate schedules for stochastic gradient descent (SGD) optimizer greatly improves the performance of existing deep neural network models. In this paper, as we use Adam (abbreviated from adaptive moment estimation
[35]) as the optimizer, the learning rates for each iteration are decided adaptively. Thus, no learning rate schedules are set by ourselves. This scheme is similar to the NoCycle snapshot ensemble method discussed in [34], that is, we take several snapshots of the same model during its training process (e.g., the 4 snapshots along the training process of the model with initial parameters ). As is indicated in Fig. 5, the snapshots are taken after an appropriate number of epochs, so that the loss of each snapshot is of similar level.
We can further ensemble a number of models that are trained independently. This is done by simply reinitializing the parameters of the model (e.g., to are 5 sets of initial parameters sampled from the same distribution used for initializing the model), which is one of the standard practices of obtaining good ensemble models [36]. The numbers of snapshots and retrained models are hyperparameters, which means they can be tuned using the validation dataset. After we obtain the all the snapshot models, we average the outputs of the models and produce the final forecast.
IiD Probabilistic Forecasting Based on Monte Carlo Dropout
If we look at the deep residual network (either ResNet or ResNetPlus) as an ensemble of relatively shallow networks, the increased width and number of connections in the network can provide more shallow networks to form the ensemble model [32]. It is expected that the relatively shallow networks themselves can partially capture the nature of the load forecasting task, and multiple shallow networks with the same input can give varied outputs. This indicates that the proposed model have the potential to be used for probabilistic load forecasting.
Probabilistic forecasting of time series can be fulfilled by capturing the uncertainty within the models [37]
. From a Bayesian probability theory point of view, the predictive probability of a Bayesian neural network can be obtained with
(7) 
where and are the observations we use to train , a neural network with parameters . The intractable posterior distribution is often approximated by various inference methods [37]. In this paper, we use MC dropout [38] to obtain the probabilistic forecasting uncertainty, which is easy and computationally efficient to implement. Specifically, dropout refers to the technique of randomly dropping out hidden units in a neural network during the training of the network [39], and a parameter
is used to control the probability that any hidden neuron is dropped out. If we apply dropout stochastically for
times at test time and collect the outputs of the network, we can approximate the first term of the forecasting uncertainty, which is(8)  
where is the th output we obtain, is the mean of all outputs, and denotes the expectation operator. The second term, , measures the inherent noise for the data generating process. According to [37], can be estimated using an independent validation dataset. We denote the validation dataset with , , and estimate by
(9) 
where is the model trained on the training dataset and is a parameter to be estimated also using the validation dataset.
We need to extend the above estimation procedure to an ensemble of models. Concretely, for an ensemble of neural network models of the same structure, we estimate the first term of (8) with a single model of the same structure trained with dropout. The parameter in (9) is also estimated by the model. More specifically, we find the that provides the best 90 and 95 interval forecasts on the validation dataset. is estimated by replacing in (9) by the ensemble model, . Note that the estimation of is specific to each hour of the day.
After obtaining the forecasting uncertainty for each forecast, we can calculate the level interval with the point forecast,
, and its corresponding quantiles to obtain probabilistic forecasting results.
IiE Model Design and Implementation Details
The proposed model consists of the neural network structure for load forecasting of one hour (referred to as the basic structure), the deep residual network (referred to as ResNet) for improving the forecasts of 24 hours, and the modified deep residual network (referred to as ResNetPlus). The configurations of the models are elaborated as follows.
IiE1 The model with the basic structure
The graphic representation of the model with the basic structure is shown in Fig. 1. Each fullyconnected layer for , , , and has 10 hidden nodes, while the fullyconnected layers for have 5 hidden nodes. , , and the fullyconnected layer before have 10 hidden nodes. All but the output layer use SELU as the activation function.
IiE2 The deep residual network (ResNet)
ResNet is added to the neural network with the basic structure. Each residual block has a hidden layer with 20 hidden nodes and SELU as the activation function. The size of the outputs of the blocks is 24, which is the same as that of the inputs. A total of 30 residual blocks are stacked, forming a 60layer deep residual network. The second level of shortcut connections is made every 5 residual blocks. The shortcut path of the highest level connects the input and the output of the network.
IiE3 The modified deep residual network (ResNetPlus)
The structure of ResNetPlus follows the structure shown in Fig. 4. The hyperparameters inside the residual blocks are the same as ResNet.
In order to properly train the models, the loss of the model, , is formulated as the sum of two terms:
(10) 
where measures the error of the forecasts, and is an outofrange penalty term used to accelerate the training process. Specifically, is defined as
(11) 
where and are the output of the model and the actual normalized load for the th hour of the th day, respectively, the number of data samples, and the number of hourly loads within a day (i.e., in this case). This error measure, widely known as the mean absolute percentage error (MAPE), is also used to evaluate the forecast results of the models. The second term, , is calculated as
(12)  
This term penalizes the model when the forecast daily load curves are out of the range of the actual load curves, thus accelerating the beginning stage of the training process. When a model is able to produce forecasts with relatively high accuracy, this term serves to emphasize the cost for overestimating the peaks and the valleys of the load curves.
All the models are trained using the Adam optimizer with default parameters as suggested in [35]. The models are implemented using Keras 2.0.2 with Tensorflow 1.0.1 as backend in the Python 3.5 environment [40, 41]. A laptop with Intel Core i75500U CPUs is used to train the models. Training the ResNetPlus model with data of three years for 700 epochs takes approximately 1.5 hours. When 5 individual models are trained, the total training time is less than 8 hours.
Iii Results and Discussion
In this section, we use the NorthAmerican Utility dataset^{3}^{3}3Available at https://class.ee.washington.edu/555/elsharkawi. and the ISONE dataset^{4}^{4}4Available at https://www.isone.com/isoexpress/web/reports/loadanddemand. to verify the effectiveness of the proposed model. As we use actual temperature as the input, we further modify the temperature values to evaluate the performance of the proposed model. Results of probabilistic forecasting on the NorthAmerican Utility dataset and the GEFCom2014 dataset [42] are also provided.
Iiia Performance of the Proposed model on the NorthAmerican Utility Dataset
The first test case uses the NorthAmerican Utility dataset. This dataset contains load and temperature data at onehour resolution for a northAmerican utility. The dataset covers the time range between January 1st, 1985 and October 12th, 1992. The data of the twoyear period prior to October 12th, 1992 is used as the test set, and the data prior to the test set is used for training the model. More specifically, two starting dates, namely, January 1st, 1986, and January 1st, 1988, are used for the training sets. As the latter starting date is used in experiments in the literature, we tune the hyperparameters using the last 10 of the training set with this starting date^{5}^{5}5For this dataset, 4 snapshots are taken between 1200 to 1350 epochs for 8 individual models. For the basic structure, all layers except the input and the output layers are shared for the 24 hours (sharing weights for 24 hours is only implemented in this test case). The ResNetPlus model has 30 layers on the main path.. The model trained with the training set containing 2 years of extra data has the same hyperparameters.
Before reporting the performance of the ensemble model obtained by combining multiple individual models, we first look at the performance of the three models mentioned in section II. The test losses of the three models are shown in Fig. 6
(the models are trained with the training set starting with January 1st, 1988). In order to yield credible results, we train each model 5 times and average the losses to obtain the solid lines in the figure. The coloured areas indicate the range between one standard deviation above and below the average losses. It is observed in the figure that ResNet is able to improve the performance of the model, and further reduction in loss can be achieved when ResNetPlus is implemented. Note that the results to be reported in this paper are all obtained with the ensemble model. For simplicity, the ensemble model with the basic structure connected with ResNetPlus is referred to as “the ResNetPlus model” hereinafter.
Model 



WTNN [43]  
WTNN [44]    
ESN [45]  
SSASVR [1]  
WTELMMABC [46]  
CLPSOMASVR [47]  
WTELMLM [48]  
Proposed model  
Proposed model (2 extra years)  
We compare the results of the proposed ResNetPlus model with existing models proposed in [1, 43, 44, 45, 46, 47, 48], as is shown in Table II. In order to estimate the performance of the models when forecast temperature is used, we also add a Gaussian noise with mean 0 F, and standard deviation 1 F to the temperature input and report the MAPE in this case. It is seen in the table that the proposed model outperforms existing models which highly depend on external feature extraction, feature selection, or hyperparameter optimization techniques. The proposed model also has a lower increase of MAPE when modified temperature is applied. In addition, the test loss can be further reduced when more data is added to the training set.
IiiB Performance of the Proposed Model on the ISONE Dataset
The second task of the paper is to examine the generalization capability of the proposed model. To this end, we use the majority of the hyperparameters of ResNetPlus tuned with the NorthAmerican Utility dataset to train load forecasting models for the ISONE dataset (The time range of the dataset is between March 2003 and December 2014). Here, the ResNetPlus structure has 10 layers on the main path.
The first test case is to predict the daily loads of the year 2006 in the ISONE dataset. For the proposed ResNetPlus model, the training period is from June 2003 to December 2005^{6}^{6}6The training dataset is used to determine how the snapshots are taken for the ensemble model for the ISONE dataset. For each implementation, 5 individual models are trained, and the snapshots are taken at 600, 650, and 700 epochs. (we reduce the size of and to 3 so that more training samples can be used, and the rest of the hyperparameters are unchanged). In comparison, the similar daybased wavelet neural network (SIWNN) model in [13] is trained with data from 2003 to 2005, while the models proposed in [49] and [46] use data from March 2003 to December 2005 (both models use past loads up to 200 hours prior to the hour to be predicted). The results of MAPEs with respect to each month are listed in Table III. The MAPEs for the 12 months in 2006 are not explicitly reported in [49]. It is seen in the table that the proposed ResNetPlus model has the lowest overall MAPE for the year 2006. For some months, however, the WTELMMABC model proposed in [46] produces better results. Nevertheless, as most of the hyperparameters are not tuned on the ISONE dataset, we can conclude that the proposed model has good generalization capability across different datasets.
We further test the generalization capability of the proposed ResNetPlus model on data of the years 2010 and 2011. The same model for the year 2006 is used for this test case, and historical data from 2004 to 2009 is used to train the model. In Table IV, we report the performance of the proposed model and compare it with models mentioned in [50, 12, 49]. Results show that the proposed ResNetPlus model outperforms existing models with respect to the overall MAPE for the two years, and an improvement of 8.9% is achieved for the year 2011. Note that all the existing models are specifically tuned on the ISONE dataset for the period from 2004 to 2009, while the design of the proposed ResNetPlus model is directly implemented without any tuning.
SIWNN [13]  WTELMPLSR [49]  WTELMMABC [46]  Proposed model  

Jan    
Feb    
Mar    
Apr    
May    
Jun    
Jul    
Aug    
Sep    
Oct    
Nov    
Dec    
Average 
Model  2010  2011 
RBFNErrCorr original [50]  
RBFNErrCorr modified [12]  
WTELMPLSR [49]  
Proposed model  
As we use actual temperature values for the input of the proposed model (except for the ”modified temperature” case of NorthAmerican Utility dataset), the results we have obtained previously provide us with an estimated upper bound of the performance of the model. Thus, we need to further analyze how the proposed model would perform when forecast temperature data is used, and whether the ensemble model is more robust to noise in forecast weather. We follow the way of modifying temperature values introduced in [43], and consider three cases of temperature modification:

Case 1: add Gaussian noise with mean 0 F, and standard deviation 1 F to the original temperature values before normalization.

Case 2: add Gaussian noise with mean 0 F, and change the standard deviation of case 1 to 2 F.

Case 3: add Gaussian noise with mean 0 F, and change the standard deviation of case 1 to 3 F.
For all three cases, we repeat the trials 5 times and calculate the means and standard deviations of increased MAPE compared with the case where actual temperature data is used.
The results of increased test MAPEs for the year 2006 with modified temperature values are shown in Fig. 7. We compare the performance of the proposed ResNetPlus model (which is an ensemble of 15 single snapshot models) with a single snapshot model trained with 700 epochs. As can be seen in the figure, the ensemble strategy greatly reduces the increase of MAPE, especially for case 1, where the increase of MAPE is 0.0168. As the reported smallest increase of MAPE for case 1 in [1] is 0.04, it is reasonable to conclude that the proposed model is robust against the uncertainty of temperature for case 1 (as we use a different dataset here, the results are not directly comparable). Is is also observed that the ensemble strategy is able to reduce the standard deviation of multiple trials. This also indicates the higher generalization capability of the proposed model with the ensemble strategy.
IiiC Probabilistic Forecasting for the Ensemble Model
zscore  Expected Coverage  Empirical Coverage 

1.000  
1.280  
1.645  
1.960 
We first use the NorthAmerican Utility dataset to demonstrate the probabilistic STLF by MC dropout. The last year of the dataset is used as the test set and the previous year is used for validation. Dropout with is added to the previously implemented ensemble model^{7}^{7}7the model implemented here uses ResNet instead of ResNetPlus, and the information of season, weekday/weekend distinction, and holiday/nonholiday distinction is not used. In addition, the activation function used for the residual blocks is ReLU. except for the input layer and the output layer (dropout with ranging from 0.05 and 0.2 produce similar results, similar to the results reported in [38]). The first term in (8) and is estimated by a single model trained with 500 epochs (with for (8) and ), and the estimated value of is 0.79.
The empirical coverages produced by the proposed model with respect to different zscores are listed in Table V, and an illustration of the 95 prediction intervals for two weeks in 1992 is provided in Fig. 8. The results show that the proposed model with MC dropout is able to give satisfactory empirical coverages for different intervals.
Model  Pinball  Winkler (50%)  Winkler (90%) 
Lasso [51]      
Ind [23]  
QRA [23]  
Proposed model  
In order to quantify the performance of the probabilistic STLF by MC dropout, we adopt the pinball loss and Winkler score mentioned in [23] and use them to assess the proposed method in terms of coverage rate and interval width. Specifically, the pinball loss is averaged over all quantiles and hours in the prediction range, and the Winkler scores are averaged over all the hours of the year in the test set. We implement the ResNetPlus model^{8}^{8}8Five individual models are trained with a dropout rate of 0.1 and 6 snapshots are taken from 100 epochs to 350 epochs. is set to 100 for MC dropout and the first term in (7) is estimated by a single model trained with 100 epochs. The estimated value of is 0.77. on the GEFCom2014 dataset and compare the results with those reported in [23, 51]. Following the setting in [23], the load and temperature data from 2006 to 2009 is used to train the proposed model, the data of the year 2010 is used for validation, and the test results are obtained using data of the year 2011. The temperature values used for the input of the model are calculated as the mean of the temperature values of all 25 weather stations in the dataset.
In Table VI, we present the values of pinball loss and Winkler scores for the proposed model and the models in [23, 51] for the year of 2011 in the GEFCom2014 dataset. The Lasso method in [51] serves as a benchmark for methods that build regression models on the input data, and the quantile regression averaging (QRA) method in [23] builds quantile regression models on sister point forecasts (the row of Ind stands for the performance of a single model). It can be seen in Table VI that the proposed ResNetPlus model is able to provide improved probabilistic forecasting results compared with existing methods in terms of the pinball loss and two Winkler scores. As we obtain the probabilistic forecasting results by sampling the trained neural networks with MC dropout, we can conclude that the proposed model is good at capturing the uncertainty of the task of STLF.
Iv Conclusion and Future Work
We have proposed an STLF model based on deep residual networks in this paper. The lowlevel neural network with the basic structure, the ResNetPlus structure, and the twostage ensemble strategy enable the proposed model to have high accuracy as well as satisfactory generalization capability. Two widely acknowledged public datasets are used to verify the effectiveness of the proposed model with various test cases. Comparisons with existing models have shown that the proposed model is superior in both forecasting accuracy and robustness to temperature variation. We have also shown that the proposed model can be directly used for probabilistic forecasting when MC dropout is adopted.
A number of paths for further work are attractive. As we have only scratched the surface of stateoftheart of deep neural networks, we may apply more building blocks of deep neural networks (e.g., CNN or LSTM) into the model to enhance its performance. In addition, we will further investigate the implementation of deep neural works for probabilistic STLF and make further comparisons with existing methods.
References
 [1] E. Ceperic, V. Ceperic, and A. Baric, “A strategy for shortterm load forecasting by support vector regression machines,” IEEE Trans. Power Syst., vol. 28, no. 4, pp. 4356–4364, Nov. 2013.

[2]
K. B. Song, Y. S. Baek, D. H. Hong, and G. Jang, “Shortterm load forecasting for the holidays using fuzzy linear regression method,”
IEEE Trans. Power Syst., vol. 20, no. 1, pp. 96–101, Feb. 2005.  [3] W. Charytoniuk, M. S. Chen, and P. V. Olinda, “Nonparametric regression based shortterm load forecasting,” IEEE Trans. Power Syst., vol. 13, no. 3, pp. 725–730, Aug. 1998.
 [4] E. E. Elattar, J. Goulermas, and Q. H. Wu, “Electric load forecasting based on locally weighted support vector regression,” IEEE Trans. Syst., Man, Cybern., Part C, vol. 40, no. 4, pp. 438–447, Feb. 2010.
 [5] J. W. Taylor, “Shortterm electricity demand forecasting using double seasonal exponential smoothing,” Journal of the Operational Research Society, vol. 54, no. 8, pp. 799–805, Jul. 2003.
 [6] M. Rejc and M. Pantos, “Shortterm transmissionloss forecast for the slovenian transmission power system based on a fuzzylogic decision approach,” IEEE Trans. Power Syst., vol. 26, no. 3, pp. 1511–1521, Jan. 2011.
 [7] E. A. Feinberg and D. Genethliou, “Load forecasting,” in Applied Mathematics for Restructured Electric Power Systems, J. H. Chow, F. F. Wu, and J. Momoh, Eds. Springer, 2005, pp. 269–285.
 [8] J. W. Taylor, L. M. D. Menezes, and P. E. Mcsharry, “A comparison of univariate methods for forecasting electricity demand up to a day ahead,” International Journal of Forecasting, vol. 22, no. 1, pp. 1–16, Jan./Mar. 2006.
 [9] H. Hahn, S. MeyerNieberg, and S. Pickl, “Electric load forecasting methods: Tools for decision making,” European Journal of Operational Research, vol. 199, no. 3, pp. 902–907, Dec. 2009.
 [10] Y. Wang, Q. Chen, T. Hong, and C. Kang, “Review of smart meter data analytics: Applications, methodologies, and challenges,” IEEE Transactions on Smart Grid, 2018.
 [11] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for shortterm load forecasting: A review and evaluation,” IEEE Trans. Power Syst., vol. 16, no. 1, pp. 44–55, Feb. 2001.
 [12] C. Cecati, J. Kolbusz, P. Różycki, P. Siano, and B. M. Wilamowski, “A novel RBF training algorithm for shortterm electric load forecasting and comparative studies,” IEEE Trans. Ind. Electron., vol. 62, no. 10, pp. 6519–6529, Apr. 2015.
 [13] Y. Chen et al., “Shortterm load forecasting: similar daybased wavelet neural networks,” IEEE Trans. Power Syst., vol. 25, no. 1, pp. 322–330, Jan. 2010.
 [14] Y. Zhao, P. B. Luh, C. Bomgardner, and G. H. Beerel, “Shortterm load forecasting: Multilevel wavelet neural networks with holiday corrections,” in Power & Energy Society General Meeting 2009, 2009, pp. 1–7.
 [15] R. Zhang, Z. Y. Dong, Y. Xu, K. Meng, and K. P. Wong, “Shortterm load forecasting of australian national electricity market by an ensemble model of extreme learning machine,” IET Generation, Transmission & Distribution, vol. 7, no. 4, pp. 391–397, Jun. 2013.
 [16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.

[17]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.  [18] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
 [19] S. Ryu, J. Noh, and H. Kim, “Deep neural network based demand side short term load forecasting,” Energies, vol. 10, no. 1, p. 3, 2016.
 [20] G. Merkel, R. J. Povinelli, and R. H. Brown, “Deep neural network regression for shortterm load forecasting of natural gas,” in 37th Annual International Symposium on Forecasting, 2017.

[21]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 770–778.  [22] T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial review,” International Journal of Forecasting, vol. 32, no. 3, pp. 914–938, Jul./Sep. 2016.
 [23] B. Liu, J. Nowotarski, T. Hong, and R. Weron, “Probabilistic load forecasting via quantile regression averaging on sister forecasts,” IEEE Transactions on Smart Grid, vol. 8, no. 2, pp. 730–737, Jun. 2017.
 [24] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “DNNbased prediction model for spatiotemporal data,” in Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2016, p. 92.

[25]
G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8609–8613.  [26] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning, vol. 30, no. 1, 2013, p. 3.
 [27] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
 [28] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 972–981.
 [29] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
 [30] K. Zhang, M. Sun, X. Han, X. Yuan, L. Guo, and T. Liu, “Residual networks of residual networks: Multilevel residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2017, doi: 10.1109/TCSVT.2017.2654543, to be published.
 [31] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 3.
 [32] L. Zhao, J. Wang, X. Li, Z. Tu, and W. Zeng, “On the connection of deep fusion to ensembling,” arXiv preprint arXiv:1611.07718, 2016.
 [33] M. De Felice and X. Yao, “Shortterm load forecasting with neural network ensembles: A comparative study [application notes],” IEEE Comput. Intell. Mag., vol. 6, no. 3, pp. 47–56, Jul. 2011.
 [34] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger, “Snapshot ensembles: Train 1, get M for free,” arXiv preprint arXiv:1704.00109, 2017.
 [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [36] A. R. Webb, Statistical Pattern Recognition. Chiechester, UK: John Wiley & Sons, 2003.
 [37] L. Zhu and N. Laptev, “Deep and confident prediction for time series at uber,” arXiv preprint arXiv:1709.01907, 2017.
 [38] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 1050–1059.
 [39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, Jan. 2014.
 [40] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
 [41] M. Abadi et al., “Tensorflow: A system for largescale machine learning,” in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, 2016, pp. 265–283.
 [42] T. Hong, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, and R. J. Hyndman, “Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond,” International Journal of Forecasting, vol. 32, no. 3, pp. 896–913, 2016.
 [43] A. R. Reis and A. A. Da Silva, “Feature extraction via multiresolution analysis for shortterm load forecasting,” IEEE Trans. Power Syst., vol. 20, no. 1, pp. 189–198, Jan. 2005.

[44]
N. Amjady and F. Keynia, “Shortterm load forecasting of power systems by combination of wavelet transform and neuroevolutionary algorithm,”
Energy, vol. 34, no. 1, pp. 46–57, Jan. 2009.  [45] A. Deihimi and H. Showkati, “Application of echo state networks in shortterm electric load forecasting,” Energy, vol. 39, no. 1, pp. 327–340, Mar. 2012.
 [46] S. Li, P. Wang, and L. Goel, “Shortterm load forecasting by wavelet transform and evolutionary extreme learning machine,” Electric Power Systems Research, vol. 122, p. 96–103, May 2015.

[47]
Z. Hu, Y. Bao, and T. Xiong, “Comprehensive learning particle swarm optimization based memetic algorithm for model selection in shortterm load forecasting using support vector regression,”
Applied Soft Computing, vol. 25, pp. 15–25, 2014.  [48] S. Li, P. Wang, and L. Goel, “A novel waveletbased ensemble method for shortterm load forecasting with hybrid neural networks and feature selection,” IEEE Transactions on Power Systems, vol. 31, no. 3, pp. 1788–1798, Jun. 2016.
 [49] S. Li, L. Goel, and P. Wang, “An ensemble approach for shortterm load forecasting by extreme learning machine,” Applied Energy, vol. 170, pp. 22–29, May 2016.

[50]
H. Yu, P. D. Reiner, T. Xie, T. Bartczak, and B. M. Wilamowski, “An incremental design of radial basis function networks,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1793–1803, Feb. 2014.  [51] F. Ziel and B. Liu, “Lasso estimation for gefcom2014 probabilistic electric load forecasting,” International Journal of Forecasting, vol. 32, no. 3, pp. 1029–1037, 2016.
Comments
There are no comments yet.