I Introduction
With further integration of electric and heat energy systems and the increased use of smart meters on building level, the analysis of heat load time series plays an increasing role in local smart energy grids.
These time series can be utilized, for example, in load forecasts for scheduling problems or to identify and reduce energy waste.
One important part in energy monitoring is the detection of anomalies, such as unusual consumption, system failures, or measurement errors.
For large scale monitoring infrastructures, in our case 40 university campus buildings with multiple measured energy forms, a manual anomaly detection is hardly feasible on a daily basis.
This makes an automated anomaly detection necessary.
The literature provides various methodologies for anomaly detection in building energy time series, see [7] for an extensive review.
Several works present building level anomaly detection methods based on probabilistic load forecasts.
The main idea is to use a probabilistic model to issue a predictive distribution over the expected load and classify an observed value as anomalous if this value is unlikely under the predicted distribution.
The authors of [12] apply a dynamic regression model to predict the load one hour ahead and select an anomaly threshold from observed percentage errors.
In [2]
the load is predicted with an autoregressive neural network and anomaly thresholds are derived from a constant Gaussian error.
[1]first predict the dayahead load with a generalized additive model (GAM) based on exogenous variables and then fit an autoregressive conditional heteroscedasticity (ARCH) model on the forecast residuals to obtain a conditional distribution over the forecast errors.
However, these works do not address the specific challenges associated with anomaly detection in heat load time series on the building level: A considerable number of observed values is zero or close to zero, seasonalities and exogenous variables, such as the temperature, have a significant effect on the load, and by considering single buildings, random effects, such as user behavior, have a large influence on the observed load. Furthermore, the model must be flexible enough to model a diverse set of buildings. Finally, the assumption of a standard homoscedastic or heteroscedastic Gaussian error distribution is too simplistic for a reliable anomaly detection.
Our proposed anomaly detection methodology consists of three steps. First, we create an ensemble of diverse point forecasting models based on different regression techniques and training set sizes. The forecasts of these models serve as inputs for probabilistic forecast combination techniques to obtain a predictive distribution conditional on the ensemble predictions. These distributions are then used to classify the observed load values into normal and anomalous based on a selected interval. This also allows to tune the aggressiveness of the anomaly detection by varying the size on the nonanomaly interval.
The efficacy of our method relies on the quality of the predictive distributions produced by the selected forecasting model. The classic approach to the model selection problem is to pick a single model exante based on validation set performance. This might be computationally expensive, especially when one wants to forecast many time series, and requires large data sets to reliably estimate the outofsample accuracy. The latter point is especially true for building level energy time series as they are usually characterized by a low signaltonoise ratio and show daily, weekly, and yearly seasonality. Furthermore, the assumption that a single best model for a time series exists at all is questionable and different models might perform well for different regions of the input space. This motivates our forecast combination approach. While the the combination of point forecasts is a well researched topic, probabilistic forecast combination is considered a current frontier in energy forecasting
[8]. To our knowledge this work is the first to propose an anomaly detection methodology based on probabilistic load forecast combination techniques.We empirically test our approach on the heat load time series of two real university campus buildings. To quantify the anomaly detection performance, we insert artificial anomalies into the time series. We find that a carefully designed probabilistic forecast combination model based on generalized additive models for location, scale, and shape (GAMLSS) [17] improves over simple averaging approaches and individual models not only in terms of the statistical forecast error measures but also leads to significant improvements in anomaly detection performance.
The remainder of the paper is structured as follows. In the next section we formalize the problem and introduce the necessary notation. We then present our ensemble of point forecasting models and the considered probabilistic forecast combination approaches. The setup and results of our empirical study are explained in Section III. We conclude in Section IV.
Ii Methodology
Iia Problem definition
The anomaly detection should answer the question whether an observed load value at time step
is an anomaly given the vector of previous observations
and a vector of explanatory variables .The core idea of forecast based anomaly detection is to classify observed values based on the probability to observe them under a distribution issued by a probabilistic regression model . Given the distribution
, we can compute the probability that the random variable
lies in the interval as , whereis the cumulative distribution function (CDF) of the distribution
.Based on this, we define an anomaly detection model that, given a predictive CDF , classifies as an anomaly if
(1) 
holds true, with and being lower and upper threshold levels of a plausibility interval. To this end, we first need to estimate the CDF of the conditional distribution . If belongs to a parametric family, is usually available in closed form. Alternatively,
could be approximated by a set of predicted quantiles.
In case of forecast combination, we take a two step approach to estimate . We use a set of point forecasting models from which we can obtain predictions . We then use the vector of predictions in a second model to obtain .
We present the deterministic and probabilistic models in the following.
IiB Deterministic forecast ensemble
We arrange the deterministic forecast ensemble with forecasts. These are yielded from three different forecasting methods, each of which is trained with a 60, a 90, and a 365 day training window of size , consisting of training samples. The effectiveness of combining forecasts of different training window sizes is shown in [9], where it is applied to electricity price forecasting. We selected the forecasting methods based on their ability to flexibly model load time series of a diverse group of buildings.
To obtain the forecast time series we retrain the models each day to predict the hourly load of the complete next day, with the training window consisting of the directly preceding days, i.e. we perform dayahead forecasting with a rolling training window.
For each model, we consider one hyperparameter as tuning parameter. We separately employ a grid search to determine the tuning parameter value that minimizes the mean absolute error (MAE) in a rolling window crossvalidation.
We present the three employed deterministic forecast methods in the following.
IiB1 Lasso linear regression (Lasso)
The first method is a linear regression model with Lasso regularization
[21]. It is commonly applied in the forecasting literature, e.g. to electric load forecasting in [23]. Lasso regression performs automatic feature selection and parameter shrinkage, which allows us to flexibly model all building time series without exante assumptions about feature relevancy. We model the load as
(2) 
where is an intercept coefficient, the regression coefficients, the observed load, the temperature, and the model error, each at time step t. We define the heating degree hour as , which is a linearization of the nonlinear temperature effect. Superscripts avg and peak denote the average and the peak value of the respective day. and
are binary variables that represent working days and working hours, which we define as from 9:00 to 17:00. We consider the lasso parameter
as tuning parameter.IiB2 Gradient boosting regression (GBR)
The second point forecasting method are gradient boosted regression trees. This method combines an ensemble of weak estimators, in this case decision trees, to form a strong estimator. The tree structure allows datadriven modeling of feature interactions
[5]. [22] apply GBR to electricity load forecasting. We consider the feature vector(3) 
of which , , and are integer values that represent the hour of the day, the day of the week, and the week of the year, respectively.
We use 300 estimators, a learning rate of , and minimize the least squares loss in training. We allow the maximum tree depth to be between 3 and 6 and consider it the tuning parameter.
IiB3 Generalized additive model (GAM)
The third method is based on a GAM, which allows flexible modeling of nonlinear feature effects with penalized bspline functions [20]. GAMs are applied to electric load forecasting for example in [4].
The model reads as
(4) 
where functions are penalized bspline functions, of which are defined with 10 bsplines and with 24 bsplines. We only add , defined with 5 bsplines, in the 365 days model.
is expressed as a set of six dummy variables. Functions
, , and are constrained to be monotonically increasing, and to be monotonically decreasing. The different number of splines mitigates overfitting of feature effects. The smoothing parameter is set equal for each function and considered the tuning parameter.The mean forecast is the hourly mean of the ensemble forecast .
IiC Probabilistic forecast combination
We compare four different probabilistic forecast combination approaches.
IiC1 Ensemble average (EA)
This model assumes a Gaussian distribution centered at the ensemble mean forecast
and a constant variance equal to the variance of the insample residuals of this mean forecast
, i.e. the predictive distribution would read . However, since , such a standard Gaussian distribution would assign nonzero probabilities for physically impossible values below zero. Therefore, we employ a mixture distribution where the probability mass for negative values is reassigned to a point mass at . We denote this mixture distribution as . Note that this is different from a zerotruncated distribution where the probability mass is assigned proportionally to all values .IiC2 Ensemble average and ensemble variance (EAEV)
The ensemble variance can be interpreted as a proxy for the uncertainty of the mean prediction and can be used to model a distribution with conditional variance. Hence, the predictive distribution of this model is given by . However, we expect the predictive distribution of this model to be underdispersed, i.e. the predicted variance is likely too small on average.
IiC3 GAMLSSbased forecast combination (GAMLSS)
GAMLSS [17] are univariate distributional regression models where all parameters of the assumed distribution for the response can be modeled as additive functions of the explanatory variables.
The GAMLSS framework also allows to fit censored regression models.
Such models are appropriate when the target variable is naturally censored above and/or below a certain value, see e.g. [13] for an application to probabilistic wind speed and precipitation forecasting.
We assume that the heat demand follows a tdistribution with a point mass at zero denoted by
. The tdistribution accommodates the large signaltonoise ratio of the time series given its ability to model heavy tails when the degrees of freedom parameter
is small. The mean model is a linear combination of the ensemble forecasts(5) 
and the logstandard deviation is modeled as a nonlinear function of the ensemble standard deviation
(6) 
where is the standard deviation of the ensemble and is a penalized bspline function with 20 equidistant knots that is constrained to be monotonically increasing. The degrees of freedom are estimated as a constant. Parameters are estimated via maximum likelihood using a 0censored tlikelihood.
IiC4 Quantile Regression Averaging (QRA)
Quantile regression [10] offers a flexible approach to probabilistic forecasting as it allows to estimate the quantile of a CDF without making parametric assumptions about the underlying distribution.
Hence, we can approximate the CDF of a conditional predictive distribution by a vector of quantiles .
QRA [15] is a forecast combination technique that builds on this idea.
The predictive quantile for level at time is given by a linear combination of the ensemble forecasts
(7) 
Note that one has to estimate a separate model for each quantile, i.e. we estimate 99 models, one for each , to obtain . We sort the estimated quantiles to ensure monotonicity and clip all negative values. A predictive CDF
can then be obtained by interpolation of
. QRA has been applied to to the problem of electrical load forecasting in [11].We additionally test two benchmark models that do not rely on forecast combination.
IiC5 Naive
This model is a naive baseline and uses the observed demand of the last day as the mean forecast. The constant variance is estimated using the observed residuals of this mean forecast, i.e. . The predictive distribution reads .
IiC6 Gradient Boosted Quantile Regression Trees (GBQRT)
As we observed the GBR model with a 365 day training window to be one of the best point forecasting models, we chose this model as a strong benchmark for the forecast combination approaches. The model uses the same features and hyperparameters as the point forecasting model but minimizes the quantile loss instead of the squared error. As for the QRA model, negative values are clipped and the quantiles are sorted in ascending order.
All probabilistic models are reestimated daily with a rolling window of 365 days.
Building A  Building B  
Days  Model  RMSE  MAE  RMSE  MAE 
60  Lasso  55.8  35.7  17.5  12.1 
60  GBR  56.5  32.0  17.6  12.2 
60  GAM  59.5  37.8  19.8  13.7 
90  Lasso  52.7  32.9  18.2  12.5 
90  GBR  60.9  31.8  17.5  12.2 
90  GAM  58.3  36.3  20.2  14.1 
365  Lasso  53.0  34.5  19.2  13.3 
365  GBR  49.3  30.4  18.7  13.0 
365  GAM  57.3  39.9  21.2  14.5 
Mean forecast  48.6  29.7  16.7  11.5 
IiD Benchmarking metrics
We benchmark point forecasts with the and the root mean squared error , where is the number of samples in the test set and a point forecast clipped at 0.
To compare the performance of the probabilistic forecasts we use the continuous ranked probability score (CRPS), which is defined as
(8) 
Closed form for solutions for (8) are available for some distributions such as the Gaussian.
Alternatively, the CRPS can be estimated based on samples as
where is a set of independent samples from the predictive distribution [6].
We use samples for each observation to estimate the CRPS and report the average value over the test period.
We benchmark the anomaly detection by true positive rate and false positive rate . TP, FP, FN, and TN are the number of true positive, false positive, false negative, and true negative classifications, respectively.
Iii Empirical evaluation
Iiia Data & software
We apply our methodology to the hourly load data of two buildings on the university campus.
The temperature data was obtained from a nearby weather observation station of Deutscher Wetterdienst [3].
We use the data of the year 2017 as the initial training set for the deterministic models.
2018 is the validation set for the deterministic models and the initial training set for the probabilistic models.
2019 is the test set for point forecasts, probabilistic forecasts, and the anomaly detection.
In each year we only consider the heating period, which we define as between September 1st and May 31st.
We don’t consider time steps in which an input value or an observed value is missing.
The test set contains 5383 samples for building A and 5340 samples for building B.
In the test period, building A and B have an average load of 394.8kW and 121.9kW, respectively.
We use the Python packages scikitlearn [16], pyGAM [19], and statsmodels [18] as well as the R package gamlss [17].
IiiB Artificial anomaly generation
To quantify the anomaly detection performance, we replace 5% of the observed load values in the test period with artificial anomalies. We create anomalies as deviations of 20% from the observed load but at least 20% of the average observed load. Otherwise the artificial deviations for very low load values would resemble the usual random fluctuations in the data. If a deviation results in a negative value, we replace it with a deviation in positive direction. We run the anomaly detection 30 times. For each run the artificial anomalies are placed at different random positions and the average performance is reported. The forecast models are trained, validated, and tested without artificial anomalies.
Building  EA  EAEV  GAMLSS  QRA  Naive  GBQRT 

A  25.51  22.00  20.34  21.20  62.95  21.36 
B  8.55  8.88  7.99  8.54  21.47  9.48 
IiiC Forecast results
Table I presents the MAE and the RMSE of the point forecasts and the ensemble mean forecast in the test period. Notably, different point forecasts of the ensemble perform best for each building and, depending on which error measure we consider, even differ in rank for just one building. The mean forecast, on the other hand, performs best for each building, irrespective of chosen error measure. This supports our motivation for employing forecast combination and is in line with literature [14].
Table II shows the mean test set CRPS score for the probabilistic forecasts.
The GAMLSS model performs best, followed by the QRA model. For building A GBQRT performs better than the simple forecast combination models.
The probability integral transform (PIT) histogram in Figure 1 shows the relative amount of observed values in equally sized intervals of the predictions’ quantile levels and hence provides information about the calibration of the forecasts [6]. GAMLSS and QRA models show the best calibration. Calibration in the upper quantile levels suggests that the distributions of the GAMLSS forecast are overdispersed in the upper half. The other forecasts show worse calibrations, with the EAEV and GBQRT forecast distributions being underdispersed and the naive forecast being overdispersed.
IiiD Anomaly detection results
In the following, when presenting results by threshold quantile levels, we choose thresholds so that .
The receiver operating characteristic (ROC) curve in Figure 1(a) shows the TPR vs. the FPR yielding from different anomaly thresholds for each model.
The best anomaly detection maximizes the area under the ROC curve.
The GAMLSS model outperforms the other models and offers interpretable threshold levels.
With good calibration, the quantile level offers an exante interpretable threshold as the FPR should be close to .
EAEV performs better than its CRPS suggests, but due to the inadequate calibration, the quantile levels offer only expost interpretability and the threshold levels can’t be generalized over different buildings.
All models perform worse on building B, see Figure 1(b). The GAMLSS model still performs best.
For both buildings, all models based on forecast combination, except EAEV for building B, perform better than the benchmark models.
In summary, a decreased CRPS does not strictly lead to an increased anomaly detection performance. A reason might be that over and underdispersion reduce the interpretability of the quantile levels but are not as important when one chooses thresholds expost based on the TPRFPR ratio.
Nevertheless, GAMLSS, with the lowest average CRPS, offers the best anomaly detection for both buildings.
Figure 2(a) shows an example of the detection of artificial anomalies.
The load time series has no recognizable periodicity and the presented anomaly values would be plausible in other time steps.
Nevertheless, most of the anomalies are detected.
Figure 2(b) shows an example where we know the original data to contain an anomaly. It is correctly detected.
Considering that a similar steep decline and value is part of a plausible load pattern on the next day, this anomaly can not be trivially identified at the time of measurement.
This confirms that the proposed methodology offers valuable results in an actual usecase.
Iv Conclusion and Outlook
This work presented a novel methodology for anomaly detection in heat load time series based on probabilistic forecast combination techniques.
An empirical study showed that probabilistic forecast combination based anomaly detection outperforms even a strong benchmark.
Of all considered models, the GAMLSSbased forecast combination model yielded the best probabilistic forecast and anomaly detection performance.
Further research could be focused on improving the composition and accuracy of the point forecast ensemble as well as refining the GAMLSS model, e.g. by considering other types of distributions.
The empirical evaluation could be extended to more buildings or other smart meter data sets.
References

[1]
(2014)
Statistical anomaly detection in mean and variation of energy consumption.
In
2014 22nd International Conference on Pattern Recognition
, Vol. , pp. 3570–3575. External Links: Document Cited by: §I.  [2] (2014) Realtime detection of anomalous power consumption. Renewable and Sustainable Energy Reviews 33, pp. 400–411. External Links: ISSN 13640321 Cited by: §I.
 [3] Deutscher Wetterdienst Climate Data Center. Note: https://www.dwd.de/EN/climate_environment/cdc/cdc_node_en.html Cited by: §IIIA.
 [4] (2012) Shortterm load forecasting based on a semiparametric additive model. IEEE Transactions on Power Systems 27 (1), pp. 134–141. Cited by: §IIB3.
 [5] (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367–378. Note: Nonlinear Methods and Data Mining External Links: ISSN 01679473 Cited by: §IIB2.
 [6] (2014) Probabilistic forecasting. Annual Review of Statistics and Its Application 1 (1), pp. 125–151. Cited by: §IID, §IIIC.
 [7] (2021) Artificial intelligence based anomaly detection of energy consumption in buildings: a review, current trends and new perspectives. Applied Energy 287, pp. 116601. External Links: ISSN 03062619 Cited by: §I.
 [8] (2020) Energy forecasting: a review and outlook. IEEE Open Access Journal of Power and Energy. Cited by: §I.
 [9] (2019) A note on averaging dayahead electricity price forecasts across calibration windows. IEEE Transactions on Sustainable Energy 10 (1), pp. 321–323. Cited by: §IIB.
 [10] (1978) Regression quantiles. Econometrica 46 (1), pp. 33–50. External Links: ISSN 00129682, 14680262 Cited by: §IIC4.
 [11] (2017) Probabilistic load forecasting via quantile regression averaging on sister forecasts. IEEE Transactions on Smart Grid 8 (2), pp. 730–737. Cited by: §IIC4.
 [12] (2018) Realtime anomaly detection for very shortterm load forecasting. Journal of Modern Power Systems and Clean Energy 6 (2), pp. 235–243. Cited by: §I.

[13]
(01 Aug. 2014)
Extending extended logistic regression: extended versus separate versus ordered versus censored
. Monthly Weather Review 142 (8), pp. 3003 – 3014. Cited by: §IIC3.  [14] (2016) Improving short term load forecast accuracy via combining sister forecasts. Energy 98, pp. 40–49. External Links: ISSN 03605442 Cited by: §IIIC.
 [15] (2015) Computing electricity spot price prediction intervals using quantile regression and forecast averaging. Computational Statistics 30 (3), pp. 791–803. Cited by: §IIC4.

[16]
(2011)
Scikitlearn: machine learning in Python
. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IIIA.  [17] (2005) Generalized additive models for location, scale and shape. 3, Applied Statistics 54, pp. 507–554. Cited by: §I, §IIC3, §IIIA.
 [18] (2010) Statsmodels: econometric and statistical modeling with python. In 9th Python in Science Conference, Cited by: §IIIA.
 [19] (201803) pyGAM: generalized additive models in python. Cited by: §IIIA.
 [20] (1990) Generalized additive models. Vol. 43, Wiley Online Library. Cited by: §IIB3.
 [21] (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §IIB1.
 [22] (2019) Comparison of three shortterm load forecast models in southern california. Energy 189, pp. 116358. External Links: ISSN 03605442 Cited by: §IIB2.
 [23] (2016) Lasso estimation for gefcom2014 probabilistic electric load forecasting. International Journal of Forecasting 32 (3), pp. 1029–1037. External Links: ISSN 01692070 Cited by: §IIB1.
Comments
There are no comments yet.