Probabilistic Forecast Combination for Anomaly Detection in Building Heat Load Time Series

by   Mario Beykirch, et al.

We consider the problem of automated anomaly detection for building level heat load time series. An anomaly detection model must be applicable to a diverse group of buildings and provide robust results on heat load time series with low signal-to-noise ratios, several seasonalities, and significant exogenous effects. We propose to employ a probabilistic forecast combination approach based on an ensemble of deterministic forecasts in an anomaly detection scheme that classifies observed values based on their probability under a predictive distribution. We show empirically that forecast based anomaly detection provides improved accuracy when employing a forecast combination approach.



There are no comments yet.


page 1

page 2

page 3

page 4


Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection

This short paper describes our ongoing research on Greenhouse - a zero-p...

GluonTS: Probabilistic Time Series Models in Python

We introduce Gluon Time Series (GluonTS)[<>], a...

Forecast-based Multi-aspect Framework for Multivariate Time-series Anomaly Detection

Today's cyber-world is vastly multivariate. Metrics collected at extreme...

CRATOS: Cogination of Reliable Algorithm for Time-series Optimal Solution

Anomaly detection of time series plays an important role in reliability ...

Adaptive Anomaly Detection in Chaotic Time Series with a Spatially Aware Echo State Network

This work builds an automated anomaly detection method for chaotic time ...

Maximally Divergent Intervals for Anomaly Detection

We present new methods for batch anomaly detection in multivariate time ...

A framework for anomaly detection using language modeling, and its applications to finance

In the finance sector, studies focused on anomaly detection are often as...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With further integration of electric and heat energy systems and the increased use of smart meters on building level, the analysis of heat load time series plays an increasing role in local smart energy grids. These time series can be utilized, for example, in load forecasts for scheduling problems or to identify and reduce energy waste. One important part in energy monitoring is the detection of anomalies, such as unusual consumption, system failures, or measurement errors. For large scale monitoring infrastructures, in our case 40 university campus buildings with multiple measured energy forms, a manual anomaly detection is hardly feasible on a daily basis. This makes an automated anomaly detection necessary.
The literature provides various methodologies for anomaly detection in building energy time series, see [7] for an extensive review. Several works present building level anomaly detection methods based on probabilistic load forecasts. The main idea is to use a probabilistic model to issue a predictive distribution over the expected load and classify an observed value as anomalous if this value is unlikely under the predicted distribution. The authors of [12] apply a dynamic regression model to predict the load one hour ahead and select an anomaly threshold from observed percentage errors. In [2]

the load is predicted with an autoregressive neural network and anomaly thresholds are derived from a constant Gaussian error.


first predict the day-ahead load with a generalized additive model (GAM) based on exogenous variables and then fit an autoregressive conditional heteroscedasticity (ARCH) model on the forecast residuals to obtain a conditional distribution over the forecast errors.

However, these works do not address the specific challenges associated with anomaly detection in heat load time series on the building level: A considerable number of observed values is zero or close to zero, seasonalities and exogenous variables, such as the temperature, have a significant effect on the load, and by considering single buildings, random effects, such as user behavior, have a large influence on the observed load. Furthermore, the model must be flexible enough to model a diverse set of buildings. Finally, the assumption of a standard homoscedastic or heteroscedastic Gaussian error distribution is too simplistic for a reliable anomaly detection.
Our proposed anomaly detection methodology consists of three steps. First, we create an ensemble of diverse point forecasting models based on different regression techniques and training set sizes. The forecasts of these models serve as inputs for probabilistic forecast combination techniques to obtain a predictive distribution conditional on the ensemble predictions. These distributions are then used to classify the observed load values into normal and anomalous based on a selected interval. This also allows to tune the aggressiveness of the anomaly detection by varying the size on the non-anomaly interval.

The efficacy of our method relies on the quality of the predictive distributions produced by the selected forecasting model. The classic approach to the model selection problem is to pick a single model ex-ante based on validation set performance. This might be computationally expensive, especially when one wants to forecast many time series, and requires large data sets to reliably estimate the out-of-sample accuracy. The latter point is especially true for building level energy time series as they are usually characterized by a low signal-to-noise ratio and show daily, weekly, and yearly seasonality. Furthermore, the assumption that a single best model for a time series exists at all is questionable and different models might perform well for different regions of the input space. This motivates our forecast combination approach. While the the combination of point forecasts is a well researched topic, probabilistic forecast combination is considered a current frontier in energy forecasting

[8]. To our knowledge this work is the first to propose an anomaly detection methodology based on probabilistic load forecast combination techniques.
We empirically test our approach on the heat load time series of two real university campus buildings. To quantify the anomaly detection performance, we insert artificial anomalies into the time series. We find that a carefully designed probabilistic forecast combination model based on generalized additive models for location, scale, and shape (GAMLSS) [17] improves over simple averaging approaches and individual models not only in terms of the statistical forecast error measures but also leads to significant improvements in anomaly detection performance.
The remainder of the paper is structured as follows. In the next section we formalize the problem and introduce the necessary notation. We then present our ensemble of point forecasting models and the considered probabilistic forecast combination approaches. The setup and results of our empirical study are explained in Section III. We conclude in Section IV.

Ii Methodology

Ii-a Problem definition

The anomaly detection should answer the question whether an observed load value at time step

is an anomaly given the vector of previous observations

and a vector of explanatory variables .

The core idea of forecast based anomaly detection is to classify observed values based on the probability to observe them under a distribution issued by a probabilistic regression model . Given the distribution

, we can compute the probability that the random variable

lies in the interval as , where

is the cumulative distribution function (CDF) of the distribution


Based on this, we define an anomaly detection model that, given a predictive CDF , classifies as an anomaly if


holds true, with and being lower and upper threshold levels of a plausibility interval. To this end, we first need to estimate the CDF of the conditional distribution . If belongs to a parametric family, is usually available in closed form. Alternatively,

could be approximated by a set of predicted quantiles.

In case of forecast combination, we take a two step approach to estimate . We use a set of point forecasting models from which we can obtain predictions . We then use the vector of predictions in a second model to obtain .

We present the deterministic and probabilistic models in the following.

Ii-B Deterministic forecast ensemble

We arrange the deterministic forecast ensemble with forecasts. These are yielded from three different forecasting methods, each of which is trained with a 60, a 90, and a 365 day training window of size , consisting of training samples. The effectiveness of combining forecasts of different training window sizes is shown in [9], where it is applied to electricity price forecasting. We selected the forecasting methods based on their ability to flexibly model load time series of a diverse group of buildings.
To obtain the forecast time series we re-train the models each day to predict the hourly load of the complete next day, with the training window consisting of the directly preceding days, i.e. we perform day-ahead forecasting with a rolling training window.
For each model, we consider one hyper-parameter as tuning parameter. We separately employ a grid search to determine the tuning parameter value that minimizes the mean absolute error (MAE) in a rolling window cross-validation.
We present the three employed deterministic forecast methods in the following.

Ii-B1 Lasso linear regression (Lasso)

The first method is a linear regression model with Lasso regularization

[21]. It is commonly applied in the forecasting literature, e.g. to electric load forecasting in [23]

. Lasso regression performs automatic feature selection and parameter shrinkage, which allows us to flexibly model all building time series without ex-ante assumptions about feature relevancy. We model the load as


where is an intercept coefficient, the regression coefficients, the observed load, the temperature, and the model error, each at time step t. We define the heating degree hour as , which is a linearization of the non-linear temperature effect. Superscripts avg and peak denote the average and the peak value of the respective day. and

are binary variables that represent working days and working hours, which we define as from 9:00 to 17:00. We consider the lasso parameter

as tuning parameter.

Ii-B2 Gradient boosting regression (GBR)

The second point forecasting method are gradient boosted regression trees. This method combines an ensemble of weak estimators, in this case decision trees, to form a strong estimator. The tree structure allows data-driven modeling of feature interactions

[5]. [22] apply GBR to electricity load forecasting. We consider the feature vector


of which , , and are integer values that represent the hour of the day, the day of the week, and the week of the year, respectively.
We use 300 estimators, a learning rate of , and minimize the least squares loss in training. We allow the maximum tree depth to be between 3 and 6 and consider it the tuning parameter.

Ii-B3 Generalized additive model (GAM)

The third method is based on a GAM, which allows flexible modeling of non-linear feature effects with penalized b-spline functions [20]. GAMs are applied to electric load forecasting for example in [4].
The model reads as


where functions are penalized b-spline functions, of which are defined with 10 b-splines and with 24 b-splines. We only add , defined with 5 b-splines, in the 365 days model.

is expressed as a set of six dummy variables. Functions

, , and are constrained to be monotonically increasing, and to be monotonically decreasing. The different number of splines mitigates over-fitting of feature effects. The smoothing parameter is set equal for each function and considered the tuning parameter.
The mean forecast is the hourly mean of the ensemble forecast .

Ii-C Probabilistic forecast combination

We compare four different probabilistic forecast combination approaches.

Ii-C1 Ensemble average (EA)

This model assumes a Gaussian distribution centered at the ensemble mean forecast

and a constant variance equal to the variance of the in-sample residuals of this mean forecast

, i.e. the predictive distribution would read . However, since , such a standard Gaussian distribution would assign non-zero probabilities for physically impossible values below zero. Therefore, we employ a mixture distribution where the probability mass for negative values is reassigned to a point mass at . We denote this mixture distribution as . Note that this is different from a zero-truncated distribution where the probability mass is assigned proportionally to all values .

Ii-C2 Ensemble average and ensemble variance (EA-EV)

The ensemble variance can be interpreted as a proxy for the uncertainty of the mean prediction and can be used to model a distribution with conditional variance. Hence, the predictive distribution of this model is given by . However, we expect the predictive distribution of this model to be under-dispersed, i.e. the predicted variance is likely too small on average.

Ii-C3 GAMLSS-based forecast combination (GAMLSS)

GAMLSS [17] are univariate distributional regression models where all parameters of the assumed distribution for the response can be modeled as additive functions of the explanatory variables. The GAMLSS framework also allows to fit censored regression models. Such models are appropriate when the target variable is naturally censored above and/or below a certain value, see e.g. [13] for an application to probabilistic wind speed and precipitation forecasting.
We assume that the heat demand follows a t-distribution with a point mass at zero denoted by

. The t-distribution accommodates the large signal-to-noise ratio of the time series given its ability to model heavy tails when the degrees of freedom parameter

is small. The mean model is a linear combination of the ensemble forecasts


and the log-standard deviation is modeled as a nonlinear function of the ensemble standard deviation


where is the standard deviation of the ensemble and is a penalized b-spline function with 20 equidistant knots that is constrained to be monotonically increasing. The degrees of freedom are estimated as a constant. Parameters are estimated via maximum likelihood using a 0-censored t-likelihood.

(a) Building A
(b) Building B
Fig. 1: PIT histograms of each probabilistic forecast in the test period. The points indicate the bin value and are positioned in its center, with each bin having a width of . The dashed line represents an optimal calibration.

Ii-C4 Quantile Regression Averaging (QRA)

Quantile regression [10] offers a flexible approach to probabilistic forecasting as it allows to estimate the -quantile of a CDF without making parametric assumptions about the underlying distribution. Hence, we can approximate the CDF of a conditional predictive distribution by a vector of quantiles .
QRA [15] is a forecast combination technique that builds on this idea. The predictive quantile for level at time is given by a linear combination of the ensemble forecasts


Note that one has to estimate a separate model for each quantile, i.e. we estimate 99 models, one for each , to obtain . We sort the estimated quantiles to ensure monotonicity and clip all negative values. A predictive CDF

can then be obtained by interpolation of

. QRA has been applied to to the problem of electrical load forecasting in [11].
We additionally test two benchmark models that do not rely on forecast combination.

Ii-C5 Naive

This model is a naive baseline and uses the observed demand of the last day as the mean forecast. The constant variance is estimated using the observed residuals of this mean forecast, i.e. . The predictive distribution reads .

Ii-C6 Gradient Boosted Quantile Regression Trees (GBQRT)

As we observed the GBR model with a 365 day training window to be one of the best point forecasting models, we chose this model as a strong benchmark for the forecast combination approaches. The model uses the same features and hyper-parameters as the point forecasting model but minimizes the quantile loss instead of the squared error. As for the QRA model, negative values are clipped and the quantiles are sorted in ascending order.

All probabilistic models are re-estimated daily with a rolling window of 365 days.

Building A Building B
60 Lasso 55.8 35.7 17.5 12.1
60 GBR 56.5 32.0 17.6 12.2
60 GAM 59.5 37.8 19.8 13.7
90 Lasso 52.7 32.9 18.2 12.5
90 GBR 60.9 31.8 17.5 12.2
90 GAM 58.3 36.3 20.2 14.1
365 Lasso 53.0 34.5 19.2 13.3
365 GBR 49.3 30.4 18.7 13.0
365 GAM 57.3 39.9 21.2 14.5
Mean forecast 48.6 29.7 16.7 11.5
TABLE I: MAE and RMSE of each point forecast in the test set. The lowest values in each column is marked in bold. The three best ensemble forecasts, per building and error measure, are marked by a superscript.

Ii-D Benchmarking metrics

We benchmark point forecasts with the and the root mean squared error , where is the number of samples in the test set and a point forecast clipped at 0.
To compare the performance of the probabilistic forecasts we use the continuous ranked probability score (CRPS), which is defined as


Closed form for solutions for (8) are available for some distributions such as the Gaussian. Alternatively, the CRPS can be estimated based on samples as where is a set of independent samples from the predictive distribution [6]. We use samples for each observation to estimate the CRPS and report the average value over the test period.
We benchmark the anomaly detection by true positive rate and false positive rate . TP, FP, FN, and TN are the number of true positive, false positive, false negative, and true negative classifications, respectively.

(a) Building A
(b) Building B
Fig. 2: ROC curves for each anomaly detection model with each point representing the result for one quantile level as threshold. Square and triangle mark the thresholds and , respectively. Dashed lines represent quantile levels . The results are averages of 30 runs.

Iii Empirical evaluation

Iii-a Data & software

We apply our methodology to the hourly load data of two buildings on the university campus. The temperature data was obtained from a near-by weather observation station of Deutscher Wetterdienst [3]. We use the data of the year 2017 as the initial training set for the deterministic models. 2018 is the validation set for the deterministic models and the initial training set for the probabilistic models. 2019 is the test set for point forecasts, probabilistic forecasts, and the anomaly detection. In each year we only consider the heating period, which we define as between September 1st and May 31st. We don’t consider time steps in which an input value or an observed value is missing. The test set contains 5383 samples for building A and 5340 samples for building B. In the test period, building A and B have an average load of 394.8kW and 121.9kW, respectively.
We use the Python packages scikit-learn [16], pyGAM [19], and statsmodels [18] as well as the R package gamlss [17].

Iii-B Artificial anomaly generation

To quantify the anomaly detection performance, we replace 5% of the observed load values in the test period with artificial anomalies. We create anomalies as deviations of 20% from the observed load but at least 20% of the average observed load. Otherwise the artificial deviations for very low load values would resemble the usual random fluctuations in the data. If a deviation results in a negative value, we replace it with a deviation in positive direction. We run the anomaly detection 30 times. For each run the artificial anomalies are placed at different random positions and the average performance is reported. The forecast models are trained, validated, and tested without artificial anomalies.

A 25.51 22.00 20.34 21.20 62.95 21.36
B 8.55 8.88 7.99 8.54 21.47 9.48
TABLE II: Average test set CRPS values with the lowest per building in bold.

Iii-C Forecast results

Table I presents the MAE and the RMSE of the point forecasts and the ensemble mean forecast in the test period. Notably, different point forecasts of the ensemble perform best for each building and, depending on which error measure we consider, even differ in rank for just one building. The mean forecast, on the other hand, performs best for each building, irrespective of chosen error measure. This supports our motivation for employing forecast combination and is in line with literature [14].

Table II shows the mean test set CRPS score for the probabilistic forecasts. The GAMLSS model performs best, followed by the QRA model. For building A GBQRT performs better than the simple forecast combination models.
The probability integral transform (PIT) histogram in Figure 1 shows the relative amount of observed values in equally sized intervals of the predictions’ quantile levels and hence provides information about the calibration of the forecasts [6]. GAMLSS and QRA models show the best calibration. Calibration in the upper quantile levels suggests that the distributions of the GAMLSS forecast are over-dispersed in the upper half. The other forecasts show worse calibrations, with the EA-EV and GBQRT forecast distributions being under-dispersed and the naive forecast being over-dispersed.

(a) Real load with inserted artificial anomalies
(b) Real load containing an anomaly
Fig. 3: Examples of the anomaly detection, when using the GAMLSS model with . a) The blue line depicts the observed load from real measurements and the orange line the observed load but altered by inserting artificial anomalies. These are depicted as green dots if detected and red if not. The interval between the anomaly thresholds is depicted in gray. b) The blue line depicts the observed load from real measurements and the gray area depicts the interval between anomaly thresholds.

Iii-D Anomaly detection results

In the following, when presenting results by threshold quantile levels, we choose thresholds so that .

The receiver operating characteristic (ROC) curve in Figure 1(a) shows the TPR vs. the FPR yielding from different anomaly thresholds for each model. The best anomaly detection maximizes the area under the ROC curve. The GAMLSS model outperforms the other models and offers interpretable threshold levels. With good calibration, the quantile level offers an ex-ante interpretable threshold as the FPR should be close to . EA-EV performs better than its CRPS suggests, but due to the inadequate calibration, the quantile levels offer only ex-post interpretability and the threshold levels can’t be generalized over different buildings. All models perform worse on building B, see Figure 1(b). The GAMLSS model still performs best. For both buildings, all models based on forecast combination, except EA-EV for building B, perform better than the benchmark models. In summary, a decreased CRPS does not strictly lead to an increased anomaly detection performance. A reason might be that over- and under-dispersion reduce the interpretability of the quantile levels but are not as important when one chooses thresholds ex-post based on the TPR-FPR ratio. Nevertheless, GAMLSS, with the lowest average CRPS, offers the best anomaly detection for both buildings.
Figure 2(a) shows an example of the detection of artificial anomalies. The load time series has no recognizable periodicity and the presented anomaly values would be plausible in other time steps. Nevertheless, most of the anomalies are detected. Figure 2(b) shows an example where we know the original data to contain an anomaly. It is correctly detected. Considering that a similar steep decline and value is part of a plausible load pattern on the next day, this anomaly can not be trivially identified at the time of measurement. This confirms that the proposed methodology offers valuable results in an actual use-case.

Iv Conclusion and Outlook

This work presented a novel methodology for anomaly detection in heat load time series based on probabilistic forecast combination techniques. An empirical study showed that probabilistic forecast combination based anomaly detection outperforms even a strong benchmark. Of all considered models, the GAMLSS-based forecast combination model yielded the best probabilistic forecast and anomaly detection performance.
Further research could be focused on improving the composition and accuracy of the point forecast ensemble as well as refining the GAMLSS model, e.g. by considering other types of distributions. The empirical evaluation could be extended to more buildings or other smart meter data sets.


  • [1] B. Chen, M. Sinn, J. Ploennigs, and A. Schumann (2014) Statistical anomaly detection in mean and variation of energy consumption. In

    2014 22nd International Conference on Pattern Recognition

    Vol. , pp. 3570–3575. External Links: Document Cited by: §I.
  • [2] J. Chou and A. S. Telaga (2014) Real-time detection of anomalous power consumption. Renewable and Sustainable Energy Reviews 33, pp. 400–411. External Links: ISSN 1364-0321 Cited by: §I.
  • [3] Deutscher Wetterdienst Climate Data Center. Note: Cited by: §III-A.
  • [4] S. Fan and R. J. Hyndman (2012) Short-term load forecasting based on a semi-parametric additive model. IEEE Transactions on Power Systems 27 (1), pp. 134–141. Cited by: §II-B3.
  • [5] J. H. Friedman (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367–378. Note: Nonlinear Methods and Data Mining External Links: ISSN 0167-9473 Cited by: §II-B2.
  • [6] T. Gneiting and M. Katzfuss (2014) Probabilistic forecasting. Annual Review of Statistics and Its Application 1 (1), pp. 125–151. Cited by: §II-D, §III-C.
  • [7] Y. Himeur, K. Ghanem, A. Alsalemi, F. Bensaali, and A. Amira (2021) Artificial intelligence based anomaly detection of energy consumption in buildings: a review, current trends and new perspectives. Applied Energy 287, pp. 116601. External Links: ISSN 0306-2619 Cited by: §I.
  • [8] T. Hong, P. Pinson, Y. Wang, R. Weron, D. Yang, and H. Zareipour (2020) Energy forecasting: a review and outlook. IEEE Open Access Journal of Power and Energy. Cited by: §I.
  • [9] K. Hubicka, G. Marcjasz, and R. Weron (2019) A note on averaging day-ahead electricity price forecasts across calibration windows. IEEE Transactions on Sustainable Energy 10 (1), pp. 321–323. Cited by: §II-B.
  • [10] R. Koenker and G. Bassett (1978) Regression quantiles. Econometrica 46 (1), pp. 33–50. External Links: ISSN 00129682, 14680262 Cited by: §II-C4.
  • [11] B. Liu, J. Nowotarski, T. Hong, and R. Weron (2017) Probabilistic load forecasting via quantile regression averaging on sister forecasts. IEEE Transactions on Smart Grid 8 (2), pp. 730–737. Cited by: §II-C4.
  • [12] J. Luo, T. Hong, and M. Yue (2018) Real-time anomaly detection for very short-term load forecasting. Journal of Modern Power Systems and Clean Energy 6 (2), pp. 235–243. Cited by: §I.
  • [13] J. W. Messner, G. J. Mayr, D. S. Wilks, and A. Zeileis (01 Aug. 2014)

    Extending extended logistic regression: extended versus separate versus ordered versus censored

    Monthly Weather Review 142 (8), pp. 3003 – 3014. Cited by: §II-C3.
  • [14] J. Nowotarski, B. Liu, R. Weron, and T. Hong (2016) Improving short term load forecast accuracy via combining sister forecasts. Energy 98, pp. 40–49. External Links: ISSN 0360-5442 Cited by: §III-C.
  • [15] J. Nowotarski and R. Weron (2015) Computing electricity spot price prediction intervals using quantile regression and forecast averaging. Computational Statistics 30 (3), pp. 791–803. Cited by: §II-C4.
  • [16] F. Pedregosa et al. (2011)

    Scikit-learn: machine learning in Python

    Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §III-A.
  • [17] R. A. Rigby and D. M. Stasinopoulos (2005) Generalized additive models for location, scale and shape. 3, Applied Statistics 54, pp. 507–554. Cited by: §I, §II-C3, §III-A.
  • [18] S. Seabold and J. Perktold (2010) Statsmodels: econometric and statistical modeling with python. In 9th Python in Science Conference, Cited by: §III-A.
  • [19] D. Servén and C. Brummitt (2018-03) pyGAM: generalized additive models in python. Cited by: §III-A.
  • [20] R. J. Tibshirani (1990) Generalized additive models. Vol. 43, Wiley Online Library. Cited by: §II-B3.
  • [21] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §II-B1.
  • [22] N. Zhang, Z. Li, X. Zou, and S. M. Quiring (2019) Comparison of three short-term load forecast models in southern california. Energy 189, pp. 116358. External Links: ISSN 0360-5442 Cited by: §II-B2.
  • [23] F. Ziel and B. Liu (2016) Lasso estimation for gefcom2014 probabilistic electric load forecasting. International Journal of Forecasting 32 (3), pp. 1029–1037. External Links: ISSN 0169-2070 Cited by: §II-B1.