Multivariate Anomaly Detection based on Prediction Intervals Constructed using Deep Learning

It has been shown that deep learning models can under certain circumstances outperform traditional statistical methods at forecasting. Furthermore, various techniques have been developed for quantifying the forecast uncertainty (prediction intervals). In this paper, we utilize prediction intervals constructed with the aid of artificial neural networks to detect anomalies in the multivariate setting. Challenges with existing deep learning-based anomaly detection approaches include (i) large sets of parameters that may be computationally intensive to tune, (ii) returning too many false positives rendering the techniques impractical for use, (iii) requiring labeled datasets for training which are often not prevalent in real life. Our approach overcomes these challenges. We benchmark our approach against the oft-preferred well-established statistical models. We focus on three deep learning architectures, namely, cascaded neural networks, reservoir computing and long short-term memory recurrent neural networks. Our finding is deep learning outperforms (or at the very least is competitive to) the latter.



There are no comments yet.


page 1

page 2

page 3

page 4


A Statistics and Deep Learning Hybrid Method for Multivariate Time Series Forecasting and Mortality Modeling

Hybrid methods have been shown to outperform pure statistical and pure d...

A Framework for End-to-End Deep Learning-Based Anomaly Detection in Transportation Networks

We develop an end-to-end deep learning-based anomaly detection model for...

A Survey on Anomaly Detection for Technical Systems using LSTM Networks

Anomalies represent deviations from the intended system operation and ca...

Optimal Reservoir Operations using Long Short-Term Memory Network

A reliable forecast of inflows to the reservoir is a key factor in the o...

Fast and scalable neuroevolution deep learning architecture search for multivariate anomaly detection

Neuroevolution is one of the methodologies that can be used for learning...

Deep learning for brake squeal: vibration detection, characterization and prediction

Despite significant advances in numerical modeling of brake squeal, the ...

Multivariate Prediction Intervals for Photovoltaic Power Generation

The current literature in probabilistic forecasting is focused on quanti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


defines an outlier as an observation that ”

deviates so significantly from other observations as to arouse suspicion that a different mechanism generated it.

” Anomalies, also known as outliers, arise for several reasons, such as intentional acts of malice, collapse or failure of various systems, or fraudulent activity. Detecting these anomalies in a timeous manner is essential in various decision-making processes. Anomaly detection is not to be confused with novelty detection, which is focused on identifying new patterns in the data. However, the techniques used for anomaly detection are often used for novelty detection and vice versa.

Car20 draw a comparison between the two fields and detail recent advances in the methods applicable to each field.

Anomaly detection, when framed as a classification problem, has been modelled using One-Class Support Vector Machines as far back as the works of

Schol99. Dealing with a highly imbalanced class distribution (where normal observations far exceed anomalous data points), Schol99 detect novel patterns by looking at the local density distribution of normal data.


use stacked LSTM networks for anomaly detection in time series. They train the network on non-anomalous data and use it as a predictor over several time steps. The resulting prediction errors are modelled as a multivariate Gaussian distribution, which is then used to assess the likelihood of anomalous behaviour. Later,

Mal16 examine anomaly detection for mechanical devices in the presence of exogenous effects. An LSTM-based Encoder-Decoder scheme is employed that learns to reconstruct time-series behaviour under normalcy and after that uses reconstruction error to detect anomalies. Sch17

employ a deep convolutional generative adversarial network in an unsupervised manner to learn a manifold of normal anatomical variability and apply it to the task of anomaly detection to aid disease detection and treatment monitoring.

More recently, the research has focused on time-series anomalies that manifest over a period of time rather than at single time points. These are known as range-based or collective anomalies. Nguyen18 consider these and use prediction errors from multiple time-steps for detecting collective anomalies.

Others, such as Zhu17 use Bayesian Neural Networks (BNNs, Khera18) and the Monte Carlo (MC) dropout framework to construct prediction intervals, inspired by Gal16a as well as Gal16b.

Several probability-based methods have also been successfully applied to the detection of anomalies

Agg13. If said statistical methods have been applied successfully, why then consider deep learning for anomaly detection at all? We provide three reasons. Firstly, the boundary between normal and anomalous behaviour is at times not precisely defined, or it evolves continually, i.e. the distribution of the data changes dynamically as new instances are introduced to the time series. In such instances, statistical algorithms may sub-perform Uth17 unless, for instance, they are retrained continuously to mitigate the changes in the data distributions. Secondly, deep learning algorithms have been shown to scale well in other applications Bri16, and we hypothesise that this scalability attribute may extend to cases that require large-scale anomaly detection. Finally, in the most recent M4 competition, it is shown that deep learning produces better prediction intervals than statistical methods Makri20a. This superior prediction interval result is corroborated by the works of Smyl20 in the univariate case as well as Mat1 in the multivariate setting, for instance. Our hypothesis is that better prediction intervals can be used to build better anomaly detection machinery.

1.1 Related Work

There have been immense advances made in applying deep learning methods in fields such as forecasting Makri20a; Makri20b, but there is a relative scarcity of deep learning approaches for anomaly detection and its sub-disciplines Kwon19; Thu20.

Considering anomaly detection as a supervised learning problem requires a dataset where each data instance is labeled. With the anomaly detection problem phrased as such, there are general steps usually followed in the literature. Although not identical in their methodologies, even the most recent advances are not dissimilar to approaches first adopted by 

Mal15Chau15 and Mal16, for instance. Generally, methodologies incorporate the following steps:

  1. use a deep learning algorithm or architecture (e.g. LSTM, encoder-decoder) for generating forecasts, retain the errors as they are used later for anomaly detection;

  2. assume the resultant errors follow a Gaussian (or other) distribution;

  3. learn a threshold by maximizing either the -score, precision, recall or some combination thereof on the validation set.

There are also some comprehensive reviews, including, for instance, the works of Ade16, who survey deep learning-based methods for fraud detection, Kwon19 who focus on techniques and applications related to cyber-intrusion detection, and as recent as Thu20 who focus on the challenge of detecting anomalies in the domain of big data.

All of these reviews, and the research highlighted above, touch on some critical points related to drawbacks. We summarize these drawbacks in Section 1.2 below.

To avoid ambiguity, all the techniques described in this Section shall further be referred to as ”reviewed techniques.”

1.2 Motivation

All the reviewed techniques, as well as their classical statistical counterparts, have similar drawbacks. Classical techniques such as statistical and probabilistic models are typically suitable only for univariate datasets Makri20a; Smyl20. For the more realistic multivariate case, Li16

for instance, propose a method for analyzing multivariate datasets by first transforming the high-dimensional data to a uni-dimensional space using Fuzzy C-Means clustering. They then detect anomalies with a Hidden Markov Model. However,

Li16’s approach does not fully utilize the additional information provided by the data’s correlation.

Another issue with the reviewed techniques is their reliance on datasets that contain the ground truth, where the anomalies have been labeled. These labels are sometimes also referred to as golden labels. There are difficulties associated with gathering this kind of data in real life, such as the cost of labeling or human error in the labeling process. Even once the reviewed models are trained and parameterized on such data, if available, the dynamic nature of most real-life scenarios would mean the tuned review model might not generalize to new types of anomalies unless retraining is performed continuously Uth17.

1.3 Contribution

  • We introduce a dynamic threshold-setting approach, which learns and adapts the applicable threshold as new information becomes available.

  • No golden labels are required for training or detecting the thresholds, and subsequently detecting the anomalies.

  • We consolidate two streams of research by testing our assertions on both outlier and step-change anomalies, whereas previous researchers mostly focused on either one of these.

  • We critically analyze forecast machinery, construct competitive prediction intervals using computational means where no statistical methodology is readily applicable, then use all of this information to detect anomalies.

2 Datasets

Our datasets come from a diverse array of economic sectors. We use data from various categories ranging from climate to space aviation. The most important aspect when selecting our datasets is ensuring they are distinctly different in terms of size, attributes, and the composition of normal to anomalous ratio of observations. This data sourcing methodology allows us to forecast for varied horizons, and mitigates biases in model performance. All the datasets are also freely available and have been used extensively in research related to ours and other fields.

2.1 Air Quality Dataset.

We use the Beijing air quality dataset (AQD, uci14) from UCI. It contains measurements of hourly weather and pollution level for five years spanning January 1, 2010 to December 31, 2014 at the US embassy in Beijing, China. Meteorological data from the Beijing Capital International Airport are also included. The attribute of interest is the level of pollution in the air (pm2.5 concentration), while Table 1 details the complete feature list.

pm2.5 pm2.5 concentration
DEWP dew point
TEMP temperature
PRES pressure
cbwd combined wind direction
Iws cumulated wind speed
Is cumulated hours of snow
Ir cumulated hours of rain
Table 1: Air Quality Data Description.

2.2 Power Consumption Dataset

The second dataset is UCI’s Household Power Consumption (HPC, uci12), a multivariate time series dataset detailing a single household’s per-minute electricity consumption spanning December 2006 to November 2010.

With the date and time fields collapsed into a unique identifier, the remaining attributes are described in Table 2.

global_active_power total active power consumed ()
global_reactive_power total reactive power consumed ()
minute-averaged current intensity ()
sub_metering_1 active energy for kitchen ()
sub_metering_2 active energy for laundry ()
active energy for climate control systems ()
Table 2: Power Consumption Data Description.

A new attribute is created: sub_metering_4 = 1006×global_active_power - ∑_k = 1^3sub_metering_ k. It represents the per-minute active energy consumption not already captured by sub_meterings 1, 2 and 3.

2.3 Bike Sharing Dataset

Next, we consider the UCI Bike Sharing Dataset (BSD, uci13). It catalogues for each hour the number of bikes rented by customers of a bike-sharing service in Washington DC for the 2011 and 2012 calendar years. It includes features such as whether a particular day is a national holiday and the weather and seasonal information. Besides the date-time and index attributes that we collapse into a unique identifier, the dataset has 17 attributes in total. We select only a subset for our analysis, as described in Table 3.

holiday boolean whether day is holiday or not
weekday day of the week
1 if clear, few clouds, partly cloudy
2 if cloudy, mist, broken clouds, few clouds
3 if light snow or rain, thunderstorm, scattered clouds
4 if heavy rain, ice pallets, thunderstorm, snow mist, fog
hum humidity
temp normalized temperature in degrees Celsius
casual count of casual users
registered count of registered users
cnt count of total rental bikes, sum of casual and registered
Table 3: Bike Sharing Data Description.

2.4 Bearings Dataset

We also use NASA’s Bearings Dataset (BDS, nasa1). It contains sensor readings on multiple bearings that are operated until failure under a constant load causing degradation. These are one second vibration signal snapshots recorded at ten minute intervals per file, where each file contains sensor data points per bearing. It contains golden labels and has been considered extensively by other researchers in their experimentation; see, for example, Pro17.

2.5 Webscope Benchmark Dataset

Lastly, we also evaluate our approach using the Yahoo Webscope Benchmark dataset (YWB, yahoo1), which is considered a good data source in the research community by, for instance, Thi17. It comprises large datasets with golden labels. It is helpful for our purposes as it contains anomalies based on outliers and change-points. The first benchmark, A1, is real site traffic to Yahoo and its related products. The other three, A2, A3, and A4, are synthetic. They all contain three months worth of data-points though of varying lengths, capture frequencies, seasonality patterns and the total number of labeled anomalies.

Because we synthesize anomalies for AQD, BSD and HPC, we only use A1 from YWB. Using the other synthetic benchmarks in YWB may not add further insights to our analysis.

3 Methodology

Our approach utilizes three fundamental steps. We:

  1. forecast using deep learning and probability-based models,

  2. derive prediction intervals using computational means and built-in density distributions, respectively, and

  3. use said prediction intervals to detect anomalies.

Each of the fundamental steps, and intermediate ones, are outlined below.

3.1 Forecasting

3.1.1 Aqd

We frame a supervised learning problem where, given the weather conditions and pollution measurements for the current and prior hours, , we forecast the pollution at the next hour .

3.1.2 Hpc

Given power consumption for 4 weeks, we forecast it for the week ahead. The observations are converted from per-minute to daily aggregates, as illustrated in Figure 1. Seven values comprise a single forecast, one value for each day in the standard week (beginning on a Sunday and ending Saturday). We evaluate each forecast in this aggregated setting as well as with each time step separately.

Figure 1: Separate Attribute Plots of the Power Consumption Data Spanning Four Years.

There are only 205 samples available for training, and this may pose a problem for parameters-tuning. So we disregard the standard week, and use the prior days to forecast the coming days. This flattening and use of overlapping windows effectively transforms the samples into a total of .

In contrast, the testing framework remains unchanged, i.e. we use the prior four standard weeks to forecast the daily power consumption for the subsequent standard week.

3.1.3 Bsd

Given all available information for BSD for the past 3 months, we forecast the total rental bikes for the coming week.

3.1.4 Bds

Given the sensor readings for the past 10 hours, we forecast the readings for the next hour. The mechanical degradation in the bearings occurs gradually over time, so we conduct our analysis using only data points occurring every 10 minutes instead of per-second observations. In total, we have observations across each of the four bearings.

To easily detect the point where the bearing vibration readings begin oscillating dramatically, we transform the signal from the time- to the frequency domain using a Fourier transform.

As depicted in Figure 2, there is an increase in the frequency amplitude for readings progressing to the complete degradation and subsequent failure of each bearing.

Figure 2: Bearing Sensor Test Frequency Data.

3.1.5 Ywb

Lastly, for the YWB, we utilize the last five days worth of observations to forecast traffic to the Yahoo entities for the next hour.

Note, all the above forecast problems can be reframed using different window sizes. We state our chosen window sizes here for reproducibility. These windows were chosen using empirical studies with optimizing based on training accuracy.

For each dataset, we use the relevant forecast framework as described above and subsequently construct prediction intervals with all three deep learning algorithms and all three statistical algorithms.

To a great extent, the treatment of each dataset is not dissimilar. As such, the narrative that follows might overlap. For ease of exposition, all methods are the same unless explicitly stated.

3.2 Preprocessing

We undertake the following steps in preparing the time series data for forecasting. Here we only highlight the most used techniques for all the datasets. NA values are removed. NA is distinct from missing (blank) values, the latter amounting to for AQD and

for HPC, for instance, are imputed using the iterative imputer from the Scikit-learn module

sci11, which models each feature with missing values as a function of other features in a round-robin fashion. We set the imputer to parameterize only on past values for each iteration, thus avoiding information leakage from the future testing set.

We label encode categorical values. Label encoding is chosen over one-hot encoding for instances where there are many categories in total (more than five), as the former reduces memory consumption. Furthermore, we normalize the datasets.

All the datasets are composed of time series with different seasonality patterns. In such instances, ANNs tend to struggle with seasonality (Clav15; Smyl20). A standard remedy is to deseasonalize the series first and remove the trend using a statistical procedure like the Seasonal and Trend decomposition using Loess (STL, Cleve90). For a more detailed treatment of the pre-processing stage, see Mat1.

3.3 Algorithms

For benchmarking, the statistical methods used are Multiple Linear Regression (MLR,


), Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX,

Han88) and Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX, Aru16).

The deep learning architectures used are the Cascaded Neural Networks based on cascade-correlation (CCN), first published by Fahlman90, a variant of the Recurrent Neural Network (RNN, Will89) known as long short-term memory (LSTM, Hoch97), and one class of reservoir computing, Echo State Networks (ESN, Jaeger01).

For aiding in the construction of prediction intervals, we employ Monte Carlo Dropout Gal16b

and the quantile bootstrap

Dav13 also known as the reverse percentile bootstrap Hest14. Details on implementation follow in Section 3.7 and 3.8 respectively.

3.4 Measures of Accuracy

For the point forecasts we use the Root Mean Squared Error (RMSE) that gives error in the same units as the forecast variable itself.

RMSE (1)

where there are forecasts, is the out-of-sample observation and the forecast at time .

For the prediction intervals we use the Mean Interval Score (MIS, Gneit07), averaged over all out-of-sample observations.

MIS = 1n∑_t = 1^n( U_t - L_t ) + 2α ( L_t - y_t ) 𝟙_{y_t ¡ L_t } + 2α ( y_t - U_t ) 𝟙_{y_t ¿ U_t }, where is the upper (lower) bound of the prediction interval at time , is the significance level and is the indicator function.

The MIS adds a penalty at the points where future observations are outside the specified bounds. The width of the interval at is added to the penalty, if any. As such, it also penalizes wide intervals. Finally, this sum at the individual points is averaged.

As a supplementary metric, we employ the Coverage Score (CS) which indicates the percentage of observations that fall within the prediction interval,

CS (2)

The subscript denotes explicit dependence on the significance level.

For evaluating the performance when it comes to anomaly detection we employ precision, recall and the -score. We also use the -score Buda17, which scores the early detection capability of an algorithm in the range (worst) to (best). An increase (decrease) of say in the -score measure may be interpreted as a technique detecting an anomaly on average of the time interval earlier (later). The -score for an algorithm for anomaly that falls within window for dataset is given by


where denotes the set of anomalies detected by algorithm in dataset , is the time taken for the initial detection of anomaly in by algorithm , and idx is the index of that timestamp, and are the beginning and end indices for window , respectively. We use the indices of the timestamp instead of the time difference as the datasets use different time steps between observations.

3.5 Analysis

3.5.1 Aqd

We configure the LSTM with neurons in the first and only hidden recurrent layer, and neuron in the output layer for predicting pollution. The model is trained for epochs with a batch size of . The input shape is time steps with features.

The ESN is initialized with input units, output unit, and internal network units in the dynamic reservoir (). We use a hour wash-out interval.

The CCN is initialized with neurons in the input layer and one neuron in the output layer. The cascading is capped at layers.

The structure of the deep learning architectures for AQD are given in Figure 3 as an illustrative example. The CCN starts off with one hidden cell that is fully connected. As additional cells are added (if required), they are connected to the input and output nodes, as well as each of the preceding hidden nodes. The ESN’s reservoir is composed of recurrent cells that are sparsely connected. The deep learning architecture configurations for the other datasets are detailed below.

Figure 3: Flowchart Diagrams Depicting our Deep Learning Model Architecture Configurations for AQD.

3.5.2 Hpc

Our model is developed with one hidden LSTM recurrent layer with 200 neurons. The second layer is fully connected with 200 neurons. Lastly, the direct forecast is produced by an output layer that outputs a vector with seven entries, as previously described. Our model is trained for 65 epochs with a batch size of 18, determined experimentally to be optimal.

The ESN uses a day wash-out interval. The CCN has seven neurons in the output layer, with cascading capped at layers.

The first three calendar years are used as training data and the final year for model evaluation purposes.

3.5.3 Bsd

The LSTM is initialized with 100 neurons in the sole hidden recurrent layer, followed by an output layer that directly forecasts a vector with seven days worth of forecasts. We fit the model for epochs with a batch size of .

The ESN uses a seven day wash-out interval. The CCN has neurons in the output layer, with cascading capped at layers.

3.5.4 Bds

We instantiate the LSTM with 100 neurons in the input layer, one hidden recurrent layer with 50 neurons, and an output layer comprised of 4 neurons. Our model is fit for 100 epochs with a batch size of 32.

The ESN is initialized with input units, output units, and internal network units in the dynamic reservoir. We use a hour wash-out interval.

The CCN is composed of neurons in the input layer and neurons in the output layer. The cascading is capped at layers.

3.5.5 Ywb

The LSTM has 5 hidden layers with 4, 16, 48, 16, and 4 neurons, respectively. Our model is fit for 45 epochs with a batch size of 24.

The ESN uses a day wash-out interval. The CCN has neurons in the output layer, with cascading capped at layers.

3.5.6 Overall

In terms of AQD, we use holdout cross-validation, fitting each of the models on the first four years of data and testing with the remaining year. For all the other datasets, we use walk-forward validation for optimizing the hyper-parameters. For instance, with HPC, our models forecast a week ahead . After that, the actual observations from a week ahead are presented to the models, and are used in the set of observations for the current iteration in the walk-forward validation process, before we forecast the subsequent week .

As an additional method to mitigate computational cost, the models are evaluated statically, i.e. they are trained on the historical observations and then iteratively forecast each step of the walk-forward validation.

For each architecture, we use the Mean Absolute Error (MAE, Sam10

) loss function and the efficient Adam version of stochastic gradient descent

King14. Each configuration was selected over various others based on empirical studies of test RMSE reduction and configuration behaviour over time.

The input variables to VARMAX and SARIMAX are declared as exogenous.

After each model is fit, we forecast, replace the seasonality and trend components, invert the scaling, and compute an error score for each. The results follow in Section 4.1.

3.6 Evaluation

The LSTM is implemented using the LSTM and Dense class of the Keras API v2.1.5


for Python 3.7 using TensorFlow v1.15.0 framework as backend


. The code implementation for the CCN is written from scratch using Python 3.7, and ESN is implemented using PyTorch v0.3

(PAs17) for Python 3.7. The naïve or benchmark methods are implemented using the Statsmodels API sea10 for Python 3.7. All models are trained on an Intel Core i7-9750H processor with an Nvidia GeForce GTX 1650 GPU.

3.7 Recurrent Layer Dropout

We compare the forecast accuracy at dropout rates of , , , and to the standard LSTM fit above (i.e. with no dropout). Each dropout experiment is repeated times to mitigate the stochastic nature of the deep learning algorithms. Table 4, and the box and whisker plot in Figure 4, display the distributions of results for each configuration in the case of AQD as an illustrative example.

Dropout Rate
count 35 35 35 35 35
mean 20.7746 21.8681 22.5086 23.7820 25.4886
std 0.0000 0.1283 0.1793 0.2436 0.2812
min 20.7746 21.5665 22.2587 23.4037 24.8913
25% 20.7746 21.8193 22.4969 23.5335 25.2295
50% 20.7746 21.8543 22.5990 23.7627 25.4671
75% 20.7746 21.9773 22.7260 23.8939 25.6549
max 20.7746 22.0783 22.9699 24.3575 25.8718
Table 4: Dropout RMSE Score Distributions.
Figure 4: Dropout RMSE Score Distributions.

From Table 4, we note that the standard implementation (LSTM with no dropout) outperforms all the configurations with dropout. At the output has the tightest distribution (a desirable property for constructing prediction intervals). The dropout configuration has a competitive RMSE compared to the standard implementation. Because we drop out at such a low probability, this configuration retains a lot more information than the others. We thus use this configuration for prediction interval construction.

We run the chosen configuration times. In this case, the dropout probability. We forecast, and use the distribution of said forecast values to construct our prediction intervals. To compute the prediction interval, we use the percentiles at (α2, (1 - α) + ( α2) ) .

3.8 Bootstrap

We resample residuals in the following manner:

  1. Once the model is fit, retain the values fitted, , and residuals , where our forecast horizon.

  2. For each tuple, , in which is the multivariate explanatory variable, add the randomly resampled residual, , to the fitted value

    , thus creating synthetic response variables

    where is selected randomly from the list for every .

  3. Using the values , forecast , then refit the model using synthetic response variables , again retaining the residuals and forecasts.

  4. Repeat steps 2 and 3, times.

We use the distribution of the forecast values to construct the prediction intervals using percentiles in the same manner described in Section 3.7 above. As in the case of the recurrent layer dropout, this technique is advantageous as it retains the explanatory variables’ information.

3.9 Anomaly Detection

Anomalies are labeled in YWB. BDS does not contain any golden labels. We synthesize anomalies by superimposing them on the other three datasets, as previously mentioned in Section 2.5 above. This step is undertaken in a stochastic manner, where a typical composition is given in Table 5.

# Metrics
AQD changepoint
HPC collective
BSD contextual
BDS changepoint
contextual & collective
Table 5: Anomaly Data Structure Summary.

If a new observation falls within our constructed prediction interval, it is classified as normal. If it falls outside the interval, we:

  1. determine by which magnitude it falls outside the bounds. This measure is deducible using a variant of MIS defined earlier (Equation 3.4 in Section 3.4), i.e. the Interval Score (IS)

    If this measure is at least larger than that of the previous observation that fell outside bounds, then we

  2. employ Chebyshev’s inequality Dix68

    which guarantees a useful property for a wide class of probability distributions, i.e. no more than a certain fraction of values can be more than a certain distance from the mean. A suitable threshold can be tuned to avoid many false alerts. In our case, we utilize this inequality to identify that

    of the values must lie within ten times the standard deviation.

If an out-of-sample observation falls outside the prediction interval bounds and falls beyond ten times the standard deviation of the observations (our dynamic threshold, of all observations), it is classified as anomalous.

Essentially, the anomaly detection procedure may be summarized in Algorithm 1, and the entire structure of our deep learning models and data flow process is presented graphically in Figure 5.

  if  then
      is normal
     if  and  then
         is anomalous {where is the last anomalous observation}
     end if
  end if
Algorithm 1 Anomaly Detection
Figure 5: Flowchart Diagram Depicting our Deep Learning Model Architectures.

4 Results

4.1 Forecasting

A persistence model Coi13 gives an RMSE score of , , , , and respectively for AQD, HPC, BSD, BDS and YWB. Those from the models under consideration, truncated to four decimals, are shown in Table 6.

width= Data AQD HPC BSD BDS YWB Statistical Benchmarks MLR 24.8791 415.8602 43.8581 1.5001 278.6595 SARIMAX 22.9377 377.8571 34.3472 1.3712 251.8902 VARMAX 23.1093 401.3319 39.2279 1.4879 266.0144 Deep Learning LSTM 20.1342 337.9452 31.8951 0.9588 213.8532 ESN 21.2357 349.1773 31.0451 1.0057 209.3517 CCN 22.9876 392.5459 35.6843 1.1421 227.1096

Table 6: Model Error Scores.

4.2 Prediction Intervals

For ease of interpretation, we modify the MIS to introduce a scaling factor, such that the best performing model has scaled Mean Interval Score (sMIS) equal to , and the rest are larger than . The worst performing model thus has the largest sMIS value. For each significance level considered we have,


where comprises all algorithms considered (both statistical and deep learning). We further highlight the top two performers at each significance level.

width= sMIS CS 0.1 0.05 0.01 0.1 0.05 0.01 MLR 1.1704 1.2105 1.2323 0.9031 0.9171 0.9305 SARIMAX Gray1.0778 1.2093 1.2167 Gray 0.9524 Gray 0.9668 Gray 0.9797 Statistical Bench- marks VARMAX 1.4141 1.2662 1.0881 0.9449 0.9621 0.9761 LSTM Gray 1.0000 1.1437 Gray 1.0000 Gray 0.9620 Gray 0.9810 Gray 0.9824 ESN 1.1859 Gray 1.1365 1.1367 0.9179 0.9308 0.9435 Deep Learning CCN 1.2103 Gray 1.0000 Gray 1.0655 0.9094 0.9285 0.9633

Table 7: Prediction Interval Performance for Air Quality Data.

width= sMIS CS 0.1 0.05 0.01 0.1 0.05 0.01 MLR 1.2131 1.1688 1.2072 0.9062 0.9169 0.9165 SARIMAX 1.0364 1.2052 1.2708 Gray0.9493 0.9501 Gray0.9553 Statistical Bench- marks VARMAX 1.1236 1.1796 1.1676 0.9075 0.9107 0.9194 LSTM 1.1727 1.0872 1.1662 0.9370 Gray0.9519 0.9422 ESN Gray1.0000 Gray1.0000 Gray1.0000 Gray0.9587 Gray0.9683 Gray0.9772 Deep Learning CCN Gray1.0186 Gray1.0388 Gray1.0409 0.9468 0.9472 0.9495

Table 8: Prediction Interval Performance for Bike Sharing Data.

width= sMIS CS 0.1 0.05 0.01 0.1 0.05 0.01 MLR 1.2262 1.2123 1.1686 0.9081 0.9117 0.9144 SARIMAX 1.1081 Gray1.0181 1.1012 Gray0.9413 0.9514 0.9760 Statistical Bench- marks VARMAX 1.1764 1.1565 1.1148 0.9178 0.9262 0.9322 LSTM Gray1.0449 1.1005 Gray1.0211 Gray0.9507 Gray0.9594 Gray0.9767 ESN 1.0910 Gray1.0000 Gray1.0000 0.9372 Gray0.9516 Gray0.9895 Deep Learning CCN Gray1.0000 1.1452 1.0708 0.9157 0.9299 0.9571

Table 9: Prediction Interval Performance for Household Power Consumption Data.

width= sMIS CS 0.1 0.05 0.01 0.1 0.05 0.01 MLR 1.3562 1.2020 1.1549 0.9106 0.9134 0.9287 SARIMAX Gray1.0000 1.1254 1.1832 0.8957 0.9009 0.9071 Statistical Bench- marks VARMAX 1.1764 1.1565 1.2149 0.9075 0.9161 0.9193 LSTM Gray1.0541 1.1045 Gray1.0299 Gray0.9401 Gray0.9412 Gray0.9491 ESN 1.0910 Gray1.0000 Gray1.0000 0.9202 Gray0.9346 Gray0.9398 Deep Learning CCN 1.1081 Gray1.0181 1.1012 Gray0.9213 0.9277 0.9301

Table 10: Prediction Interval Performance for Bearings Data.

width= sMIS CS 0.1 0.05 0.01 0.1 0.05 0.01 MLR 1.3054 1.112 1.1499 0.8878 0.8923 0.8971 SARIMAX 1.1081 Gray1.0169 1.1012 Gray0.9293 0.9313 0.9360 Statistical Bench- marks VARMAX 1.2262 1.2123 1.1686 0.8981 0.9108 0.9142 LSTM 1.0812 Gray1.0000 Gray1.0000 0.9271 Gray0.9360 Gray0.9494 ESN Gray1.0546 1.1050 Gray1.0233 Gray0.9307 Gray0.9394 Gray0.9464 Deep Learning CCN Gray1.0000 1.1452 1.0708 0.9157 0.9299 0.9309

Table 11: Prediction Interval Performance for Webscope Benchmark Data.

Tables 7 to 11 detail the prediction interval results. We note that at least one deep learning algorithm is always in the top two performers. The interpretation is that deep learning algorithms outperform the traditional statistical methods, or at the very least, are competitive. This outperformance is also evidenced by the granular result for aggregated daily RMSE for HPC, as displayed in Figure 6. Figure 6 is supplementary to Table 9 as it gives a granular score at a daily level, which is consistent with the overall score detailed in the latter. Figure 7 illustrates the point forecasts and prediction intervals for ESN (best performer for BSD) at the last hours of our forecast horizon. Our results are consistent with the findings of Makri20a, who noted that deep learning models produce better prediction intervals, albeit in the univariate case.

ESN is the best performing architecture for the BSD at the all prediction intervals considered. This result is consistent with the RMSE scores in Table 6.

Figure 6: Overall Daily RMSE for Household Power Consumption Dataset.
Figure 7: Point Forecasts and Prediction Intervals for Bike Sharing Dataset.

4.3 Anomaly Detection

Figures 8 and 9 detail the distributions for the - and -scores, respectively, for all the models applied to each dataset. We note the -score is varied for each dataset, but the results indicate LSTM is the best performer, followed in every instance by either ESN or CCN. The worst performer is MLR across all datasets. Since the

-score is the harmonic average of Precision and Recall, where

is the perfect score and is worst, we note that the deep learning models are better at learning the dynamic threshold.

Again, with the -score, the statistical methods are outperformed by the deep learning models. This result follows as better dynamic threshold calibration leads to earlier detection of anomalies.

(a) -Score for AQD.
(b) -Score for BSD.
(c) -Score for HPC.
(d) -Score for BDS.
(e) -Score for YWB.
Figure 8: -Score distribution for each datasets and all the models.
(a) -Score for AQD.
(b) -Score for BSD.
(c) -Score for HPC.
(d) -Score for BDS.
(e) -Score for YWB.
Figure 9: -Score distribution for each datasets and all the models.

5 Conclusion

Our point of departure is that deep learning models have been shown to outperform traditional statistical methods at forecasting, and deep learning has been shown to produce superior uncertainty quantification. In this paper, we produce point forecasts, produce their associated prediction intervals using computational means where no built-in probabilistic constructs exist, and use said prediction intervals to detect anomalies on different multivariate datasets.

With the methodology presented, we note that better point forecast accuracy leads to the construction of better prediction intervals, which in turn leads to the main result presented: better prediction intervals lead to more accurate and faster anomaly detection. The improvement is as a result of better parameterized models being capable of more accurate forecasts. These models in turn are used in the interval construction process and subsequently multivariate anomaly detection.

The goal was not to prove that deep learning is far superior to statistical methods in this regard, but to show that when proper methodology is followed, it can be competitive in the worst case. This in an important result, as it opens the way for the wide-scale adoption of deep learning in other applications.

The computational cost of the deep learning algorithms is not discussed in this paper and this could be an avenue for further exploration. The algorithmic time complexity is known for each model, with ESN being the fastest deep learning architecture. Even at that, it is still not as fast as any of the statistical methods. In terms of the deep learning architectures, future work can consider techniques for improving upon the cost of computation, so that even in this aspect they become competitive to their statistical counterparts.

Finding techniques to address this last issue in particular, will enable researchers to study multivariate time series in more detail, and at a grander scale. There exists a need for conducting our kind of analysis on multivariate time series at the scale of, say, the M4 competition Makri20a.

Conflict of Interest

The authors declare that they have no conflict of interest.