1 Introduction
Hawk80
defines an outlier as an observation that ”
deviates so significantly from other observations as to arouse suspicion that a different mechanism generated it.” Anomalies, also known as outliers, arise for several reasons, such as intentional acts of malice, collapse or failure of various systems, or fraudulent activity. Detecting these anomalies in a timeous manner is essential in various decisionmaking processes. Anomaly detection is not to be confused with novelty detection, which is focused on identifying new patterns in the data. However, the techniques used for anomaly detection are often used for novelty detection and vice versa.
Car20 draw a comparison between the two fields and detail recent advances in the methods applicable to each field.Anomaly detection, when framed as a classification problem, has been modelled using OneClass Support Vector Machines as far back as the works of
Schol99. Dealing with a highly imbalanced class distribution (where normal observations far exceed anomalous data points), Schol99 detect novel patterns by looking at the local density distribution of normal data.Mal15
use stacked LSTM networks for anomaly detection in time series. They train the network on nonanomalous data and use it as a predictor over several time steps. The resulting prediction errors are modelled as a multivariate Gaussian distribution, which is then used to assess the likelihood of anomalous behaviour. Later,
Mal16 examine anomaly detection for mechanical devices in the presence of exogenous effects. An LSTMbased EncoderDecoder scheme is employed that learns to reconstruct timeseries behaviour under normalcy and after that uses reconstruction error to detect anomalies. Sch17employ a deep convolutional generative adversarial network in an unsupervised manner to learn a manifold of normal anatomical variability and apply it to the task of anomaly detection to aid disease detection and treatment monitoring.
More recently, the research has focused on timeseries anomalies that manifest over a period of time rather than at single time points. These are known as rangebased or collective anomalies. Nguyen18 consider these and use prediction errors from multiple timesteps for detecting collective anomalies.
Others, such as Zhu17 use Bayesian Neural Networks (BNNs, Khera18) and the Monte Carlo (MC) dropout framework to construct prediction intervals, inspired by Gal16a as well as Gal16b.
Several probabilitybased methods have also been successfully applied to the detection of anomalies
Agg13. If said statistical methods have been applied successfully, why then consider deep learning for anomaly detection at all? We provide three reasons. Firstly, the boundary between normal and anomalous behaviour is at times not precisely defined, or it evolves continually, i.e. the distribution of the data changes dynamically as new instances are introduced to the time series. In such instances, statistical algorithms may subperform Uth17 unless, for instance, they are retrained continuously to mitigate the changes in the data distributions. Secondly, deep learning algorithms have been shown to scale well in other applications Bri16, and we hypothesise that this scalability attribute may extend to cases that require largescale anomaly detection. Finally, in the most recent M4 competition, it is shown that deep learning produces better prediction intervals than statistical methods Makri20a. This superior prediction interval result is corroborated by the works of Smyl20 in the univariate case as well as Mat1 in the multivariate setting, for instance. Our hypothesis is that better prediction intervals can be used to build better anomaly detection machinery.1.1 Related Work
There have been immense advances made in applying deep learning methods in fields such as forecasting Makri20a; Makri20b, but there is a relative scarcity of deep learning approaches for anomaly detection and its subdisciplines Kwon19; Thu20.
Considering anomaly detection as a supervised learning problem requires a dataset where each data instance is labeled. With the anomaly detection problem phrased as such, there are general steps usually followed in the literature. Although not identical in their methodologies, even the most recent advances are not dissimilar to approaches first adopted by
Mal15, Chau15 and Mal16, for instance. Generally, methodologies incorporate the following steps:
use a deep learning algorithm or architecture (e.g. LSTM, encoderdecoder) for generating forecasts, retain the errors as they are used later for anomaly detection;

assume the resultant errors follow a Gaussian (or other) distribution;

learn a threshold by maximizing either the score, precision, recall or some combination thereof on the validation set.
There are also some comprehensive reviews, including, for instance, the works of Ade16, who survey deep learningbased methods for fraud detection, Kwon19 who focus on techniques and applications related to cyberintrusion detection, and as recent as Thu20 who focus on the challenge of detecting anomalies in the domain of big data.
All of these reviews, and the research highlighted above, touch on some critical points related to drawbacks. We summarize these drawbacks in Section 1.2 below.
To avoid ambiguity, all the techniques described in this Section shall further be referred to as ”reviewed techniques.”
1.2 Motivation
All the reviewed techniques, as well as their classical statistical counterparts, have similar drawbacks. Classical techniques such as statistical and probabilistic models are typically suitable only for univariate datasets Makri20a; Smyl20. For the more realistic multivariate case, Li16
for instance, propose a method for analyzing multivariate datasets by first transforming the highdimensional data to a unidimensional space using Fuzzy CMeans clustering. They then detect anomalies with a Hidden Markov Model. However,
Li16’s approach does not fully utilize the additional information provided by the data’s correlation.Another issue with the reviewed techniques is their reliance on datasets that contain the ground truth, where the anomalies have been labeled. These labels are sometimes also referred to as golden labels. There are difficulties associated with gathering this kind of data in real life, such as the cost of labeling or human error in the labeling process. Even once the reviewed models are trained and parameterized on such data, if available, the dynamic nature of most reallife scenarios would mean the tuned review model might not generalize to new types of anomalies unless retraining is performed continuously Uth17.
1.3 Contribution

We introduce a dynamic thresholdsetting approach, which learns and adapts the applicable threshold as new information becomes available.

No golden labels are required for training or detecting the thresholds, and subsequently detecting the anomalies.

We consolidate two streams of research by testing our assertions on both outlier and stepchange anomalies, whereas previous researchers mostly focused on either one of these.

We critically analyze forecast machinery, construct competitive prediction intervals using computational means where no statistical methodology is readily applicable, then use all of this information to detect anomalies.
2 Datasets
Our datasets come from a diverse array of economic sectors. We use data from various categories ranging from climate to space aviation. The most important aspect when selecting our datasets is ensuring they are distinctly different in terms of size, attributes, and the composition of normal to anomalous ratio of observations. This data sourcing methodology allows us to forecast for varied horizons, and mitigates biases in model performance. All the datasets are also freely available and have been used extensively in research related to ours and other fields.
2.1 Air Quality Dataset.
We use the Beijing air quality dataset (AQD, uci14) from UCI. It contains measurements of hourly weather and pollution level for five years spanning January 1, 2010 to December 31, 2014 at the US embassy in Beijing, China. Meteorological data from the Beijing Capital International Airport are also included. The attribute of interest is the level of pollution in the air (pm2.5 concentration), while Table 1 details the complete feature list.




pm2.5  pm2.5 concentration  
DEWP  dew point  
TEMP  temperature  
PRES  pressure  
cbwd  combined wind direction  
Iws  cumulated wind speed  
Is  cumulated hours of snow  
Ir  cumulated hours of rain 
2.2 Power Consumption Dataset
The second dataset is UCI’s Household Power Consumption (HPC, uci12), a multivariate time series dataset detailing a single household’s perminute electricity consumption spanning December 2006 to November 2010.
With the date and time fields collapsed into a unique identifier, the remaining attributes are described in Table 2.




global_active_power  total active power consumed ()  
global_reactive_power  total reactive power consumed ()  
global_intensity 


sub_metering_1  active energy for kitchen ()  
sub_metering_2  active energy for laundry ()  
sub_metering_3 

A new attribute is created: sub_metering_4 = 1006×global_active_power  ∑_k = 1^3sub_metering_ k. It represents the perminute active energy consumption not already captured by sub_meterings 1, 2 and 3.
2.3 Bike Sharing Dataset
Next, we consider the UCI Bike Sharing Dataset (BSD, uci13). It catalogues for each hour the number of bikes rented by customers of a bikesharing service in Washington DC for the 2011 and 2012 calendar years. It includes features such as whether a particular day is a national holiday and the weather and seasonal information. Besides the datetime and index attributes that we collapse into a unique identifier, the dataset has 17 attributes in total. We select only a subset for our analysis, as described in Table 3.




holiday  boolean whether day is holiday or not  
weekday  day of the week  
weathersit 


hum  humidity  
temp  normalized temperature in degrees Celsius  
casual  count of casual users  
registered  count of registered users  
cnt  count of total rental bikes, sum of casual and registered 
2.4 Bearings Dataset
We also use NASA’s Bearings Dataset (BDS, nasa1). It contains sensor readings on multiple bearings that are operated until failure under a constant load causing degradation. These are one second vibration signal snapshots recorded at ten minute intervals per file, where each file contains sensor data points per bearing. It contains golden labels and has been considered extensively by other researchers in their experimentation; see, for example, Pro17.
2.5 Webscope Benchmark Dataset
Lastly, we also evaluate our approach using the Yahoo Webscope Benchmark dataset (YWB, yahoo1), which is considered a good data source in the research community by, for instance, Thi17. It comprises large datasets with golden labels. It is helpful for our purposes as it contains anomalies based on outliers and changepoints. The first benchmark, A1, is real site traffic to Yahoo and its related products. The other three, A2, A3, and A4, are synthetic. They all contain three months worth of datapoints though of varying lengths, capture frequencies, seasonality patterns and the total number of labeled anomalies.
Because we synthesize anomalies for AQD, BSD and HPC, we only use A1 from YWB. Using the other synthetic benchmarks in YWB may not add further insights to our analysis.
3 Methodology
Our approach utilizes three fundamental steps. We:

forecast using deep learning and probabilitybased models,

derive prediction intervals using computational means and builtin density distributions, respectively, and

use said prediction intervals to detect anomalies.
Each of the fundamental steps, and intermediate ones, are outlined below.
3.1 Forecasting
3.1.1 Aqd
We frame a supervised learning problem where, given the weather conditions and pollution measurements for the current and prior hours, , we forecast the pollution at the next hour .
3.1.2 Hpc
Given power consumption for 4 weeks, we forecast it for the week ahead. The observations are converted from perminute to daily aggregates, as illustrated in Figure 1. Seven values comprise a single forecast, one value for each day in the standard week (beginning on a Sunday and ending Saturday). We evaluate each forecast in this aggregated setting as well as with each time step separately.
There are only 205 samples available for training, and this may pose a problem for parameterstuning. So we disregard the standard week, and use the prior days to forecast the coming days. This flattening and use of overlapping windows effectively transforms the samples into a total of .
In contrast, the testing framework remains unchanged, i.e. we use the prior four standard weeks to forecast the daily power consumption for the subsequent standard week.
3.1.3 Bsd
Given all available information for BSD for the past 3 months, we forecast the total rental bikes for the coming week.
3.1.4 Bds
Given the sensor readings for the past 10 hours, we forecast the readings for the next hour. The mechanical degradation in the bearings occurs gradually over time, so we conduct our analysis using only data points occurring every 10 minutes instead of persecond observations. In total, we have observations across each of the four bearings.
To easily detect the point where the bearing vibration readings begin oscillating dramatically, we transform the signal from the time to the frequency domain using a Fourier transform.
As depicted in Figure 2, there is an increase in the frequency amplitude for readings progressing to the complete degradation and subsequent failure of each bearing.
3.1.5 Ywb
Lastly, for the YWB, we utilize the last five days worth of observations to forecast traffic to the Yahoo entities for the next hour.
Note, all the above forecast problems can be reframed using different window sizes. We state our chosen window sizes here for reproducibility. These windows were chosen using empirical studies with optimizing based on training accuracy.
For each dataset, we use the relevant forecast framework as described above and subsequently construct prediction intervals with all three deep learning algorithms and all three statistical algorithms.
To a great extent, the treatment of each dataset is not dissimilar. As such, the narrative that follows might overlap. For ease of exposition, all methods are the same unless explicitly stated.
3.2 Preprocessing
We undertake the following steps in preparing the time series data for forecasting. Here we only highlight the most used techniques for all the datasets. NA values are removed. NA is distinct from missing (blank) values, the latter amounting to for AQD and
for HPC, for instance, are imputed using the iterative imputer from the Scikitlearn module
sci11, which models each feature with missing values as a function of other features in a roundrobin fashion. We set the imputer to parameterize only on past values for each iteration, thus avoiding information leakage from the future testing set.We label encode categorical values. Label encoding is chosen over onehot encoding for instances where there are many categories in total (more than five), as the former reduces memory consumption. Furthermore, we normalize the datasets.
All the datasets are composed of time series with different seasonality patterns. In such instances, ANNs tend to struggle with seasonality (Clav15; Smyl20). A standard remedy is to deseasonalize the series first and remove the trend using a statistical procedure like the Seasonal and Trend decomposition using Loess (STL, Cleve90). For a more detailed treatment of the preprocessing stage, see Mat1.
3.3 Algorithms
For benchmarking, the statistical methods used are Multiple Linear Regression (MLR,
Mat2005), Vector Autoregression MovingAverage with Exogenous Regressors (VARMAX,
Han88) and Seasonal Autoregressive Integrated MovingAverage with Exogenous Regressors (SARIMAX, Aru16).The deep learning architectures used are the Cascaded Neural Networks based on cascadecorrelation (CCN), first published by Fahlman90, a variant of the Recurrent Neural Network (RNN, Will89) known as long shortterm memory (LSTM, Hoch97), and one class of reservoir computing, Echo State Networks (ESN, Jaeger01).
3.4 Measures of Accuracy
For the point forecasts we use the Root Mean Squared Error (RMSE) that gives error in the same units as the forecast variable itself.
RMSE  (1) 
where there are forecasts, is the outofsample observation and the forecast at time .
For the prediction intervals we use the Mean Interval Score (MIS, Gneit07), averaged over all outofsample observations.
MIS = 1n∑_t = 1^n( U_t  L_t ) + 2α ( L_t  y_t ) 𝟙_{y_t ¡ L_t } + 2α ( y_t  U_t ) 𝟙_{y_t ¿ U_t }, where is the upper (lower) bound of the prediction interval at time , is the significance level and is the indicator function.
The MIS adds a penalty at the points where future observations are outside the specified bounds. The width of the interval at is added to the penalty, if any. As such, it also penalizes wide intervals. Finally, this sum at the individual points is averaged.
As a supplementary metric, we employ the Coverage Score (CS) which indicates the percentage of observations that fall within the prediction interval,
CS  (2) 
The subscript denotes explicit dependence on the significance level.
For evaluating the performance when it comes to anomaly detection we employ precision, recall and the score. We also use the score Buda17, which scores the early detection capability of an algorithm in the range (worst) to (best). An increase (decrease) of say in the score measure may be interpreted as a technique detecting an anomaly on average of the time interval earlier (later). The score for an algorithm for anomaly that falls within window for dataset is given by
(3) 
where denotes the set of anomalies detected by algorithm in dataset , is the time taken for the initial detection of anomaly in by algorithm , and idx is the index of that timestamp, and are the beginning and end indices for window , respectively. We use the indices of the timestamp instead of the time difference as the datasets use different time steps between observations.
3.5 Analysis
3.5.1 Aqd
We configure the LSTM with neurons in the first and only hidden recurrent layer, and neuron in the output layer for predicting pollution. The model is trained for epochs with a batch size of . The input shape is time steps with features.
The ESN is initialized with input units, output unit, and internal network units in the dynamic reservoir (). We use a hour washout interval.
The CCN is initialized with neurons in the input layer and one neuron in the output layer. The cascading is capped at layers.
The structure of the deep learning architectures for AQD are given in Figure 3 as an illustrative example. The CCN starts off with one hidden cell that is fully connected. As additional cells are added (if required), they are connected to the input and output nodes, as well as each of the preceding hidden nodes. The ESN’s reservoir is composed of recurrent cells that are sparsely connected. The deep learning architecture configurations for the other datasets are detailed below.
3.5.2 Hpc
Our model is developed with one hidden LSTM recurrent layer with 200 neurons. The second layer is fully connected with 200 neurons. Lastly, the direct forecast is produced by an output layer that outputs a vector with seven entries, as previously described. Our model is trained for 65 epochs with a batch size of 18, determined experimentally to be optimal.
The ESN uses a day washout interval. The CCN has seven neurons in the output layer, with cascading capped at layers.
The first three calendar years are used as training data and the final year for model evaluation purposes.
3.5.3 Bsd
The LSTM is initialized with 100 neurons in the sole hidden recurrent layer, followed by an output layer that directly forecasts a vector with seven days worth of forecasts. We fit the model for epochs with a batch size of .
The ESN uses a seven day washout interval. The CCN has neurons in the output layer, with cascading capped at layers.
3.5.4 Bds
We instantiate the LSTM with 100 neurons in the input layer, one hidden recurrent layer with 50 neurons, and an output layer comprised of 4 neurons. Our model is fit for 100 epochs with a batch size of 32.
The ESN is initialized with input units, output units, and internal network units in the dynamic reservoir. We use a hour washout interval.
The CCN is composed of neurons in the input layer and neurons in the output layer. The cascading is capped at layers.
3.5.5 Ywb
The LSTM has 5 hidden layers with 4, 16, 48, 16, and 4 neurons, respectively. Our model is fit for 45 epochs with a batch size of 24.
The ESN uses a day washout interval. The CCN has neurons in the output layer, with cascading capped at layers.
3.5.6 Overall
In terms of AQD, we use holdout crossvalidation, fitting each of the models on the first four years of data and testing with the remaining year. For all the other datasets, we use walkforward validation for optimizing the hyperparameters. For instance, with HPC, our models forecast a week ahead . After that, the actual observations from a week ahead are presented to the models, and are used in the set of observations for the current iteration in the walkforward validation process, before we forecast the subsequent week .
As an additional method to mitigate computational cost, the models are evaluated statically, i.e. they are trained on the historical observations and then iteratively forecast each step of the walkforward validation.
For each architecture, we use the Mean Absolute Error (MAE, Sam10
) loss function and the efficient Adam version of stochastic gradient descent
King14. Each configuration was selected over various others based on empirical studies of test RMSE reduction and configuration behaviour over time.The input variables to VARMAX and SARIMAX are declared as exogenous.
After each model is fit, we forecast, replace the seasonality and trend components, invert the scaling, and compute an error score for each. The results follow in Section 4.1.
3.6 Evaluation
The LSTM is implemented using the LSTM and Dense class of the Keras API v2.1.5
(Cho15)for Python 3.7 using TensorFlow v1.15.0 framework as backend
(Abadi16). The code implementation for the CCN is written from scratch using Python 3.7, and ESN is implemented using PyTorch v0.3
(PAs17) for Python 3.7. The naïve or benchmark methods are implemented using the Statsmodels API sea10 for Python 3.7. All models are trained on an Intel Core i79750H processor with an Nvidia GeForce GTX 1650 GPU.3.7 Recurrent Layer Dropout
We compare the forecast accuracy at dropout rates of , , , and to the standard LSTM fit above (i.e. with no dropout). Each dropout experiment is repeated times to mitigate the stochastic nature of the deep learning algorithms. Table 4, and the box and whisker plot in Figure 4, display the distributions of results for each configuration in the case of AQD as an illustrative example.









count  35  35  35  35  35  
mean  20.7746  21.8681  22.5086  23.7820  25.4886  
std  0.0000  0.1283  0.1793  0.2436  0.2812  
min  20.7746  21.5665  22.2587  23.4037  24.8913  
25%  20.7746  21.8193  22.4969  23.5335  25.2295  
50%  20.7746  21.8543  22.5990  23.7627  25.4671  
75%  20.7746  21.9773  22.7260  23.8939  25.6549  
max  20.7746  22.0783  22.9699  24.3575  25.8718 
From Table 4, we note that the standard implementation (LSTM with no dropout) outperforms all the configurations with dropout. At the output has the tightest distribution (a desirable property for constructing prediction intervals). The dropout configuration has a competitive RMSE compared to the standard implementation. Because we drop out at such a low probability, this configuration retains a lot more information than the others. We thus use this configuration for prediction interval construction.
We run the chosen configuration times. In this case, the dropout probability. We forecast, and use the distribution of said forecast values to construct our prediction intervals. To compute the prediction interval, we use the percentiles at (α2, (1  α) + ( α2) ) .
3.8 Bootstrap
We resample residuals in the following manner:

Once the model is fit, retain the values fitted, , and residuals , where our forecast horizon.

For each tuple, , in which is the multivariate explanatory variable, add the randomly resampled residual, , to the fitted value
, thus creating synthetic response variables
where is selected randomly from the list for every . 
Using the values , forecast , then refit the model using synthetic response variables , again retaining the residuals and forecasts.

Repeat steps 2 and 3, times.
We use the distribution of the forecast values to construct the prediction intervals using percentiles in the same manner described in Section 3.7 above. As in the case of the recurrent layer dropout, this technique is advantageous as it retains the explanatory variables’ information.
3.9 Anomaly Detection
Anomalies are labeled in YWB. BDS does not contain any golden labels. We synthesize anomalies by superimposing them on the other three datasets, as previously mentioned in Section 2.5 above. This step is undertaken in a stochastic manner, where a typical composition is given in Table 5.






AQD  changepoint  
HPC  collective  
BSD  contextual  
BDS  changepoint  
YWB 

If a new observation falls within our constructed prediction interval, it is classified as normal. If it falls outside the interval, we:

employ Chebyshev’s inequality Dix68
which guarantees a useful property for a wide class of probability distributions, i.e. no more than a certain fraction of values can be more than a certain distance from the mean. A suitable threshold can be tuned to avoid many false alerts. In our case, we utilize this inequality to identify that
of the values must lie within ten times the standard deviation.
If an outofsample observation falls outside the prediction interval bounds and falls beyond ten times the standard deviation of the observations (our dynamic threshold, of all observations), it is classified as anomalous.
4 Results
4.1 Forecasting
A persistence model Coi13 gives an RMSE score of , , , , and respectively for AQD, HPC, BSD, BDS and YWB. Those from the models under consideration, truncated to four decimals, are shown in Table 6.
4.2 Prediction Intervals
For ease of interpretation, we modify the MIS to introduce a scaling factor, such that the best performing model has scaled Mean Interval Score (sMIS) equal to , and the rest are larger than . The worst performing model thus has the largest sMIS value. For each significance level considered we have,
(4) 
where comprises all algorithms considered (both statistical and deep learning). We further highlight the top two performers at each significance level.
Tables 7 to 11 detail the prediction interval results. We note that at least one deep learning algorithm is always in the top two performers. The interpretation is that deep learning algorithms outperform the traditional statistical methods, or at the very least, are competitive. This outperformance is also evidenced by the granular result for aggregated daily RMSE for HPC, as displayed in Figure 6. Figure 6 is supplementary to Table 9 as it gives a granular score at a daily level, which is consistent with the overall score detailed in the latter. Figure 7 illustrates the point forecasts and prediction intervals for ESN (best performer for BSD) at the last hours of our forecast horizon. Our results are consistent with the findings of Makri20a, who noted that deep learning models produce better prediction intervals, albeit in the univariate case.
ESN is the best performing architecture for the BSD at the all prediction intervals considered. This result is consistent with the RMSE scores in Table 6.
4.3 Anomaly Detection
Figures 8 and 9 detail the distributions for the  and scores, respectively, for all the models applied to each dataset. We note the score is varied for each dataset, but the results indicate LSTM is the best performer, followed in every instance by either ESN or CCN. The worst performer is MLR across all datasets. Since the
score is the harmonic average of Precision and Recall, where
is the perfect score and is worst, we note that the deep learning models are better at learning the dynamic threshold.Again, with the score, the statistical methods are outperformed by the deep learning models. This result follows as better dynamic threshold calibration leads to earlier detection of anomalies.
5 Conclusion
Our point of departure is that deep learning models have been shown to outperform traditional statistical methods at forecasting, and deep learning has been shown to produce superior uncertainty quantification. In this paper, we produce point forecasts, produce their associated prediction intervals using computational means where no builtin probabilistic constructs exist, and use said prediction intervals to detect anomalies on different multivariate datasets.
With the methodology presented, we note that better point forecast accuracy leads to the construction of better prediction intervals, which in turn leads to the main result presented: better prediction intervals lead to more accurate and faster anomaly detection. The improvement is as a result of better parameterized models being capable of more accurate forecasts. These models in turn are used in the interval construction process and subsequently multivariate anomaly detection.
The goal was not to prove that deep learning is far superior to statistical methods in this regard, but to show that when proper methodology is followed, it can be competitive in the worst case. This in an important result, as it opens the way for the widescale adoption of deep learning in other applications.
The computational cost of the deep learning algorithms is not discussed in this paper and this could be an avenue for further exploration. The algorithmic time complexity is known for each model, with ESN being the fastest deep learning architecture. Even at that, it is still not as fast as any of the statistical methods. In terms of the deep learning architectures, future work can consider techniques for improving upon the cost of computation, so that even in this aspect they become competitive to their statistical counterparts.
Finding techniques to address this last issue in particular, will enable researchers to study multivariate time series in more detail, and at a grander scale. There exists a need for conducting our kind of analysis on multivariate time series at the scale of, say, the M4 competition Makri20a.
Conflict of Interest
The authors declare that they have no conflict of interest.
Comments
There are no comments yet.