A Framework for End-to-End Deep Learning-Based Anomaly Detection in Transportation Networks

11/20/2019 ∙ by Neema Davis, et al. ∙ Indian Institute Of Technology, Madras 0

We develop an end-to-end deep learning-based anomaly detection model for temporal data in transportation networks. The proposed EVT-LSTM model is derived from the popular LSTM (Long Short-Term Memory) network and adopts an objective function that is based on fundamental results from EVT (Extreme Value Theory). We compare the EVT-LSTM model with some established statistical, machine learning, and hybrid deep learning baselines. Experiments on seven diverse real-world data sets demonstrate the superior anomaly detection performance of our proposed model over the other models considered in the comparison study.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The increasing availability of large-scale traffic data sets provides an opportunity to explore them for knowledge discovery in ITS (Intelligent Transportation Systems). The avenues for exploration are numerous, ranging from uncovering traffic patterns [1], city dynamics [2], driving directions [3], discovering hot spots in a city [4], finding vacant taxis around a city [5], predicting taxi demand [6], taxi operation patterns [7], to detecting anomalies [8], among others.

Various verticals of ITS have always received active research attention in the past. However, the recent emergence of deep learning techniques and their applicability in transportation systems has resulted in a heightened interest in this area [9]. Consequently, traditional machine learning models in many applications are now being replaced by deep learning techniques, which is reshaping the landscape of intelligent transport networks. Out of the several applications of ITS, the area of anomaly detection has benefited significantly from the application of deep learning-based techniques [10]. Anomaly detection aims to find those patterns which are not normally expected from the data. Typical observations from traffic data demonstrate strong spatio-temporal patterns, showing periodicity and strong correlations between adjacent observations. These patterns may vary depending on the time of the day, day of the week, season, or location. Occasional deviations from these patterns can be termed as abnormal events. Information explored from these anomalous events can provide useful guidelines to urban planners. For instance, abnormal traffic event detection can be utilized to help mitigate congestion, plan driving routes, and reduce the imbalance between taxi demand and supply.

Within the transportation domain, anomaly detection has been applied to abnormal trajectory detection [8], finding atypical regions [11], obstacle detection [12], congestion analysis [13], and irregularities in taxi passenger demand [14], among others. Anomaly detection also finds extensive use in a wide range of applications such as fraud detection for credit cards, insurance, or health care, intrusion detection for cyber-security, fault detection in safety-critical systems, and military surveillance for enemy activities [15].

I-a Related Literature

Traditionally, anomaly detection has been performed using parametric and non-parametric statistical models, data clustering, rule-based systems, mixture models, and SVMs (Support Vector Machines), among others; for extensive surveys, the interested reader can refer to

[15] and [16]

. These traditional models often fail to capture the complex structures in the data. Additionally, as the volume of the data increases, traditional methods may experience difficulties in finding outliers at such a large scale. Hence, the performance of the aforementioned algorithms in detecting outliers might be sub-optimal for real-world sequences.

In recent years, deep learning-based anomaly detection algorithms have become increasingly popular, with applications in a diverse set of tasks [10]

. Unsupervised anomaly detection using deep learning has mainly been hybrid in nature. First, the deep neural network learns the complex patterns of the data. Then, the hidden layer representations from this trained network are used as input to traditional anomaly detection algorithms. There are two popular categories of deep learning-based anomaly detection. The first category consists of methods that analyze the reconstruction errors in an auto-encoder trained over the normal data. A deficiency in the reconstruction of a test point indicates abnormality

[17]. The second class of methods utilizes either an auto-encoder trained over the normal class to generate a low-dimensional embedding, or a neural network to generate predictions. To identify anomalies, one uses classical methods over the embedding or predictions, such as a parametric distribution assumption [17], an OC-SVM (One Class-SVM) [18], etc.

While the currently popular hybrid deep learning-based anomaly detection techniques have proven to be effective in multiple tasks, these neural networks are not customized for anomaly detection. Since the hybrid models extract features using a neural network and feed it to a separate anomaly detection method, they fail to influence the representational learning in the hidden layers. A more advanced variant of this approach combines the encoding and detection steps using an appropriate objective function, which is used to train a single neural model that performs both procedures [19]. In another related research [20]

, the authors use geometrical transformations to perform end-to-end deep learning-based anomaly detection using CNNs (Convolutional Neural Networks). In


, an OC-SVM objective is implemented in a feed-forward neural network for deep anomaly detection.

The primary focus of the aforementioned literature is on anomaly detection in the context of image data sets. The anomaly detection techniques tailored for images need not necessarily perform well with time sequences. Therefore, in this study, we aim to develop an end-to-end anomaly detection using LSTM (Long Short-Term Memory) network [22], which is a neural network designed for sequential data. By gathering insights from EVT (Extreme Value Theory) [23], we design an end-to-end LSTM-based anomaly detection model. To the best of our knowledge, an LSTM-based end-to-end deep anomaly detection model for transportation data has not been explored in the literature. Further, our objective function and network weight updation are based on results from EVT. So far, Extreme Value Theory has not been employed in training a neural network model for performing anomaly detection. These features set our research apart from existing literature111A part of this work has been presented as a conference paper [24]..

I-B Our Contributions

We propose an end-to-end deep anomaly detection algorithm, and compare the model against several baseline models: (i) parametric GARCH (Generalized Auto Regressive Conditional Heteroskedasticity) model, (ii) non-parametric OC-SVM model, and (iii) hybrid LSTM anomaly detection models based on different detection rules. The detection rules used in hybrid deep anomaly detection model are based on the Gaussian distribution, Tukey’s method, and EVT. The key findings obtained by comparing the traditional and deep learning-based models are outlined below.

  1. This study develops an end-to-end deep anomaly detection algorithm for temporal data based on an LSTM network and an objective function derived from EVT.

  2. Our proposed EVT-LSTM model outperforms several statistical, machine learning, and hybrid deep learning-based algorithms across seven diverse data sets.

  3. We highlight the necessity of a customized neural network model in deep learning-based anomaly detection setting.

The rest of the paper is organized as follows. In Section II, we explain the traditional baseline models considered for anomaly detection in this study. The hybrid deep anomaly detection model, along with the three detection strategies, is explained in Section III. It is followed by Section IV, where we introduce our proposed EVT-LSTM model. The experimental settings are provided in Section V, and the results are outlined in Section VI. We conclude our work in Section VII.

Ii Traditional Anomaly Detection

In this section, we provide brief descriptions of two traditional anomaly detection models considered as baselines in our comparison study.

Ii-a GARCH Model

Parametric statistical models [25]

represent one of the early works on outlier detection in time series. Several models were subsequently proposed in the literature for parametric anomaly detection, including ARMA (Auto Regressive Moving Average), ARIMA (Auto Regressive Integrated Moving Average), and EWMA (Exponentially Weighted Moving Average), to list a few


. We assume that the normal data instances are located at the high probability regions of a stochastic model compared to the anomalies that have a low probability. A common practice followed here is to either assume a distribution for the anomalies

[26] or fit a regression model to the data [27].

A regression-based anomaly detection technique involves two steps: (a) the regression model is used to model the data, (b) the residuals, i.e., the part not explained by the regression model, are used to determine the anomaly scores. A popular choice for regression-based anomaly detection is the GARCH model [28]

, which is often applied to financial time-series. A GARCH process is often preferred over other regression models such as ARMA because it imposes a specific structure on the conditional variance of the process. The variance is not assumed to be a constant, making the series non-stationary in nature and rendering them suitable for real-world scenarios. Essentially, the GARCH process models the error variance of the time-series as an ARMA process. The AR part models the variance of the residuals and the MA portion models the variance of the process. The time series

at each instance is given by:



is discrete white noise with zero mean and unit variance, and

is given by:


where, and are the parameters of the model. In other words, is a Generalized Auto Regressive Conditional Heteroskedastic model of order r and s, denoted by GARCH(r, s).

Parametric methods allow the model to be evaluated very rapidly for new instances and are suitable for large data sets; the model grows only with model complexity and not the data size. However, they limit their applicability by enforcing a predetermined distribution to the data. These approaches are accurate only if the data fits the chosen distribution model. The non-parametric approach described below can overcome this disadvantage associated with parametric models.

Ii-B OC-SVM Model

Non-parametric methods such as SVMs [29]

apply local kernel models rather than a single global distribution model to the data. Their popularity stems from the ability to combine speed and low complexity growth of parametric methods with the model flexibility of non-parametric methods. Kernel-based methods estimate the density distribution of the input space and identify outliers as lying in regions of low density.

Typically, the SVM model is given a set of training examples labeled as belonging to one of two classes. The model tries to divide the training sample points into two categories by creating a boundary while penalizing training samples that fall on the wrong side of the boundary. The SVM model can then make predictions by assigning points to either side of the boundary. For anomaly detection applications, the training examples are often limited. Therefore, SVMs are more popularly applied in a one-class setting here, where the SVM model is trained on data that has only one class, that is the normal

class. This is particularly useful in anomaly detection because by inferring the properties of the normal class, the examples that deviate from the normal class can be identified. The SVM model needs a kernel function that can map the original non-linear observations into a higher-dimensional space in which they are separable. Commonly used kernel functions are linear, sigmoid, Gaussian, and RBF (Radial Basis Function)

[29, Chapter 2]. During the testing phase, if a test instance falls within the learned region, it is declared as normal, else it is deemed as anomalous.

The SVM model requires a kernel function, which has to be carefully tuned for obtaining good classification accuracy. Further, the anomaly detection is supervised in nature; it requires prior knowledge of the labels. On the other hand, the recently developed anomaly detection models based on neural networks can perform unsupervised anomaly detection, and hence, has seen widespread use over the SVM model for anomaly detection lately.

Iii Hybrid Deep Anomaly Detection

The suitability of neural network models for anomaly detection originates from their unsupervised learning nature and the ability to learn highly complex non-linear sequences. When presented with normal non-anomalous data, the neural network can learn and capture the normal behavior of the system. Later, when the model encounters a data instance that deviates significantly from the rest of the set, it generates a high prediction error, suggesting at anomalous behavior. This form of prediction-based anomaly detection is hybrid in nature as it requires the application of a set of detection rules on the errors obtained from the network. Often, the decision rules employed are traditional statistical or machine learning anomaly detection algorithms. Popular detection techniques involve thresholding the prediction errors

[30], assuming an underlying parametric distribution on the prediction errors [17], or applying machine learning techniques such as an SVM model on the errors [31]. We now briefly describe the prediction model and detection rules considered in our study.

Iii-a Prediction Model

We use the LSTM network as the time-series prediction model. They are state of the art neural network models which are widely used in sequence learning applications [22]. We feed the recent number of values of every data set into the model, and the model outputs number of forecasts. The and are known as look-back and look-ahead respectively. Dropout and early stopping are employed to avoid over-fitting.

Each data set is divided into a training set, a validation set, and a test set. The model learns from the training data and validates its performance on the hold-out validation data. The training set is assumed to be free of anomalies. This is a reasonable assumption in real-world scenarios where instances of normal behavior may be available in abundance, but instances of anomalous behavior are rare. The validation and test set are mixtures of anomalous and non-anomalous data instances. The prediction model is trained on normal data without any anomalies, i.e., on the training data, so that it learns the normal behavior of the time-series. Once the model is trained, anomaly detection is performed on the test set, by using the prediction errors as anomaly indicators. In this paper, the prediction error is defined as the absolute difference between the input received at time and its corresponding prediction from the model at .

We consider three detection techniques by which the prediction errors can be used to set an anomaly threshold: (i) the Gaussian-based detection rule that makes assumptions about the parent distribution, (ii) the Tukey’s method based detection rule that does not make any assumptions on the distribution, and (iii) the EVT-based detection rule that makes assumption about the tail of the distribution, but not about the parent distribution. If any prediction error value lies outside of the chosen threshold, then the corresponding input can be considered as a possible anomaly. The detection rules considered are as follows:

Iii-B Gaussian-based Detection [17]

One of the earliest and popular works in prediction-based anomaly detection setting [17] assumes that the prediction errors from the training set follow a Gaussian distribution. The prediction errors obtained from the LSTM model is fit to a Gaussian distribution. The mean, , and variance, , of the Gaussian distribution are computed using MLE (Maximum Likelihood Estimation) [32]. The Log PDs (Probability Densities) of errors are calculated based on the parameters estimated and used as anomaly scores. A low value of Log PD indicates that the likelihood of an observation being an anomaly is high. A validation set containing both normal data and anomalies is used to set a threshold on the Log PD values. The threshold is chosen such that it can separate all the anomalies from normal observations while incurring as few false positives as possible. The threshold is then evaluated on a separate test set.

Iii-C Tukey’s Method Based Detection [30]

Tukey’s method uses quartiles to define an anomaly threshold. It makes no distributional assumptions and does not depend on the knowledge of a mean or a standard deviation. In Tukey’s method, a possible outlier lies outside the threshold

, where is the lower quartile or the percentile, and is the upper quartile or the percentile. The metric is known as the interquartile distance. The prediction errors obtained from the training, validation, and test sets are concatenated, and the lower quartiles and interquartile distances are calculated. The values lying outside are identified as potential outliers.

Iii-D EVT-based Detection [23]


be a random variable and

be its CDF (Cumulative Distribution Function). The tail of the distribution is given by

. The probability tends to zero for the extreme events in the system. A key result from EVT [33] suggests that the distribution of the extreme values is not highly sensitive to the parent data distribution. This result enables us to accurately compute probabilities without first estimating the underlying distribution. Under a weak condition, the extreme events have the same kind of distribution, regardless of the parent distributions, known as the EVD (Extreme Value Distribution):


where, is the scale parameter, and is the extreme value index of the distribution. Based on the value takes, the tail distribution can be Fréchet (), Gumbel (), or Weibull (). By fitting an EVD to the unknown input distribution tail, it is then possible to evaluate the probability of potential extreme events. In some recent work [23], the authors use results from EVT to detect anomalies in a uni-variate data stream, following the POTs (Peaks-Over-Thresholds) approach. Based on an initial threshold , the POTs approach attempts to fit a GPD (Generalized Pareto Distribution) to the excesses, . In other words, rather than fitting an EVD to the extreme values of , the POTs approach fits a GPD to the excesses . To compute the maximum likelihood estimates for GPD, we follow the procedure outlined by [34]. Once the parameters are obtained, the threshold can be computed as:


where, and are the estimated parameters of the GPD, is some desired probability, is the total number of observations, and is the number of peaks, i.e., the number of s.t. . The probability is calculated for all the observations and those data instances with can be considered as plausible anomalies. The authors in [23] recommend choosing a value for within [, ] and

as the 98% quantile, which we follow in our study. More details of this algorithm can be found in


Iv End-to-End Deep Anomaly Detection

In Section I-A, we highlighted the need for developing end-to-end deep learning-based anomaly detection models, especially for temporal data. An end-to-end deep anomaly detection technique involves modifying the objective function of a deep learning model such as an LSTM or a CNN. Modifications are introduced so that the models that were formerly learning patterns for forecasting will now learn to detect deviations from the normal behavior. Instead of first predicting using a neural network and then feeding the predictions to a separate post-processing technique, the outputs of an end-to-end deep anomaly detection model can be directly interpreted as anomaly scores. In [19], the authors combine a CNN with an SVDD (Support Vector Deep Description) objective. The SVDD is a technique similar to the OC-SVM, where a hyper-sphere is used to separate the data instead of a hyper-plane.

Let be a neural network with layers and a set of weights . This network maps data from an input space to an output space . That is, is the network representation of given by the network with parameters . The One-Class Deep SVDD objective given in [19], for a CNN model with input {}, is as follows:


The first term in the quadratic loss objective function penalizes the distance between every network representation and the center of the hyper-sphere . The second term penalizes the network weights by employing a network weight decay regularizer with hyper-parameter , where denotes the Frobenius norm. In [19], the was fixed as the mean of the network predictions that results from performing an initial forward pass on the training data samples. The experiments were conducted for MNIST and CIFAR-10 image data sets.

In order to develop a similar model for time-sequences, we implement the aforementioned objective function in an LSTM model. Interestingly, we find that while this quadratic loss objective function works satisfactorily for anomaly detection in images, it does not fare well for temporal data. When adopted in the LSTM network, we notice that Eqn. 5 minimizes the distance between the predictions and their initial mean by reducing the magnitude of the predictions, resulting in a large fraction of false positives. This behavior suggests that an objective function that directly minimizes the network predictions might not be a sensible choice for anomaly detection in temporal data. We recall that the success of hybrid deep learning-based anomaly detection algorithms was mainly attributed to an efficient threshold based on the prediction errors. Therefore, it is natural to explore an objective function that minimizes the prediction errors and not the actual predictions.

Further, in our recent work [24], after comparing different detection strategies for hybrid deep anomaly detection, we noticed the potential of a strategy based on extreme values. We found that an EVT-based detection rule performed better than other popular detection techniques. The superior performance of an EVT-based strategy in a deep learning setting encouraged us to integrate EVT into the objective function of the LSTM model, leading to an end-to-end deep anomaly detection model.

Iv-a EVT-LSTM model

In our study, the inputs in are mapped to the set in . Our EVT-LSTM model is based on the objective function given as follows:


Here, instead of minimizing the distance between the network representations and the mean obtained after an initial forward pass as in Eqn. 5, we minimize the Euclidean distance between every absolute prediction error and a threshold . The threshold is obtained from Eqn. 4, and is updated periodically during the training phase. This form of optimization is called an alternating minimization approach and has been used with similar objective functions in related literature [19, 21]. The objective functions in these related literature minimized a function of the predictions obtained from image data sets. On the other hand, our objective function (Eqn. 6) optimizes a function of the prediction errors. Our proposed algorithm is given in Algorithm 1.

1 Input: Set of examples
Output: Set of decision scores
Initialization: Threshold
while convergence criteria unmet do
       Update weights of the network using Eqn. 6
       for once in every epochs do
             Calculate prediction errors,
             Fit a GPD to excesses by using MLE and find ,
             Update using Eqn. 4
       end for
end while
Compute decision score for each
if  then
       is anomalous
       is non-anomalous
end if
Algorithm 1 The training process of the proposed EVT-LSTM model. The threshold is updated every = 20 epochs.

The threshold is initialized to zero at the beginning of the experiment. During the training phase, the LSTM model tries to optimize the objective function given in Eqn. 6. The prediction errors on the training set are calculated every epochs. The 98% empirical quantile of the errors is chosen to set an initial threshold in InitThreshold(). The excesses occurring above are fit to a GPD using MLE, and the parameters and are estimated. Then, using Eqn. 4, we calculate the new value for the threshold . The objective function (Eqn. 6) is updated with this recent value of threshold obtained. The next epochs use the modified objective function to train the model, after which the threshold

is again calculated and updated. The training stops when either the convergence is achieved, or the maximum number of epochs is reached. Finally, on a test set, the decision scores are calculated and used for classifying instances as anomalous or non-anomalous.

V Experimental Settings

In this section, we discuss the data sets considered, evaluation metrics used, and the procedure for choosing parameters for each anomaly detection model.

V-a Description of Data Sets

We consider seven real-world data sets in our comparison study: three road traffic-based data sets, two taxi demand data sets, and two data sets from other application domains. The travel time, vehicle occupancy, and traffic speed data sets considered are real-time data, obtained from a traffic detector and collected by the Minnesota Department of Transportation. Discussions on these traffic data sets are available at the Numenta Anomaly Benchmark GitHub repository222https://github.com/numenta/NAB/tree/master/data. The NYC (New York City) taxi demand data set is publicly available at [35] and contains the trip details of government-run street hailing taxis. The Bengaluru taxi demand data set is acquired from a leading private Indian transportation company dealing with app-based taxi rental services. The ECG (electrocardiogram) data is obtained from [36] and has annotations from a cardiologist to indicate the unusual heartbeat patterns. Bitcoin historic prices are obtained from coindeskr333https://cran.r-project.org/package=coindeskr package, R.

Brief descriptions of the data sets used are given below.

  1. Vehicular Travel Time: The data set is obtained from a traffic sensor and has 2500 readings from July 10, 2015, to September 17, 2015, with eight marked anomalies.

  2. Vehicular Speed: The data set contains the average speed of all vehicles passing through the traffic detector. A total of 1128 readings for the period September 8, 2015 - September 17, 2015, is available. There are three marked unusual sub-sequences in the data set.

  3. Vehicle Occupancy: There are a total of 2382 readings indicating the percentage of the time, during a 30-second period, that the detector sensed a vehicle. The data is available for a period of 17 days, from September 1, 2015, to September 17, 2015, and has two marked anomalies.

  4. NYC (New York City) Taxi Demand [35]: The publicly available NYC data set contains the pick-up locations and time stamps of street hailing yellow taxi services from the period of January 1, 2016, to February 29, 2016. We pick three time-sequences (S1, S2, and S3) with clearly apparent anomalies from data aggregated over 15 minute time periods in 1 grids.

  5. Bengaluru Taxi Demand: This data set has GPS traces of passengers booking a taxi by logging into the service provider’s mobile application. Similar to the NYC data set, this data is also available for January and February 2016. We aggregate the data over 15 minute periods in 1 grids and pick three sequences with clearly visible anomalies.

  6. ECG (Electrocardiogram) [36]: There are a total of 18000 readings, with three unusual sub-sequences labeled as anomalies. The data set has a repeating pattern, with some variability in the period length.

  7. Bitcoin Prices: Historical bitcoin prices are available for the period from January 1, 2017, to May 27, 2019. The fraction of anomalies in this data set of 877 readings are observed to be 0.06%, most of them occurring around the beginning of the year 2018.

V-B Evaluation Metrics

We consider three evaluation metrics for comparing our models: (i) Precision, , (ii) Recall, , and (iii) F1-score,

, which is the harmonic mean of Precision and Recall. Min-max normalization is performed on every data set before modeling and evaluation.

  1. Precision, :

  2. Recall, :

  3. F1-score, :


True positives are the anomalous instances that have been correctly classified as anomalies by the model. Similarly, true negatives are the instances correctly identified as non-anomalous data. False positives are the non-anomalies incorrectly classified as anomalous, and false negatives are the incorrectly identified anomalies. Since F1-score summarizes both Precision and Recall, we consider the model with the highest F1-score as the superior anomaly detection technique.

V-C Parameter Selection

In order to perform efficient anomaly detection, it is necessary to set appropriate hyper-parameters and anomaly thresholds for each model. The suitable set of parameters and thresholds vary with the use case considered. Below, we briefly discuss the procedures through which the parameters are shortlisted for each anomaly detection model.

Data Sets Model Threshold
Vehicular Travel Time
ARIMA(1, 0, 3)-GARCH(1, 1)
Vehicular Speed
ARIMA(0, 1, 4)-GARCH(1, 1)
Vehicle Occupancy
ARIMA(0, 1, 1)-GARCH(1, 1)
NYC Taxi
ARIMA(0, 1, 3)-GARCH(1, 1)
ARIMA(3, 0, 4)-GARCH(1, 1)
ARIMA(2, 1, 2)-GARCH(2, 2)
Bengaluru Taxi
ARIMA(1, 0, 3)-GARCH(1, 2)
ARIMA(3, 1, 3)-GARCH(1, 1)
ARIMA(1, 0, 1)-GARCH(1, 2)
ARIMA(4, 1, 2)-GARCH(1, 1)
Bitcoin Prices
ARIMA(2, 1, 2)-GARCH(1, 1)
Table I: Appropriate ARIMA(, , )-GARCH(, ) models obtained for each data set, by varying , in the range [0, 5], in [0, 1], and , in [1, 2]. The anomaly thresholds are obtained from a hold-out validation set, so that as few false positives are incurred.

V-C1 GARCH Model

For every data set, time-sequences are generated based on the training data. For Bengaluru and NYC taxi demand data sets, the temporal aggregation is performed at sampling periods of 15 minutes. Then, by varying the , , and parameters of an ARIMA(, , ) process between [1, 5], appropriate models are chosen for every time-sequence. The residuals obtained from fitting the ARIMA processes are then modeled as suitable GARCH(, ) processes. We find that suitable values for parameters and often lie in the range [1, 2]. Once appropriate models are developed, anomaly scores are obtained based on the deviation of the GARCH predictions from the actual values. An anomaly threshold is set based on the validation set and examined on a test set. The parameters of the fitted ARIMA-GARCH models, along with the anomaly thresholds are given in Table I.

V-C2 OC-SVM Model

Appropriate kernel functions are crucial for satisfactory anomaly detection performance of SVMs, and the choices vary with the data sets considered. In our study, we consider Linear, RBF, Polynomial, and Sigmoid kernels. Another important parameter is the kernel coefficient for the RBF, Polynomial, and Sigmoid kernels. After varying in the range [0.0001, 0.1], a value of 0.0001 is found to suit most of the data sets considered. For every use case, multiple SVM models ran on the training data, with different parameters chosen from the range of values considered. Then, suitable choices are made by observing the classification accuracy on a hold-out validation set. Finally, the best OC-SVM model obtained is used to detect anomalies on a test set. The shortlisted OC-SVM models are given in Table II.

Data Sets Kernel Setting
Vehicular Travel Time RBF(0.0001)
Vehicular Speed Poly(0.0001)
Vehicle Occupancy RBF(0.0001)
NYC Taxi
S1 RBF(0.0001)
Bengaluru Taxi
S1 RBF(0.0001)
Electrocardiogram Linear
Bitcoin Prices Sigmoid(0.0001)
Table II: The shortlisted OC-SVM models for the data sets considered. We consider Linear, Sigmoid, Polynomial, and RBF kernels, and vary between [0.0001, 0.1].
Data Sets LSTM Architecture
Vehicular Travel Time
1 Recurrent layer: {20}, Dropout: 0.2,
1 Dense layer: {1}, Learning rate: 0.01
Vehicular Speed
1 Recurrent layer: {60}, Dropout: 0.19,
1 Dense layer: {1}, Learning rate: 0.0001
Vehicle Occupancy
1 Recurrent layer: {50}, Dropout: 0.23,
1 Dense layer: {1}, Learning rate: 0.0001
NYC Taxi Demand
2 Recurrent layers: {50, 20}, Dropout: 0.4,
1 Dense layer:{24}, Learning rate: 0.0001
Bengaluru Taxi Demand
2 Recurrent layers: {20, 10}, Dropout: 0.25,
1 Dense layer:{24}, Learning rate: 0.0001
2 Recurrent layers: {60, 30}, Dropout: 0.1,
1 Dense layer:{5}, Learning rate: 0.05
Bitcoin Prices
1 Recurrent layer: {10}, Dropout: 0.1,
1 Dense layer: {1}, Learning rate: 0.0001
Table III: The LSTM architectures for the data sets considered. The optimal set of hyper-parameters for each data set is chosen after running the TPE (Tree-structured Parzen Estimator) Bayesian Optimization algorithm.

V-C3 Hybrid LSTM Models

For a neural network model, hyper-parameters define the high-level features of the model, such as its complexity, or capacity to learn. The important hyper-parameters include the number of hidden recurrent layers, dropout values, learning rate, and the number of units in each layer. We use the TPE (Tree-structured Parzen Estimator) Bayesian Optimization [37] to select these hyper-parameters. The output layer is a fully connected dense layer with linear activation. The Adam optimizer [38] is used to minimize the Mean Squared Error objective function. All LSTM-based models ran for 100 epochs with a batch size of 64.

The chosen set of parameters for each data set is given in Table III. We follow the same model settings as [39] for the ECG data set. For the traffic speed, travel time, vehicle occupancy, and bitcoin prices data sets, the limited availability of readings suggested look-back and look-ahead times of 1 each. We have over 10 million points for the New York and Bengaluru cities, allowing for a large look-back time. The considerable amount of data in these two cases allows the LSTM to learn better representations of the input data, aiding the anomaly detection process.

The false positive regulators are the parameters that impact the performance of the detection algorithms. The false positive regulator for the Gaussian-based detection rule, , is chosen for each time-sequence such that the F1-score on the validation errors is maximized. The thresholds, , for Tukey’s method are directly obtained from the entire set of prediction errors, based on a simple quantile calculation. For both hybrid and end-to-end EVT-LSTM deep learning models, we follow similar procedures to set the parameters for EVT rule. As mentioned earlier, an initial threshold has to be chosen for the EVT-based detection, typically 98% quantile. The false positive regulator for the EVT-based anomaly detection, , is set from an initialization data stream. We set using the same initialization stream that is used for setting . The initialization stream contains the prediction errors from the training and validation sets. The probability is chosen so that the EVT-based anomaly detection picks up all the anomalies from the initialization stream. The chosen values for the false positive regulators of the hybrid LSTM-based techniques are given in Table IV.

Data Sets Hybrid LSTM Models
Gaussian () Tukey () EVT ()
Vehicular Travel Time -20 572.9
Vehicular Speed -18 24.4
Vehicle Occupancy -23 12.9
NYC Taxi
S1 -19 12.1
S2 -17 12.8
S3 -15 10.5
Bengaluru Taxi
S1 -25 33.5
S2 -18 27.1
S3 -25 14.0
Electrocardiogram -23 0.1
Bitcoin Prices -17 12961.8
Table IV: The chosen false positive regulator values for the LSTM-based hybrid anomaly detection models. While the thresholds for both Gaussian and Tukey’s method based models vary significantly with each data set considered, the probability values for EVT-based detection is found to remain within [, ].


The hyper-parameters and false positive regulators chosen for hybrid LSTM models are used for the EVT-LSTM model as well. We follow the guidelines in [19] while setting the hyper-parameter for the network weight regularizer. The threshold is updated every = 20 epochs. The values chosen for hybrid deep learning models seem to suit end-to-end deep learning models, for most of the scenarios considered. An exception was the Bengaluru Taxi Demand data set, where the suitable value for turned out to be . Nevertheless, the best choices for the probability remained in [, ].

Data Sets P-values
Vehicular Travel Time 0.005
Vehicular Speed 0.005
Vehicle Occupancy 0.370
NYC Taxi
S1 0.805
S2 0.056
S3 0.147
Bengaluru Taxi
S1 0.570
S2 0.180
S3 0.006
Electrocardiogram 0.002
Bitcoin Prices 0.051
Table V:

P-values obtained from the A-D statistical test. The decision to reject the null hypothesis is taken when the p-values lie below 0.001. In all the data sets considered, the null hypothesis that the tails of the prediction errors follow a GPD is accepted.

Data Sets Anomaly Detection Models
Vehicular Travel Time
0.01 0.04 0.07 0.21 0.36 0.36
Vehicular Speed 0.18 0.56 0.79 0.74 0.79 0.79
Vehicle Occupancy 1.0 0.33 0.5 1.0 1.0 1.0
NYC Taxi
S1 0.002 0.03 0.25 1.0 1.0 1.0
S2 0.005 0.16 0.14 0.33 1.0 1.0
S3 0.007 0.6 0.66 0.86 0.86 0.86
Taxi Demand
S1 0.03 0.29 0.47 0.57 1.0 1.0
S2 0.002 0.12 0.08 0.5 0.5 0.66
S3 0.04 0.44 0.26 0.54 0.62 0.72
Electrocardiogram 0.1 0.22 0.49 0.32 0.37 0.28
Bitcoin Prices 0.52 0.31 0.19 0.83 0.83 0.84
Table VI: The anomaly detection performance of various models considered in the study, across diverse data sets, based on F1-score. The proposed end-to-end EVT-LSTM deep anomaly detection model is observed to perform better compared to the statistical, machine learning and hybrid deep learning techniques considered.

Vi Results

In this section, we analyze whether the tails of the prediction error distribution follow a GPD, and present results from the numerical tests performed.

Vi-a Statistical Tests

We conduct a statistical test known as the A-D (Anderson-Darling) test [40]

for checking the compliance of the tail distribution to a GPD. The A-D test can be used to assess whether a sample of the data comes from a specific probability distribution. This test makes use of the specific distribution while calculating the critical values. The test statistic

measures the distance between the hypothesized distribution and the empirical CDF of the data. Based on the test static and the p-values obtained, the null hypothesis that the data follow a specified distribution can (cannot) be rejected. The A-D test is a modification of the K-S (Kolmogorov-Smirnov) test [41] and gives more weight to the tails than does the K-S test. The A-D test is conducted on the excesses , i.e., the prediction errors lying above empirical threshold . The p-values obtained from this statistical test are given in Table V. We reject the null hypothesis for each data set if the corresponding p-value lies below 0.001. For all the data sets under study, statistical evidence from the A-D test suggests that the tail distributions of the prediction errors tend to follow GPD.

Vi-B Numerical Results

The anomaly detection performance based on the F1-score metric, of various models across different data sets, is provided in Table VI. Based on the results from the table, we can draw the following inferences:

  • The poor performance of the parametric GARCH models suggest that assuming a particular distribution on the prediction errors can critically affect anomaly detection accuracy.

  • Deep learning-based anomaly detection algorithms exhibit superior detection accuracy over statistical and machine learning-based algorithms across seven diverse data sets.

  • Out of the two classes of deep learning-based anomaly detection models considered, an end-to-end detection algorithm outperforms hybrid detection models on a broad variety of data sets.

When the parametric GARCH model is employed for anomaly detection, we observe that the model has a sufficiently high Recall, but very low Precision. The threshold chosen based on the validation set classifies a large number of non-anomalies as anomalous on the test set. Thus, the overall anomaly detection performance is affected by the presence of several false positives, resulting in a low F1-score value. Exceptions to this behavior are observed with vehicle occupancy data set and to an extent, with the bitcoin prices data. The magnitude of the anomalies is much higher than that of the non-anomalies in these data sets, which appears to be the reason behind this exception.

The OC-SVM model achieves a higher detection accuracy compared to statistical GARCH model but does not fare well compared to the deep learning variants. They also showcase high Recall and poor Precision values. On the other hand, a single value of kernel coefficient (0.0001) proved to be a satisfactory fit for all the data sets considered.

On comparing hybrid and end-to-end deep anomaly detection models, we see that the proposed end-to-end EVT-LSTM model shows superior detection accuracy. The anomaly detection requires no post-processing tools, and the performance is always at least as good as that of the hybrid models considered, for the majority of data sets considered. This observation suggests that a deep learning model customized for anomaly detection can provide better accuracy results than running traditional algorithms on a deep learning model developed for forecasting. The only exception is observed in the ECG data set, which can be attributed to the anomaly labeling scheme followed. The labeling scheme employed in this data set marks an entire period of the ECG signal as anomalous in case any point in that period is an anomaly. In other words, we deal with collective anomalies in this data set. The fraction of anomalies is, hence, higher in the ECG data set compared to other data sets that have point anomalies. Thus, the anomalies cover a broad spectrum above the upper quartile of prediction errors for ECG data. Since the Tukey’s method thresholds the raw prediction errors based on the upper quartile, it results in good anomaly detection for the ECG data set. This finding suggests that a simple threshold based on the magnitude of prediction errors might be sufficient when the fraction of anomalies in the data set is relatively high. Generally, Tukey’s method can detect most of the anomalies but results in a large number of false positives, similar to GARCH and OC-SVM models. This behavior is not desirable in an anomaly detection setting.

An important observation is made regarding the variability in false positive regulator values of various methods. Recalling the results from Table IV, we find high variability in the false positive regulator values of Gaussian and Tukey detection rules. The choices for thresholds and vary significantly with the data set considered. While varied between [-15, -25], was found to take values between [0.11, 12961.8]. The strong dependence of the anomaly thresholds on the time-sequence considered limit the applicability of such detection rules. On the other hand, the only free parameter for EVT-based detection, the probability , does not appear to have a significant dependence on the data set. This false positive regulator was found to stay within the range [, ]. A false positive parameter with low dependency on the data sets is highly preferred in real-world settings, thereby strengthening the case of a detection algorithm based on EVT.

In summary, considering data sets from various verticals of ITS, we found that an end-to-end deep learning-based anomaly detection algorithm holds great potential in detecting abnormal traffic instances. Our proposed EVT-LSTM model accurately detected anomalous traffic speed, vehicle occupancy, travel time, and taxi demand instances, in addition to data sets from medical and financial domains.

Vii Concluding Remarks

Detection of anomalies is a crucial part of ITS (Intelligent Transportation Systems), as it can provide useful recommendations to urban planners and taxi aggregators, among others. In this study, we developed an end-to-end deep learning-based anomaly detection model for temporal data in transportation networks.

The proposed EVT-LSTM model incorporates concepts from EVT (Extreme Value Theory) into the objective function of an LSTM (Long Short-Term Memory) deep learning model. The output network representations from our proposed model can be directly utilized for anomaly detection, a clear advantage over the currently popular hybrid deep learning-based detection models that require separate post-processing tools.

Our proposed model was compared against traditional statistical, machine learning, and deep learning-based anomaly detection models. When evaluated across seven diverse data sets, the EVT-LSTM model exhibited superior anomaly detection performance against these established baseline models. The proposed model was able to detect true positives faithfully while incurring as few false positives as possible. We found strong evidence to suggest that a deep learning model customized for anomaly detection can provide better detection accuracy than the hybrid deep anomaly detection techniques.

There are numerous avenues that merit future attention. To validate the performance of the proposed algorithm further, new data sets can be introduced. While our algorithm employs an objective function based on EVT, it would be useful to explore other objective functions, to enhance the detection accuracy.


  • [1] M. Lippi, M. Bertini, and P. Frasconi, “Collective traffic forecasting,” in Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2010, pp. 259–273.
  • [2] Y. Zheng, Y. Liu, J. Yuan, and X. Xie, “Urban computing with taxicabs,” in Proceedings of the International Conference on Ubiquitous Computing.   ACM, 2011, pp. 89–98.
  • [3] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “T-drive: Driving directions based on taxi trajectories,” in Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems.   ACM, 2010, pp. 99–108.
  • [4] H.-w. Chang, Y.-c. Tai, and J. Y.-j. Hsu, “Context-aware taxi demand hotspots prediction,” International Journal of Business Intelligence and Data Mining, vol. 5, no. 1, pp. 3–18, 2010.
  • [5] S. Phithakkitnukoon, M. Veloso, C. Bento, A. Biderman, and C. Ratti, “Taxi-aware map: Identifying and predicting vacant taxis in the city,” in Proceedings of the International Joint Conference on Ambient Intelligence.   Springer, 2010, pp. 86–95.
  • [6] N. Davis, G. Raina, and K. Jagannathan, “Taxi demand forecasting: A hedge-based tessellation strategy for improved accuracy,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 11, pp. 3686–3697, 2018.
  • [7] B. Li, D. Zhang, L. Sun, C. Chen, S. Li, G. Qi, and Q. Yang, “Hunting or waiting? Discovering passenger-finding strategies from a large-scale real-world taxi dataset,” in Proceedings of the International Conference on Pervasive Computing and Communications Workshops.   IEEE, 2011, pp. 63–68.
  • [8] C. Chen, D. Zhang, P. S. Castro, N. Li, L. Sun, S. Li, and Z. Wang, “iBOAT: Isolation-based online anomalous trajectory detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 806–818, 2013.
  • [9] Y. Wang, D. Zhang, Y. Liu, B. Dai, and L. H. Lee, “Enhancing transportation systems via deep learning: A survey,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 1, pp. 144–163, 2019.
  • [10] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
  • [11] X. Kong, X. Song, F. Xia, H. Guo, J. Wang, and A. Tolba, “LoTAD: Long-term traffic anomaly detection based on crowdsourced bus trajectory data,” World Wide Web, vol. 21, no. 3, pp. 825–847, 2018.
  • [12]

    A. Dairi, F. Harrou, Y. Sun, and M. Senouci, “Obstacle detection for intelligent transportation systems using deep stacked autoencoder and

    -nearest neighbor scheme,” IEEE Sensors Journal, vol. 18, no. 12, pp. 5122–5132, 2018.
  • [13] I. Markou, F. Rodrigues, and F. C. Pereira, “Use of taxi-trip data in analysis of demand patterns for detection and explanation of anomalies,” Transportation Research Record, vol. 2643, no. 1, pp. 129–138, 2017.
  • [14] M. Wittmann, M. Kollek, and M. Lienkamp, “Event-driven anomalies in spatiotemporal taxi passseger demand,” in Proceedings of the International Conference on Intelligent Transportation Systems.   IEEE, 2018, pp. 979–984.
  • [15] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15–73, 2009.
  • [16] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.
  • [17] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, “LSTM-based encoder-decoder for multi-sensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
  • [18] P. Oza and V. M. Patel, “One-class convolutional neural network,” IEEE Signal Processing Letters, vol. 26, no. 2, pp. 277–281, 2018.
  • [19] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 4390–4399.
  • [20] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” in Advances in Neural Information Processing Systems, 2018, pp. 9758–9769.
  • [21] R. Chalapathy, A. K. Menon, and S. Chawla, “Anomaly detection using one-class neural networks,” arXiv preprint arXiv:1802.06360, 2018.
  • [22] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” in Proceedings of the International Conference on Artificial Neural Networks.   IET, 1999, pp. 850–855.
  • [23] A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, “Anomaly detection in streams with extreme value theory,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining.   ACM, 2017, pp. 1067–1075.
  • [24] N. Davis, G. Raina, and K. Jagannathan, “Lstm-based anomaly detection: Detection rules from extreme value theory,” in Proceedings of the EPIA Conference on Artificial Intelligence.   Springer, 2019, pp. 572–583.
  • [25] A. J. Fox, “Outliers in time series,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 34, no. 3, pp. 350–363, 1972.
  • [26] E. Eskin, “Anomaly detection over noisy data using learned probability distributions,” in Proceedings of the International Conference on Machine Learning.   Morgan Kaufmann Publishers Inc., 2000, pp. 255–262.
  • [27] D. Chen, X. Shao, B. Hu, and Q. Su, “Simultaneous wavelength selection and outlier detection in multivariate regression of near-infrared spectra,” Analytical Sciences, vol. 21, no. 2, pp. 161–166, 2005.
  • [28] R. Engle, “Garch 101: The use of arch/garch models in applied econometrics,” Journal of Economic Perspectives, vol. 15, no. 4, pp. 157–168, 2001.
  • [29] B. Schölkopf, A. J. Smola, F. Bach et al., Learning with kernels: support vector machines, regularization, optimization, and beyond.   MIT press, 2002.
  • [30] C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan, “Statistical techniques for online anomaly detection in data centers,” in Proceedings of the International Symposium on Integrated Network Management and Workshops.   IEEE, 2011, pp. 385–392.
  • [31] T. Ergen, A. H. Mirza, and S. S. Kozat, “Unsupervised and semi-supervised anomaly detection with LSTM neural networks,” arXiv preprint arXiv:1710.09207, 2017.
  • [32] I. J. Myung, “Tutorial on maximum likelihood estimation,” Journal of mathematical Psychology, vol. 47, no. 1, pp. 90–100, 2003.
  • [33] L. De Haan and A. Ferreira, Extreme value theory: An introduction.   Springer Science & Business Media, 2007.
  • [34] S. D. Grimshaw, “Computing maximum likelihood estimates for the generalized pareto distribution,” Technometrics, vol. 35, no. 2, pp. 185–191, 1993.
  • [35] N. Y. C. Taxi & Limousine Commission, “TLC trip record data,” 2016, https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, accessed 2019-10-01.
  • [36] E. Keogh, J. Lin, and A. Fu, “Hot sax: Efficiently finding the most unusual time series subsequence,” in Proceedings of the International Conference on Data Mining.   IEEE, 2005, pp. 1–8.
  • [37] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2546–2554.
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [39] A. Singh, “Anomaly detection for temporal data using long short-term memory,” Master’s thesis, KTH Royal Institute of Technology, 2017.
  • [40] M. A. Stephens, “EDF statistics for goodness of fit and some comparisons,” Journal of the American Statistical Association, vol. 69, no. 347, pp. 730–737, 1974.
  • [41] F. J. Massey Jr, “The kolmogorov-smirnov test for goodness of fit,” Journal of the American Statistical Association, vol. 46, no. 253, pp. 68–78, 1951.