I Introduction
The increasing availability of largescale traffic data sets provides an opportunity to explore them for knowledge discovery in ITS (Intelligent Transportation Systems). The avenues for exploration are numerous, ranging from uncovering traffic patterns [1], city dynamics [2], driving directions [3], discovering hot spots in a city [4], finding vacant taxis around a city [5], predicting taxi demand [6], taxi operation patterns [7], to detecting anomalies [8], among others.
Various verticals of ITS have always received active research attention in the past. However, the recent emergence of deep learning techniques and their applicability in transportation systems has resulted in a heightened interest in this area [9]. Consequently, traditional machine learning models in many applications are now being replaced by deep learning techniques, which is reshaping the landscape of intelligent transport networks. Out of the several applications of ITS, the area of anomaly detection has benefited significantly from the application of deep learningbased techniques [10]. Anomaly detection aims to find those patterns which are not normally expected from the data. Typical observations from traffic data demonstrate strong spatiotemporal patterns, showing periodicity and strong correlations between adjacent observations. These patterns may vary depending on the time of the day, day of the week, season, or location. Occasional deviations from these patterns can be termed as abnormal events. Information explored from these anomalous events can provide useful guidelines to urban planners. For instance, abnormal traffic event detection can be utilized to help mitigate congestion, plan driving routes, and reduce the imbalance between taxi demand and supply.
Within the transportation domain, anomaly detection has been applied to abnormal trajectory detection [8], finding atypical regions [11], obstacle detection [12], congestion analysis [13], and irregularities in taxi passenger demand [14], among others. Anomaly detection also finds extensive use in a wide range of applications such as fraud detection for credit cards, insurance, or health care, intrusion detection for cybersecurity, fault detection in safetycritical systems, and military surveillance for enemy activities [15].
Ia Related Literature
Traditionally, anomaly detection has been performed using parametric and nonparametric statistical models, data clustering, rulebased systems, mixture models, and SVMs (Support Vector Machines), among others; for extensive surveys, the interested reader can refer to
[15] and [16]. These traditional models often fail to capture the complex structures in the data. Additionally, as the volume of the data increases, traditional methods may experience difficulties in finding outliers at such a large scale. Hence, the performance of the aforementioned algorithms in detecting outliers might be suboptimal for realworld sequences.
In recent years, deep learningbased anomaly detection algorithms have become increasingly popular, with applications in a diverse set of tasks [10]
. Unsupervised anomaly detection using deep learning has mainly been hybrid in nature. First, the deep neural network learns the complex patterns of the data. Then, the hidden layer representations from this trained network are used as input to traditional anomaly detection algorithms. There are two popular categories of deep learningbased anomaly detection. The first category consists of methods that analyze the reconstruction errors in an autoencoder trained over the normal data. A deficiency in the reconstruction of a test point indicates abnormality
[17]. The second class of methods utilizes either an autoencoder trained over the normal class to generate a lowdimensional embedding, or a neural network to generate predictions. To identify anomalies, one uses classical methods over the embedding or predictions, such as a parametric distribution assumption [17], an OCSVM (One ClassSVM) [18], etc.While the currently popular hybrid deep learningbased anomaly detection techniques have proven to be effective in multiple tasks, these neural networks are not customized for anomaly detection. Since the hybrid models extract features using a neural network and feed it to a separate anomaly detection method, they fail to influence the representational learning in the hidden layers. A more advanced variant of this approach combines the encoding and detection steps using an appropriate objective function, which is used to train a single neural model that performs both procedures [19]. In another related research [20]
, the authors use geometrical transformations to perform endtoend deep learningbased anomaly detection using CNNs (Convolutional Neural Networks). In
[21], an OCSVM objective is implemented in a feedforward neural network for deep anomaly detection.
The primary focus of the aforementioned literature is on anomaly detection in the context of image data sets. The anomaly detection techniques tailored for images need not necessarily perform well with time sequences. Therefore, in this study, we aim to develop an endtoend anomaly detection using LSTM (Long ShortTerm Memory) network [22], which is a neural network designed for sequential data. By gathering insights from EVT (Extreme Value Theory) [23], we design an endtoend LSTMbased anomaly detection model. To the best of our knowledge, an LSTMbased endtoend deep anomaly detection model for transportation data has not been explored in the literature. Further, our objective function and network weight updation are based on results from EVT. So far, Extreme Value Theory has not been employed in training a neural network model for performing anomaly detection. These features set our research apart from existing literature^{1}^{1}1A part of this work has been presented as a conference paper [24]..
IB Our Contributions
We propose an endtoend deep anomaly detection algorithm, and compare the model against several baseline models: (i) parametric GARCH (Generalized Auto Regressive Conditional Heteroskedasticity) model, (ii) nonparametric OCSVM model, and (iii) hybrid LSTM anomaly detection models based on different detection rules. The detection rules used in hybrid deep anomaly detection model are based on the Gaussian distribution, Tukey’s method, and EVT. The key findings obtained by comparing the traditional and deep learningbased models are outlined below.

This study develops an endtoend deep anomaly detection algorithm for temporal data based on an LSTM network and an objective function derived from EVT.

Our proposed EVTLSTM model outperforms several statistical, machine learning, and hybrid deep learningbased algorithms across seven diverse data sets.

We highlight the necessity of a customized neural network model in deep learningbased anomaly detection setting.
The rest of the paper is organized as follows. In Section II, we explain the traditional baseline models considered for anomaly detection in this study. The hybrid deep anomaly detection model, along with the three detection strategies, is explained in Section III. It is followed by Section IV, where we introduce our proposed EVTLSTM model. The experimental settings are provided in Section V, and the results are outlined in Section VI. We conclude our work in Section VII.
Ii Traditional Anomaly Detection
In this section, we provide brief descriptions of two traditional anomaly detection models considered as baselines in our comparison study.
Iia GARCH Model
Parametric statistical models [25]
represent one of the early works on outlier detection in time series. Several models were subsequently proposed in the literature for parametric anomaly detection, including ARMA (Auto Regressive Moving Average), ARIMA (Auto Regressive Integrated Moving Average), and EWMA (Exponentially Weighted Moving Average), to list a few
[15]. We assume that the normal data instances are located at the high probability regions of a stochastic model compared to the anomalies that have a low probability. A common practice followed here is to either assume a distribution for the anomalies
[26] or fit a regression model to the data [27].A regressionbased anomaly detection technique involves two steps: (a) the regression model is used to model the data, (b) the residuals, i.e., the part not explained by the regression model, are used to determine the anomaly scores. A popular choice for regressionbased anomaly detection is the GARCH model [28]
, which is often applied to financial timeseries. A GARCH process is often preferred over other regression models such as ARMA because it imposes a specific structure on the conditional variance of the process. The variance is not assumed to be a constant, making the series nonstationary in nature and rendering them suitable for realworld scenarios. Essentially, the GARCH process models the error variance of the timeseries as an ARMA process. The AR part models the variance of the residuals and the MA portion models the variance of the process. The time series
at each instance is given by:(1) 
where,
is discrete white noise with zero mean and unit variance, and
is given by:(2) 
where, and are the parameters of the model. In other words, is a Generalized Auto Regressive Conditional Heteroskedastic model of order r and s, denoted by GARCH(r, s).
Parametric methods allow the model to be evaluated very rapidly for new instances and are suitable for large data sets; the model grows only with model complexity and not the data size. However, they limit their applicability by enforcing a predetermined distribution to the data. These approaches are accurate only if the data fits the chosen distribution model. The nonparametric approach described below can overcome this disadvantage associated with parametric models.
IiB OCSVM Model
Nonparametric methods such as SVMs [29]
apply local kernel models rather than a single global distribution model to the data. Their popularity stems from the ability to combine speed and low complexity growth of parametric methods with the model flexibility of nonparametric methods. Kernelbased methods estimate the density distribution of the input space and identify outliers as lying in regions of low density.
Typically, the SVM model is given a set of training examples labeled as belonging to one of two classes. The model tries to divide the training sample points into two categories by creating a boundary while penalizing training samples that fall on the wrong side of the boundary. The SVM model can then make predictions by assigning points to either side of the boundary. For anomaly detection applications, the training examples are often limited. Therefore, SVMs are more popularly applied in a oneclass setting here, where the SVM model is trained on data that has only one class, that is the normal
class. This is particularly useful in anomaly detection because by inferring the properties of the normal class, the examples that deviate from the normal class can be identified. The SVM model needs a kernel function that can map the original nonlinear observations into a higherdimensional space in which they are separable. Commonly used kernel functions are linear, sigmoid, Gaussian, and RBF (Radial Basis Function)
[29, Chapter 2]. During the testing phase, if a test instance falls within the learned region, it is declared as normal, else it is deemed as anomalous.The SVM model requires a kernel function, which has to be carefully tuned for obtaining good classification accuracy. Further, the anomaly detection is supervised in nature; it requires prior knowledge of the labels. On the other hand, the recently developed anomaly detection models based on neural networks can perform unsupervised anomaly detection, and hence, has seen widespread use over the SVM model for anomaly detection lately.
Iii Hybrid Deep Anomaly Detection
The suitability of neural network models for anomaly detection originates from their unsupervised learning nature and the ability to learn highly complex nonlinear sequences. When presented with normal nonanomalous data, the neural network can learn and capture the normal behavior of the system. Later, when the model encounters a data instance that deviates significantly from the rest of the set, it generates a high prediction error, suggesting at anomalous behavior. This form of predictionbased anomaly detection is hybrid in nature as it requires the application of a set of detection rules on the errors obtained from the network. Often, the decision rules employed are traditional statistical or machine learning anomaly detection algorithms. Popular detection techniques involve thresholding the prediction errors
[30], assuming an underlying parametric distribution on the prediction errors [17], or applying machine learning techniques such as an SVM model on the errors [31]. We now briefly describe the prediction model and detection rules considered in our study.Iiia Prediction Model
We use the LSTM network as the timeseries prediction model. They are state of the art neural network models which are widely used in sequence learning applications [22]. We feed the recent number of values of every data set into the model, and the model outputs number of forecasts. The and are known as lookback and lookahead respectively. Dropout and early stopping are employed to avoid overfitting.
Each data set is divided into a training set, a validation set, and a test set. The model learns from the training data and validates its performance on the holdout validation data. The training set is assumed to be free of anomalies. This is a reasonable assumption in realworld scenarios where instances of normal behavior may be available in abundance, but instances of anomalous behavior are rare. The validation and test set are mixtures of anomalous and nonanomalous data instances. The prediction model is trained on normal data without any anomalies, i.e., on the training data, so that it learns the normal behavior of the timeseries. Once the model is trained, anomaly detection is performed on the test set, by using the prediction errors as anomaly indicators. In this paper, the prediction error is defined as the absolute difference between the input received at time and its corresponding prediction from the model at .
We consider three detection techniques by which the prediction errors can be used to set an anomaly threshold: (i) the Gaussianbased detection rule that makes assumptions about the parent distribution, (ii) the Tukey’s method based detection rule that does not make any assumptions on the distribution, and (iii) the EVTbased detection rule that makes assumption about the tail of the distribution, but not about the parent distribution. If any prediction error value lies outside of the chosen threshold, then the corresponding input can be considered as a possible anomaly. The detection rules considered are as follows:
IiiB Gaussianbased Detection [17]
One of the earliest and popular works in predictionbased anomaly detection setting [17] assumes that the prediction errors from the training set follow a Gaussian distribution. The prediction errors obtained from the LSTM model is fit to a Gaussian distribution. The mean, , and variance, , of the Gaussian distribution are computed using MLE (Maximum Likelihood Estimation) [32]. The Log PDs (Probability Densities) of errors are calculated based on the parameters estimated and used as anomaly scores. A low value of Log PD indicates that the likelihood of an observation being an anomaly is high. A validation set containing both normal data and anomalies is used to set a threshold on the Log PD values. The threshold is chosen such that it can separate all the anomalies from normal observations while incurring as few false positives as possible. The threshold is then evaluated on a separate test set.
IiiC Tukey’s Method Based Detection [30]
Tukey’s method uses quartiles to define an anomaly threshold. It makes no distributional assumptions and does not depend on the knowledge of a mean or a standard deviation. In Tukey’s method, a possible outlier lies outside the threshold
, where is the lower quartile or the percentile, and is the upper quartile or the percentile. The metric is known as the interquartile distance. The prediction errors obtained from the training, validation, and test sets are concatenated, and the lower quartiles and interquartile distances are calculated. The values lying outside are identified as potential outliers.IiiD EVTbased Detection [23]
Let
be a random variable and
be its CDF (Cumulative Distribution Function). The tail of the distribution is given by
. The probability tends to zero for the extreme events in the system. A key result from EVT [33] suggests that the distribution of the extreme values is not highly sensitive to the parent data distribution. This result enables us to accurately compute probabilities without first estimating the underlying distribution. Under a weak condition, the extreme events have the same kind of distribution, regardless of the parent distributions, known as the EVD (Extreme Value Distribution):(3) 
where, is the scale parameter, and is the extreme value index of the distribution. Based on the value takes, the tail distribution can be Fréchet (), Gumbel (), or Weibull (). By fitting an EVD to the unknown input distribution tail, it is then possible to evaluate the probability of potential extreme events. In some recent work [23], the authors use results from EVT to detect anomalies in a univariate data stream, following the POTs (PeaksOverThresholds) approach. Based on an initial threshold , the POTs approach attempts to fit a GPD (Generalized Pareto Distribution) to the excesses, . In other words, rather than fitting an EVD to the extreme values of , the POTs approach fits a GPD to the excesses . To compute the maximum likelihood estimates for GPD, we follow the procedure outlined by [34]. Once the parameters are obtained, the threshold can be computed as:
(4) 
where, and are the estimated parameters of the GPD, is some desired probability, is the total number of observations, and is the number of peaks, i.e., the number of s.t. . The probability is calculated for all the observations and those data instances with can be considered as plausible anomalies. The authors in [23] recommend choosing a value for within [, ] and
as the 98% quantile, which we follow in our study. More details of this algorithm can be found in
[23].Iv EndtoEnd Deep Anomaly Detection
In Section IA, we highlighted the need for developing endtoend deep learningbased anomaly detection models, especially for temporal data. An endtoend deep anomaly detection technique involves modifying the objective function of a deep learning model such as an LSTM or a CNN. Modifications are introduced so that the models that were formerly learning patterns for forecasting will now learn to detect deviations from the normal behavior. Instead of first predicting using a neural network and then feeding the predictions to a separate postprocessing technique, the outputs of an endtoend deep anomaly detection model can be directly interpreted as anomaly scores. In [19], the authors combine a CNN with an SVDD (Support Vector Deep Description) objective. The SVDD is a technique similar to the OCSVM, where a hypersphere is used to separate the data instead of a hyperplane.
Let be a neural network with layers and a set of weights . This network maps data from an input space to an output space . That is, is the network representation of given by the network with parameters . The OneClass Deep SVDD objective given in [19], for a CNN model with input {}, is as follows:
(5) 
The first term in the quadratic loss objective function penalizes the distance between every network representation and the center of the hypersphere . The second term penalizes the network weights by employing a network weight decay regularizer with hyperparameter , where denotes the Frobenius norm. In [19], the was fixed as the mean of the network predictions that results from performing an initial forward pass on the training data samples. The experiments were conducted for MNIST and CIFAR10 image data sets.
In order to develop a similar model for timesequences, we implement the aforementioned objective function in an LSTM model. Interestingly, we find that while this quadratic loss objective function works satisfactorily for anomaly detection in images, it does not fare well for temporal data. When adopted in the LSTM network, we notice that Eqn. 5 minimizes the distance between the predictions and their initial mean by reducing the magnitude of the predictions, resulting in a large fraction of false positives. This behavior suggests that an objective function that directly minimizes the network predictions might not be a sensible choice for anomaly detection in temporal data. We recall that the success of hybrid deep learningbased anomaly detection algorithms was mainly attributed to an efficient threshold based on the prediction errors. Therefore, it is natural to explore an objective function that minimizes the prediction errors and not the actual predictions.
Further, in our recent work [24], after comparing different detection strategies for hybrid deep anomaly detection, we noticed the potential of a strategy based on extreme values. We found that an EVTbased detection rule performed better than other popular detection techniques. The superior performance of an EVTbased strategy in a deep learning setting encouraged us to integrate EVT into the objective function of the LSTM model, leading to an endtoend deep anomaly detection model.
Iva EVTLSTM model
In our study, the inputs in are mapped to the set in . Our EVTLSTM model is based on the objective function given as follows:
(6) 
Here, instead of minimizing the distance between the network representations and the mean obtained after an initial forward pass as in Eqn. 5, we minimize the Euclidean distance between every absolute prediction error and a threshold . The threshold is obtained from Eqn. 4, and is updated periodically during the training phase. This form of optimization is called an alternating minimization approach and has been used with similar objective functions in related literature [19, 21]. The objective functions in these related literature minimized a function of the predictions obtained from image data sets. On the other hand, our objective function (Eqn. 6) optimizes a function of the prediction errors. Our proposed algorithm is given in Algorithm 1.
The threshold is initialized to zero at the beginning of the experiment. During the training phase, the LSTM model tries to optimize the objective function given in Eqn. 6. The prediction errors on the training set are calculated every epochs. The 98% empirical quantile of the errors is chosen to set an initial threshold in InitThreshold(). The excesses occurring above are fit to a GPD using MLE, and the parameters and are estimated. Then, using Eqn. 4, we calculate the new value for the threshold . The objective function (Eqn. 6) is updated with this recent value of threshold obtained. The next epochs use the modified objective function to train the model, after which the threshold
is again calculated and updated. The training stops when either the convergence is achieved, or the maximum number of epochs is reached. Finally, on a test set, the decision scores are calculated and used for classifying instances as anomalous or nonanomalous.
V Experimental Settings
In this section, we discuss the data sets considered, evaluation metrics used, and the procedure for choosing parameters for each anomaly detection model.
Va Description of Data Sets
We consider seven realworld data sets in our comparison study: three road trafficbased data sets, two taxi demand data sets, and two data sets from other application domains. The travel time, vehicle occupancy, and traffic speed data sets considered are realtime data, obtained from a traffic detector and collected by the Minnesota Department of Transportation. Discussions on these traffic data sets are available at the Numenta Anomaly Benchmark GitHub repository^{2}^{2}2https://github.com/numenta/NAB/tree/master/data. The NYC (New York City) taxi demand data set is publicly available at [35] and contains the trip details of governmentrun street hailing taxis. The Bengaluru taxi demand data set is acquired from a leading private Indian transportation company dealing with appbased taxi rental services. The ECG (electrocardiogram) data is obtained from [36] and has annotations from a cardiologist to indicate the unusual heartbeat patterns. Bitcoin historic prices are obtained from coindeskr^{3}^{3}3https://cran.rproject.org/package=coindeskr package, R.
Brief descriptions of the data sets used are given below.

Vehicular Travel Time: The data set is obtained from a traffic sensor and has 2500 readings from July 10, 2015, to September 17, 2015, with eight marked anomalies.

Vehicular Speed: The data set contains the average speed of all vehicles passing through the traffic detector. A total of 1128 readings for the period September 8, 2015  September 17, 2015, is available. There are three marked unusual subsequences in the data set.

Vehicle Occupancy: There are a total of 2382 readings indicating the percentage of the time, during a 30second period, that the detector sensed a vehicle. The data is available for a period of 17 days, from September 1, 2015, to September 17, 2015, and has two marked anomalies.

NYC (New York City) Taxi Demand [35]: The publicly available NYC data set contains the pickup locations and time stamps of street hailing yellow taxi services from the period of January 1, 2016, to February 29, 2016. We pick three timesequences (S1, S2, and S3) with clearly apparent anomalies from data aggregated over 15 minute time periods in 1 grids.

Bengaluru Taxi Demand: This data set has GPS traces of passengers booking a taxi by logging into the service provider’s mobile application. Similar to the NYC data set, this data is also available for January and February 2016. We aggregate the data over 15 minute periods in 1 grids and pick three sequences with clearly visible anomalies.

ECG (Electrocardiogram) [36]: There are a total of 18000 readings, with three unusual subsequences labeled as anomalies. The data set has a repeating pattern, with some variability in the period length.

Bitcoin Prices: Historical bitcoin prices are available for the period from January 1, 2017, to May 27, 2019. The fraction of anomalies in this data set of 877 readings are observed to be 0.06%, most of them occurring around the beginning of the year 2018.
VB Evaluation Metrics
We consider three evaluation metrics for comparing our models: (i) Precision, , (ii) Recall, , and (iii) F1score,
, which is the harmonic mean of Precision and Recall. Minmax normalization is performed on every data set before modeling and evaluation.

Precision, :
(7) 
Recall, :
(8) 
F1score, :
(9)
True positives are the anomalous instances that have been correctly classified as anomalies by the model. Similarly, true negatives are the instances correctly identified as nonanomalous data. False positives are the nonanomalies incorrectly classified as anomalous, and false negatives are the incorrectly identified anomalies. Since F1score summarizes both Precision and Recall, we consider the model with the highest F1score as the superior anomaly detection technique.
VC Parameter Selection
In order to perform efficient anomaly detection, it is necessary to set appropriate hyperparameters and anomaly thresholds for each model. The suitable set of parameters and thresholds vary with the use case considered. Below, we briefly discuss the procedures through which the parameters are shortlisted for each anomaly detection model.
Data Sets  Model  Threshold  


0.016  
Vehicular Speed 

0.036  
Vehicle Occupancy 

0.433  

S1 

0.009  
S2 

0.047  
S3 

0.051  

S1 

0.064  
S2 

0.003  
S3 

0.060  
Electrocardiogram 


Bitcoin Prices 

0.025 
VC1 GARCH Model
For every data set, timesequences are generated based on the training data. For Bengaluru and NYC taxi demand data sets, the temporal aggregation is performed at sampling periods of 15 minutes. Then, by varying the , , and parameters of an ARIMA(, , ) process between [1, 5], appropriate models are chosen for every timesequence. The residuals obtained from fitting the ARIMA processes are then modeled as suitable GARCH(, ) processes. We find that suitable values for parameters and often lie in the range [1, 2]. Once appropriate models are developed, anomaly scores are obtained based on the deviation of the GARCH predictions from the actual values. An anomaly threshold is set based on the validation set and examined on a test set. The parameters of the fitted ARIMAGARCH models, along with the anomaly thresholds are given in Table I.
VC2 OCSVM Model
Appropriate kernel functions are crucial for satisfactory anomaly detection performance of SVMs, and the choices vary with the data sets considered. In our study, we consider Linear, RBF, Polynomial, and Sigmoid kernels. Another important parameter is the kernel coefficient for the RBF, Polynomial, and Sigmoid kernels. After varying in the range [0.0001, 0.1], a value of 0.0001 is found to suit most of the data sets considered. For every use case, multiple SVM models ran on the training data, with different parameters chosen from the range of values considered. Then, suitable choices are made by observing the classification accuracy on a holdout validation set. Finally, the best OCSVM model obtained is used to detect anomalies on a test set. The shortlisted OCSVM models are given in Table II.
Data Sets  Kernel Setting  
Vehicular Travel Time  RBF(0.0001)  
Vehicular Speed  Poly(0.0001)  
Vehicle Occupancy  RBF(0.0001)  

S1  RBF(0.0001)  
S2  
S3  

S1  RBF(0.0001)  
S2  
S3  
Electrocardiogram  Linear  
Bitcoin Prices  Sigmoid(0.0001) 
Data Sets  LSTM Architecture  

Vehicular Travel Time 


Vehicular Speed 


Vehicle Occupancy 


NYC Taxi Demand 


Bengaluru Taxi Demand 


Electrocardiogram 


Bitcoin Prices 

VC3 Hybrid LSTM Models
For a neural network model, hyperparameters define the highlevel features of the model, such as its complexity, or capacity to learn. The important hyperparameters include the number of hidden recurrent layers, dropout values, learning rate, and the number of units in each layer. We use the TPE (Treestructured Parzen Estimator) Bayesian Optimization [37] to select these hyperparameters. The output layer is a fully connected dense layer with linear activation. The Adam optimizer [38] is used to minimize the Mean Squared Error objective function. All LSTMbased models ran for 100 epochs with a batch size of 64.
The chosen set of parameters for each data set is given in Table III. We follow the same model settings as [39] for the ECG data set. For the traffic speed, travel time, vehicle occupancy, and bitcoin prices data sets, the limited availability of readings suggested lookback and lookahead times of 1 each. We have over 10 million points for the New York and Bengaluru cities, allowing for a large lookback time. The considerable amount of data in these two cases allows the LSTM to learn better representations of the input data, aiding the anomaly detection process.
The false positive regulators are the parameters that impact the performance of the detection algorithms. The false positive regulator for the Gaussianbased detection rule, , is chosen for each timesequence such that the F1score on the validation errors is maximized. The thresholds, , for Tukey’s method are directly obtained from the entire set of prediction errors, based on a simple quantile calculation. For both hybrid and endtoend EVTLSTM deep learning models, we follow similar procedures to set the parameters for EVT rule. As mentioned earlier, an initial threshold has to be chosen for the EVTbased detection, typically 98% quantile. The false positive regulator for the EVTbased anomaly detection, , is set from an initialization data stream. We set using the same initialization stream that is used for setting . The initialization stream contains the prediction errors from the training and validation sets. The probability is chosen so that the EVTbased anomaly detection picks up all the anomalies from the initialization stream. The chosen values for the false positive regulators of the hybrid LSTMbased techniques are given in Table IV.
Data Sets  Hybrid LSTM Models  
Gaussian ()  Tukey ()  EVT ()  
Vehicular Travel Time  20  572.9  
Vehicular Speed  18  24.4  
Vehicle Occupancy  23  12.9  

S1  19  12.1  
S2  17  12.8  
S3  15  10.5  

S1  25  33.5  
S2  18  27.1  
S3  25  14.0  
Electrocardiogram  23  0.1  
Bitcoin Prices  17  12961.8 
VC4 EVTLSTM Model
The hyperparameters and false positive regulators chosen for hybrid LSTM models are used for the EVTLSTM model as well. We follow the guidelines in [19] while setting the hyperparameter for the network weight regularizer. The threshold is updated every = 20 epochs. The values chosen for hybrid deep learning models seem to suit endtoend deep learning models, for most of the scenarios considered. An exception was the Bengaluru Taxi Demand data set, where the suitable value for turned out to be . Nevertheless, the best choices for the probability remained in [, ].
Data Sets  Pvalues  
Vehicular Travel Time  0.005  
Vehicular Speed  0.005  
Vehicle Occupancy  0.370  

S1  0.805  
S2  0.056  
S3  0.147  

S1  0.570  
S2  0.180  
S3  0.006  
Electrocardiogram  0.002  
Bitcoin Prices  0.051 
Pvalues obtained from the AD statistical test. The decision to reject the null hypothesis is taken when the pvalues lie below 0.001. In all the data sets considered, the null hypothesis that the tails of the prediction errors follow a GPD is accepted.
Data Sets  Anomaly Detection Models  
GARCH  OCSVM 






0.01  0.04  0.07  0.21  0.36  0.36  
Vehicular Speed  0.18  0.56  0.79  0.74  0.79  0.79  
Vehicle Occupancy  1.0  0.33  0.5  1.0  1.0  1.0  

S1  0.002  0.03  0.25  1.0  1.0  1.0  
S2  0.005  0.16  0.14  0.33  1.0  1.0  
S3  0.007  0.6  0.66  0.86  0.86  0.86  

S1  0.03  0.29  0.47  0.57  1.0  1.0  
S2  0.002  0.12  0.08  0.5  0.5  0.66  
S3  0.04  0.44  0.26  0.54  0.62  0.72  
Electrocardiogram  0.1  0.22  0.49  0.32  0.37  0.28  
Bitcoin Prices  0.52  0.31  0.19  0.83  0.83  0.84 
Vi Results
In this section, we analyze whether the tails of the prediction error distribution follow a GPD, and present results from the numerical tests performed.
Via Statistical Tests
We conduct a statistical test known as the AD (AndersonDarling) test [40]
for checking the compliance of the tail distribution to a GPD. The AD test can be used to assess whether a sample of the data comes from a specific probability distribution. This test makes use of the specific distribution while calculating the critical values. The test statistic
measures the distance between the hypothesized distribution and the empirical CDF of the data. Based on the test static and the pvalues obtained, the null hypothesis that the data follow a specified distribution can (cannot) be rejected. The AD test is a modification of the KS (KolmogorovSmirnov) test [41] and gives more weight to the tails than does the KS test. The AD test is conducted on the excesses , i.e., the prediction errors lying above empirical threshold . The pvalues obtained from this statistical test are given in Table V. We reject the null hypothesis for each data set if the corresponding pvalue lies below 0.001. For all the data sets under study, statistical evidence from the AD test suggests that the tail distributions of the prediction errors tend to follow GPD.ViB Numerical Results
The anomaly detection performance based on the F1score metric, of various models across different data sets, is provided in Table VI. Based on the results from the table, we can draw the following inferences:

The poor performance of the parametric GARCH models suggest that assuming a particular distribution on the prediction errors can critically affect anomaly detection accuracy.

Deep learningbased anomaly detection algorithms exhibit superior detection accuracy over statistical and machine learningbased algorithms across seven diverse data sets.

Out of the two classes of deep learningbased anomaly detection models considered, an endtoend detection algorithm outperforms hybrid detection models on a broad variety of data sets.
When the parametric GARCH model is employed for anomaly detection, we observe that the model has a sufficiently high Recall, but very low Precision. The threshold chosen based on the validation set classifies a large number of nonanomalies as anomalous on the test set. Thus, the overall anomaly detection performance is affected by the presence of several false positives, resulting in a low F1score value. Exceptions to this behavior are observed with vehicle occupancy data set and to an extent, with the bitcoin prices data. The magnitude of the anomalies is much higher than that of the nonanomalies in these data sets, which appears to be the reason behind this exception.
The OCSVM model achieves a higher detection accuracy compared to statistical GARCH model but does not fare well compared to the deep learning variants. They also showcase high Recall and poor Precision values. On the other hand, a single value of kernel coefficient (0.0001) proved to be a satisfactory fit for all the data sets considered.
On comparing hybrid and endtoend deep anomaly detection models, we see that the proposed endtoend EVTLSTM model shows superior detection accuracy. The anomaly detection requires no postprocessing tools, and the performance is always at least as good as that of the hybrid models considered, for the majority of data sets considered. This observation suggests that a deep learning model customized for anomaly detection can provide better accuracy results than running traditional algorithms on a deep learning model developed for forecasting. The only exception is observed in the ECG data set, which can be attributed to the anomaly labeling scheme followed. The labeling scheme employed in this data set marks an entire period of the ECG signal as anomalous in case any point in that period is an anomaly. In other words, we deal with collective anomalies in this data set. The fraction of anomalies is, hence, higher in the ECG data set compared to other data sets that have point anomalies. Thus, the anomalies cover a broad spectrum above the upper quartile of prediction errors for ECG data. Since the Tukey’s method thresholds the raw prediction errors based on the upper quartile, it results in good anomaly detection for the ECG data set. This finding suggests that a simple threshold based on the magnitude of prediction errors might be sufficient when the fraction of anomalies in the data set is relatively high. Generally, Tukey’s method can detect most of the anomalies but results in a large number of false positives, similar to GARCH and OCSVM models. This behavior is not desirable in an anomaly detection setting.
An important observation is made regarding the variability in false positive regulator values of various methods. Recalling the results from Table IV, we find high variability in the false positive regulator values of Gaussian and Tukey detection rules. The choices for thresholds and vary significantly with the data set considered. While varied between [15, 25], was found to take values between [0.11, 12961.8]. The strong dependence of the anomaly thresholds on the timesequence considered limit the applicability of such detection rules. On the other hand, the only free parameter for EVTbased detection, the probability , does not appear to have a significant dependence on the data set. This false positive regulator was found to stay within the range [, ]. A false positive parameter with low dependency on the data sets is highly preferred in realworld settings, thereby strengthening the case of a detection algorithm based on EVT.
In summary, considering data sets from various verticals of ITS, we found that an endtoend deep learningbased anomaly detection algorithm holds great potential in detecting abnormal traffic instances. Our proposed EVTLSTM model accurately detected anomalous traffic speed, vehicle occupancy, travel time, and taxi demand instances, in addition to data sets from medical and financial domains.
Vii Concluding Remarks
Detection of anomalies is a crucial part of ITS (Intelligent Transportation Systems), as it can provide useful recommendations to urban planners and taxi aggregators, among others. In this study, we developed an endtoend deep learningbased anomaly detection model for temporal data in transportation networks.
The proposed EVTLSTM model incorporates concepts from EVT (Extreme Value Theory) into the objective function of an LSTM (Long ShortTerm Memory) deep learning model. The output network representations from our proposed model can be directly utilized for anomaly detection, a clear advantage over the currently popular hybrid deep learningbased detection models that require separate postprocessing tools.
Our proposed model was compared against traditional statistical, machine learning, and deep learningbased anomaly detection models. When evaluated across seven diverse data sets, the EVTLSTM model exhibited superior anomaly detection performance against these established baseline models. The proposed model was able to detect true positives faithfully while incurring as few false positives as possible. We found strong evidence to suggest that a deep learning model customized for anomaly detection can provide better detection accuracy than the hybrid deep anomaly detection techniques.
There are numerous avenues that merit future attention. To validate the performance of the proposed algorithm further, new data sets can be introduced. While our algorithm employs an objective function based on EVT, it would be useful to explore other objective functions, to enhance the detection accuracy.
References
 [1] M. Lippi, M. Bertini, and P. Frasconi, “Collective traffic forecasting,” in Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2010, pp. 259–273.
 [2] Y. Zheng, Y. Liu, J. Yuan, and X. Xie, “Urban computing with taxicabs,” in Proceedings of the International Conference on Ubiquitous Computing. ACM, 2011, pp. 89–98.
 [3] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “Tdrive: Driving directions based on taxi trajectories,” in Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2010, pp. 99–108.
 [4] H.w. Chang, Y.c. Tai, and J. Y.j. Hsu, “Contextaware taxi demand hotspots prediction,” International Journal of Business Intelligence and Data Mining, vol. 5, no. 1, pp. 3–18, 2010.
 [5] S. Phithakkitnukoon, M. Veloso, C. Bento, A. Biderman, and C. Ratti, “Taxiaware map: Identifying and predicting vacant taxis in the city,” in Proceedings of the International Joint Conference on Ambient Intelligence. Springer, 2010, pp. 86–95.
 [6] N. Davis, G. Raina, and K. Jagannathan, “Taxi demand forecasting: A hedgebased tessellation strategy for improved accuracy,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 11, pp. 3686–3697, 2018.
 [7] B. Li, D. Zhang, L. Sun, C. Chen, S. Li, G. Qi, and Q. Yang, “Hunting or waiting? Discovering passengerfinding strategies from a largescale realworld taxi dataset,” in Proceedings of the International Conference on Pervasive Computing and Communications Workshops. IEEE, 2011, pp. 63–68.
 [8] C. Chen, D. Zhang, P. S. Castro, N. Li, L. Sun, S. Li, and Z. Wang, “iBOAT: Isolationbased online anomalous trajectory detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 806–818, 2013.
 [9] Y. Wang, D. Zhang, Y. Liu, B. Dai, and L. H. Lee, “Enhancing transportation systems via deep learning: A survey,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 1, pp. 144–163, 2019.
 [10] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
 [11] X. Kong, X. Song, F. Xia, H. Guo, J. Wang, and A. Tolba, “LoTAD: Longterm traffic anomaly detection based on crowdsourced bus trajectory data,” World Wide Web, vol. 21, no. 3, pp. 825–847, 2018.

[12]
A. Dairi, F. Harrou, Y. Sun, and M. Senouci, “Obstacle detection for intelligent transportation systems using deep stacked autoencoder and
nearest neighbor scheme,” IEEE Sensors Journal, vol. 18, no. 12, pp. 5122–5132, 2018.  [13] I. Markou, F. Rodrigues, and F. C. Pereira, “Use of taxitrip data in analysis of demand patterns for detection and explanation of anomalies,” Transportation Research Record, vol. 2643, no. 1, pp. 129–138, 2017.
 [14] M. Wittmann, M. Kollek, and M. Lienkamp, “Eventdriven anomalies in spatiotemporal taxi passseger demand,” in Proceedings of the International Conference on Intelligent Transportation Systems. IEEE, 2018, pp. 979–984.
 [15] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15–73, 2009.
 [16] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.
 [17] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, “LSTMbased encoderdecoder for multisensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
 [18] P. Oza and V. M. Patel, “Oneclass convolutional neural network,” IEEE Signal Processing Letters, vol. 26, no. 2, pp. 277–281, 2018.
 [19] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft, “Deep oneclass classification,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 4390–4399.
 [20] I. Golan and R. ElYaniv, “Deep anomaly detection using geometric transformations,” in Advances in Neural Information Processing Systems, 2018, pp. 9758–9769.
 [21] R. Chalapathy, A. K. Menon, and S. Chawla, “Anomaly detection using oneclass neural networks,” arXiv preprint arXiv:1802.06360, 2018.
 [22] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” in Proceedings of the International Conference on Artificial Neural Networks. IET, 1999, pp. 850–855.
 [23] A. Siffer, P.A. Fouque, A. Termier, and C. Largouet, “Anomaly detection in streams with extreme value theory,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1067–1075.
 [24] N. Davis, G. Raina, and K. Jagannathan, “Lstmbased anomaly detection: Detection rules from extreme value theory,” in Proceedings of the EPIA Conference on Artificial Intelligence. Springer, 2019, pp. 572–583.
 [25] A. J. Fox, “Outliers in time series,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 34, no. 3, pp. 350–363, 1972.
 [26] E. Eskin, “Anomaly detection over noisy data using learned probability distributions,” in Proceedings of the International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2000, pp. 255–262.
 [27] D. Chen, X. Shao, B. Hu, and Q. Su, “Simultaneous wavelength selection and outlier detection in multivariate regression of nearinfrared spectra,” Analytical Sciences, vol. 21, no. 2, pp. 161–166, 2005.
 [28] R. Engle, “Garch 101: The use of arch/garch models in applied econometrics,” Journal of Economic Perspectives, vol. 15, no. 4, pp. 157–168, 2001.
 [29] B. Schölkopf, A. J. Smola, F. Bach et al., Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [30] C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, and K. Schwan, “Statistical techniques for online anomaly detection in data centers,” in Proceedings of the International Symposium on Integrated Network Management and Workshops. IEEE, 2011, pp. 385–392.
 [31] T. Ergen, A. H. Mirza, and S. S. Kozat, “Unsupervised and semisupervised anomaly detection with LSTM neural networks,” arXiv preprint arXiv:1710.09207, 2017.
 [32] I. J. Myung, “Tutorial on maximum likelihood estimation,” Journal of mathematical Psychology, vol. 47, no. 1, pp. 90–100, 2003.
 [33] L. De Haan and A. Ferreira, Extreme value theory: An introduction. Springer Science & Business Media, 2007.
 [34] S. D. Grimshaw, “Computing maximum likelihood estimates for the generalized pareto distribution,” Technometrics, vol. 35, no. 2, pp. 185–191, 1993.
 [35] N. Y. C. Taxi & Limousine Commission, “TLC trip record data,” 2016, https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page, accessed 20191001.
 [36] E. Keogh, J. Lin, and A. Fu, “Hot sax: Efficiently finding the most unusual time series subsequence,” in Proceedings of the International Conference on Data Mining. IEEE, 2005, pp. 1–8.
 [37] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyperparameter optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2546–2554.
 [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [39] A. Singh, “Anomaly detection for temporal data using long shortterm memory,” Master’s thesis, KTH Royal Institute of Technology, 2017.
 [40] M. A. Stephens, “EDF statistics for goodness of fit and some comparisons,” Journal of the American Statistical Association, vol. 69, no. 347, pp. 730–737, 1974.
 [41] F. J. Massey Jr, “The kolmogorovsmirnov test for goodness of fit,” Journal of the American Statistical Association, vol. 46, no. 253, pp. 68–78, 1951.
Comments
There are no comments yet.