I Introduction
The wide availability of spatial data acquisition devices, from specialized remote sensors to standard GPS equipped smartphones, has resulted in the creation of several locationbased applications. Although the object to be localized can vary (a vehicle, an animal, a person, etc.), in general, such trajectory data can be understood as a series of ordered points that characterize the object motion [zheng2015trajectory].
In the context of trajectory data, anomaly detection is a task critical to monitor spatial events and enable recognition of unexpected behaviors [meng2019overview]
. One could define an anomaly, or outlier
^{1}^{1}1In this work we use the terms outlier and anomaly interchangeably., as a data point which significantly differs from the overall observed data [aggarwal2015outlier]. It is worth emphasizing that, as opposed to single point standard regression, in such a scenario a data example is a full trajectory or at least a segment of it.Meng et al. [meng2019overview]
propose a traditional anomaly detection taxonomy that includes methods based on classification, clustering, distance, density and statistics. We pursue the latter, which consists in a modelbased procedure that aims to explain the available data, mostly within a probabilistic density estimation framework. Anomalies are then detected by measuring how much the model fits a given new data point. Such an approach does not require labeled data, as it is a form of unsupervised learning. However, probabilistic density estimation approaches for trajectory anomaly detection usually consider simple distributions, such as a multivariate Gaussian
[hazel2000multivariate]. Even if a more flexible Gaussian mixture model (GMM) is used, such as in
[basharat2008learning, li2016anomaly], it is not straightforward to determine the number of components in the mixture.In this work we tackle the task of trajectory data analysis by pursuing an approach based on normalizing flows (NFs, [rezende2015variational]), a general framework for estimating complex probabilistic densities. In summary, a NF transforms an initial simple density by a sequence of invertible transformations to better explain the observed data. Following recent works, such as [rezende2015variational], each transformation, i.e. a flow, is parametrized by (possibly deep) neural networks. One of the main advantages of such flowbased approach is the available exact model loglikelihood, which is used as an objective function to jointly optimize all the model parameters. Furthermore, we can compute exact loglikelihood values for new data points, which we will then apply as a coherent anomaly score.
NFbased anomaly detection approaches have been recently proposed [yamaguchi2019adaflow, iwata2019supervised]. In contrast to those works, we aim to evaluate NF models with trajectory data, which is inherently sequential. Thus, we include in our evaluations the socalled masked autoregressive flow (MAF), an autoregressive flow framework that directly models the conditional distributions of the input variables [papamakarios2017masked].
Trajectory data can be high dimensional due to the presence of several measured points within a single trajectory. Besides, distinct data examples can have different lengths, which cannot be straightforwardly compared. We propose a methodology that tackles both issues by considering fixedsize segments of the available trajectories. A NF generative model is than used to estimate the probability density of such segments. It is expected that segments which belong to trajectories considered normal correspond to higher model likelihoods than segments that belong to trajectories considered anomalous. In the test step, a single anomaly score for a new trajectory is computed from its segments using an aggregation function. Moreover, since we choose a trajectory data representation that incorporates timestamps, the time domain is considered in the modeling. We name our approach aggregated anomaly detection with normalizing flows (GRADINGS). We emphasize that, to the best of our knowledge, our work is the first evaluation of NFbased models in the task of anomaly detection in trajectory data.
We evaluate the proposed GRADINGS approach using realworld trajectories available in the Microsoft GeoLife data set [zheng2009mining, zheng2008, zheng2010]. The obtained experimental results indicate the feasibility of our solution. Our framework, specially the variant that considers the MAF model, achieves better anomaly detection results in comparison to standard techniques, such as the GMM and the local outlier factor (LOF, [breunig2000lof]) method.
Ii Problem statement and data representation
The problem of anomaly detection can be vaguely described as the task of finding data patterns that differ from what is expected and is considered normal [agrawal2015survey]. In this work, such unexpected patterns are related to trajectories sufficiently different from the previously seen data, which is assumed to be mostly normal. We consider that trajectories can differ in terms of spatial segments that comprise them and/or the time period they occur.
As follows we establish the adopted data representation and the main theoretical aspects of the unsupervised anomaly detection task.
Iia Trajectory representation
Broadly speaking, a trajectory consists of a sequence of GPS points (i.e., latitude, longitude and timestamp) generated by a moving object on a monitoring system. Below we formally define it.
Definition II.1 (Trajectory)
A trajectory with size is defined as a finite ordered sequence
(1) 
where is a location point, such that are respectively the th latitude, th longitude, and th timestamp of the trajectory . Furthermore, we have , for all , which ensures a temporal ordered sequence of points.
IiB Problem statement
Given a set of trajectories, the goal of a trajectory anomaly detection model is to find trajectories that are significantly different from the majority, considered to be normal. In other words, let be a set of trajectories of moving objects in a GPS monitoring system. The task of trajectory anomaly detection is to create a model from the available trajectories to evaluate the anomaly degree of any given trajectory .
In this work, we follow an unsupervised anomaly detection procedure, which does not require the trajectories to be previously labeled as normal or anomalous. We detail such an approach as follows.
IiC Unsupervised anomaly detection
Unsupervised anomaly detection approaches (or anomaly detection over noisy data [eskin2000]
) makes two assumptions over the data. The first one is that the dataset contains a large number of normal elements and relatively few anomalies. The second assumption is that the abnormal data is generated by a different probability distribution
[leung2005].After the training step, the determination if a sample is normal or abnormal can be made by a decision system as follows:
(2) 
where is a predefined threshold and is an anomaly score. The threshold is a value that separates abnormal from normal data samples. In the context of supervised and semisupervised anomaly detection, this value is usually chosen by using a validation set that contains known anomalous samples [schmidt2019, neema2019]. After that, metrics such as accuracy and score are computed to judge the quality of the models.
Alternatively, and more common to the unsupervised learning setup, we can judge the model quality without the choice of a single value for the threshold . This can be done by finding the receiver operating characteristic (ROC) curve, which indicates the relation between the false positive rate and the true positive rate as the threshold is changed. A practical metric to summarize the information provided by the ROC curve is the area under the curve (AUC or AUROC) [ling2003].
Iii Classical anomaly detection techniques
Anomaly detection algorithms can be classified in several groups based on distance, probability, reconstruction, and information theory
[pimentel2014]. In this section, we describe two of most know techniques used in anomaly detection problems.Iiia Anomaly detection using the LOF algorithm
Local outlier factor (LOF, [breunig2000lof]) is an unsupervised distancebased anomaly detection algorithm. The anomaly score in LOF is computed by comparing the local density of a sample to the surrounding neighborhood. The local density is inversely correlated with the average distance from the point to its neighborhood.
Let be a set of data points. The set of nearest neighbors of is denoted by and defined as , where is the result of a nearest neighbor query [roussopoulos1995, Papadias2009].
Then, we define the distance neighborhood of a sample as where is the Euclidean distance.
We use the above to define the reachability distance of with respect to another sample as .
The local reachability density of with respect to is then denoted by and defined as
Using all the previous definitions, the LOF anomaly score can be finally formalized as the average ratio of local reachability densities with respect to and its neighborhood:
(3) 
Note that the above score measures the local density deviation of a given data point with respect to its neighbors.
IiiB Gaussian mixture model for anomaly detection
One way of computing an anomaly score in Eq. (2) is to use a probability density estimator. This approach first trains the density estimator and then uses the negative loglikelihood of each testing data as an anomaly score, i.e.,
(4) 
In such a context, the Gaussian mixture model (GMM) is a common choice. A GMM uses a linear combination of Gaussian density functions to approximate an unknown probability distribution. The parameters of each the component are usually adjusted using the Expectation Maximization (EM,
[dempster1977]) algorithm.Consider a data set , where . We assume that the points from are generated in an i.i.d. fashion from an underlying density . Furthermore, suppose that is defined as a finite mixture model with components:
(5) 
where
is a multivariate Gaussian density with mean vector
and covariance matrix ; are the mixture weights, which are restricted to be nonnegative and sum up to 1, i.e., . The mixture weights represent the probability that a randomly selected data point was generated by the component . After the optimization of the GMM parameters via the EM algorithm, Eq. (5) can be directly applied as an anomaly score for new data points.It is worth noting that flowbased generative models constitute a flexible alternative to density estimation with standard techniques such as the GMM. In the next section we detail the flowbased models used in this work.
Iv Probability density estimation via normalizing flows
NF models are powerful tools for estimating complicated probability densities [zhisheng2019, kingma2018glow]. Two merits of these models are the exact inference and loglikelihood evaluation [kingma2018glow]. The latter is specially valuable in the context of anomaly detection.
Let be a random vector with unknown distribution . In the most general flowbased model, the generative process is defined as [rezende2015]
(6)  
(7) 
where is a latent (unobserved) variable and is a simple and known distribution, e.g., a multivariate Gaussian. The function , called bijective, is an invertible function such that . If the transformation is considered to be a composition of
successive mappings and we apply the change of variables rule, the loglikelihood of the random variable
can be written as [rezende2015](8) 
where , and .
The usual training criterion of flowbased generative models is simply the negative loglikelyhood over the training set :
(9) 
We summarize the evaluated NF models as follows.
Iva Real NVP
Realvalued nonvolume preserving (RealNVP, [dinh2017density]) is a type of NF that uses a bijection called coupling layer that transforms only some input dimensions via functions that depend on the untransformed dimensions. If denotes the sequential indexes of the untransformed dimensions, the components of the layer output are given by
(10)  
(11) 
where respectively represent scale and translation functions parametrized by neural networks, and is the elementwise product operator. The elements in each flow are permuted to different orders, allowing all of the inputs to have a chance to be altered.
The Jacobian matrix of the above described transformations can be calculated using
(12) 
where is the
order identity matrix,
is a zero matrix and
is a diagonal matrix whose elements are equal to the vector . The Jacobian matrix in Eq. (12) is triangular, thus, its determinant is a simple product of the diagonal terms:(13) 
Since the computation of the Jacobian determinant of the mentioned transformations does not involve calculating the inverse of the functions and , such functions can be arbitrarily complex, usually a deep neural network [dinh2017density]. All the model parameters (i.e., the networks’ weights) are jointly optimized via maximization of the Eq. (9
) via stochastic gradient descent methods.
IvB Masked autoregressive flow (MAF)
We can decompose any joint density
of highdimensional data into a product of onedimensional conditionals using the chain rule of probabilities:
(14) 
The Masked Autoregressive Flow (MAF, [papamakarios2017masked, germain2015made]) uses the above autoregressive constraint to model the probability density whose conditionals are parameterized as single Gaussians. Thus, the th conditional probability is given by
(15) 
where
are two unconstrained scalar functions that compute the mean and logstandard deviation of the
th conditional given all previous variables. The bijective transformation of MAF generates each conditioned on the past dimensions ,(16) 
As a consequence of the autoregressive nature of this transformation, the dimension of the resulting variable depends only on the dimensions of the input variable . Thus, the Jacobian matrix of this transformation is triangular [kingma2016] and its determinant is equal to the product of its diagonal terms:
(17) 
As in the RealNVP, the functions and
can be arbitrarily complex. In the MAF model, these functions are implemented by an efficient feedforward network called Masked Autoencoder for Distribution Estimation (MADE,
[germain2015made]) that takes as input and outputs the means and logstandard deviations for all dimensions in a single network pass.The nature of MAF transformations allows more flexible generalizations when compared to the RealNVP model. As one can see, if for the first dimensions we fix and apply the MAF transformations into the other dimensions, the MAF structure becomes equivalent to the RealNVP. Besides, we can see the coupling layer of the RealNVP as a special case of the MAF transformation [papamakarios2017masked].
V Proposed Methodology
We can compute an anomaly score for a sequential data type sample either directly or by first computing scores for local subsections and then aggregating them. These subsections are called pattern fragments, segments, sliding windows, motifs, or ngrams
[gupta2014]. In Definition V.1 we present a formal description of these objects.Definition V.1 (trajectory segment)
Given a trajectory with length , the segment of with a userdefined length is a finite ordered sequence of location points, denoted by
(18) 
where and .
Segmentbased techniques usually perform better when compared to direct detection methods [gupta2014]. Furthermore, they enable handling large sequences with different lengths. As follows we detail our proposal, named aggregated anomaly detection with normalizing flows (GRADINGS), which consists in three main steps.
In the first step, GRADINGS transforms the set of trajectories into a set of trajectory segments. Thus, given a set of trajectories , the transformed set is defined by
(19) 
where , , and is a function that flattens a segment into a dimensional row vector, where , i.e.,
The second step consists in estimating the distribution from the available trajectory segments. This step is performed by using one of the NF generative models described in Section IV. At this point, the GRADINGS is able to compute the anomaly degree for any trajectory segment, denoted by :
(20) 
In the last step, we aggregate the anomaly scores of the segments to compute a single anomaly score for the trajectory. More specifically, given a trajectory , its anomaly score, denoted by , can be computed using an aggregation function that combines the anomaly degree of each segment in the trajectory , i.e.,
(21) 
Possible choices for the aggregation function includes the median or the average.
Vi Experiments
To assess the performance of the proposed methodology, we conduct experiments comparing GRADINGS when using either Real NVP or MAF estimators against standard LOF and GMM anomaly detectors with real world data.
Via Data set description
We consider the version 1.3 of the Microsoft GeoLife data set [zheng2009mining, zheng2008, zheng2010], comprised of real trajectory data measured from 182 users over a period of five years (from April 2007 to August 2012), which is equivalent to trajectories. For 73 users, the transportation mode is labeled, such as driving, taking a bus, riding a bike and walking. Each trajectory represents a complete trip from departure to arrival location.
In our experimental setting, we use a subset of the data that consists of the trajectories located in Beijing, China, made using car (126 trajectories) or bus (365 trajectories). We define two different scenarios. In the first one, called CAR BUS, we use the car trajectories as indistribution data (i.e., as “normal” patterns) and the bus trajectories as outofdistribution data (i.e., as “anomalies”). In the second one, called BUS CAR scenario we switch the roles: the bus trajectories act as indistribution data and the car trajectories are seen as outofdistribution samples. For each scenarios we use segments with length correspondent to , , and location points, accounting a total of data sets. All of these data sets have segments of car trajectories and segments of bus trajectories.
The timestamp information of the trajectory data is firstly converted to the hour of the week (e.g. Tuesday, 12:30, is equal to 36.5 if we consider the Monday as the start of the week) and then encoded into two variables using . This encoding ensures that similar periodic times are close in the input space, even in different weeks (e.g. Sunday, 23:59 is close to Monday, 00:00).
ViB Results and discussion
We report results for individual segments scores and full trajectories scores. In the latter, we consider both the average and the median as score aggregation functions (see Eq. (21)).
We train all the models on the normal data and then apply them to unseen normal samples as well as abnormal data samples. The normal data have been partitioned into two folds, the first one with 80% of the data for the training, and the other 20% is grouped with the abnormal data to compute the evaluation metrics.
For the MAF and RealNVP models we use flows of neural networks as bijective functions, with the MADE structure in the case of the MAF model. Each network has two hidden hidden layers, each one with neurons. Both models were trained for 300 epochs. A grid search with fold crossvalidation is used to perform the hyperparameter tuning using the training data for the GMM model. The
value of the LOF algorithm was determined using the heuristic presented in
[breunig2000lof].The ROC curves and the correspondent AUROC values are presented in Figs. 1 and 2. In addition, we present in Table I the the false positive rate obtained when we fix a true positive rate of , named the FPR80 metric.
In all evaluated pair scenariovariant the NFbased solutions performed better in terms of AUCROC. In most of them, the GRADINGS framework with the MAF model was the best. In terms of FPR80, the MAF also achieved better results in 16 out of 18 evaluations, with the RealNVP being slightly better in the others. It is important to highlight that the use of a segment aggregation strategy considerably increased the performance concerning the AUROC in all experiments. Particularly, models with the median as the aggregation function achieved the best results in terms of AUROC and FPR80.
In terms of the segment length, when using the median as the aggregation function, we can see that the performance is inversely proportional to the size of the segment. On the other hand, using the average as the aggregation function, the performance decreases as the segment size increases. Since the average score is more sensitive to outliers, we hypothesize that the increase of the segment size may cause more outliers to appear in the same pattern. The results that consider only the individual segments do not show any specific behavior with respect to the segment length.
In summary, the obtained results indicate the importance of both main ingredients of the proposed GRADINGS framework: (i) the NFbased density estimation; and (ii) the aggregation of the individual segments degrees into a single trajectory anomaly score. Furthermore, we have also verified that, in general, the combination of the autoregressive MAF model, the median aggregation function and a larger (e.g. 30) segment length representation offers the best performance.

Model  
Scenario  Variant  Length  MAF  RealNVP  GMM  LOF 
CAR BUS 
segment  0.423  0.643  0.698  0.719  
0.498  0.640  0.653  0.688  
0.608  0.652  0.699  0.727  

average  0.342  0.335  0.376  0.465  
0.272  0.435  0.500  0.550  
0.361  0.577  0.556  0.622  

median  0.245  0.375  0.308  0.481  
0.247  0.335  0.353  0.419  
0.201  0.361  0.315  0.462  
BUS CAR 
segment  0.603  0.592  0.597  0.684  
0.510  0.633  0.682  0.692  
0.489  0.517  0.631  0.689  

average  0.252  0.310  0.482  0.712  
0.529  0.601  0.635  0.704  
0.311  0.555  0.622  0.732  

median  0.226  0.330  0.761  0.771  
0.190  0.294  0.744  0.819  
0.055  0.328  0.564  0.747  

Vii Conclusion and Further Work
Anomaly detection is a challenging task with important practical applications. In the context of trajectory data, GPS measurements are usually widely available. However, the high dimensional patterns and the lack of labeled data hinder the application of standard techniques.
In this work we have proposed GRADINGS, an unsupervised density estimation methodology that includes flexible normalizing flows, more specifically the Real NVP and the MAF structures. GRADINGS aggregates the analytical loglikelihood values of trajectory segments into a single robust anomaly score, which enables the use of trajectories with distinct lengths. The empirical results obtained using real world data showed promising performance compared to the LOF and GMM baselines, specially when considering the autoregressive MAFbased version.
The present research outcome encourages us to pursue additional NF approaches for trajectory anomaly detection. For instance, future work shall evaluate the use of convolutionbased flows, such as the socalled Glow [kingma2018glow], which can handle data with multiple channel representation. Models with more complex invertible transformations, such as the recently proposed [huang2018neural, oliva2018transformation, durkan2019neural], are also worthy subjects of future investigations.