Log In Sign Up

Anomaly Detection in Trajectory Data with Normalizing Flows

The task of detecting anomalous data patterns is as important in practical applications as challenging. In the context of spatial data, recognition of unexpected trajectories brings additional difficulties, such as high dimensionality and varying pattern lengths. We aim to tackle such a problem from a probability density estimation point of view, since it provides an unsupervised procedure to identify out of distribution samples. More specifically, we pursue an approach based on normalizing flows, a recent framework that enables complex density estimation from data with neural networks. Our proposal computes exact model likelihood values, an important feature of normalizing flows, for each segment of the trajectory. Then, we aggregate the segments' likelihoods into a single coherent trajectory anomaly score. Such a strategy enables handling possibly large sequences with different lengths. We evaluate our methodology, named aggregated anomaly detection with normalizing flows (GRADINGS), using real world trajectory data and compare it with more traditional anomaly detection techniques. The promising results obtained in the performed computational experiments indicate the feasibility of the GRADINGS, specially the variant that considers autoregressive normalizing flows.


page 1

page 2

page 3

page 4


Supervised Anomaly Detection based on Deep Autoregressive Density Estimators

We propose a supervised anomaly detection method based on neural density...

Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series

Anomaly detection is a widely studied task for a broad variety of data t...

Flow-based Self-supervised Density Estimation for Anomalous Sound Detection

To develop a machine sound monitoring system, a method for detecting ano...

Anomaly Detection with Density Estimation

We leverage recent breakthroughs in neural density estimation to propose...

AND: Autoregressive Novelty Detectors

We propose an unsupervised model for novelty detection. The subject is t...

Collaborative Anomaly Detection

In recommendation systems, items are likely to be exposed to various use...

Video Anomaly Detection by Estimating Likelihood of Representations

Video anomaly detection is a challenging task not only because it involv...

I Introduction

The wide availability of spatial data acquisition devices, from specialized remote sensors to standard GPS equipped smartphones, has resulted in the creation of several location-based applications. Although the object to be localized can vary (a vehicle, an animal, a person, etc.), in general, such trajectory data can be understood as a series of ordered points that characterize the object motion [zheng2015trajectory].

In the context of trajectory data, anomaly detection is a task critical to monitor spatial events and enable recognition of unexpected behaviors [meng2019overview]

. One could define an anomaly, or outlier

111In this work we use the terms outlier and anomaly interchangeably., as a data point which significantly differs from the overall observed data [aggarwal2015outlier]. It is worth emphasizing that, as opposed to single point standard regression, in such a scenario a data example is a full trajectory or at least a segment of it.

Meng et al. [meng2019overview]

propose a traditional anomaly detection taxonomy that includes methods based on classification, clustering, distance, density and statistics. We pursue the latter, which consists in a model-based procedure that aims to explain the available data, mostly within a probabilistic density estimation framework. Anomalies are then detected by measuring how much the model fits a given new data point. Such an approach does not require labeled data, as it is a form of unsupervised learning. However, probabilistic density estimation approaches for trajectory anomaly detection usually consider simple distributions, such as a multivariate Gaussian


. Even if a more flexible Gaussian mixture model (GMM) is used, such as in

[basharat2008learning, li2016anomaly], it is not straightforward to determine the number of components in the mixture.

In this work we tackle the task of trajectory data analysis by pursuing an approach based on normalizing flows (NFs, [rezende2015variational]), a general framework for estimating complex probabilistic densities. In summary, a NF transforms an initial simple density by a sequence of invertible transformations to better explain the observed data. Following recent works, such as [rezende2015variational], each transformation, i.e. a flow, is parametrized by (possibly deep) neural networks. One of the main advantages of such flow-based approach is the available exact model log-likelihood, which is used as an objective function to jointly optimize all the model parameters. Furthermore, we can compute exact log-likelihood values for new data points, which we will then apply as a coherent anomaly score.

NF-based anomaly detection approaches have been recently proposed [yamaguchi2019adaflow, iwata2019supervised]. In contrast to those works, we aim to evaluate NF models with trajectory data, which is inherently sequential. Thus, we include in our evaluations the so-called masked autoregressive flow (MAF), an autoregressive flow framework that directly models the conditional distributions of the input variables [papamakarios2017masked].

Trajectory data can be high dimensional due to the presence of several measured points within a single trajectory. Besides, distinct data examples can have different lengths, which cannot be straightforwardly compared. We propose a methodology that tackles both issues by considering fixed-size segments of the available trajectories. A NF generative model is than used to estimate the probability density of such segments. It is expected that segments which belong to trajectories considered normal correspond to higher model likelihoods than segments that belong to trajectories considered anomalous. In the test step, a single anomaly score for a new trajectory is computed from its segments using an aggregation function. Moreover, since we choose a trajectory data representation that incorporates timestamps, the time domain is considered in the modeling. We name our approach aggregated anomaly detection with normalizing flows (GRADINGS). We emphasize that, to the best of our knowledge, our work is the first evaluation of NF-based models in the task of anomaly detection in trajectory data.

We evaluate the proposed GRADINGS approach using real-world trajectories available in the Microsoft GeoLife data set [zheng2009mining, zheng2008, zheng2010]. The obtained experimental results indicate the feasibility of our solution. Our framework, specially the variant that considers the MAF model, achieves better anomaly detection results in comparison to standard techniques, such as the GMM and the local outlier factor (LOF, [breunig2000lof]) method.

Ii Problem statement and data representation

The problem of anomaly detection can be vaguely described as the task of finding data patterns that differ from what is expected and is considered normal [agrawal2015survey]. In this work, such unexpected patterns are related to trajectories sufficiently different from the previously seen data, which is assumed to be mostly normal. We consider that trajectories can differ in terms of spatial segments that comprise them and/or the time period they occur.

As follows we establish the adopted data representation and the main theoretical aspects of the unsupervised anomaly detection task.

Ii-a Trajectory representation

Broadly speaking, a trajectory consists of a sequence of GPS points (i.e., latitude, longitude and timestamp) generated by a moving object on a monitoring system. Below we formally define it.

Definition II.1 (Trajectory)

A trajectory with size is defined as a finite ordered sequence


where is a location point, such that are respectively the -th latitude, -th longitude, and -th timestamp of the trajectory . Furthermore, we have , for all , which ensures a temporal ordered sequence of points.

Ii-B Problem statement

Given a set of trajectories, the goal of a trajectory anomaly detection model is to find trajectories that are significantly different from the majority, considered to be normal. In other words, let  be a set of trajectories of moving objects in a GPS monitoring system. The task of trajectory anomaly detection is to create a model from the available trajectories to evaluate the anomaly degree of any given trajectory .

In this work, we follow an unsupervised anomaly detection procedure, which does not require the trajectories to be previously labeled as normal or anomalous. We detail such an approach as follows.

Ii-C Unsupervised anomaly detection

Unsupervised anomaly detection approaches (or anomaly detection over noisy data [eskin2000]

) makes two assumptions over the data. The first one is that the dataset contains a large number of normal elements and relatively few anomalies. The second assumption is that the abnormal data is generated by a different probability distribution 


After the training step, the determination if a sample is normal or abnormal can be made by a decision system as follows:


where is a predefined threshold and is an anomaly score. The threshold is a value that separates abnormal from normal data samples. In the context of supervised and semi-supervised anomaly detection, this value is usually chosen by using a validation set that contains known anomalous samples [schmidt2019, neema2019]. After that, metrics such as accuracy and -score are computed to judge the quality of the models.

Alternatively, and more common to the unsupervised learning setup, we can judge the model quality without the choice of a single value for the threshold . This can be done by finding the receiver operating characteristic (ROC) curve, which indicates the relation between the false positive rate and the true positive rate as the threshold is changed. A practical metric to summarize the information provided by the ROC curve is the area under the curve (AUC or AUROC) [ling2003].

Iii Classical anomaly detection techniques

Anomaly detection algorithms can be classified in several groups based on distance, probability, reconstruction, and information theory 

[pimentel2014]. In this section, we describe two of most know techniques used in anomaly detection problems.

Iii-a Anomaly detection using the LOF algorithm

Local outlier factor (LOF, [breunig2000lof]) is an unsupervised distance-based anomaly detection algorithm. The anomaly score in LOF is computed by comparing the local density of a sample to the surrounding neighborhood. The local density is inversely correlated with the average distance from the point to its neighborhood.

Let be a set of data points. The set of -nearest neighbors of is denoted by and defined as , where is the result of a -nearest neighbor query [roussopoulos1995, Papadias2009].

Then, we define the -distance neighborhood  of a sample as where is the Euclidean distance.

We use the above to define the reachability distance of with respect to another sample as .

The local reachability density of with respect to is then denoted by and defined as

Using all the previous definitions, the LOF anomaly score can be finally formalized as the average ratio of local reachability densities with respect to and its -neighborhood:


Note that the above score measures the local density deviation of a given data point with respect to its neighbors.

Iii-B Gaussian mixture model for anomaly detection

One way of computing an anomaly score in Eq. (2) is to use a probability density estimator. This approach first trains the density estimator and then uses the negative log-likelihood of each testing data as an anomaly score, i.e.,


In such a context, the Gaussian mixture model (GMM) is a common choice. A GMM uses a linear combination of Gaussian density functions to approximate an unknown probability distribution. The parameters of each the component are usually adjusted using the Expectation Maximization (EM, 

[dempster1977]) algorithm.

Consider a data set , where . We assume that the points from are generated in an i.i.d. fashion from an underlying density . Furthermore, suppose that is defined as a finite mixture model with components:



is a multivariate Gaussian density with mean vector

and covariance matrix ; are the mixture weights, which are restricted to be non-negative and sum up to 1, i.e., . The mixture weights represent the probability that a randomly selected data point was generated by the component . After the optimization of the GMM parameters via the EM algorithm, Eq. (5) can be directly applied as an anomaly score for new data points.

It is worth noting that flow-based generative models constitute a flexible alternative to density estimation with standard techniques such as the GMM. In the next section we detail the flow-based models used in this work.

Iv Probability density estimation via normalizing flows

NF models are powerful tools for estimating complicated probability densities [zhisheng2019, kingma2018glow]. Two merits of these models are the exact inference and log-likelihood evaluation [kingma2018glow]. The latter is specially valuable in the context of anomaly detection.

Let be a random vector with unknown distribution . In the most general flow-based model, the generative process is defined as [rezende2015]


where is a latent (unobserved) variable and is a simple and known distribution, e.g., a multivariate Gaussian. The function , called bijective, is an invertible function such that . If the transformation is considered to be a composition of

successive mappings and we apply the change of variables rule, the log-likelihood of the random variable

can be written as [rezende2015]


where , and .

The usual training criterion of flow-based generative models is simply the negative log-likelyhood over the training set :


We summarize the evaluated NF models as follows.

Iv-a Real NVP

Real-valued non-volume preserving (Real-NVP, [dinh2017density]) is a type of NF that uses a bijection called coupling layer that transforms only some input dimensions via functions that depend on the untransformed dimensions. If denotes the sequential indexes of the untransformed dimensions, the components of the layer output are given by


where respectively represent scale and translation functions parametrized by neural networks, and is the element-wise product operator. The elements in each flow are permuted to different orders, allowing all of the inputs to have a chance to be altered.

The Jacobian matrix of the above described transformations can be calculated using


where is the

-order identity matrix,

is a zero matrix and

is a diagonal matrix whose elements are equal to the vector . The Jacobian matrix in Eq. (12) is triangular, thus, its determinant is a simple product of the diagonal terms:


Since the computation of the Jacobian determinant of the mentioned transformations does not involve calculating the inverse of the functions and , such functions can be arbitrarily complex, usually a deep neural network [dinh2017density]. All the model parameters (i.e., the networks’ weights) are jointly optimized via maximization of the Eq. (9

) via stochastic gradient descent methods.

Iv-B Masked autoregressive flow (MAF)

We can decompose any joint density

of high-dimensional data into a product of one-dimensional conditionals using the chain rule of probabilities:


The Masked Autoregressive Flow (MAF, [papamakarios2017masked, germain2015made]) uses the above autoregressive constraint to model the probability density whose conditionals are parameterized as single Gaussians. Thus, the -th conditional probability is given by



are two unconstrained scalar functions that compute the mean and log-standard deviation of the

-th conditional given all previous variables. The bijective transformation of MAF generates each conditioned on the past dimensions ,


As a consequence of the autoregressive nature of this transformation, the dimension of the resulting variable depends only on the dimensions of the input variable . Thus, the Jacobian matrix of this transformation is triangular [kingma2016] and its determinant is equal to the product of its diagonal terms:


As in the RealNVP, the functions and

can be arbitrarily complex. In the MAF model, these functions are implemented by an efficient feedforward network called Masked Autoencoder for Distribution Estimation (MADE, 

[germain2015made]) that takes as input and outputs the means and log-standard deviations for all dimensions in a single network pass.

The nature of MAF transformations allows more flexible generalizations when compared to the RealNVP model. As one can see, if for the first dimensions we fix and apply the MAF transformations into the other dimensions, the MAF structure becomes equivalent to the RealNVP. Besides, we can see the coupling layer of the RealNVP as a special case of the MAF transformation [papamakarios2017masked].

V Proposed Methodology

We can compute an anomaly score for a sequential data type sample either directly or by first computing scores for local subsections and then aggregating them. These subsections are called pattern fragments, segments, sliding windows, motifs, or n-grams 

[gupta2014]. In Definition V.1 we present a formal description of these objects.

Definition V.1 (trajectory segment)

Given a trajectory with length , the segment  of with a user-defined length is a finite ordered sequence of location points, denoted by


where and .

Segment-based techniques usually perform better when compared to direct detection methods [gupta2014]. Furthermore, they enable handling large sequences with different lengths. As follows we detail our proposal, named aggregated anomaly detection with normalizing flows (GRADINGS), which consists in three main steps.

In the first step, GRADINGS transforms the set of trajectories into a set of trajectory segments. Thus, given a set of trajectories , the transformed set is defined by


where , , and is a function that flattens a -segment into a -dimensional row vector, where , i.e.,

The second step consists in estimating the distribution from the available trajectory segments. This step is performed by using one of the NF generative models described in Section IV. At this point, the GRADINGS is able to compute the anomaly degree for any trajectory segment, denoted by :


In the last step, we aggregate the anomaly scores of the segments to compute a single anomaly score for the trajectory. More specifically, given a trajectory , its anomaly score, denoted by , can be computed using an aggregation function that combines the anomaly degree of each segment in the trajectory , i.e.,


Possible choices for the aggregation function includes the median or the average.

Vi Experiments

To assess the performance of the proposed methodology, we conduct experiments comparing GRADINGS when using either Real NVP or MAF estimators against standard LOF and GMM anomaly detectors with real world data.

Vi-a Data set description

We consider the version 1.3 of the Microsoft GeoLife data set [zheng2009mining, zheng2008, zheng2010], comprised of real trajectory data measured from 182 users over a period of five years (from April 2007 to August 2012), which is equivalent to trajectories. For 73 users, the transportation mode is labeled, such as driving, taking a bus, riding a bike and walking. Each trajectory represents a complete trip from departure to arrival location.

In our experimental setting, we use a subset of the data that consists of the trajectories located in Beijing, China, made using car (126 trajectories) or bus (365 trajectories). We define two different scenarios. In the first one, called CAR BUS, we use the car trajectories as in-distribution data (i.e., as “normal” patterns) and the bus trajectories as out-of-distribution data (i.e., as “anomalies”). In the second one, called BUS CAR scenario we switch the roles: the bus trajectories act as in-distribution data and the car trajectories are seen as out-of-distribution samples. For each scenarios we use segments with length correspondent to , , and location points, accounting a total of data sets. All of these data sets have segments of car trajectories and segments of bus trajectories.

The timestamp information of the trajectory data is firstly converted to the hour of the week (e.g. Tuesday, 12:30, is equal to 36.5 if we consider the Monday as the start of the week) and then encoded into two variables using . This encoding ensures that similar periodic times are close in the input space, even in different weeks (e.g. Sunday, 23:59 is close to Monday, 00:00).

Vi-B Results and discussion

We report results for individual segments scores and full trajectories scores. In the latter, we consider both the average and the median as score aggregation functions (see Eq. (21)).

We train all the models on the normal data and then apply them to unseen normal samples as well as abnormal data samples. The normal data have been partitioned into two folds, the first one with 80% of the data for the training, and the other 20% is grouped with the abnormal data to compute the evaluation metrics.

For the MAF and RealNVP models we use flows of neural networks as bijective functions, with the MADE structure in the case of the MAF model. Each network has two hidden hidden layers, each one with neurons. Both models were trained for 300 epochs. A grid search with -fold cross-validation is used to perform the hyper-parameter tuning using the training data for the GMM model. The

value of the LOF algorithm was determined using the heuristic presented in 


The ROC curves and the correspondent AUROC values are presented in Figs. 1 and 2. In addition, we present in Table I the the false positive rate obtained when we fix a true positive rate of , named the FPR80 metric.

In all evaluated pair scenario-variant the NF-based solutions performed better in terms of AUCROC. In most of them, the GRADINGS framework with the MAF model was the best. In terms of FPR80, the MAF also achieved better results in 16 out of 18 evaluations, with the RealNVP being slightly better in the others. It is important to highlight that the use of a segment aggregation strategy considerably increased the performance concerning the AUROC in all experiments. Particularly, models with the median as the aggregation function achieved the best results in terms of AUROC and FPR80.

In terms of the segment length, when using the median as the aggregation function, we can see that the performance is inversely proportional to the size of the segment. On the other hand, using the average as the aggregation function, the performance decreases as the segment size increases. Since the average score is more sensitive to outliers, we hypothesize that the increase of the segment size may cause more outliers to appear in the same pattern. The results that consider only the individual segments do not show any specific behavior with respect to the segment length.

In summary, the obtained results indicate the importance of both main ingredients of the proposed GRADINGS framework: (i) the NF-based density estimation; and (ii) the aggregation of the individual segments degrees into a single trajectory anomaly score. Furthermore, we have also verified that, in general, the combination of the autoregressive MAF model, the median aggregation function and a larger (e.g. 30) segment length representation offers the best performance.

Scenario Variant Length MAF RealNVP GMM LOF


segment 0.423 0.643 0.698 0.719
0.498 0.640 0.653 0.688
0.608 0.652 0.699 0.727

average 0.342 0.335 0.376 0.465
0.272 0.435 0.500 0.550
0.361 0.577 0.556 0.622

median 0.245 0.375 0.308 0.481
0.247 0.335 0.353 0.419
0.201 0.361 0.315 0.462


segment 0.603 0.592 0.597 0.684
0.510 0.633 0.682 0.692
0.489 0.517 0.631 0.689

average 0.252 0.310 0.482 0.712
0.529 0.601 0.635 0.704
0.311 0.555 0.622 0.732

median 0.226 0.330 0.761 0.771
0.190 0.294 0.744 0.819
0.055 0.328 0.564 0.747

TABLE I: False positive rates obtained when we fix a true positive rate of (FPR80) for all experimental scenarios.
Fig. 1: Anomaly detection results for CAR BUS scenario. ROC curves and respective AUROC values for segments (left column) and for trajectories, using average (middle column), and median (right column). The rows represent the segment lengths – (a, b, c), (d, e, f), and (g, h, i). The dashed line indicates a completely random detector.
Fig. 2: Anomaly detection results for BUS CAR scenario. ROC curves and respective AUROC values for segments (left column) and for trajectories, using average (middle column), and median (right column). The rows represent the segment lengths – (a, b, c), (d, e, f), and (g, h, i). The dashed line indicates a completely random detector.

Vii Conclusion and Further Work

Anomaly detection is a challenging task with important practical applications. In the context of trajectory data, GPS measurements are usually widely available. However, the high dimensional patterns and the lack of labeled data hinder the application of standard techniques.

In this work we have proposed GRADINGS, an unsupervised density estimation methodology that includes flexible normalizing flows, more specifically the Real NVP and the MAF structures. GRADINGS aggregates the analytical log-likelihood values of trajectory segments into a single robust anomaly score, which enables the use of trajectories with distinct lengths. The empirical results obtained using real world data showed promising performance compared to the LOF and GMM baselines, specially when considering the autoregressive MAF-based version.

The present research outcome encourages us to pursue additional NF approaches for trajectory anomaly detection. For instance, future work shall evaluate the use of convolution-based flows, such as the so-called Glow [kingma2018glow], which can handle data with multiple channel representation. Models with more complex invertible transformations, such as the recently proposed [huang2018neural, oliva2018transformation, durkan2019neural], are also worthy subjects of future investigations.