Timely detection of traffic anomaly is one of the prerequisites of an Intelligent Transportation Systems (ITS). If not done timely, anomalies may create cascading effects leading to chaos in traffic. Typical examples of traffic anomalies are, lane driving violation, over-speeding, collision, red-light violation, etc. Anomaly detection using video object trajectories with deep learning has not yet been explored much. In this paper, we propose a color gradient approach for representing vehicular trajectories extracted from videos. These trajectories are then used for classification and anomaly detection at traffic junctions using a hybrid CNN-VAE architecture.
Most commonly used features for video guided scene understanding are trajectories. A trajectory is a time series data with object locations indexed in temporal order. Classifying trajectories using neural networks is not trivial due to variation in the data length. Key to the success of a time series signal classification lies in finding an effective representation of the data. Neural networks-based classifiers need fixed size inputs. CNN, Long Short Term Memory (LSTM) and Recurrent Neural Network (RNN) have been used for time series classification[30, 13, 10]
. However, time series data can be of varying length. . Therefore, classification of varying length data can be applied after preprocessing, e.g. converting them into fixed length data either by padding or subsampling. If the trajectory length variance is large, preprocessing in mandatory.
Video anomaly detection at traffic junctions is highly challenging due to its contextual nature. For example, when a signal turns green at a traffic junction, only a few of the paths or directions are allowed for vehicle movement. Any motion that violates direction, is assumed to be anomaly though such motions can be normal in a different context.
I-a Related Work
have been used for classification of time-series data using neural networks. Also, other models such as Deep Belief Networks (DBN) have been used for human activity detection . On the other hand, CNNs are primarily used in image classification [16, 31], activity recognition in videos , speech recognition , etc.
Long Short Term Memory networks (LSTMs)  are a special kind of Recurrent Neural Network (RNN) that can be used for handling sequential/time series data. Authors of [22, 8] have proposed a recurrent network connecting LSTMs to CNNs to perform action recognition and video classification, respectively. Donahue et al.  have tested the learned models for activity recognition, image description and video description. The work proposed in  has achieved the state-of-the-art performance in video classification by connecting CNNs and LSTMs under a hybrid deep learning framework. Sequential Deep Trajectory Descriptor (DTD) has been used for action recognition  from the video sequences. Deep Neural Network (DNN)-based trajectory classification has been applied on Global Positioning System (GPS) trajectories . Dense feature trajectories used have been utilized for action recognition in videos . The LSTM-based work proposed in  uses fixed size features to classify trajectories of surrounding vehicles at four way intersections based on LIDAR (LIght Detection And Ranging), GPS, and inertial measurement unit (IMU) measurements.
Dense trajectories extracted using neural networks have also been used for action recognition in videos including classifying a person when walking, running, jumping [27, 3], etc. These methods cannot handle multiple actions present in a scene. However, in real life scenario, multiple objects can interact resulting more than one action within the scene. Training neural networks for action recognition can be challenging in presence of multiple activities. However, object trajectories extracted using traditional methods [2, 21, 1] can be used for learning the motion patterns using DNNs as they can automatically extract features from trajectories. The trained/learned model can then be used in classification and action recognition applications.
In this work, we encode video trajectories using a high-level representation, named color gradient, that embeds spatio-temporal information of the objects-in-motion. The high-level representation is then used for trajectory classification and anomaly detection using a hybrid CNN-VAE architecture.
I-B Motivation and Contributions
Since accurate classification is the key to detect anomalies, a classifier that can handle time series data with length variations, has been preferred. Typical neural networks-based methods need fixed input size. Therefore, varying length trajectories cannot directly be used in such classifiers. Conventional methods such as the one proposed in  convert the varying length time series data into fixed size by sampling. This is similar to quantization, which leads to information loss. The question is: Why can’t a trajectory represented using an image be given as an input to a classifier? However, trajectories representing movement of more than one object in between two locations may look visually similar when projected in 2D space. Such representations fail to preserve temporal relations between successive points of a trajectory. Encoding of time information in the form of color gradient (red violet) reveals, similar patterns produce similar color gradient as depicted in Fig.1(a). Similarly, the trajectories with possible anomalies exhibit different spatio-temporal characteristics as depicted in Fig.1(b). This has motivated us to propose the following:
A high-level representation of object trajectories using color gradient that encodes spatio-temporal information of trajectories of varying length.
A semi-supervised labeling technique based on modified Dirichlet Process Mixture Model (mDPMM)  clustering to identify the trajectory classes.
A method using t-Distributed Stochastic Neighbor Embedding (t-SNE)  to eliminate anomalous trajectories in the training data.
Detection of traffic anomalies using a hybrid CNN-VAE architecture.
First we discuss the background of the terms and concepts used in the work. A scene represents the view captured using static camera. We use observation or data to represent a trajectory. A cluster is a collection of trajectories of similar characteristics. A class is a set of trajectories having some selected common characteristics. Here, a class typically represents a unique path in a scene. A model is a representation of a real-world phenomenon. Here, model represents the weight parameters of the trained neural networks. We assume a model can represent a scene. Reconstruction loss (of CNN-VAE architecture) represents a measure of deviation from the input. A typical anomaly represents deviation from the normal path. Some anomalies are known a-priori. For example, when a signal turns green at a traffic junction, only a few of the paths are allowed for vehicle motion. Any motion that conflicts/intersects the allowed path, is considered as known anomaly. However, some anomalies may not be present in the training data. We refer them to as unknown anomalies.
Object trajectories are obtained using [1, 24]. A trajectory () can be represented using (1), where represents the position of moving object at time and be its length. A cluster is a collection of trajectories of similar characteristics. A class is a set of trajectories having some selected common characteristics. It can be trajectories in the same lanes, trajectories following same route, etc.
Traffic anomalies can be classified into two types; known and unknown. Known anomalies correspond to trajectories that may be allowed in different contexts. On the contrary, unknown anomalies correspond to trajectories that are not present in the training data. In order to detect both types of anomalies, it is important to learn the normal trajectory patterns or classes. The overall anomaly detection framework is presented in Fig. 2.
Ii-A1 Modified DPMM Guided Clustering
When raw trajectories are obtained from some tracking algorithms, they need to be clustered to identify different patterns. In , we have proposed a modified DPMM (mDPMM) to group pixels having similar characteristics. Here, we use mDPMM to group trajectories to learn the motion patterns. The model is expressed using (2 - 5).
is a random variable representing the trajectory andcorresponds to the latent variable representing cluster labels. takes one of the values from , where is the number of clusters.
, referred to as mixing proportion, is a vector of length
representing the probabilities ofto be . is the parameter of cluster and denotes the distribution defined by . is the concentration parameter of Dirichlet distribution and its value decides the number of clusters formed. is referred to as concentration radius. Trajectory clustering is to be done by taking as , where , represents the start position, the end position and is the duration/length of the trajectory.
Using the inference method given in , clustering of clustering of trajectories can be done. These clusters can be typically grouped into two types. First type contains large number of trajectories and they represent prominent patterns in the scene. The second type of clusters contain less number of trajectories. They can either correspond to less frequently occurring patterns or anomalies.
Ii-A2 Gradient Conversion of the Trajectories
A trajectory in time series is mapped into a color gradient form by varying hue using , within an image frame. These gradient frames become inputs to the CNN and VAE.
Ii-A3 Anomaly Elimination in Training Data using t-SNE
is a machine learning algorithm for visualizing high-dimensional data in a low-dimensional space. We use this for visualizing latent features of a trained VAE in two dimensions. Trajectories belonging to same class typically lie in close proximity in the visualization plane. However, trajectories that are far away from a class are inspected again for manual anomaly checking.
Ii-B Trajectory Annotation
Suppose a set of trajectories captured from a traffic junction or road are given. These trajectories must belong to any one of the defined set of paths (classes). Applying mDPMM helps to identify prominent patterns from these trajectories. Like any unsupervised method, clustering algorithm can only identify different possible patterns from the trajectory data. Though prominent patterns can correspond to normal trajectories, clusters with less number of trajectories can represent a rare pattern or an anomaly. This necessitates to have an additional annotation process to identify allowed classes. Clustering reduces the load of the manual labeling process as an initial grouping is done through mDPMM. The annotator can identify these rare patterns through visual observation of the scene and separate the anomalous trajectories to finalize the allowed classes. This process is called class annotation.
More refinements are possible within a class. It is possible that two trajectories with similar endpoints and duration may follow different paths, out of which one may be normal. This may not always be detected through visual observation. Therefore, t-SNE has been used to visualize the distribution of trajectories within the classes. This helps to remove noises (anomalies) from the training set being prepared for VAE.
Ii-C Training CNN and VAE Framework
A CNN classifier typically consists of repeated occurrences of cascaded convolution, activation, and pooling layers followed by fully connected layers. The architecture used in this work is depicted in Fig. 3
(a). During the training stage, a cost/loss function representing the cross-entropy between the expected and predicted class is minimized using Adam optimizer with a learning rate of .
We use a variational autoencoder (VAE) similar to  to detect unknown anomalies. It typically consists of encoding and decoding stages. Input to the encoder is . Output is a hidden/latent feature , where represents weights and biases of encoder network. Decoder takes latent feature and regenerates , where represents weights and biases of decoder. Loss function () for a trajectory is given in (8) in terms of log likelihood () as given in as given (6
) and Kullback-Leibler Divergence (KLD) as given in as given (7). Adam optimizer minimizes the average loss function during training. Once trained, VAE can detect anomalies using the average reconstruction loss on the trained VAE.
Ii-D Anomaly Detection
Classification is performed on the trained CNN using test trajectories represented in gradient form to obtain class . Let be the threshold of reconstruction loss value for normal classes on the trained VAE. is derived using the variance of loss values on the training trajectories. A trajectory can be considered anomalous when or , such that is a set of allowed trajectory classes of a particular signal .
However, a classifier is needed for anomaly detection to handle conflicting trajectories. In a typical traffic junction, a set of flows may be allowed at a given time. For example, the QMUL dataset (Fig. 4) suggests, any two flows, e.g, south-to-north on left side and north-to-south on right side, are allowed at a given time. Any other movements can be termed anomalous though individually such movements may be allowed at a different time. VAE cannot detect such known anomalies. Therefore, CNN helps to detect such conflicting anomalies. It also helps to identify the anomalous path.
Iii Experimental Results
We have used tensorflow and openCV for developing the classification and anomaly detection framework. We have used three datasets, namely T15, QMUL  and a junction video dataset (referred to as 4WAY). Context tracker  has been used for creating trajectories from QMUL dataset, and Temporal Unknown Incremental Clustering (TUIC) 
has been used for obtaining 4WAY trajectories. Inputs to CNN-VAE are resized to 120x120x3. CNN training has been completed with 50 epochs with a learning rate ofon T15 and 4WAY dataset videos. A batch size of 20 has been used for the QMUL dataset. VAE for T15 has been trained with in 500 epochs using a batch size of 20. VAE for QMUL dataset has been trained with and batch size = 10 in 500 epochs.
Iii-a Experiments on Trajectory Clustering and Annotation
The annotation aspects of unlabeled trajectories using mDPMM are shown in Fig.4. Trajectory details are presented in Table I. Since T15 dataset readily comes with associated class annotation, unsupervised clustering has not been not applied on this dataset. Fig.5 presents the t-SNE guided refinement.
Iii-B Experiments on Classification and Comparisons
Trajectories of T15, QMUL and 4WAY datasets have been used for classification. Classification results are shown in Fig. 6 and summarized in Table II. It can be observed that the proposed method performs accurate classification across all datasets. We have randomly selected % of the trajectories for training and the rest for testing. Our proposed classification method has been compared with other state-of-the-art classification methods such as HAR-CNN, LSTM, LSTM+CNN  typically used for time series data classification. We have converted the input trajectories to samples by downsampling or upsampling depending on their size. The comparative results are shown in Table II. The results reveal, our proposed method performs better than the existing approaches across all datasets. However, classification without color gradient degrades slightly, even though it performs better than most of the existing work.
Iii-C Experiments on Anomaly Detection
T15 dataset has been used to evaluate the anomaly detection framework. Reconstructions using VAE are depicted in Fig. 8. Four kinds of anomalous trajectories are used in our experiments: (i) Trajectories terminating abruptly. (ii) Speed variation as compared to normal trajectories of the same class. (iii) Trajectories of objects traveling in opposite direction of the normal traffic. (iv) Trajectories corresponding to vehicles violating lane driving. Since T15 dataset does not contain type three anomalous trajectories, we have created a few such trajectories by gradient conversion in reverse order. We have used two times the converged loss value as a threshold for detecting anomaly based on the empirical study on anomalous and normal trajectories as shown in Fig 8(a). Anomaly detection results are shown in Fig. 8(b). We have used 69 randomly selected normal trajectories that are not used in the training and 31 identified anomalous trajectories along with synthetically created ones. We have created synthetic trajectories for lane change and corresponding to each class for opposite direction driving anomalies. The comparisons of trajectory projection on image plane using VAE under different conditions are presented in Table III. We are able to detect anomalies with an accuracy of 87.3% when t-SNE is used. This reveals, without gradient representation, anomaly detection accuracy drops significantly (49.2%).
Iii-D Comparison of Anomaly Detections
Since video trajectory-based anomaly detection method using DNNs proposed in this paper is of the first kind, we could not find benchmark datasets that can be used in comparison with neural network-based anomaly detection. Hence we have performed high-level comparison with the state-of-the-art anomaly detection techniques presented in  and  using the input reconstruction property. The work proposed in  uses sparse combination learning for learning normal behavior, while  learns the model from the spatio-temporal video segments using Autoencoder. Several experiments have been conducted on QMUL dataset. Training videos have been created by splitting the original traffic video into segments starting from the frame number by eliminating anomalous segments from the scene. Testing has been conducted using the video segment prior to the frame number . We have trained our proposed architecture using trajectories obtained with the help of the method proposed in  with for the testing. Training for the method proposed in  has been done using the same configuration as reported in their work, while testing has been conducted with an error threshold of . For training the model proposed in , we have used a sequence length () = 10, batch size = 4 and number of epochs = 200.
report several false positives on the QMUL dataset. Moreover, these methods cannot detect contextual anomalies. A deeper analysis reveals that the false positives are mainly due to the unseen characteristics present in the scene with heterogeneous data, making it difficult to learn all spatio-temporal features. Such methods can work only when the video duration is long enough that can learn all types of object motions possible within a scene. However, it may be difficult to train as separating normal video segments from the anomalous can be very challenging when anomalies are present throughout the video. As our method is trajectory-based, individual trajectories can be characterized as normal or abnormal rather than declaring a video segment normal or anomalous. Moreover, training a deep neural network using video frames can be time consuming. On the contrary, a trajectory has been condensed into a single video frame as done in our method. In a nutshell, we are using the advantages of conventional trajectory extraction methods as well as the feature extraction capabilities of deep neural network to achieve classification which is fast. TableIV summarizes the comparative results.
|Parameters||Proposed method||Sparse reconstruction||Spatio-temporal autoencoder|
|False alarm rate||Low||High||High|
|Unknown anomaly detection||Yes||Yes||Yes|
|Contextual anomaly detection||Yes||No||No|
|Detection time||Once trajectory is available||Per frame||Per sequence length|
Iii-E Discussions and Limitations
Key to accurate anomaly detection lies in training the model with normal trajectories. Apart from mDPMM-based clustering, t-SNE visualization plays an important role in eliminating anomalous trajectories. The need for a classifier is to detect known anomalies such as traffic rule violations by vehicles. While unknown anomalies are detected using VAE, the CNN classifier helps to identify known anomalies and to localize the path of unknown anomalies. The loss values in terms of KLD and likelihood are justified as they represent the distance of the trajectories from the allowed class distributions. A small offset from the converged loss can be a good estimate of the threshold. CNN classifier performs with higher accuracy as compared to other methods.
Some of the limitations of the proposed method are: (i) The method is tracking dependent. However, with improved tracking, we can overcome this issue. (ii) A large number of training samples need to be available to learn the allowed paths in a traffic junction.
The key idea behind this work is to represent time varying visual data using color gradient form in order to train DNN-based systems for encoding temporal features. This method combines traditional object tracking-based results to be combined with neural network-based methods to use the advantages of both systems. It has been observed through experiments that the proposed color gradient feature using CNN performs better than existing classifiers. We are also able to detect a few types of trajectory anomalies using the proposed architecture. It performs better than some of the existing reconstruction-based anomaly detection methods. We plan to extend this work to develop a real-time anomaly detection system for traffic intersections using online trajectories which will be able to detect discussed anomalies as well as other anomalies such as over-speeding. We also plan to explore this method for time series data analysis in other domains.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 GPU used for this research.
-  S. H. Bae and K. J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2017.
-  Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance video. In CVPR, 2011.
-  A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–267, March 2001.
-  A. Bulling, U. Blanke, and B. Schiele. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR), 46(3):33, 2014.
-  Y. S. Chong and Y. H. Tay. Abnormal event detection in videos using spatiotemporal autoencoder. In ISNN, 2017.
-  L. Deng, J. Li, J. T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, et al. Recent advances in deep learning for speech research at microsoft. In ICASSP, 2013.
-  T. B. Dinh, N. Vo, and G. Medioni. Context tracker: Exploring supporters and distracters in unconstrained environments. In CVPR, 2011.
-  J. Donahue, L. A. Hendricks , S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
Y. Endo, H. Toda, K. Nishida, and J. Ikedo.
Classifying spatial trajectories using representation learning.
International Journal of Data Science and Analytics, 2(3):107–117, Dec 2016.
-  N. Y. Hammerla, S. Halloran, and T. Ploetz. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv:1604.08880, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
-  A. Khosroshahi, E. Ohn-Bar, and M. M. Trivedi. Surround vehicles trajectory analysis with recurrent neural networks. In ITSC, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In DMKD, 2003.
-  C. C. Loy, T. Xiang, and S. Gong. From local temporal correlation to global anomaly detection. In ECCV, 2008.
-  C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In ICCV, 2013.
-  L. V. D. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.
-  J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
-  T. Plötz, N. Y. Hammerla, and P. Olivier. Feature learning for activity recognition in ubiquitous computing. In IJCAI, 2011.
-  K. K. Santhosh, D. P. Dogra, and P. P. Roy. Temporal unknown incremental clustering model for analysis of traffic surveillance videos. IEEE Transactions on Intelligent Transportation Systems, pages 1–12, 2018.
-  Y. Shi, Y. Tian, Y. Wang, and T. Huang. Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Transactions on Multimedia, 19(7):1510–1520, 2017.
Training restricted boltzmann machines using approximations to the likelihood gradient.In ICML, 2008.
-  H. Wang, A. Kläser, C. Schmid, and C. L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
-  Z. Wu, X. Wang, Y. G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM MULTIMEDIA, 2015.
-  H. Xu, Y. Zhou, W. Lin, and H. Zha. Unsupervised trajectory clustering via adaptive multi-kernel-based shrinkage. In ICCV, 2015.
-  J. Yang, M. N. Nguyen, P. P. San, X. Li, and S. Krishnaswamy. Deep convolutional neural networks on multichannel time series for human activity recognition. In IJCAI, 2015.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
-  B. Zhou, X. Tang, and X. Wang. Measuring crowd collectiveness. In CVPR, 2013.