DeepAI
Log In Sign Up

Are pre-trained CNNs good feature extractors for anomaly detection in surveillance videos?

Recently, several techniques have been explored to detect unusual behaviour in surveillance videos. Nevertheless, few studies leverage features from pre-trained CNNs and none of then present a comparison of features generate by different models. Motivated by this gap, we compare features extracted by four state-of-the-art image classification networks as a way of describing patches from security video frames. We carry out experiments on the Ped1 and Ped2 datasets and analyze the usage of different feature normalization techniques. Our results indicate that choosing the appropriate normalization is crucial to improve the anomaly detection performance when working with CNN features. Also, in the Ped2 dataset our approach was able to obtain results comparable to the ones of several state-of-the-art methods. Lastly, as our method only considers the appearance of each frame, we believe that it can be combined with approaches that focus on motion patterns to further improve performance.

READ FULL TEXT VIEW PDF
10/06/2020

Video Anomaly Detection Using Pre-Trained Deep Convolutional Neural Nets and Context Mining

Anomaly detection is critically important for intelligent surveillance s...
11/05/2020

Robust Unsupervised Video Anomaly Detection by Multi-Path Frame Prediction

Video anomaly detection is commonly used in many applications such as se...
09/08/2018

CNNs for Surveillance Footage Scene Classification

In this project, we adapt high-performing CNN architectures to different...
02/25/2022

Data refinement for fully unsupervised visual inspection using pre-trained networks

Anomaly detection has recently seen great progress in the field of visua...
04/05/2020

Any-Shot Sequential Anomaly Detection in Surveillance Videos

Anomaly detection in surveillance videos has been recently gaining atten...
08/17/2019

Anomaly Detection in Video Sequence with Appearance-Motion Correspondence

Anomaly detection in surveillance videos is currently a challenge becaus...

I Introduction

Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of the Itaú-Unibanco, FAPESP and CNPq.

Nowadays security cameras are widely employed to monitor public spaces – such as malls, squares and universities. Yet, those surveillance cameras may be ineffective, mainly because each video feed needs a person constantly watching it. Keeping track of the events in security videos is very hard for humans, especially for two reasons: 1) a single person is responsible for monitoring several cameras simultaneously [1]; 2) it is hard for people to maintain an acceptable level of attention when watching this kind of video [2]. Due to the aforementioned issues, security footage is of little help with regards to preventing dangerous situations and end up being used mostly for investigation purposes, after something already happened.

The ineffectuality of surveillance systems motivated the machine vision community to work on automated systems to detect unusual behaviour in security videos, consequently, we have seen outstanding advances in this area. Over the last few years a great deal of methods have been proposed to detect anomalies in videos. Among the approaches employed to tackle this issue we have: time series decomposition of optical-flow [3], optical-flow features [4, 5], dictionary learning [6], auto-encoders [7] and GANs (generative adversarial networks) [8]. Despite all progress obtained over the last few years, automatically detecting unusual events in videos remains an open research area and several approaches are yet to be explored.

One kind of approach that – to the best of our knowledge – have not yet been broadly investigated in anomaly detection scenarios is transfer-learning. This technique consists of leveraging knowledge obtained from solving a particular problem to solve a different one (on a related domain). With regards to CNNs most transfer learning applications use pre-trained networks as feature extractors or as starting point for training. In recent years, such usage of pre-trained networks has been considered to be very effective by various studies 

[9], even when the original and target domains are considerably different [10]. Still, such performance is not guaranteed for every scenario, as pointed out by some other studies [11, 12].

Motivated by the aforementioned results and the lack of investigation regarding the usage of pre-trained CNNs to detect unusual behaviour in security videos, in this paper, we devote our efforts to evaluate the application of several state-of-the-art image classification CNNs as features extractor machines for surveillance footage. We compared features generated by four networks (VGG-16 [13], RestNet-50 [14], Xception [15] and DenseNet-121 [16]) as a way to describe appearance for frame regions of security videos. Despite neglecting the motion part of anomaly detection in videos, our experiments (conducted using the Ped1 and Ped2 UCSD datasets [17, 18]) show that those features are able to achieve competitive results.

Ii Related work

Fig. 1: Experimental setup diagram.

Recently, the usage of CNN features has been explored to detect unusual activities in surveillance footage. In [19], the authors describe image regions using AlexNet [20] and, then, track variations in such description in order to detect anomalies. With this tracking, they are able to use a image based CNN to find both motion and appearance anomalies.

In [21] a pre-trained C3D model [22] was employed for feature extraction. Such model was originally designed for action recognition in videos, thus it learns spatio-temporal representations, which is very useful to deal with both motion and appearance abnormalities. Nevertheless, in [21] the pre-trained model is used in a classification setup, therefore it has access to instances of both normal and anomalous behaviour during training.

In spite of the fact that are some studies where pre-trained networks were employed to detect anomalies in security videos, they do not compare several models in order to determine which one is best suited for the application. For this reason, in this paper we design a experimental setup and present several experimental results — employing different network architectures and data normalization procedures — in order to shed some light on this issue. Instead of attempting to outperform state-of-the-art methods, we aim at a better understanding of how off-the-shelf CNN features behave for the application of video surveillance, providing a guideline for future research on how to choose the appropriate model for this task.

Iii Experimental setup

In order to measure how well the features extracted from pre-trained image classification CNNs perform, when detecting anomalies in surveillance videos, we employed the experimental setup depicted in Figure 1. This experimental setup has three steps.

In the first step of our setup we start by converting each video frame to pixels. Next, we take image regions of

pixels using a stride of 16 pixels and perform a forward pass of each of these patches through the convolutional part of a CNN pre-trained on the ImageNet dataset 

[23]. This image patch end up being represented by features, where the value of

depends on the network – it is equal to the number of filters (neurons) at the last convolutional layer of the network. At this point one can notice that using the convolutional part of the network – instead of the entire network – is rather convenient in our framework. This is due to the fact that when using the entire networks we have to provide input images of the same size of the original dataset (e.g.

), while with the convolutional part of the network we do not have such constraint. In this part of our setup, we investigated features generated by the following state-of-the-art image classification models (trained on ImageNet): VGG-16 [13], ResNet-50 [14], Xception [15] and DenseNet-121 [16]. Please refer to Section III-B for more details on each model and also for the number of features generated by each one of them.

In the second step

we prepared the data obtained on the first step in order to use it with an anomaly detection method. To do so, we started by normalizing the data, in this part we tested the z-score, 0-1, L1 and L2 normalization methods (see Section 

III-A for a description of these techniques). Given that we employed the nearest neighbour technique to detect anomalies, this normalization step is fundamental to the performance of the system. After normalizing the data we used the Incremental PCA (IPCA) algorithm [24] to reduce data dimensionality to 50 and 100 dimensions. This particular method was chosen for two reasons: 1) it can be used to reduce the number of features and, consequently, speed up the computation of nearest neighbours; 2) it does not need to load the entire dataset into RAM in order to compute its transformation, so it can deal with large datasets such as those used in this paper.

Lastly, in the third step of our experimental setup, we train an 1-NN model using the datasets obtained in step two (with 50 or 100 features depending on the experiment). Then, we find the distance of all the test set instances to their nearest neighbour in the training set. Those distances are used as an anomaly score. Hence, the higher the distance, the more a instance (image region) is considered to be anomalous. It is important to notice that we train only one model to work with descriptions from the entire image. Also, aiming to speed up our anomaly detection framework, we used the approximate nearest neighbour method from [25]. Next, we obtain the anomaly score of a frame simply by taking the max score among its patches. Based on those frame scores we can compute the AUC (Area Under ROC Curve) and EER (Equal Error Ratio) on a frame-level classification.

Our experiments were carried out on the UCSD video anomaly detection datasets: Ped1 and Ped2 [17, 18]. More information about the two datasets are presented Section III-C.

Iii-a Data normalization

In our experiments we used the following four data normalization methods:

  • z-score

    : outputs features centered at the origin (i.e. with zero mean) and with unit standard deviation by computing:

    where is the value of feature for instance , and the mean and standard deviation of is obtained using the entire training set;

  • 0-1: outputs features in the interval. To do so, it uses the following formula:

    where is the value of feature for instance and the max and min of are obtained from the training set;

  • L1: in this method all instances (dataset rows) are scaled so their L1 norm are equal to one, as follows:

    where is an instance and is the absolute value of the f-th feature of .

  • L2: similar to the previous one, but using the L2 norm instead, as in the equation bellow:

Iii-B CNN models

Bellow, we quickly present some of the main characteristics of the four CNNs used as feature extractors in our experiments.

Vgg-16 [13]

is composed of 16 layers, 13 of them convolutional and the remaining 3 dense layers. All convolutional layers use

kernels and ReLU as activation function. Also, those 13 layers are divided into 6 groups and each group has a max-pooling layer at its end.

ResNet-50 [14]

uses an alternative layer configuration method, called residual units. Residual units – such as the one depicted in Figure 2 – allow deeper models to be learned by using shortcut connections to avoid vanishing/exploding gradients during training. This network has 49 convolutional layers and only one fully connected layer. All but one of its convolutional layers are organized into 16 residual units like the one of Figure 2.

Fig. 2: Residual unit employed by ResNet-50 [14].

Xception [15]

is inspired by Inception V3 [26]

, and built by replacing the inception modules with depthwise separable convolutions. This type of convolution is performed by, first, applying convolutions (e.g. using

kernels) to each tensor channel separately and, then, applying an

convolution to all channels, this process is illustrated in Figure 3. By using this type of operation the Xception model was able to outperform Inception V3 for image classification using the same number of trainable parameters.

Fig. 3: Depthwise separable convolution layer used by Xception [15].

DenseNet-121 [16]

is an architecture similar to ResNets because they also use shortcut/skipping connections. Nevertheless, instead of employing this type of connection in some specific parts of the network, they use shortcut connections to connect every layer to all of its subsequent layers. The authors claim this approach avoids vanishing/exploding gradients and allows the training of even deeper networks when compared to ResNets.

As previously mentioned, those models have different number of trainable parameters and generate descriptors of different sizes. In Table I, we present the number of trainable parameters in the convolutional part (feature extractor part) and the number of generated features of each CNN used in our experiments.

feature extractor # of trainable parameters # of features
VGG-16 [13] 14,714,688 512
ResNet-50 [14] 23,534,592 2048
Xception [15] 20,806,952 2048
DenseNet-121 [16] 6,953,856 1024
TABLE I: Number of trainable parameters and output features for each one of the network architectures used in our experiments.

Iii-C Datasets

Both datasets employed in our experiments were obtained for stationary surveillance cameras at the UCSD (University of California, San Diego) campus [17, 18]. This means that all of their videos are real (they were not staged). Each dataset contains videos from a single camera and only the test set has anomalous events. Some specifications and sample images from both datasets are presented in Table II and Figure 4, respectively.

characteristic dataset
Ped1 Ped2
training videos 34 16
test videos 36 12
frame resolution
fps 10 10
frames per video 200
TABLE II: Video specifications for Ped1 and Ped2.
Fig. 4: Frames from Ped1 (first row) and Ped2 (second row), with anomalies manually highlighted by red boxes.

Iii-D Reproducibility remarks

In our experiments we used the Keras [27] library, and its pre-trained models, for feature extraction. In order to normalize our features and compute IPCA, we used the Scikit-learn [28] library. Lastly, for the approximate nearest neighbour distance computation we have used the FLANN [29] library. To facilitate the reproduction of our results our source code is publicly available111 Our source code is available at https://github.com/tiagosn/cnn_features_anomaly_detection. .

Iv Results and Discussion

As stated in Section III, we trained nearest neighbour models using features extracted from four pre-trained CNNs. In those tests, before training our nearest neighbour models, we first normalized our features and, then, used IPCA to reduce the number of dimensions. Aiming at obtaining the best results out of our features we experimented with several configuration regarding both the normalization technique and the number of dimensions used after the IPCA transformation. The results of such experiments are presented in Table III.

feature extractor IPCA dimensions normalization method dataset
Ped1 Ped2
AUC EER AUC EER
VGG-16 [13] 50 0-1 63.98% 40.85% 63.81% 42.90%
z-score 63.35% 42.02% 64.98% 40.05%
L1 59.01% 43.28% 63.73% 39.95%
L2 59.16% 43.15% 63.37% 40.13%
100 0-1 64.06% 41.02% 63.42% 40.29%
z-score 63.78% 41.13% 64.97% 38.53%
L1 63.62% 40.40% 65.49% 38.40%
L2 61.02% 42.72% 62.84% 40.70%
ResNet-50 [14] 50 0-1 59.10% 44.94% 62.05% 42.27%
z-score 60.95% 43.17% 79.59% 28.88%
L1 55.26% 45.73% 45.59% 51.93%
L2 63.60% 41.25% 71.05% 34.53%
100 0-1 59.65% 44.66% 69.02% 37.08%
z-score 61.98% 42.21% 83.90% 23.33%
L1 55.48% 45.93% 47.49% 51.46%
L2 62.67% 42.19% 70.08% 34.89%
Xception [15] 50 0-1 59.95% 44.18% 86.61% 21.82%
z-score 59.16% 44.23% 87.33% 21.66%
L1 51.59% 48.91% 63.72% 41.32%
L2 60.02% 44.21% 80.78% 25.39%
100 0-1 60.68% 43.75% 87.94% 19.55%
z-score 59.61% 43.59% 88.93% 20.02%
L1 51.99% 48.71% 65.21% 40.35%
L2 58.83% 43.56% 82.03% 24.58%
DenseNet-121 [16] 50 0-1 63.65% 41.02% 82.91% 23.58%
z-score 62.04% 41.83% 72.57% 34.81%
L1 63.00% 40.60% 73.73% 31.98%
L2 63.15% 41.34% 72.79% 32.87%
100 0-1 63.16% 41.88% 84.61% 23.06%
z-score 62.57% 41.44% 83.07% 24.31%
L1 62.73% 41.40% 78.09% 28.16%
L2 62.71% 42.13% 78.05% 27.07%
TABLE III: Frame level detection results on Ped1 and Ped2 datasets using several pre-trained CNNs as feature extractors, various feature normalization techniques and approximate nearest neighbour. Please keep in mind that the lower the EER, the better and the higher the AUC, the better.

By looking at those results it is possible to notice that the normalization method can greatly impact anomaly detection results. Also, it is possible to realize that not all networks have the same behaviour with regards to normalization. With ResNet-50 and Xcpetion features we usually obtained the best results by using the z-score normalization, while with DenseNet-121 the best results were obtained by using the 0-1 normalization. Considering the number of dimensions used to train the model, in most cases, going from 50 to 100 features has shown to be beneficial. Nevertheless, it is important to keep in mind that increasing the number of features makes the nearest neighbour inference slower.

Now, we compare the best results obtained in our test against classic and state-of-the-art results methods on the UCSD (Ped1 and Ped2) datasets. Such comparison is presented in Table IV. Although the CNN features are not better when compared to state-of-the-art results, the performance was reasonable. Specially, considering that feature extraction was not trained with any data from the target task (i.e. Ped1 and Ped2 frames) and motion information was ignored when we only described individual frames. In particular for Ped2, the results are comparable to the ones obtained by some state-of-the-art methods, therefore they can be considered as a good baseline for this dataset. In the case of Ped1, the results are only comparable to some classic methods. We believe that one of the main reasons for these lower performance is the fact that Ped1 has changes in perspective – objects change size (regarding number of pixels) according to the image region that they are in. Such characteristic can hamper the performance in a setup like ours, where a single model is trained to deal with image patches from the entire frame.

method dataset
Ped1 Ped2
AUC EER AUC EER
LMH [5] 63.40% 38.90% 58.10% 45.80%
MPPCA [30] 59.00% 40.00% 69.30% 30.00%
Social force [31] 67.50% 31.00% 55.60% 42.00%
Sparse reconstruction [32] - 19.00% - -
LSA [33] 92.70% 16.00% - -
Sparse combination [34] 91.80% 15.00% - -
MDT [18] 81.80% 25.00% 82.90% 25.00%
LNND [35] - 27.90% - 23.70%
Motion influence map [36] - 24.10% - 9.80%
Composition pattern [6] - 21.00% - 20.00%
HOFM [4] 71.50% 33.30% 89.90% 19.00%
AMDN [7] 92.10% 16.00% 90.80% 17.00%
Flow decomposition [3] - - - 31.70%
Adversarial discriminator [8] 96.80% 7.00% 95.50% 11.00%
Plug-and-play CNN [19] 95.70% 8.00% 88.40% 18.00%
CNN features (best) 64.06% 40.40% 88.93% 19.55%
TABLE IV: Comparison of frame-level anomaly detection on Ped1 and Ped2 datasets. Please note that the lower the EER, the better and the higher the AUC, the better.

V Conclusion

In this paper we investigated the usage of pre-trained image classification CNNs as feature extractors to tackle the detection of unusual events on videos. According to our experiments such networks – when paired with suitable data normalization techniques – can be very useful to detect abnormal events in security videos. This can be noticed in the Ped2 experiments, where they were able to achieve results comparable to the ones of the state-of-the-art techniques. Another important fact regarding our experiments is that we only explored those features to model appearance anomalies; hence, we neglected the motion part of the anomalies. That being said, we strongly believe that the method proposed in this work can be combined with methods that focus on motion anomalies (e.g.  [4, 3]) to obtain better results.

Vi Future work

The main points that we intent to investigate in feature research are:

  • The performance of such features in RGB surveillance videos like the ones of the Avenue dataset [34]. Such analysis can help to better understand if gray scale image can hamper the description capabilities of CNNs trained in RGB domains;

  • Combine our method with some methods that only tackle motion related anomalies, such as the ones presented in [4, 3], to see if their fusion can further improve performance;

  • Test the usage of other anomaly detection techniques, for instance Isolation Forest [37] with those CNN features.

Acknowledgment

This work was supported by FAPESP (grants #2015/04883-0 and #2016/16111-4), CNPq (grant #307973/2017-4), Itaú-Unibanco and partially supported by CEPID-CeMEAI (FAPESP grant #2013/07375-0).

References

  • [1] H. M. Dee and S. A. Velastin, “How close are we to solving the problem of automated visual surveillance?” Machine Vision and Applications, 2008.
  • [2] N. Haering, P. L. Venetianer, and A. Lipton, “The evolution of video surveillance: an overview,” Machine Vision and Applications, 2008.
  • [3] M. Ponti, T. S. Nazare, and J. Kittler, “Optical-flow features empirical mode decomposition for motion anomaly detection,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  • [4] R. V. H. M. Colque, C. Caetano, and W. R. Schwartz, “Histograms of optical flow orientation and magnitude to detect anomalous events in videos,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2015.
  • [5] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time unusual event detection using multiple fixed-location monitors.” IEEE Trans. Pattern Anal. Mach. Intell., 2008.
  • [6] N. Li, X. Wu, D. Xu, H. Guo, and W. Feng, “Spatio-temporal context analysis within video volumes for anomalous-event detection and localization,” Neurocomputing, 2015.
  • [7] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep representations of appearance and motion for anomalous event detection.” in BMVC, 2015.
  • [8] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” CoRR, vol. abs/1706.07680, 2017.
  • [9]

    M. A. Ponti, L. S. Ribeiro, T. S. Nazare, T. Bui, and J. Collomosse, “Everything you wanted to know about deep learning for computer vision but were afraid to ask,” in

    2017 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T), 2018.
  • [10] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” in

    Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , ser. CVPRW ’14.   Washington, DC, USA: IEEE Computer Society, 2014, pp. 512–519.
  • [11]

    S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in

    2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), 2016.
  • [12]

    T. S. Nazaré, G. B. P. da Costa, W. A. Contato, and M. Ponti, “Deep convolutional neural networks and noisy images,” in

    Iberoamerican Congress on Pattern Recognition, 2017.
  • [13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [15] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981.
  • [18] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
  • [19] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe, “Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012.
  • [21] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” CoRR, vol. abs/1801.04264, 2018.
  • [22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
  • [23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
  • [24] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vision, vol. 77, no. 1-3, pp. 125–141, May 2008.
  • [25]

    M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,”

    Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, 2014.
  • [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [27] F. Chollet et al., “Keras,” https://keras.io, 2015.
  • [28]

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”

    Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: http://scikit-learn.org
  • [29] M. Muja and D. G. Lowe, “FLANN: Fast Library for Approximate Nearest Neighbors.” [Online]. Available: http://scikit-learn.org
  • [30] J. Kim and K. Grauman, “Observe locally, infer globally: A space-time mrf for detecting abnormal activities with incremental updates,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2009.
  • [31] R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model.” in CVPR, 2009.
  • [32] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in CVPR 2011, 2011.
  • [33] V. Saligrama and Z. Chen, “Video anomaly detection based on local statistical aggregates.” in CVPR.   IEEE Computer Society, 2012.
  • [34] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,” in Proceedings of the 2013 IEEE International Conference on Computer Vision, ser. ICCV ’13, 2013.
  • [35] X. Hu, S. Hu, X. Zhang, H. Zhang, and L. Luo, “Anomaly detection based on local nearest neighbor distance descriptor in crowded scenes,” The Scientific World Journal, vol. 2014, 2014.
  • [36] D.-g. Lee, S. Member, H.-i. Suk, and S.-k. Park, “Motion Influence Map for Unusual Human Activity Detection and Localization in Crowded Scenes,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 8215, 2015.
  • [37] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly detection,” ACM Trans. Knowl. Discov. Data, 2012.