Any-Shot Sequential Anomaly Detection in Surveillance Videos

04/05/2020 ∙ by Keval Doshi, et al. ∙ University of South Florida 0

Anomaly detection in surveillance videos has been recently gaining attention. Even though the performance of state-of-the-art methods on publicly available data sets has been competitive, they demand a massive amount of training data. Also, they lack a concrete approach for continuously updating the trained model once new data is available. Furthermore, online decision making is an important but mostly neglected factor in this domain. Motivated by these research gaps, we propose an online anomaly detection method for surveillance videos using transfer learning and any-shot learning, which in turn significantly reduces the training complexity and provides a mechanism that can detect anomalies using only a few labeled nominal examples. Our proposed algorithm leverages the feature extraction power of neural network-based models for transfer learning and the any-shot learning capability of statistical detection methods.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rapid advancements in the technology of closed-circuit television (CCTV) cameras and their underlying infrastructure has led to a sheer number of surveillance cameras being implemented globally, estimated to go beyond 1 billion by the end of the year 2021

[videosurveillance]. Considering the massive amounts of videos generated in real-time, manual video analysis by human operator becomes inefficient, expensive, and nearly impossible, which in turn makes a great demand for automated and intelligent methods for an efficient video surveillance system. An important task in video surveillance is anomaly detection, which refers to the identification of events that do not conform to the expected behavior [chandola2009anomaly].

A vast majority of the recent video anomaly detection methods directly depend on deep neural network architectures [sultani2018real]. It is well known that these deep neural network models are data hungry. As a result, they require many labeled nominal frames and long hours of training to produce acceptable results on a new data set. Moreover, most of them are not suitable for online detection of anomalies as they need the knowledge of future video frames for appropriate normalization of detection score.

Motivated by the aforementioned domain challenges and research gaps, we propose a hybrid use of neural networks and statistical nearest neighbor (NN) decision approach for finding video anomalies with limited training in an online fashion. In summary, our contributions in this paper are as follows:

  • We significantly reduce the training complexity by leveraging transfer learning while simultaneously outperforming the current state-of-the-art algorithms.

  • We propose a novel framework for statistical any-shot sequential anomaly detection which is capable of learning continuously and from few samples.

  • Extensive evaluation on publicly available data sets show that our proposed framework can transition effectively between few-shot and many-shot learning.

2 Related Works

Anomaly detection methods for video surveillance can be broadly classified into two categories: traditional and deep learning-based methods. Traditional methods

[saligrama2012video, dalal2005histograms, zhao2011online, mo2013adaptive] extract hand-crafted motion and appearance features such as histogram of optical flow [chaudhry2009histograms, colque2016histograms] and histogram of oriented gradients [dalal2005histograms] to detect spatiotemporal anomalies [saligrama2012video]. The recent literature is dominated by the neural network-based methods [hasan2016learning, hinami2017joint, liu2018future, luo2017revisit, ravanbakhsh2019training, sabokrou2018adversarially, xu2015learning] due to their superior performance [xu2015learning]. For instance, in [luo2017remembering]

, Convolutional Neural Networks (CNN), and Convolutional Long Short Term Memory (CLSTM) are used to learn appearance and motion features. More recently, Generative Adversarial Networks (GANs) have been used to generate internal scene representations based on a given frame and its optical flow to detect deviation of the GAN output from the nominal data

[ravanbakhsh2017abnormal, liu2018future]. However, there is also a significant debate on the shortcomings of neural network-based methods in terms of interpretability, analyzability, and reliability of their decisions [jiang2018trust]. For example, [papernot2018deep, sitawarin2019defending] propose using a nearest neighbor-based approach together with deep neural network structures to achieve robustness, interpretability for the decisions made by the model, and defense against adversarial attack. Also, deep neural networks for visual recognition typically require a large amount of labelled examples for training [krizhevsky2012imagenet], which might not be available for all possible behaviors/patterns. Hence, recently researchers have begun to address the challenge of few-shot learning [koch2015siamese, sung2018learning, snell2017prototypical, vinyals2016matching]. A line of few-shot learning methods is based on the idea of transfer learning, i.e, using a pre-trained model learned from one domain for another domain [pan2009survey, yosinski2014transferable, kornblith2019better].

3 Proposed Method

Figure 1: Proposed few-shot learning framework. At each time

, neural network-based feature extraction module provides motion (optical flow), location (center coordinates and area of bounding box), and appearance (class probabilities) features to the statistical anomaly detection module, which automatically sets its decision threshold to satisfy a false alarm constraint and makes online decisions.

An anomaly is construed as an unusual event which does not conform to the learned nominal patterns. However, in general, for practical implementations, it is unrealistic to assume the availability of sufficient training data for all possible nominal patterns/events. Thus, a practical framework should be able to perform any-shot learning of nominal events. This presents a novel challenge to current approaches mentioned in Section 2 as their decision mechanism is extensively dependent on Deep Neural Networks (DNNs). DNNs typically require a large amount of training data with sufficient number of samples for each type of nominal event or exhibit the risk of catastrophic forgetting [kirkpatrick2017overcoming]

. Also, in general, the type of anomaly that the detector might encounter is broad and unknown while training the algorithm. For example, an anomalous event can be justified on the basis of appearance (a person carrying a gun), motion (two people fighting) or location (a person walking on the roadway). To account for all such cases, we create a feature vector

for each object in frame at time , where is given by . The weights are used to adjust the relative importance of each feature category.

3.1 Transfer Learning

Most existing works propose training specialized data-hungry deep learning models from scratch, however this bounds their applicability to the cases where abundant data is available. Also, the training time required for such models grows exponentially with the size of training data, making them impractical to be deployed in scenarios where the model needs to continuously learn. Hence, we propose to leverage transfer learning to extract meaningful features from video.

Object Detection: To obtain location and appearance features, we propose to detect objects using a pre-trained real-time object detection system such as You Only Look Once (YOLO) [redmon2016you]. YOLO offers a higher frames-per-second (fps) processing while providing better accuracy as compared to the other state-of-the-art models such as SSD and ResNet. For online anomaly detection, speed is a critical factor, and hence we currently prefer YOLOv3 in our implementations. For each detected object in image , we get a bounding box (location) along with the class probabilities (appearance). Instead of simply using the entire bounding box, we monitor the center of the box and its area to obtain the location features. In a test video, objects diverging from the nominal paths and/or belonging to previously unseen classes will help us detect anomalies, as explained in Section 3.2.

Optical Flow: Apart from spatial information, temporal information is also a critical aspect of videos. Hence, we propose to monitor the contextual motion of different objects in a frame using a pre-trained optical flow model such as Flownet 2 [ilg2017flownet]

. We hypothesize that any kind of motion anomaly would alter the probability distribution of the optical flow for the frame. Hence, we extract the mean, variance, and the higher order statistics skewness and kurtosis, which represent asymmetry and sharpness of the probability distribution.

Combining the motion, location, and appearance features, for each object detected in a frame, we construct a feature vector as shown in Fig. 1, where Mean, Variance, Skewness and Kurtosis are extracted from the optical flow; denote the coordinates of the center of the bounding box and the area of the bounding box (Section 3.1); and are the class probabilities for the detected object (Section 3.1). Hence, at any given time , with denoting the number of possible classes, the dimensionality of the feature vector is given by .

3.2 Any-Shot Sequential Anomaly Detection

Anomaly detection in streaming video fits well to the sequential change detection framework [basseville1993detection] as we can safely assume that any anomalous event would persist for an unknown but significant period of time. The eventual goal is to detect anomalies with minimal detection delays while satisfying a desired false alarm rate. Traditional parametric change detection algorithms which require probabilistic models cannot be implemented directly here as no prior knowledge about the anomalous events is available. Moreover, it is unrealistic to assume that the available training data includes sufficient number of frames for every possible nominal event. For example, while monitoring a street, the number of frames available for a car would be much more than a truck.

Training: In the -shot video setting, given a set of nominal frames, we leverage our transfer learning module to extract the training set , where is the number of detected objects, and is a -dimensional feature vector. Assuming that the training data does not include any anomalies, correspond to points in the nominal data space, distributed according to an unknown complex probability distribution. To determine the nominal data patterns in a nonparametric way, we use -nearest-neighbor (NN) Euclidean distance to capture the interactions between the nominal data points due to its essential traits such as analyzability, interpretability, and computational efficiency, which deep learning-based models sorely lack. Given the informativeness of extracted motion, location, and appearance features, anomalous instances are expected to lie further away from the nominal training (support) set, which will lead to statistically higher NN distances for the anomalous instances in the test (query) set with respect to the nominal data points. The training procedure of our detector is given as follows:

  1. Partition the training set into two sets and such that .

  2. Then for each feature vector in , we compute the NN distance with respect to the points in .

  3. For a significance level , e.g., , the th percentile of NN distances is used as a baseline statistic for computing the anomaly evidence of test instances.

Testing: During the testing phase, for each object detected at time , the sequential anomaly detection algorithm constructs the feature vector and computes the NN distance with respect to the training instances in . Then, the instantaneous frame-level anomaly evidence is computed as


Finally, following a CUSUM-like procedure [basseville1993detection] we update the running decision statistic as


We decide that there exists an anomaly in video if the decision statistic exceeds the threshold . After the anomaly decision, to determine the anomalous frames, we find the frame started to grow, say , and also determine the frame stops increasing and keeps decreasing for a certain number, e.g., , of consecutive frames, say . Finally, we label the frames between and as anomalous, and continue testing for new anomalies with frame by resetting .

Existing works consider the decision threshold as a design parameter, however for a practical anomaly detection algorithm, a clear procedure for selecting it is necessary. In [fap], we provide an asymptotic () upper bound on the false alarm rate:


where is given by


In (4), is the Lambert-W function, is the constant for the -dimensional Lebesgue measure (i.e., is the -dimensional volume of the hyperball with radius ), and is the upper bound for . Although the expression for looks complicated, all the terms in (4) can be easily computed. Particularly, is directly given by the dimensionality , comes from the training phase, is also found in training, and finally there is a built-in Lambert-W function in popular programming languages such as Python and Matlab. Hence, given the training data, can be easily computed, and the threshold can be chosen to asymptotically achieve the desired false alarm period as follows


4 Experiments

Most of the recent works evaluate their performance on three publicly available benchmark data sets, namely the UCSD pedestrian data set, the ShanghaiTech campus data set and the CUHK avenue data set. Even though each data set has its own set of challenges, all the data sets have a common nominal definition. This makes them susceptible to trivial algorithmic designs which can achieve competitive results as there is a very obvious shift between the nominal and anomalous distributions. Hence, to make the problem more challenging and test the any-shot learning capabilities of different state-of-the-art algorithms, we also test on a modified version of the UCSD data set. For performance evaluation, following the existing works [cong2011sparse, ionescu2019object, liu2018future], we use the frame-level Area under the Curve (AuC) metric.

Any-shot learning: As compared to the original UCSD data set, where a person riding a bike is considered as anomalous, in this case we assume that it is a nominal behavior with very few training samples. Our goal here is to compare the any-shot learning capability of the proposed and state-of-the-art algorithms and see how well they adapt to new patterns. In this case, in addition to the available training data, we also train on a few samples of a person riding a bike. In Fig. 2, it is seen that the proposed algorithm clearly outperforms the state-of-the-art algorithms [ionescu2019object, liu2018future] in terms of any-shot learning performance. It is important to note that for video applications, 10 shots correspond to less than a second in real time.

Figure 2: Comparison of the proposed and state-of-the-art algorithms Liu et al. [liu2018future] and Ionescu et al. [ionescu2019object] in terms of any-shot learning. The proposed algorithm is able to transition well between few-shot and many-shot learning.

Benchmark Datasets: To show the competitive performance of the proposed algorithm with large training data, we compare our results on the entire data sets to a wide range of methods in Table 1. We should note here that our reported result in all the data sets is based on online decision making without seeing future video frames. A common technique used by several recent works [liu2018future, ionescu2019object] is to normalize the computed statistic for each test video independently, which is not suitable for online detection.

Methodology CUHK Avenue UCSD Ped 2 ShanghaiTech
Conv-AE [hasan2016learning] 80.0 85.0 60.9
ConvLSTM-AE[luo2017remembering] 77.0 88.1 -
Stacked RNN[luo2017revisit] 81.7 92.2 68.0
GANs [ravanbakhsh2018plug] - 88.4 -
Liu et al. [liu2018future] 85.1 95.4 72.8
Sultani et al. [sultani2018real] - - 71.5
Ours 86.4 97.8 71.62
Table 1: AuC result comparison on three datasets.

5 Conclusion

For video anomaly detection, we presented an online anomaly detection algorithm which consists of a transfer learning-based feature extraction module and a statistical decision making module. The first module efficiently minimizes the training complexity and extracts motion, location, and appearance features. The second module is a sequential anomaly detector which enables a clear procedure for selecting decision threshold through asymptotic performance analysis. Through experiments on publicly available data, we showed that the proposed detector significantly outperforms the state-of-the-art algorithms in terms of any-shot learning of new nominal patterns. 0