A Modular and Unified Framework for Detecting and Localizing Video Anomalies

03/21/2021 ∙ by Keval Doshi, et al. ∙ University of South Florida 0

Anomaly detection in videos has been attracting an increasing amount of attention. Despite the competitive performance of recent methods on benchmark datasets, they typically lack desirable features such as modularity, cross-domain adaptivity, interpretability, and real-time anomalous event detection. Furthermore, current state-of-the-art approaches are evaluated using the standard instance-based detection metric by considering video frames as independent instances, which is not ideal for video anomaly detection. Motivated by these research gaps, we propose a modular and unified approach to the online video anomaly detection and localization problem, called MOVAD, which consists of a novel transfer learning based plug-and-play architecture, a sequential anomaly detector, a mathematical framework for selecting the detection threshold, and a suitable performance metric for real-time anomalous event detection in videos. Extensive performance evaluations on benchmark datasets show that the proposed framework significantly outperforms the current state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Intoduction

With the increasing demand for security, increasing storage and processing capabilities, and decreasing cost of electronics, surveillance cameras have been widely deployed [50]. Due to the exponential increase in the number of CCTV cameras, the amount of video generated far surpasses our ability to manually analyze it. Automated detection of anomalies in video is challenging since the definition of “anomaly” is ambiguous – any event that does not conform to “normal” behaviors can be considered as an anomaly. For example, a person riding a bike is usually a nominal behavior, however, it may be considered as anomalous if it occurs in a restricted space.

Specifically, due to the important role video anomaly detection plays in ensuring safety, security, and sometimes prevention of potential catastrophes, a major functionality of a video anomaly detection system is the real-time decision making capability. While there is a lot of prior work on anomaly detection in surveillance videos, they mainly focus on offline localization of anomaly in video frames following an instance-based binary hypothesis testing approach and ignoring the online (i.e., real-time) detection of anomalous events. For example, most of the existing works, e.g. [13, 21, 50], employ a video normalization technique that requires an entire video segment for computation. They also typically depend on the assumption that there is an anomaly in the video segment. In practice, this assumption either will not hold for short video segments (on the order of minutes) or will cause long delays in detecting anomalous events for sufficiently long video segments (on the order of days).

The automated video surveillance literature lacks a clear distinction between online anomalous event detection and offline anomalous frame localization [21, 13, 33, 31, 28]. While the commonly used frame-level AUC (area under the ROC curve), which is borrowed from the instance-based binary hypothesis testing, might be a suitable metric for localizing the anomaly in video frames, it ignores the temporal nature of videos and fails to capture the dynamics of detection results, e.g., a detector that detects a late portion of an anomalous event and alarms the user after a long delay can achieve the same frame-level AUC as the detector that quickly detects the anomalous event and timely alarms the user but misses some anomalous frames afterwards. While minimizing the delay in detecting an anomalous event is critical [27], it is also necessary to control the false alarm rate. Hence, a video anomaly detector should aim to judiciously raise alarms in a timely manner.

For practical implementations, it is unrealistic to assume the availability of sufficient training data such that it encompasses all possible nominal events/behaviors. Thus, a practical framework should also be able to perform few-shot adaptation to new nominal scenarios over time. This presents a novel challenge to the current approaches discussed in Section 2

as their decision functions heavily depend on Deep Neural Networks (DNNs)

[7]. DNNs typically require a large amount of training data to learn a new nominal pattern or exhibit the risk of catastrophic forgetting with incremental updates [15].

Another limitation of existing methods is the lack of interpretability due to the inclination towards end-to-end deep learning based models, leading to a semantic gap between the visual features and the real interpretation of events

[30]. While such models perform well on some benchmark datasets, i.e., they are easily able to detect a certain category of anomalies, they cannot adequately generalize to other types of anomalies. For example, [30, 38, 28]

propose a pose estimation based framework, and hence are only able to detect human-related anomalies. Moreover, there is no straightforward way to modify such methods to target a different class of anomaly since they are based on intricately designed neural networks.

Our goal in this paper is to present a more systematic framework for video anomaly detection and localization, and tackle practical challenges such as few-shot adaptation, which is largely unexplored in the existing literature. In summary, our contributions in this paper are as follows:

  • We present a systematic unified framework for online event detection and offline frame localization for video anomalies, and propose a new performance metric for online event detection.

  • We propose a modular transfer learning based anomaly detection architecture which can be easily modified to target specific anomaly categories and can easily adapt to new scenarios using a few samples (cross-domain adaptivity).

  • We introduce a statistical technique for the selection of detection threshold to satisfy a desired false alarm rate.

2 Related Works

There is a fast-growing body of research investigating anomaly detection in videos. A key component of computer vision problems is the extraction of meaningful features. In video surveillance, the extracted features should be capable of capturing the difference between nominal and anomalous events within a video

[7]

. While some methods use supervised learning to train on both nominal and anomalous events

[20, 17]

, the majority of existing research is concentrated on semi-supervised learning due to the limitations in the availability of annotated anomalous instances. Early anomaly detection methods used handcrafted approaches which extract different types of motion information in the form of histogram of oriented gradients (HOGs)

[2, 4] and optical flow. Another category is sparse coding-based methods [49], which were used to learn a dictionary of normal sparse events, and attempt to detect anomalies based on the reconstructability of video from the dictionary atoms. For example, [29]

uses sparse reconstruction to learn joint trajectory representations of multiple objects. These approaches, while computationally inexpensive, often fail to capture complex anomalous patterns. The recent literature however has been dominated by Convolutional Neural Network (CNN) based methods

[10, 11, 24, 36, 40, 47, 28, 33, 13] due to their significantly superior detection performance. Recently, transfer learning based object detection methods have also been frequently used [6, 7, 13, 9] to learn appearance features. The neural network-based methods can be broadly segregated into reconstruction-based methods [10, 35, 3, 13] and prediction-based methods [21, 23, 39]. However, these CNNs require a significant amount of training to adapt to a new scenario. Hence, recently few-shot learning has been gaining attention in the computer vision literature [16, 44, 41, 45, 22, 23]. However, no significant progress has been made yet in few-shot scene adaptation for video surveillance. Hence, in this work, we primarily compare our few-shot adaptation performance with [23], which proposes a meta-learning algorithm for cross-domain adaptivity.

3 Proposed Method

Figure 1: Proposed MOVAD framework. At each time

, neural network-based feature extraction module provides location (center coordinates and area of bounding box), appearance (class probabilities), global motion (optical flow), and local motion (pose estimation) features to the statistical anomaly detection module, which computes

NN distance for anomaly evidence using a fully connected neural network, and sequentially decides for anomalous events using an RNN. In human pose estimation, the single person pose estimation (SPPE) is converted to multi-person pose features.

3.1 Motivation

In the recent anomaly detection literature, most of the proposed methods consist of training a deep neural network on available nominal samples. However, such an approach has several shortcomings. First, the applicability of such a method is limited to a few scenarios where there is a drastic change in the appearance or motion of an object. In [6], it is shown that modifying the benchmark datasets results in a significant drop in the performance of state-of-the-art algorithms. Second, to the best of our knowledge, there is no existing method that can be easily modified or extended to a new category of anomalies. For example, even recent algorithms such as [48, 21, 32] cannot detect (or be modified to detect) anomalies pertaining to changes in human poses. Third, because of the extensive use of end-to-end learning in recent algorithms, the models lack interpretability. While there are certain supervised methods, e.g., [42], which are capable of recognizing the type of anomaly, they depend on the availability of anomalous data. Finally, existing methods also lack a clear procedure for incorporating new knowledge, and would likely necessitate significant changes to the existing architecture.

Motivated by these shortcomings, we propose a modular framework, called Modular Online Video Anomaly Detector (MOVAD), consisting of deep learning-based feature extraction and statistical anomaly detection, as shown in Fig. 1

. In particular, transfer learning based convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used to extract informative features, followed by a novel

NN-based neural network and RNN-based sequential anomaly detector.

The choice of separating feature extraction module and decision module also enables theoretical performance analysis and a closed-form expression for the detection threshold. In the following sections, we discuss our framework in detail.

3.2 Transfer Learning-Based Feature Extraction

In general, the end-to-end training of DNNs for video anomaly detection necessitates focusing on a particular aspect in which anomalies may occur, such as object appearance or motion or pose, and extracting only those features. However, even in the same scene, anomalous events may be manifested in different aspects. Hence, advanced video anomaly detectors should utilize features from multiple aspects together. For instance, biological vision systems extracts different features in the visual cortex such as appearance, global motion, and local motion [46]. To this end, we propose a flexible feature extraction module that can work with various modalities, which enables a plug-and-play modular architecture. This means although appearance, global motion, and local motion features are considered in this paper, the proposed framework can be easily modified to add new feature extractors or remove existing ones. Furthermore, entirely retraining a video anomaly detector for new scene/domain is typically not necessary since most domains share the same feature types (appearance, global motion, local motion, etc.). As a result, to significantly reduce the training computational complexity, a transfer learning approach is utilized in the proposed framework. We next explain the considered feature extractors, which work in parallel as shown in Fig. 1.

Object Appearance:

A pre-trained object detection system is used to detect objects and extract appearance and spatial features. Since we do not assume any prior knowledge about the type of anomalies, and hence by extension the object classes, we use a model trained on the MS-COCO dataset. For online anomaly detection, the real-time operation is a critical factor, and hence, we currently prefer the You Only Look Once (YOLO)

[37] algorithm, specifically YOLOv4, in our implementations. It should be noted that the choice of the object detector is not critical for the proposed framework, and can be adjusted according to the application. Using the object detector, we extract the bounding box (location) as well as the class probabilities (appearance) for each object detected in a given frame. Instead of directly using the bounding box coordinates, we instead compute the center and area of the box and leverage them as our spatial features. During testing, any object belonging to a previously unseen class and/or deviating from the known nominal paths contributes to an anomalous event alarm.

Global Motion: Apart from spatial and appearance features, capturing the motion of different objects is also critical for detecting anomalies in videos. Hence, to monitor the contextual motion of different objects, we propose using a pre-trained optical flow model such as Flownet 2 [12]

. We hypothesize that objects with an unusually high/low optical flow intensity would exhibit an anomalous behavior. Thus, the mean and variance are for each detected object are used as our global motion features.

Local Motion: To study the social behavior in a video, it is an important factor to study the human motion closely. For inanimate objects like cars, trucks, bikes, etc., monitoring the optical flow is sufficient to judge whether they portray some sort of anomalous behavior. However, with regard to humans, we also need to monitor their poses to determine whether an action is anomalous or not. Hence, using a pre-trained multi-person pose estimator such as AlphaPose [8] is proposed to extract skeletal trajectories.

3.3 Statistical Anomaly Detection

Anomaly Evidence: Given the various extracted features, the next step in the proposed framework is to compute an anomaly evidence score for each video frame in an online fashion. Due its favorable characteristics, such as interpretability and theoretical tractability, we use -nearest-neighbor (

NN) distance as an anomaly evidence. For a feature vector

representing each object in frame , our objective is to compute its Euclidean distance to the th nearest feature vector in the nominal training set. Since NN distance computation becomes expensive with increasing training size, for scalability, we propose training a fully connected neural network with parameters , which takes as the input and gives an accurate approximation to . The objective function for training the NN neural network is given by

(1)

where is the number of feature vectors in the training set, is the regularization term. The number of neighbors

determines a trade-off between sensitivity to anomalies and robustness to nominal outliers. While smaller

values makes the system more sensitive to real anomalies, it may also make the system more vulnerable to nominal outliers. However, the choice of is not critical for the detection performance since the proposed sequential detection module does not directly decide on the anomaly evidences. As shown next, through the internal memory of the RNN structure, it gathers the evidences to detect anomalous events, hence does not typically raise an alarm due to a single evidence due to an outlying frame.

Online Anomaly Detection: To accommodate the temporal continuity of video data and detect anomalous events in an online fashion, a sequential statistical decision making method based on RNN is proposed. The anomaly evidence scores (i.e., NN distances) from streaming video frames provide an informative time series data which typically takes large values when the anomalous event starts. However, to avoid false alarms due to outlying large evidences from nominal frames, the proposed framework does not decide using individual evidences, but instead utilizes the temporal information inherent in the evidence time series (i.e., an anomalous event consists of a number of successive anomalous frames). Specifically, it takes the streaming NN distances

as input and updates an internal state, which is then passed through ReLU activation function to yield the decision statistic

. The time series is obtained by taking the largest NN distance among objects in each frame, i.e.,

. The output neuron in RNN compares

with a threshold to raise an alarm if or continue with the next frame otherwise. Note that the RNN structure can be expanded to accept multiple time series (in addition to NN distances) and to have deeper layers if desired. While NN distances are available for the nominal class, there is no such scores for the anomaly class to train RNN in the considered semi-supervised setup. Synthetic NN distances are generated uniformly in the interval where is a statistical significance level (e.g., ), is the percentile of nominal distances in the training set, and is the maximum nominal distance in the training.

To circumvent the training with synthetic data, and obtain a closed-form expression for the threshold , we also propose a simplified decision rule. Motivated by the resemblance of the memory (internal state) and ReLU operations of RNN with the minimax optimum sequential change detection algorithm CUSUM [1], we consider fixing the RNN weights to obtain the simplified decision statistic . In this update rule, the weights of internal state and input are set to one, where the input is the normalized NN distance, where is the dimensionality of feature vectors . In our experiments, the simplified detector gave very similar results to the general RNN detector. With the weights set to one, there is no need to train the RNN, and the simplified decision statistic lends itself to theoretical analysis to derive a closed-form expression for the threshold , as explained next.

Theorem 1

As the training size grows (), the false alarm rate of the proposed simplified detector based on is upper bounded by and the threshold can be set as

(2)

to asymptotically satisfy a desired false alrm constraint . The constant is computed from the training data and given by

(3)

where is the Lambert-W function, is the constant for the -dimensional Lebesgue measure (i.e., is the -dimensional volume of the hyperball with radius ), and is the upper bound for .

Proof. See the supplementary file.

Although the expression for looks complicated, all the terms in Eq. (3) can be easily computed. Particularly, is directly given by the number of features , comes from the training phase, is also found in training, and finally there is a built-in Lambert-W function in popular programming languages such as Python and Matlab. Hence, given the training data, can be easily computed, and the threshold can be chosen using Eq. (2) to asymptotically achieve the desired false alarm rate .

Decision threshold

is a key parameter that is common to all existing anomaly detection algorithms, and yet is often overlooked. Since an alarm is raised when the test statistic crosses the threshold, choosing an appropriate threshold is critical for controlling the number of false alarms and minimizing the need for human involvement. In a practical setting, without a clear procedure for selecting the decision threshold, an exhaustive empirical process is needed to calibrate the threshold for an acceptable false alarm rate.

New Performance Metric for Online Detection: Low detection delay is a crucial requirement in most video-related applications such as autonomous driving [19] and automated video surveillance. However, the detection delay, which is the time required by an algorithm to detect an anomalous event, is largely unexplored in the field of video anomaly detection. The popular performance metric in the video anomaly detection literature, AUC, cannot effectively evaluate the performance of online anomaly detection algorithms [18]. Hence, we present a new performance metric called APD (Average Precision as a function of Delay), which is based on average detection delay and precision. The proposed delay metric is given by

(4)

where denotes the normalized average detection delay, and denotes the precision. The average detection delay is normalized by the largest possible delay either defined by a performance requirement or the length of natural cuts in the video stream such as the video segments in the benchmark datasets (See Sec. 4.1).

Offline Localization: Once an anomalous event is detected, the detection instance is marked as the starting point, and the decision statistic is updated as usual to determine the end point. When the decision statistic drops consecutively for a number of frames (e.g., five frames is found to be a good number in our experiments), the beginning of the drop window is marked as the end point. Finally, the frames between the start and end points are labeled as anomalous.

Implementation Details: In our implementation, we fix the number of neighbors as . However, as indicated in Section 3.3, the choice of is not sensitive and does not significantly affect the performance of the detector. The detection performance is controlled by the decision threshold , which can be mathematically set by following Eq. (2). For the NN regression network, we use a fully connected deep neural network with 3 hidden layers consisting of 20 neurons each. We empirically chose the simplest network that gave a sufficiently low prediction error. The feature vector is 18-dimensional for each detected object, and consists of 15 class probabilities (appearance), mean and variance of optical flow in the bounding box (global motion), and prediction error of pose if human (local motion). Global and local motion features are normalized to [0,1] using the min and max values from the training data.

4 Experiments

In this section, we first briefly discuss the benchmark datasets and the evaluation metrics. Then, we provide a detailed comparison between the proposed algorithm and the state-of-the-art algorithms in terms of online detection and offline localization. We also evaluate our few-shot adaptation performance.

4.1 Datasets

We consider four publicly available benchmark datasets, namely the CUHK Avenue dataset, the UCSD pedestrian dataset, the ShanghaiTech campus dataset, and the UR fall dataset.

UCSD Ped 2: The UCSD pedestrian dataset is one of the most widely used video anomaly detection datasets. Due to the low resolution of the UCSD Ped 1 videos, we only consider the UCSD Ped 2 dataset. The Ped 2 dataset consists of 16 training videos and 12 test videos. The anomalous events are caused due to vehicles such as bicycles, skateboards and wheelchairs. Despite being widely used as a benchmark dataset, most anomalies are obvious and can be easily detected from a single frame.

CUHK Avenue: Another popular dataset is the CUHK Avenue dataset, which consists of short video clips taken from a single outdoor surveillance camera looking at the side of a building with a pedestrian walkway in front of it. It contains 16 training and 21 test videos with a frame resolution of 360 640.

ShanghaiTech: The ShanghaiTech dataset is one of the largest and most challenging datasets available for anomaly detection in videos. It consists of 330 training and 107 test videos from 13 different scenes, which sets it apart from the other available datasets. The resolution for each video frame is 480 856.

UR Fall: While the UR fall dataset is not popularly used for video anomaly detection, it has recently been proposed for testing the generalization capability of anomaly detection algorithms [23]. This dataset contains 70 depth videos collected with a Microsoft Kinect camera in a nursing home and the anomalies consist of a person falling in a closed room.

4.2 Results

Online Detection: Since the proposed online detection formulation is event-based as compared to frame-based, it only considers an anomaly as a single event irrespective of the duration over which it occurs. In this setup, we present our results only on the ShanghaiTech dataset as the UCSD and CUHK Avenue datasets have fewer than 50 anomalous events, which is not enough for a reliable average performance comparison. A common technique used by several recent works [21, 13, 30, 32] is to normalize the computed statistic for each test video independently, including the ShanghaiTech dataset. However, this methodology cannot be implemented in an online (real-time) system as it requires the prior knowledge of the minimum and maximum values the statistic might take. Moreover, many recent methods [13, 23, 31] do not have their implementation details/code publicly available, while others are end-to-end [31, 33, 38] and cannot be implemented to work in an online fashion. Hence, we compare our method with the online versions of [21, 30, 26]. As shown in Fig. 2, our proposed algorithm achieves a better performance than the other algorithms in terms of quick detection and achieving high precision in alarms. This result is also summarized in Table 1 in terms of the APD values.

Figure 2: Comparison of the proposed and the state-of-the-art algorithms Liu et al. [21] and Morais et al. [30] in terms of online detection capability. The proposed algorithm has a significantly higher precision for any given detection delay.
Online Detection
Methodology APD
Liu et al. [21] 0.504
Morais et al. [30] 0.324
Luo et al. [26] 0.447
Ours 0.705
Table 1: Online detection comparison in terms of the proposed APD metric on the ShanghaiTech dataset. Higher APD value represents a better online anomaly detection performance.
Figure 3: Threshold selected according to Eq. (2) satisfies the desired lower bound on false alarm period (i.e., upper bound on false alarm rate) even in the non-asymptotic regime with the finite sample size of the CUHK Avenue dataset.

Threshold Selection: We next evaluate the non-asymptotic use of the asymptotic threshold expression given in Eq. (2). As shown in Fig. 3, even with the limited data size of the CUHK Avenue dataset, the derived expression satisfies the desired upper bound on the false alarm rate, which corresponds to a lower bound on the false period (inverse rate) in the figure.

Offline Localization
Methodology CUHK Avenue UCSD Ped 2 ShanghaiTech
MPPCA [14] - 69.3 -
Del et al. [5] 78.3 - -
Conv-AE [10] 80.0 85.0 60.9
ConvLSTM-AE[25] 77.0 88.1 -
Growing Neural Gas [43] - 93.5 -
Stacked RNN[24] 81.7 92.2 68.0
Deep Generic [11] - 92.2 -
GANs [34] - 88.4 -
Future Frame [21] 85.1 95.4 72.8
Skeletal Trajectory [30] - - 73.4
Multi-timescale Prediction [38] 82.85 - 76.03
Memory-guided Normality [32] 88.5 97.0 70.5
Ours 88.7 97.2 73.62
Table 2: Offline anomaly localization comparison in terms of frame-level AUC on three datasets.
Target Methods 1-shot (K=1) 5-shot (K=5) 10-shot (K=10)
UCSD Ped 2 Pre-trained (ShanghaiTech) 81.95 81.95 81.95
Pre-trained (UCF Crime) 62.53 62.53 62.53
r-GAN (ShanghaiTech) 91.19 91.8 92.8
r-GAN (UCF Crime) 83.08 86.41 90.21
Ours 93.19 95.91 96.01
CUHK Avenue Pre-trained (ShanghaiTech) 71.43 71.43 71.43
Pre-trained (UCF Crime) 71.43 71.43 71.43
r-GAN (ShanghaiTech) 76.58 77.1 78.79
r-GAN (UCF Crime) 72.62 74.68 79.02
Ours 80.18 80.21 80.68
UR Fall Pre-trained (ShanghaiTech) 64.08 64.08 64.08
Pre-trained (UCF Crime) 50.87 50.87 50.87
r-GAN (ShanghaiTech) 75.51 78.7 83.24
r-GAN (UCF Crime) 74.59 79.08 81.85
Ours 86.11 88.7 91.28
Table 3: Few-shot scene adaptation comparison of the proposed and the state-of-the-art [23] algorithms in terms of frame-level AUC. The proposed algorithm is able to quickly adapt to new scenarios.

Offline Localization: To show the offline localization capability of our algorithm, we also compare our algorithm to a wide range of state-of-the-art methods, as shown in Table 2, using the frame-level AUC criterion. The pixel-level criterion, which focuses on the spatial localization of anomalies, can be made equivalent to the frame-level criterion through simple post-processing techniques [33]. Hence, for offline anomaly localization, we consider frame-level AUC criterion. While [13] recently showed significant gains over the other algorithms, their methodology of computing the average AUC over an entire dataset gave them an unfair advantage. Specifically, as opposed to determining the AUC on the concatenated videos, first the AUC for each video segment was computed and then those AUC values were averaged. As shown in Table 2, our proposed algorithm outperforms the existing algorithms on the UCSD Ped 2 and CUHK Avenue datasets, and performs competitively on the ShanghaiTech dataset. The multi-timescale framework [38] is the only one that outperforms ours on the ShanghaiTech dataset since the anomalies are mostly caused by previously unseen human poses and [38] extensively monitors them using a past-future trajectory prediction based framework. However, this causes their performance to severely degrade on the CUHK Avenue dataset, and similar to [30], they cannot work on the UCSD dataset.

Few-Shot Scene Adaptation: Our goal here is to compare the few-shot scene adaptation capability of the proposed algorithm and see how well it can generalize to new scenarios. In this case, we only use a few scenes from a specific scenario to adapt. However, few-shot scene adaptation is mostly unexplored and to the best of our knowledge only [23] discusses it. Hence, following the experimental setup defined in [23], we use K-shots to adapt to a new scenario, where 1-shot is a sequence of 10 frames. From [23], we use the following baselines for comparison.

Pre-trained: This baseline learns the model from videos available during training, then directly applies the model in testing without any adaptation.

r-GAN: We also compare with a few-shot scene-adaptive anomaly detection model using a meta-learning framework proposed in [23]. They use a GAN-based framework similar to [21] and MAML algorithm for meta-learning.

As compared to the pre-trained and r-GAN models, which need considerable training on either the ShanghaiTech or UCF Crime [42] dataset, our transfer learning based algorithm (pre-trained on generic datasets such as MS-COCO) is able to leverage our optical flow model which requires minimal computation to establish a baseline and adapt the decision parameter to a new scene. Due to the lack of available training data, we are unable to use the local motion and appearance features meaningfully, and hence our features are only dependant on the optical flow statistics. However, as shown in Table 3, we are still able to outperform the compared methods in terms of the frame-level AUC.

(a)
(b)
Figure 4: The proposed model is able to interpret the cause of the anomaly correctly.
Figure 5: The advantage of sequential anomaly detection over a single-shot detector. It is seen that a sequential detector can significantly reduce the number of false alarms.

4.3 Ablation Study

ShanghaiTech
Module AUC
Object Detection 0.594
Optical Flow 0.703
Pose Estimation 0.652
Table 4: Performance of each module in terms of the frame-level AUC on the ShanghaiTech dataset.

In Table 4, we present the results for each module of the proposed MOVAD framework on the ShanghaiTech dataset. While it is clear that optical flow is the major contributor among all the modules in this dataset, each module serves a specific purpose. In this dataset, although several recent works perform closely to the proposed framework, a distinguishing advantage of MOVAD is its interpretability. By leveraging the statistical nature of our decision making module, it is possible to determine the cause of increase in the decision statistic. In Fig. 4, we present a sample scenario from the CUHK Avenue and UCSD datasets, in which the proposed detector is able to evaluate the statistics from each module and justify the cause of the anomaly. However, since there is no ground truth available in terms of the description of the anomaly, we were unable to quantitatively evaluate the interpretability performance of MOVAD.

Impact of Sequential Detection: To emphasize the significance of the proposed sequential detection method, we compare a nonsequential version of our algorithm by applying a threshold to the instantaneous anomaly evidence (Sec. 3.3), which is similar to the approach employed by many recent works [21, 42, 13]. As shown in Fig. 5, the proposed sequential statistic handles noisy evidence by integrating recent evidence over time. On the other hand, the instantaneous anomaly evidence is more prone to false alarms since it only considers the noisy evidence available at the current time to decide. Specifically, without sequential detection, the APD presented in Table 1 for the proposed framework reduces to 0.673.

5 Conclusion and Discussions

For video anomaly detection, we presented a modular framework called MOVAD, which consists of an interpretable transfer learning based feature extractor, and a novel NN-RNN based sequential anomaly detector. Mathematical analysis was provided for false alarm rate and threshold selection. Following the timely detection requirement in practical settings, MOVAD first detects anomalous events in an online fashion, and then deals with localizing the anomalous video frames. Online detection of anomalous events is largely overlooked in the video anomaly detection literature, thus a new performance metric was also introduced to compare algorithms in terms of online anomaly detection in videos. Through extensive testing on the benchmark datasets, we show that MOVAD significantly outperforms the state-of-the-art methods for online detection while performing competitively for offline localization.

While being able to capture anomalies in various video aspects, such as object appearance and motion, the proposed method currently is not optimized for specific anomaly types. For instance, it is not able to detect unexpected human poses as the optical flow does not change significantly (see Supplementary). For future work, we plan to focus on continual and self-supervised learning for MOVAD.

References

  • [1] M. Basseville and I. V. Nikiforov (1993) Detection of abrupt changes: theory and application. Vol. 104, prentice Hall Englewood Cliffs. Cited by: §3.3.
  • [2] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1932–1939. Cited by: §2.
  • [3] Y. S. Chong and Y. H. Tay (2017)

    Abnormal event detection in videos using spatiotemporal autoencoder

    .
    In International Symposium on Neural Networks, pp. 189–196. Cited by: §2.
  • [4] R. V. H. M. Colque, C. Caetano, M. T. L. de Andrade, and W. R. Schwartz (2016) Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos. IEEE Transactions on Circuits and Systems for Video Technology 27 (3), pp. 673–682. Cited by: §2.
  • [5] A. Del Giorno, J. A. Bagnell, and M. Hebert (2016) A discriminative framework for anomaly detection in large videos. In European Conference on Computer Vision, pp. 334–349. Cited by: Table 2.
  • [6] K. Doshi and Y. Yilmaz (2020) Any-shot sequential anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 934–935. Cited by: §2, §3.1.
  • [7] K. Doshi and Y. Yilmaz (2020) Continual learning for anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 254–255. Cited by: §1, §2.
  • [8] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) RMPE: regional multi-person pose estimation. In ICCV, Cited by: §3.2.
  • [9] M. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah (2020) A scene-agnostic framework with adversarial training for abnormal event detection in video. arXiv preprint arXiv:2008.12328. Cited by: §2.
  • [10] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §2, Table 2.
  • [11] R. Hinami, T. Mei, and S. Satoh (2017) Joint detection and recounting of abnormal events by learning deep generic knowledge. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3619–3627. Cited by: §2, Table 2.
  • [12] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: §3.2.
  • [13] R. T. Ionescu, F. S. Khan, M. Georgescu, and L. Shao (2019) Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7842–7851. Cited by: §1, §1, §2, §4.2, §4.2, §4.3.
  • [14] J. Kim and K. Grauman (2009) Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2928. Cited by: Table 2.
  • [15] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
  • [16] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §2.
  • [17] F. Landi, C. G. Snoek, and R. Cucchiara (2019) Anomaly locality in video surveillance. arXiv preprint arXiv:1901.10364. Cited by: §2.
  • [18] A. Lavin and S. Ahmad (2015) Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark. In

    2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)

    ,
    pp. 38–44. Cited by: §3.3.
  • [19] S. Lin, Y. Zhang, C. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars (2018) The architectural implications of autonomous driving: constraints and acceleration. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 751–766. Cited by: §3.3.
  • [20] K. Liu and H. Ma (2019) Exploring background-bias for anomaly detection in surveillance videos. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1490–1499. Cited by: §2.
  • [21] W. Liu, W. Luo, D. Lian, and S. Gao (2018) Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545. Cited by: §1, §1, §2, §3.1, Figure 2, §4.2, §4.2, §4.3, Table 1, Table 2.
  • [22] V. Lomonaco and D. Maltoni (2017) Core50: a new dataset and benchmark for continuous object recognition. arXiv preprint arXiv:1705.03550. Cited by: §2.
  • [23] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang (2020) Few-shot scene-adaptive anomaly detection. arXiv preprint arXiv:2007.07843. Cited by: §2, §4.1, §4.2, §4.2, §4.2, Table 3.
  • [24] W. Luo, W. Liu, and S. Gao (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, pp. 341–349. Cited by: §2, Table 2.
  • [25] W. Luo, W. Liu, and S. Gao (2017) Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 439–444. Cited by: Table 2.
  • [26] W. Luo, W. Liu, D. Lian, J. Tang, L. Duan, X. Peng, and S. Gao (2019) Video anomaly detection with sparse coding inspired deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.2, Table 1.
  • [27] H. Mao, X. Yang, and W. J. Dally (2019) A delay metric for video object detection: what average precision fails to tell. In Proceedings of the IEEE International Conference on Computer Vision, pp. 573–582. Cited by: §1.
  • [28] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan (2020) Graph embedded pose clustering for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10539–10547. Cited by: §1, §1, §2.
  • [29] X. Mo, V. Monga, R. Bala, and Z. Fan (2013) Adaptive sparse representations for video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology 24 (4), pp. 631–645. Cited by: §2.
  • [30] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh (2019) Learning regularity in skeleton trajectories for anomaly detection in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11996–12004. Cited by: §1, Figure 2, §4.2, §4.2, Table 1, Table 2.
  • [31] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai (2020) Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12173–12182. Cited by: §1, §4.2.
  • [32] H. Park, J. Noh, and B. Ham (2020) Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14372–14381. Cited by: §3.1, §4.2, Table 2.
  • [33] B. Ramachandra and M. Jones (2020) Street scene: a new dataset and evaluation protocol for video anomaly detection. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2569–2578. Cited by: §1, §2, §4.2, §4.2.
  • [34] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe (2018) Plug-and-play cnn for crowd motion analysis: an application in abnormal event detection. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1689–1698. Cited by: Table 2.
  • [35] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe (2017) Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 1577–1581. Cited by: §2.
  • [36] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe (2019) Training adversarial discriminators for cross-channel abnormal event detection in crowds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1896–1904. Cited by: §2.
  • [37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §3.2.
  • [38] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri (2020) Multi-timescale trajectory prediction for abnormal human activity detection. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2626–2634. Cited by: §1, §4.2, §4.2, Table 2.
  • [39] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.
  • [40] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli (2018)

    Adversarially learned one-class classifier for novelty detection

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3379–3388. Cited by: §2.
  • [41] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §2.
  • [42] W. Sultani, C. Chen, and M. Shah (2018) Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488. Cited by: §3.1, §4.2, §4.3.
  • [43] Q. Sun, H. Liu, and T. Harada (2017) Online growing neural gas for anomaly detection in changing surveillance scenes. Pattern Recognition 64, pp. 187–201. Cited by: Table 2.
  • [44] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §2.
  • [45] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2.
  • [46] Visual system. https://en.wikipedia.org/wiki/Visual_system. Cited by: §3.2.
  • [47] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe (2015) Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553. Cited by: §2.
  • [48] M. Z. Zaheer, J. Lee, M. Astrid, and S. Lee (2020) Old is gold: redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14183–14193. Cited by: §3.1.
  • [49] B. Zhao, L. Fei-Fei, and E. P. Xing (2011) Online detection of unusual events in videos via dynamic sparse coding. In CVPR 2011, pp. 3313–3320. Cited by: §2.
  • [50] J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh (2019) Anomalynet: an anomaly detection network for video surveillance. IEEE Transactions on Information Forensics and Security 14 (10), pp. 2537–2550. Cited by: §1, §1.