We consider the problem of anomaly detection in surveillance videos. Given a video, the goal is to identify frames where abnormal events happen. This is a very challenging problem since the definition of “anomaly” is ambiguous – any event that does not conform to “normal” behaviours can be considered as an anomaly. As a result, we cannot solve this problem via a standard classification framework since it is impossible to collect training data that cover all possible abnormal events. Existing literature usually addresses this problem by training a model using only normal data to learn a generic distribution for normal behaviours. During testing, the model classifies anomaly using the distance between the given sample and the learned distribution.
A lot of prior work (e.g. [hasan2016learning, masci2011stacked, sabokrou2016video, chalapathy2017robust, sabokrou2018adversarially, abati2019latent, gong2019memorizing]) in anomaly detection use frame reconstruction. These approaches learn a model to reconstruct the normal training data and use the reconstruction error to identify anomalies. Alternatively, [liu2018future, nguyen2019anomaly, lu2019future, luo2017remembering, medel2016anomaly]
use future frame prediction for anomaly detection. These methods learn a model that takes a sequence of consecutive frames as the input and predicts the next frame. The difference between the predicted frame and the actual frame at the next time step is used to indicate the probability of an anomaly.
However, existing anomaly detection approaches share common limitations. They implicitly assume that the model (frame reconstruction, or future frame prediction) learned from the training videos can be directly used in unseen test videos. This is a reasonable assumption only if training and testing videos are from the same scene (e.g. captured by the same camera). In the experiment section, we will demonstrate that if we learn an anomaly detection model from videos captured from one scene and directly test the model in a completely different scene, the performance will drop. Of course, one possible way of alleviating this problem is to train the anomaly detection model using videos collected from diverse scenes. Then the learned model will likely generalize to videos from new scenes. However, this approach is also not ideal. In order to learn a model that can generalize well to diverse scenes, the model requires a large capacity. In many real-world applications, the anomaly detection system is often deployed on edge devices with limited computing powers. As a result, even if we can train a huge model that generalizes well to different scenes, we may not be able to deploy this model.
Our work is motivated by the following key observation. In real-world anomaly detection applications, we usually only need to consider one particular scene for testing since the surveillance cameras are normally installed at fixed locations. As long as a model works well in this particular scene, it does not matter at all whether the same model works on images from other scenes. In other words, we would like to have a model specifically adapted to the scene where the model is deployed. In this paper, we propose a novel problem called the few-shot scene-adaptive anomaly detection illustrated in Fig. 1. During training, we assume that we have access to videos collected from multiple scenes. During testing, the model is given a few frames in a video from a new target scene. Note that the learning algorithm does not see any images from the target scene during training. Our goal is to produce an anomaly detection model specifically adapted to this target scene using these few frames. We believe this new problem setting is closer to real-world applications. If we have a reliable solution to this problem, we only need a few frames from a target camera to produce an anomaly detection model that is specifically adapted to this camera. In this paper, we propose a meta-learning based approach to this problem. During training, we learn a model that can quickly adapt to a new scene by using only a few frames from it. This is accomplished by learning from a set of tasks, where each task mimics the few-shot scene-adaptive anomaly detection scenario using videos from an available scene.
This paper makes several contributions. First, we introduce a new problem called few-shot scene-adaptive anomaly detection, which is closer to the real-world deployment of anomaly detection systems. Second, we propose a novel meta-learning based approach for solving this problem. We demonstrate that our proposed approach significantly outperforms alternative methods on several benchmark datasets.
2 Related Work
Anomaly Detection in Videos
: Recent research in anomaly detection for surveillance videos can be categorized as either reconstruction-based or prediction-based methods. Reconstruction-based methods train a deep learning model to reconstruct the frames in a video and use the reconstruction error to differentiate the normal and abnormal events. Examples of reconstruction models include convolutional auto-encoders[masci2011stacked, hasan2016learning, sabokrou2016video, chalapathy2017robust, gong2019memorizing]
, latent autoregressive models[abati2019latent], deep adversarial training [sabokrou2018adversarially], etc. Prediction-based detection methods define anomalies as anything that does not conform to the prediction of a deep learning model. Sequential models like Convolutional LSTM (ConvLSTM) [xingjian2015convolutional] have been widely used for future frame prediction and utilized to the task of anomaly detection [luo2017remembering, medel2016anomaly]. Popular generative networks like generative adversarial networks (GANs) [goodfellow2014generative]
and variational autoencoders (VAEs)[kingma2013auto] are also applied in prediction-based anomaly detection. Liu et al. [liu2018future] propose a conditional GAN based model with a low level optical flow [dosovitskiy2015flownet] feature. Lu et al. [lu2019future] incorporate a sequential model in generative networks (VAEs) and propose a convolutional VRNN model. Moreover, [gong2019memorizing] apply optical flow prediction constraint on a reconstruction based model.
Few-Shot and Meta Learning: To mimic the fast and flexible learning ability of humans, few-shot learning aims at adapting quickly to a new task with only a few training samples [lake2015human]. In particular, meta learning (also known as learning to learn) has been shown to be an effective solution to the few-shot learning problem. The research in meta-learning can be categorized into three common approaches: metric-based [koch2015siamese, vinyals2016matching, sung2018learning], model-based [santoro2016meta, munkhdalai2017meta] and optimization-based approaches [ravi2016optimization, finn2017model]. Metric-based approaches typically apply Siamese [koch2015siamese], matching [vinyals2016matching], relation [sung2018learning] or prototypical networks [snell2017prototypical] for learning a metric or distance function over data points. Model-based approaches are devised for fast learning from the model architecture perspective [santoro2016meta, munkhdalai2017meta], where rapid parameter updating during training steps is usually achieved by the architecture itself. Lastly, optimization-based approaches modify the optimization algorithm for quick adaptation [ravi2016optimization, finn2017model]. These methods can quickly adapt to a new task through the meta-update scheme among multiple tasks during parameter optimization. However, most of the approaches above are designed for simple tasks like image classification. In our proposed work, we follow a similar optimization-based meta-learning approach proposed in [finn2017model] and apply it to the much more challenging task of anomaly detection. To the best of our knowledge, we are the first to cast anomaly detection as meta-learning from multiple scenes.
3 Problem Setup
We first briefly summarize the standard anomaly detection framework. Then we describe our problem setup of few-shot scene-adaptive anomaly detection.
Anomaly Detection: The anomaly detection framework can be roughly categorized into reconstruction-based or prediction-based methods. For reconstruction-based methods, given a image , the model generates a reconstructed image . For prediction-based methods, given consecutive frames in a video, the goal is to learn a model with parameters that takes these frames as its input and predicts the next frame at time . We use to denote the predicted frame at time . The anomaly detection is determined by the difference between the predicted/reconstructed frame and the actual frame. If this difference is larger than a threshold, this frame is considered an anomaly.
During training, the goal is to learn the future frame prediction/reconstruction model from a collection of normal videos. Note that the training data only contain normal videos since it is usually difficult to collect training data with abnormal events for real-world applications.
Few-Shot Scene-Adaptive Anomaly Detection: The standard anomaly detection framework described above have some limitations that make it difficult to apply it in real-world scenarios. It implicitly assumes that the model (either reconstruction-based or prediction-based) learned from the training videos can generalize well on test videos. In practical applications, it is unrealistic to collect training videos from the target scene where the system will be deployed. In most cases, training and test videos will come from different scenes. The anomaly detection model can easily overfit to the particular training scene and will not generalize to a different scene during testing. We will empirically demonstrate this in the experiment section.
In this paper, we introduce a new problem setup that is closer to real-world applications. This setup is motivated by two crucial observations. First of all, in most anomaly detection applications, the test images come from a particular scene captured by the same camera. In this case, we only need the learned model to perform well on this particular scene. Second, although it is unrealistic to collect a large number of videos from the target scene, it is reasonable to assume that we will have access to a small number of images from the target scene. For example, when a surveillance camera is installed, there is often a calibration process. We can easily collect a few images from the target environment during this calibration process.
Motivated by these observations, we propose a problem setup called few-shot scene-adaptive anomaly detection. During training, we have access to videos collected from different scenes. During testing, the videos will come from a target scene that never appears during training. Our model will learn to adapt to this target scene from only a few initial frames. The adapted model is expected to work well in the target scene.
4 Our Approach: MAML for Scene-Adaptive Anomaly Detection
We propose to learn few-shot scene-adaptive anomaly detection models using a meta-learning framework, in particular, the MAML algorithm [finn2017model] for meta-learning. Figure 2 shows an overview of the proposed approach. The meta-learning framework consists of a meta-training phase and a meta-testing phase. During meta-training, we have access to videos collected from multiple scenes. The goal of meta-training is learning to quickly adapt to a new scene based on a few frames from it. During this phase, the model is trained from a large number of few-shot scene-adaptive anomaly detection tasks constructed using the videos available in meta-training, where each task corresponds to a particular scene. In each task, our method learns to adapt a pre-trained future frame prediction model using a few frames from the corresponding scene. The learning procedure (meta-learner) is designed in a way such that the adapted model will work well on other frames from the same scene. Through this meta-training process, the model will learn to effectively perform few-shot adaptation for a new scene. During meta-testing, given a few frames from a new target scene, the meta-learner is used to adapt a pre-trained model to this scene. Afterwards, the adapted model is expected to work well on other frames from this target scene.
Our proposed meta-learning framework can be used in conjunction with any anomaly detection model as the backbone architecture. We first introduce the meta-learning approach for scene-adaptive anomaly detection in a general way that is independent of the particular choice of the backbone architecture, we then describe the details of the proposed backbone architectures used in this paper.
Our goal of few-shot scene-adaptive anomaly detection is to learn a model that can quickly adapt to a new scene using only a few examples from this scene. To accomplish this, the model is trained during a meta-training phase using a set of tasks where it learns to quickly adapt to a new task using only a few samples from the task. The key to applying meta-learning for our application is how to construct these tasks for the meta-training. Intuitively, we should construct these tasks so that they mimic the situation during testing.
Tasks in Meta-learning:
We construct the tasks for meta-training as follows.
(1) Let us consider a future frame prediction model that maps observed frames to the predicted frame at . We have access to scenes during meta-training, denoted as . For a given scene , we can construct a corresponding task , where and are the training and the validation sets in the task . We first split videos from into many overlapping consecutive segments of length . Let us consider a segment . We then consider the first frames as the input and the last frame as the output , i.e. and . This will form an input/output pair . The future frame prediction model can be equivalently written as . In the training set , we randomly sample input/output pairs from to learn future frame prediction model, i.e. . Note that to match the testing scheme, we make sure that all the samples in come from the same video. We also randomly sample input/output pairs (excluding those in ) to form the test data .
(2) Similarly, for reconstruction-based models, we construct task using individual frames. Since the groundtruth label for each image is itself, we randomly sample images from one video as and sample images from the same video as .
Meta-Training: Let us consider a pre-trained anomaly detection model with parameters . Following MAML [finn2017model], we adapt to a task
by defining a loss function on the training setof this task and use one gradient update to change the parameters from to : && θ’_i= θ- α▽_θ L_T_i(f_θ; D_i^tr), where
&& L_T_i(f_θ; D_i^tr)=∑_(x_j,y_j)∈D_i^trL(f_θ(x_j),y_j) where is the step size. Here measures the difference between the predicted frame and the actual future frame . We define by combine the least absolute deviation ( loss) [pollard1991asymptotics], multi-scale structural similarity measurement ( loss) [wang2003multiscale] and gradient difference ( loss) [mathieu2015deep]:
where are coefficients that weight between different terms of the loss function.
The updated parameters are specifically adapted to the task . Intuitively we would like to perform on the validation set of this task. We measure the performance of on as:
The goal of meta-training is to learn the initial model parameters , so that the scene-adapted parameters obtained via Eq. 2 will minimize the loss in Eq. 2 across all tasks. Formally, the objective of meta-learning is defined as:
Meta-Testing: After meta-training, we obtain the learned model parameters . During meta-testing, we are given a new target scene . We simply use Eq. 2 to obtain the adapted parameters based on examples in . Then we apply on the remaining frames in the to measure the performance. We use the first several frames of one video in for adaptation and use the remaining frames for testing. This is similar to real-world settings where it is only possible to obtain the first several frames for a new camera.
Backbone Architecture: Our scene-adaptive anomaly detection framework is general. In theory, we can use any anomaly detection network as the backbone architecture. In this paper, we propose a future frame prediction based backbone architecture similar to [liu2018future]. Following [liu2018future], we build our model based on conditional GAN. One limitation of [liu2018future] is that it requires additional low-level feature (ie. optical flows) and is not trained end-to-end. To capture spatial-temporal information of the videos, we propose to combine generative models and sequential modelling. Specifically, we build a model using ConvLSTM and adversarial training. This model consists of a generator and a discriminator. To build the generator, we apply a U-Net [ronneberger2015u] to predict the future frame and pass the prediction to a ConvLSTM module [xingjian2015convolutional] to retain the information of the previous steps. The generator and discriminator are adversarially trained. We call our model r-GAN. Since the backbone architecture is not the main focus of the paper, we skip the details and refers readers to the supplementary material for the detailed architecture of this backbone. In the experiment section, we will demonstrate that our backbone architecture outperforms [liu2018future] even though we do not use optical flows.
We have also experiment with other variants of the backbone architecture. For example, we have tried using the ConvLSTM module in the latent space of an autoencoder. We call this variant r-GAN*. Another variant is to use a vartional autoencoder instead of GAN. We call this variant r-VAE. Readers are referred to the supplementary material for the details of these different variants. In the experiment, we will show that r-GAN achieves the best performance among all these different variants. So we use r-GAN as the backbone architecture in the meta learning framework.
In this section, we first introduce our datasets and experimental setup in Sec. 5.1. We then describe some baseline approaches used for comparison in Sec. 5.2. Lastly, we show our experimental results and the ablation study results in Sec. 5.3.
5.1 Datasets and Setup
|Del et al.[del2016discriminative]||-||-||78.3||-|
|Prediction||Stacked RNN [luo2017revisit]||-||92.2||81.7||68.0|
|Nguyen et al. [nguyen2019anomaly]||-||96.2||86.9||-|
Datasets: This paper addresses a new problem. In particular, the problem setup requires training videos from multiple scenes and test videos from different scenes. There are no existing datasets that we can directly use for this problem setup. Instead, we repurpose several available datasets.
Shanghai Tech [luo2017revisit]: This dataset contains 437 videos collected from 13 scenes. The training videos only contain normal events, while the test videos may contain anomalies. In the standard split in [luo2017revisit], both training and test sets contain videos from these 13 scenes. This split does not fit our problem setup where test scenes should be distinct from those in training. In our experiment, we propose a new train/test split more suitable for our problem. We also perform cross-dataset testing where we use the original Shanghai Tech dataset during meta-training and other datasets for meta-testing.
UCF crime [Sultani_2018_CVPR]
: This dataset contains normal and crime videos collected from a large number of real-world surveillance cameras where each video comes from a different scene. Since this dataset does not come with ground-truth frame-level annotations, we cannot use it for testing since we do not have the ground-truth to calculate the evaluation metrics. Therefore, we only use the 950 normal videos from this dataset for meta-training, then test the model on other datasets. This dataset is much more challenging than Shanghai Tech when being used for meta-training, since the scenes are diverse and very dissimilar to our test sets. Our insight is that if our model can adapt to a target dataset by meta-training on UCF crime, our model can be trained with similar surveillance videos.
UCSD Pedestrian 1 [mahadevan2010anomaly], UCSD Pedestrian 2 (Ped 2) [mahadevan2010anomaly], and CUHK Avenue [lu2013abnormal]: Each of these datasets contains videos from only one scene but different times. They contain 36, 12 and 21 test videos, respectively, including a total number of 99 abnormal events such as moving bicycles, vehicles, people throwing things, wandering and running. We use the model trained from Shanghai Tech or UCF crime datasets and test on these datasets.
UR fall [kwolek2014human]: This dataset contains 70 depth videos collected with a Microsoft Kinect camera in a nursing home. Each frame is represented as a 1-channel grayscale image capturing the depth information. In our case, we convert each frame to an RGB image by duplicating the grayscale value among 3 color channels for every pixel. This dataset is originally collected for research in fall detection. We follow previous work in [nogas2018deepfall] which considers a person falling as the anomaly. Again, we use this dataset for testing. Since this dataset is drastically different from other anomaly detection datasets, good performance on this dataset will be very strong evidence of the generalization power of our approach.
Evaluation Metrics: Following prior work [liu2018future, luo2017remembering, mahadevan2010anomaly], we evaluate the performance using the area under the ROC curve (AUC). The ROC curve is obtained by varying the threshold for the anomaly score for each frame-wise prediction.
To the best of our knowledge, this is the first work on the scene-adaptive anomaly detection problem. Therefore, there is no prior work that we can directly compare with. Nevertheless, we define the following baselines for comparison.
Pre-trained: This baseline learns the model from videos available during training, then directly applies the model in testing without any adaptation.
Fine-tuned: This baseline first learns a pre-trained model. Then it adapts to the target scene using the standard fine-tuning technique on the few frames from the target scene.
5.3 Experimental Results
Sanity Check on Backbone Architecture: We first perform an experiment as a sanity check to show that our proposed backbone architecture is comparable to the state-of-the-art. Note that this sanity check uses the standard training/test setup (training set and testing set are provided by the original datasets), and our model can be directly compared with other existing methods. Table 1 shows the comparisons among our proposed architecture (r-GAN), its variants (r-GAN* and r-VAE), and other methods when using the standard anomaly detection training/test setup on several anomaly detection datasets. Table 3 shows the comparison on the fall detection dataset. We can see that our backbone architecture r-GAN outperforms its variants and the existing state-of-the-art methods on almost all the datasets. As a result, we use r-GAN as our backbone architecture to test our few-shot scene-adaptive anomaly detection algorithm in this paper.
Results on Shanghai Tech: In this experiment, we use Shanghai Tech for both training and testing. In the train/test split used in [liu2018future], both training and test sets contain videos from the same set of 13 scenes. This split does not fit our problem. Instead, we propose a split where the training set contains videos of 6 scenes from the original training set, and the test set contains videos of the remaining 7 scenes from the original test set. This will allow us to demonstrate the generalization ability of the proposed meta-learning approach. Table 3 shows the average AUC score over our test split of this dataset (7 scenes). Our model outperforms the two baselines.
Cross-dataset Testing: To demonstrate the generalization power of our approach, we also perform cross-dataset testing. In this experiment, we use either Shanghai Tech (the original training set) or UCF crime for meta-training, then use the other datasets (UCSD Ped1, UCSD Ped2, CUHK Avenue and UR Fall) for meta-testing. We present our cross-dataset testing results in Table 4. Compared with Table 3, the improvement of our approach over the baselines in Table 4 is even more significant (e.g. more than in some cases). It is particularly exciting that our model can successfully adapt to the UR Fall dataset, considering this dataset contains depth images and scenes that are drastically different from those used during meta-training.
Ablation Study: In this study, we show the effect of the batch size (i.e. the number of sampled scenes) during the meta-training process. For this study, we train r-GAN on the Shanghai Tech dataset and test on Ped 1, Ped 2 and CUHK. We experiment with sampling either one () or five () tasks in each epoch during meta-training. Table 5 shows the comparison. Overall, using our approach with performs better than simple fine-tuning, but not as good as . One explanation is that by having access to multiple scenes in one epoch, the model is less likely to overfit to any specific scene.
Qualitative Results: Figure 5 shows qualitative examples of detected anomalies. We visualize the anomaly scores on the frames in a video. We compare our method with the baselines in one graph for different values of and different datasets.
We have introduced a new problem called few-shot scene-adaptive anomaly detection. Given a few frames captured from a new scene, our goal is to produce an anomaly detection model specifically adapted to this scene. We believe this new problem setup is closer to the real-world deployment of anomaly detection systems. We have developed a meta-learning based approach to this problem. During meta-training, we have access to videos from multiple scenes. We use these videos to construct a collection of tasks, where each task is a few-shot scene-adaptive anomaly detection task. Our model learns to effectively adapt to a new task with only a few frames from the corresponding scene. Experimental results show that our proposed approach significantly outperforms other alternative methods.
Acknowledgement: This work was supported by the NSERC and UMGF funding. We thank NVIDIA for donating some of the GPUs used in this work.