Anomaly detection is an essential problem in video surveillance. Due to the massive amount of available video data from surveillance cameras, it is time-consuming and inefficient to have human observers watching surveillance videos and report any anomalies. Ideally, we want an automatic system that can report abnormal events. Anomaly detection is challenging since the definition of “anomaly” is broad and ambiguous – anything that deviates expected behaviours can be considered as “anomaly”. It is infeasible to collect labeled training data that cover all possible anomalies. As a result, recent work in anomaly detection has focused on unsupervised approaches that do not require human labels.
Some recent work (e.g. [hasan2016learning, lu2013abnormal, luo2017remembering, sabokrou2018adversarially]) in anomaly detection uses the idea of frame reconstruction. They build models that learn to reconstruct the normal (or regular) frames observed during training. During testing, any irregular (abnormal) event will lead to a large reconstruction error. The higher reconstruction error indicates the possible abnormal event in the frame. Previous work [an2015variational, sabokrou2018adversarially, liu2018future] has applied variants of generative models such as variational autoencoder (VAE) [kingma2013auto] or generative adversarial network (GAN) [goodfellow2014generative] to model the distribution of the natural behaviours. To build a real-time anomaly detection system, Liu et al. [liu2018future] propose a future frame prediction framework for anomaly detection. Given several observed frames, their method learns a GAN-based model to predict the future frame. An anomaly then corresponds to a large difference between the predicted future frame and the actual future frame. One limitation of [liu2018future] is that it directly concatenates the several observed frames as the input to the GAN model. As a result, the model does not directly represent the temporal information in a video. Although [liu2018future] uses optical flow features which capture some temporal information at the feature level, the optical flow information is only used as a constraint during training and is not used during testing.
In this paper, we follow the future frame prediction framework in [liu2018future] and propose a new approach that better capture the temporal information in a video for anomaly detection. We propose to combine sequential models (in particular, ConvLSTM) with generative models (in particular, VAE) to build a model that can be trained end-to-end. Although sequential generative models have been previously proposed for speech recognition and music generation [mogren2016c, chung2015recurrent], they have not been applied in anomaly detection. An example of our proposed video anomaly detection system can be seen in Fig 1. Given several consecutive frames, our model learns to predict the next future frame. For normal frames, our method is able to predict the next frame reasonably well. When there is anomaly in the future frame, the prediction is often distorted and blurry. By comparing the predicted future frame with the actual future frame, our system can detect suspicious behaviours or events (in this case, the man is throwing his bag up and down) are detected in a video frame.
In this paper, we make the following contributions. We propose a sequential generative model for video anomaly detection using the future frame prediction framework. We combine ConvLSTM with VAE to better capture the temporal relationship among frames in a video. Our experimental results demonstrate that the proposed model outperforms existing state-of-the-art approaches, even without using optical flow features.
2 Related Work
In this section, we review several lines of prior research related to our work.
Anomaly Detection with Hand-crafted Features: Early work in video anomaly detection uses hand-crafted features. [tung2011goal, wu2010chaotic] use trajectory features to represent normal behaviours. However, these methods can not be applied to crowded scenes. To address this limitation, low-level features such as histogram of oriented gradient and histogram of oriented flows are also applied [dalal2005histograms, dalal2006human] for human detection. [zhao2011online, lu2013abnormal, cong2011sparse]
represent each scene by a dictionary of temporal and spatial information. These approaches have low performance due to the fact that the dictionary does not ensure the capacity of normal events and cannot classify anomaly correspondingly. Statistical-based models have also been proposed. For example,[kim2009observe]
proposes an approach based on a mixture of probabilistic PCA (MPPCA) with optical flow pattern. Gaussian mixture model[mahadevan2010anomaly] has also been applied for anomaly detection.
Anomaly Detection with Deep Learning
: In order to address the limitation of hand-crafted features in anomaly detection, there has been recent work that explores the use of deep learning approaches. A lot of these methods learn a deep learning model to reconstruct a frame and use the reconstruction error for anomaly detection. Inspired by[masci2011stacked], Hasan et al. [hasan2016learning] apply convolutional autoencoder for reconstructing normal frames. Some follow-up works [sabokrou2016video, chalapathy2017robust] propose to build a more robust version. Xu et al. [xu2017detecting] use stacked de-noising autoencoders [vincent2008extracting] and optical flow to capture both appearance and motion information.
Some work considers using a future frame prediction approach for anomaly detection. Medel et al. [medel2016anomaly] apply ConvLSTM as a backbone network and build a future prediction model for anomaly detection. Luo et al. [luo2017remembering] combine autoencoder and ConvLSTM to reconstruct the output of ConvLSTM to the original image size. Because the inner structure of ConvLSTM is entirely deterministic, these predictive modeling methods cannot predict highly structured moving objects, which results in inaccurate predictions of anomalies.
Generative models, such as VAE [kingma2013auto] and GAN [goodfellow2014generative], have been applied for the purpose of learning the distribution of regular frames. Sabokrou et al. [sabokrou2018adversarially]
propose a one class classifier using conditional adversarial networks[isola2017image]. Xie et al. [xie2012image]
use a GAN-based image inpainting method to detect and localize the abnormal objects. Liu et al.[liu2018future] propose a GAN-based future frame prediction network with optical flow network[dosovitskiy2015flownet]. An et al. [an2015variational] apply VAE to build an anomaly detection system, but the method is not performed on real-world datasets.
Sequential Generative Models: There has been some work on incorporating sequential information in generative models. Chung et al. [chung2015recurrent]
argue that latent random variables can play crucial roles in the dynamics of RNN. By combining VAE and RNN, they are able to model sequences with significant improvement on RNN. However, this model has only been used on simple tasks such as speech generation or handwriting generation.[mogren2016c]
propose a sequential generative model using adversarial training on RNN. They argue that with the supervision of a discriminator, their proposed generative model can be trained to be very expressive with high flexibility on continuous sequences such as music. However, the potential of this model on computer vision tasks has not yet been explored.
3.1 Variational Autoencoder
Variational autoencoder (VAE) [kingma2013auto] has been shown to be effective in reconstructing complex distributions for non-sequential data. Given an input , VAE applies an encoder (also known as inference model) to generate the latent variable that captures the variation in . It uses a decoder to approximate the observation given the latent variable. The inference model represents the approximate posterior using the mean
calculated by a neural network, where and are outputs of some neural networks that take as the input. A prior
is chosen to be a simple Gaussian distribution. With the constraints of distribution on latent variables, the complete objective function can be described as below:
is the Kullback-Leibler divergence[hershey2007approximating] between the prior and the posterior.
3.2 Variational Recurrent Neural Network
VAE is a generative model. It cannot directly be used to model sequential data. For the problem of anomaly detection, our data are inherently sequential since we need to consider the information in several consecutive frames in order to predict the next frame. Variational Recurrent Neural Network (VRNN)[chung2015recurrent] is an extension of vanilla VAE. It combines VAE with a recurrent neural network in order to model sequential data. Since this approach shares the same inspiration with our Conv-VRNN approach, we will explain the technical details in the next section.
Following [liu2018future], we approach the anomaly detection problem using the future frame prediction framework. The goal is to build a model that takes several frames in a video as the input and predict the future frame. The predicted future frame is then compared with the actual future frame. If their difference is significant, we will consider it to be an anomaly. The main difference from [liu2018future] is that our proposed approach combines a recurrent network with a generative model. As a result, our approach can better capture temporal information in the video.
Our problem formulation is as follows. Given a sequence of frames , we aim at predicting the next frame . Note that is a constant which we define as 4 in our case. We use to denote the predicted frame at time . During training, we learn a model that minimizes the difference between the predicted and actual future frames, i.e. . During testing, if this difference is too large, we will consider to be an anomaly.
In this section, we first introduce our model Conv-VRNN (Sec. 4.1) for future frame prediction. Our model combines VAE and a ConvLSTM module. We then describe how to use the proposed model to detect anomaly during testing (Sec. 4.2).
4.1 Conv-VRNN for Future Frame Prediction
To extend VAE to model image sequences for anomaly detection, we use the idea of Variational Recurrent Neural Network (VRNN) [chung2015recurrent] and build a Conv-VRNN model for future frame prediction. An overview of our proposed model is shown in Figure 2. Let be the input image at time , where is the spatial dimension of the image. We define to be the hidden state of a ConvLSTM at time step . Note that we choose the spatial dimension of to match the image size. Our method consists of four components at each time step :
Prior Distribution in VAE: This module takes the hidden state
from the previous time step as the input. It then generates a distribution on the latent variable in VAE. We first extract a feature vector from. Since . We denote this feature as , where and correspond to the spatial dimension and the channel dimension of the CNN feature map. Here we set . We then apply two different fully connected layers on to produce two vectors corresponding to the mean and the variance of a Gaussian distribution in VAE, denoted by and . In our implementation, the dimension of and is set to be 20, i.e. . We then use and to define a Gaussian distribution for the prior distribution on the latent variable in VAE as follows:
where creates a diagonal matrix from a vector and represent the prior distribution on the latent variable.
Encoder: The module takes the hidden state of previous time step and the frame at current time as the input. It then produces a vector of the latent variable in VAE. We first concatenate and along their channel dimensions, then apply a CNN to extract a feature map. Again, we apply two different fully connected layers on this feature map to produce and . Similarly, the dimension of and to be 20. We then define the posterior of the latent variable in VAE as:
Recurrence: To capture the temporal information among frames in a video, we use a ConvLSTM to represent the recurrent relationship among frames. From the current input image , we apply a CNN to extract a feature map which we denote as . To match the dimension of this feature, we also resize the latent variable (recall ) as follows. We first use fully connected layers to map to a high-dimensional space , then reshape to a 3D tensor of dimension . We use to denote this reshaped tensor. We concatenate the input feature with the along the channel dimension and use it as the input to ConvLSTM at time :
Decoder: This module takes the resized hidden state as its input and produces a predicted frame for the next time-step. Note that the dimensions of match those of the extracted feature of previous hidden state . We concatenate and along the channel dimension. The result is used as the input of this decoder module. The decoder is implemented as a deconvolutional nerual network that generates the predicted frame .
Model Learning: For learning parameters in Conv-VRNN, we combine the least absolute deviation ( loss) [pollard1991asymptotics], multi-scale structural similarity measurement (msssim loss) [wang2003multiscale] and gradient difference (gdl loss) [mathieu2015deep]
to define a loss that measure the quality of the predicted frame. These three loss functions can be defined as follows:
(1) L1 loss between ground-truth and prediction is the summation of the absolute value between every pixel of the two images.
(2)We use multi-scale SSIM to represent the structural difference. MSSSIM is a multi-scale version of SSIM, which performs better on video sequences.
(3) Gradient difference is widely used for measuring the performance of a prediction. Gradient difference loss considers the intensities difference between neighbour pixels.
Overall, given the predicted frame and the ground-truth , the complete loss function is defined as:
We define the complete objective function as:
4.2 Anomaly Detection
Given an input sequence of frames during testing, we use our model to predict the next frame in the future. This predicted future frame is compared with the ground-truth future frame by calculating (see Eq. 5). Same as [liu2018future], after calculating the overall spatial loss of each testing video, we normalize the losses to get a score in the range of for each frame in the video by:
We then use as the score indicating how likely a particular frame is an anomaly.
In this section, we first discuss our experimental setup in Sec. 5.1. Then we present both quantitative and qualitative results in Sec. 5.2. We also perform extensive ablation studies in Sec. 5.3 to analyze our proposed approach.
5.1 Experimental Setup
We evaluate our method on three benchmark datasets. (1) UCSD Pedestrian 1 (Ped 1) dataset[mahadevan2010anomaly]: this dataset contains 34 training videos and 36 testing videos. In training videos, only pedestrians exist in the frames. Test videos include 40 abnormal events, such as moving bicycles and vehicles. (2) UCSD Pedestrian 2 (Ped 2) dataset[mahadevan2010anomaly]. This dataset considers the same set of anomalies with the UCSD Ped 1 dataset. It consists of 16 training videos and 12 testing videos with 12 irregular occasions. (3) CUHK Avenue (Avenue) dataset [lu2013abnormal]. This dataset consists of 16 training videos and 21 testing videos. It contains 47 abnormal events like throwing things, wandering, and running. Figure 3 shows some example frames from these datasets.
Following prior work [liu2018future] [luo2017revisit] [mahadevan2010anomaly], we evaluate our methods using the area under the ROC curve (AUC). The ROC curve is obtained by varying the threshold for the anomaly score. A higher AUC value represents a more accurate anomaly detection system. To ensure the comparability between different methods, we calculate AUC from the frame-level prediction, which has been used by different existing methods.
5.2 Experimental Results
|Del et al.[del2016discriminative]||N/A||N/A||78.3%|
|Stacked RNN [luo2017revisit]||N/A||92.2%||81.7%|
|Liu et al. [liu2018future]||83.1%||95.4%||84.9%|
|GT (normal)||prediction (normal)||GT (abnormal)||prediction (abnormal)|
Table 1 shows the results of our proposed method compared with existing state-of-the-art approaches. To be consistent with [liu2018future], we have set . In other words, our model takes 4 consecutive frames as the input and predicts the future frame at the next time step. It then compares the prediction with the actual frame at the next time step to decide whether this frame is an anomaly. We can see that Conv-VRNN outperforms existing methods on all three datasets.
Figure 4 shows some qualitative examples of future frame prediction. We can see that for a normal frame, the predicted future frame tends to be close to the actual future prediction. For an abnormal frame, the predicted future frame tends to be blurry or distorted compared with the actual future frame. Figure 5 shows example of detected anomaly by visualizing the anomaly score on different frames in a video.
5.3 Ablation Study
We perform additional ablation study to gain further insights of our proposed methods.
5.3.1 Conv-VAE vs Conv-VRNN
In order to analyze the effect of incorporating temporal information, we implement a variant of our model without RNN. We call this variant Conv-VAE. Conv-VAE uses the encoder module to encode a latent variable and uses the decoder module for prediction. We have experimented with Conv-VAE that takes either one input frame or four frames to predict the next frame. The results are shown in Table 2. We can see that Conv-VRNN outperforms Conv-VAE. This demonstrates the importance of capturing the temporal information using RNN for anomaly detection.
|Ped 1||Ped 2||Avenue|
5.3.2 Analysis on Losses
As we mentioned in Sec 4, we apply three different losses for prediction. The analysis of the impact of the losses can be visualized in Table 3. We choose three combinations of objective functions for evaluation: constraint only on intensity (), constraint on intensity and structure (), constraint on intensity, structure and gradient (). The results demonstrate that the appearance information is better captured by the model with more constraints.
5.3.3 Sequential Model vs Optical Flow
Our Conv-VRNN uses a RNN module to capture the temporal information in a video. An alternative way of capturing temporal information is to use optical flow features. We have implemented a Conv-VAE model with such constraint. Following [liu2018future], we apply the pretrained Flownet [dosovitskiy2015flownet]
to estimate the optical flow, and use the returned loss of the Flownet as a motion constraint of the network only in training time. Table4, Figure 6 show that although adding optical flow in our implementation of Conv-VAE improves the performance compared with Conv-VAE applied on only raw frames, our proposed Conv-VRNN approach still performs better even if we do not use optical flow features. This demonstrates that it is more effective to design the generative model to directly capture the temporal information instead of relying on low-level optical flow features.
|Conv-VAE(w/o optical flow)||80.15%||88.13%||80.92%|
|Conv-VAE(with optical flow)||81.36%||89.52%||82.23%|
In this paper, we have proposed a sequential generative network for anomaly detection based on convolutional VRNN using the future frame prediction framework. By combining a ConvLSTM module with VAE, our approach can effectively capture the temporal information crucial for future frame prediction. On three benchmark datasets, our proposed approach outperforms existing state-of-the-art methods.
Acknowledgement: This work was supported by the NSERC and UMGF funding programs. We thank NVIDIA for donating some of the GPUs used in this work.