Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection

07/29/2021 ∙ by Xinyang Feng, et al. ∙ University of Connecticut Columbia University 0

Detecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem, few of them can capture the normal spatio-temporal patterns effectively and efficiently. Moreover, existing works seldom explicitly consider the local consistency at frame level and global coherence of temporal dynamics in video sequences. To this end, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically, we first present a convolutional transformer to perform future frame prediction. It contains three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Next, a dual discriminator based adversarial training procedure, which jointly considers an image discriminator that can maintain the local consistency at frame-level and a video discriminator that can enforce the global coherence of temporal dynamics, is employed to enhance the future frame prediction. Finally, the prediction error is used to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial spatio-temporal modeling framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the rapid growth of video surveillance data, there is an increasing demand to automatically detect abnormal video sequences in the context of large-scale normal (regular) video data. Despite a substantial amount of research effort has been devoted to this problem (Mahadevan et al., 2010; Li et al., 2014; Lu et al., 2013; Hasan et al., 2016; Xu et al., 2017; Liu et al., 2018; Tang et al., 2020; Park et al., 2020; Chang et al., 2020), video anomaly detection, which aims to identify the activities that do not conform to regular patterns in a video sequence, is still a challenging task. This is because real-world abnormal video activities can be extremely diverse while the prior knowledge about these anomalies is usually limited or even unavailable.

With the assumption that a model can only generalize to data from the same distribution as the training set, abnormal activities in the test set will manifest as deviance from regular patterns. A common approach to resolve this problem is to learn a model that can capture regular patterns in the normal video clips during the training stage, and check whether there exists any irregular pattern that diverges from regular patterns in the test video clips. Within this framework, it is crucial to not only represent the regular appearances but also capture the normal spatio-temporal dynamics to differentiate abnormal activities from normal activities in a video sequence. This serves as an important motivation for our proposed methods.

Early studies have used handcrafted features to represent video patterns (Mahadevan et al., 2010; Li et al., 2014; Lu et al., 2013; Song and Tao, 2010). For instance, Li et al. (2014)

introduced mixtures of dynamic textures and defined outliers under this model as anomalies. These approaches, however, are usually not optimal for video anomaly detection since the features are extracted based upon a different objective.

Recently, deep neural networks are becoming prevalent in video anomaly detection, showing superior performance over handcrafted feature based methods. For instance, Hasan et al. (2016)

developed a convolutional autoencoder (Conv-AE) to model the spatio-temporal patterns in a video sequence simultaneously with a 2D CNN. The temporal dynamics, however, are not explicitly considered. To better cope with the spatio-temporal information in a video sequence, convolutional long short-term memory (LSTM) autoencoder (ConvLSTM-AE) 

(Shi et al., 2015; Luo et al., 2017a) was proposed to model the spatial patterns with fully convolutional networks and encode the temporal dynamics using convolutional LSTM (ConvLSTM). ConvLSTM, however, suffers from computational and interpretation issues. A powerful alternative for sequence modeling is the self-attention mechanism (Vaswani et al., 2017). It has demonstrated superior performance and efficiency in many different tasks, e.g., sequence-to-sequence machine translation (Vaswani et al., 2017), time series prediction (Qin et al., 2017)

, autoregressive model based image generation 

(Parmar et al., 2018), and GAN-based image synthesis (Zhang et al., 2019a). However, it has seldom been employed to capture regular spatio-temporal patterns in the surveillance videos.

More recently, adversarial learning has shown impressive progress on video anomaly detection. For instance, Ravanbakhsh et al. (2017) developed a GAN based anomaly detection approach following conditional GAN framework (Isola et al., 2017). Liu et al. (2018) proposed an anomaly detection approach based on future frame prediction. Tang et al. (2020) extended this framework by adding a reconstruction task. The generative models in these two works were based on U-Net (Ronneberger et al., 2015). Similar to Conv-AE, the temporal dynamics in the video clip were not explicitly encoded and the temporal coherence was enforced by a loss term on the optical flow. Moreover, the potential discriminative information in the form of consistency at frame-level and global coherence of temporal dynamics in video sequences were not fully considered in previous works.

In this paper, to better capture the regular spatio-temporal patterns and cope with the potential discriminative information at frame-level and in video sequences, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. We first present a convolutional transformer to perform future frame prediction. The convolutional transformer is essentially a encoder-decoder framework consisting of three key components, i.e., a convolutional encoder to capture the spatial patterns of the input video clip, a novel temporal self-attention module adapted for video temporal modeling that can explicitly encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Because of the temporal self-attention module, convolutional transformer can capture the underlying temporal dynamics efficiently and effectively. Next, in order to maintain the local consistency of the predicted frame and the global coherence conditioned on the previous frames, we adapt dual discriminator GAN to deal with video frames and employ an adversarial training procedure to further enhance the prediction performance. Finally, the prediction error is adopted to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed framework and techniques.

2. Related Work

The proposed Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) is closely related to deep learning based video anomaly detection and self-attention mechanism

(Vaswani et al., 2017).

Note that we focus our discussions on methods based on unsupervised settings, which are efficient in generalization without the time-consuming and error-prone process of manual labeling. We are aware that there are numerous works on weakly supervised or supervised video anomaly detection, e.g., Sultani et al. ((2018)) proposed a deep multiple instance ranking framework using video-level labels and achieves better performance than convolutional auto-encoder (Conv-AE) based method (Hasan et al., 2016), but it employs both normal and abnormal video clips for training which is different from our setting.

Deep neural networks based video anomaly detection methods demonstrate superior performance over traditional methods based on handcrafted features. Hasan et al. ((2016)) developed Conv-AE method to simultaneously learn the spatio-temporal patterns in a video with 2D convolutional neural networks by concatenating the video frames in the channel dimension. The temporal information is mixed with the spatial information in the first convolutional layer, thus not explicitly encoded. Xu et al. ((2017)

) proposed appearance and motion DeepNet (AMDN) to learn video feature representations, which however still requires a decoupled one-class SVM classifier applied on learned representation to generate anomaly score. Dong et al. (

(2019)) proposed a memory-augmented autoencoder (MemAE) that uses a memory module to constrain the reconstruction.

More recently, adversarial learning has demonstrated flexibility and impressive performance in multiple video anomaly detection studies. A generative adversarial networks (GANs) based anomaly detection approach (Ravanbakhsh et al., 2017)

was developed following cGAN framework of image-to-image translation  

(Isola et al., 2017). Specifically, it employs image and optical flow as source domain and target domain, and vice versa, and trains cross-channel generation through adversarial learning. The reconstruction error is used to compute anomaly score. The only temporal constraint is imposed by the optical flow calculation. Liu et al. ((2018)) proposed an anomaly detection approach based on future frame prediction in GAN framework and U-Net (Ronneberger et al., 2015). Similar to Conv-AE, the temporal information is not explicitly encoded and the temporal coherence between neighboring frames is enforced by a loss term on the optical flow. Tang et al. ((2020)) extended the future frame prediction framework by adding a reconstruction task. One way to alleviate the temporal encoding issue in video spatio-temporal modeling is to use convolutional LSTM autoencoder (ConvLSTM-AE) based methods (Shi et al., 2015; Chong and Tay, 2017; Luo et al., 2017a; Zhang et al., 2019b), where the spatial and temporal patterns are encoded with fully convolutional networks and convolutional LSTM, respectively. Despite its popularity, ConvLSTM suffers from issues such as large memory consumption. The complex gating operations add to the computational cost and complicate the information flow, making interpretation difficult.

A more effective and efficient alternative for sequence modeling is the self-attention mechanism  (Vaswani et al., 2017), which is essentially an attention mechanism relating different positions of a single sequence to compute a representation of the sequence, in which the keys, values, and queries are from the same set of features. Some related applications include autoregressive model based image generation (Parmar et al., 2018), GAN-based image synthesis (Zhang et al., 2019a).

In this work, based on related works, we introduce the convolutional transformer by extending the self-attention mechanism to video sequence modeling and develop a novel self-attention module specialized for spatio-temporal modeling in video sequences. Compared to existing approaches for video anomaly detection, the proposed convolutional transformer model has the advantage of being able to explicitly and efficiently encode the temporal information in a sequence of feature maps, where the computation of attentions can be fully parallelized via matrix multiplications. Based on the convolutional transformer, a dual discriminator generative adversarial networks (D2GAN) approach is developed to further enhance the future frame prediction through enforcing local consistency of the predicted frame and the global coherence conditioned on the previous frames. Note that the proposed D2GAN differs from existing works on dual discriminator based GAN which have been applied to different scenarios (Nguyen et al., 2017; Xu et al., 2019; Yu et al., 2018; Dong et al., 2020).

Figure 1. The architecture of the proposed CT-D2GAN framework. (Upper panel) The convolutional transformer generator is consisted of a convolutional encoder, a temporal self-attention module, and a convolutional decoder. Multi-head self-attention is applied on the feature maps extracted from convolutional encoder: is transformed to multi-head feature maps via a convolutional operation; within each head, we apply a global average pooling (GAP) operation on

to generate a spatial feature vector by aggregating over spatial dimension, and concatenate the positional encoding (PE) vector; we then compare the similarity

between query and memory feature vectors and generate the attention weights by normalizing across time steps using softmax ; the attended feature map is a weighted average of the feature maps at different time steps; the final attended map is the concatenation over all the heads; the final integrated map is a weighted average of the query and the attended feature maps according to a spatial selective gate (SSG). is decoded to the predicted future frame with the convolutional decoder. (Lower panels) The image discriminator (left) and video discriminator (right) used in our dual discriminator GAN framework.

3. Ct-D2gan

In this section, we first introduce the problem formulation and input to our framework. Then, we present the motivation and technical details of the proposed CT-D2GAN framework including convolutional transformer, dual discriminator GAN, the overall loss function, and lastly the regularity score calculation. An overview of the framework is illustrated in Figure

1.

In CT-D2GAN, a convolutional transformer is employed to generate future frame prediction based on past frames, an image discriminator and a video discriminator are used to maintain the local consistency and global coherence.

3.1. Problem Statement

Given an input representation of video clip of length , i.e., , where , , are the height, width and number of channels, we aim to predict the -th frame as and identify abnormal activities based upon the prediction error, i.e., , where .

3.2. Input

As appearance and motion are two characteristics of video data, it is common to explicitly incorporate optical flow together with the still images to describe a video sequence (Simonyan and Zisserman, 2014), e.g. optical flow has been employed to represent video sequences in the cGAN framework (Ravanbakhsh et al., 2017) and used as a motion constraint (Liu et al., 2018).

In this work, we stack image with pre-computed optical flow maps (Brox et al., 2004; Ilg et al., 2017) in channel dimension as inputs similar to Simonyan et al. (2014) for video action recognition and Ravanbakhsh et al. (2017) for video anomaly detection. The optical flow maps consist of a horizontal component, a vertical component and a magnitude component. To be noted that, the optical flow map is computed from the previous image and current image, thus does not contain future frame information. Therefore, the input can be given as , and we used 5 consecutive frames as inputs, i.e., , similar to Liu et al. (2018).

3.3. Convolutional Transformer

Convolutional transformer is developed to obtain a future frame prediction based on past frames. It consists of three key components: a convolutional encoder to encode spatial information, a temporal self-attention module to capture the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict future frame.

3.3.1. Convolutional Encoder

The convolutional encoder (Long et al., 2015) is employed to extract spatial features from each frame of the video. Each frame of the video is first resized to and then fed into the convolutional encoder. The convolutional encoder consists of 5 convolutional blocks. And the convolutional block follows common structure in CNN. All the convolutional kernels are set as

pixels. For brevity, we denote a convolutional layer with stride

and number of filters as Conv

, a batch normalization layer as BN, a scaled exponential linear unit 

(Klambauer et al., 2017) as SELU, and a dropout operation with dropout ratio as dropout. The structure of the convolutional encoder is: [Conv-SELU-BN]-[Conv-SELU-BN-Conv-SELU]-[Conv-SELU-BN-Conv-SELU]-[Conv-SELU-BN-dropout-Conv-SELU]-[Conv-SELU-BN-dropout-Conv-SELU] , where each [] represents a convolutional block.

At the -th convolutional block conv, denotes the input feature maps to the self-attention module with , , as the height, width, and number of channels, respectively. The temporal dynamics among the spatial feature maps of different time steps will be encoded with temporal self-attention module.

3.3.2. Temporal Self-attention Module

To explicitly encode the temporal information in the video sequence, we extend self-attention mechanism in the transformer model (Vaswani et al., 2017) and develop a novel temporal self-attention module to capture the temporal dynamics of the multi-scale spatial feature maps generated from the convolutional encoder. This section applies to all layers, thus we omit the layer for clarity. An illustration of the multi-head temporal self-attention module is shown in the upper panel of Figure 1. Spatial Feature Vector. We first use global average pooling (GAP) to extract a feature vector from the feature map extracted in the convolutional encoder. The feature vector in current time step will be used as part of the query and each historical feature vector , will be used as part of the key to index spatial feature maps.

Positional Encoding. Different from sequence models such as LSTM, self-attention does not model sequential information inherently, therefore it is necessary to incorporate temporal positional information into the model. We generate a positional encoding vector following (Vaswani et al., 2017):

(1)

where denotes the dimension of , denotes the temporal position and denotes the index of the dimension. Empirically, we fix in our study.

Temporal Self-Attention. We concatenate the positional encoding vector with the spatial feature vector for each time step and use the concatenated vectors as the queries and keys, and the feature maps as the values in the setting of self-attention mechanism. For each query frame at time , the current concatenated feature vector is used as query, and compared to the feature vector of each frame from the input video clip i.e. memory

using cosine similarity:

(2)

Based on the similarity between and , we can generate the normalized attention weights across the temporal dimension using a softmax function:

(3)

where a positive temperature variable is introduced to sharpen the level of focus in the softmax function and is automatically learned in the model through a single hidden densely-connected layer with the query as the input.

The final attended feature maps are a weighted sum of all feature maps using the attention weights in Eq. (3):

(4)

Multi-head Temporal Self-Attention. Multi-head self-attention (Vaswani et al., 2017) enables the model to jointly attend to information from different representation subspaces at different positions. We adapt it to spatio-temporal modeling by first mapping the spatial feature maps to groups, each using 32 convolutional kernels. For each group of feature maps with dimension , we then perform the single head self-attention as described in the previous subsection and generate attended feature maps for head as :

(5)

where is the transformed feature map at frame for head , is the corresponding attention weight. The final multi-head attended feature map is the concatenation of the attended feature maps from all the heads along the channel dimension:

(6)

In this way, the final attended feature maps not only integrate spatial information from the convolutional encoder, but also capture temporal information from multi-head temporal self-attention mechanism.

Spatial Selective Gate. The aforementioned module extends the self-attention mechanism to the temporal modeling of 2D image feature maps, however, it comes with the loss of fine-grained spatial resolution due to the GAP operation. To compensate this, we introduce spatial selective gate (SSG), which is a spatial attention mechanism to integrate the current and historical information. The attended feature maps from the temporal self-attention module and the feature maps of the current query are concatenated, on which we learn a spatial selective gate using a sub-network with structure: Conv-BN-SELU-Conv-BN-SELU-Conv-BN-SELU-Conv-Conv-Sigmoid. The final output is a pixel-wise weighted average of the attended maps and the current query’s multi-head transformed feature maps , according to :

(7)

where denotes element-wise multiplication.

We add SSG at each level of temporal self-attention module. As the spatial dimensions are larger at shallow layers and we want to include contextual information while preserving the spatial resolution, we use dilated convolution (Yu and Koltun, 2016) with different dilatation factors at the 4 convolutional blocks in the sub-network , specifically from conv to conv, the dilation factors are (1,2,4,1), (1,2,2,1), (1,1,2,1), (1,1,1,1). Note that SSG is computationally more efficient than directly forwarding the concatenated feature maps to the convolutional decoder.

3.3.3. Convolutional Decoder

The outputs of the temporal self-attention module are fed into the convolutional decoder. The convolutional decoder predicts the video frame using 4 transposed convolutional layers with stride 2 on the feature maps in a reverse order of the convolutional encoder. The fully-scaled feature maps then go through one convolutional layer with 32 filters and one convolutional layer with filters of size that maps to the same size of channels in the input. In order to predict finer details, we utilize the skip connection (Ronneberger et al., 2015) to connect the spatio-temporally integrated maps at each level of the convolutional encoder to the corresponding level of the convolutional decoder, which allows the model to further fine-tune the predicted frames.

3.4. Dual Discriminator GAN

We propose a dual discriminator GAN using both an image discriminator and a video discriminator to further enhance the future frame prediction of convolutional transformer via adversarial training. The image discriminator critiques on whether the current frame is generated or real just on the basis of one single frame to assess the local consistency. The video discriminator

performs critique on the prediction conditioned on the past frames to assess the global coherence. Specifically, we stack the past frames with current generated or real frame in the temporal dimension and the video discriminator is essentially a video classifier. This idea of combining local and global (contextual) discriminator is similar to adversarial image inpainting 

(Yu et al., 2018) but is used in a totally different context.

The network structures of the two discriminators are kept the same except that we use 2D operations in image discriminator and the corresponding 3D operations in the video discriminator. We use PatchGAN architecture as described in (Isola et al., 2017) and use spectral normalization (Miyato et al., 2018) in each convolutional layer. In the 3D version, the stride and kernel size in the temporal dimension were set at 1 and 2 respectively.

The method in Liu et al. (2018) is similar to using the image discriminator only. Different from the video discriminator in Tulyakov et al. (2018), which applies on the whole synthetic video clip, our proposed video discriminator conditions on the real frames.

Dataset Total # frames/clips Training # frames/clips Testing # frames/clips Anomaly Types
UCSD Ped2 4,560/28 2,550/16 2,010/12 biker, skater, vehicle
CUHK Avenue 30,652/37 15,328/16 15,324/21 running, loitering, object throwing
ShanghaiTech 315,307/437 274,516/330 40,791/107 biker, skater, vehicle, sudden motion
Table 1. Video anomaly detection datasets details

3.5. Loss

For the adversarial training, we use the Wasserstein GAN with gradient penalty (WGAN-GP) setting (Arjovsky et al., 2017; Gulrajani et al., 2017). The generator is the mapping : . For discriminators, and are video and image discriminators respectively. The GAN loss is:

(8)

where , . The penalty coefficient is fixed as 10 in all our experiments.

In addition, we consider the pixel-wise loss of the prediction. Therefore the total loss is:

(9)

We trained our models on each dataset separately by minimizing the loss above using ADAM (Kingma and Ba, 2015) algorithm with learning rate 0.0002 and a batch size of 5.

3.6. Regularity Score

A regularity score based on the prediction error is calculated for each video frame:

(10)

In Hasan et al. (2016), is the frame-wise reconstruction . In Liu et al. (2018),

is equivalently negative frame-wise prediction PSNR (Peak Signal to Noise Ratio): PSNR

. In this study, we use similar setting to the two methods above with: .

4. Experiments

In this section, we first introduce the three public datasets used in our experiments, which follow the same setup as other similar unsupervised video anomaly detection studies. Then, we report the video anomaly detection performance and comparison with other methods. Finally, we perform ablation studies to demonstrate the contribution of each component and interpret the results based on the proposed CT-D2GAN.

4.1. Datasets

We evaluate our framework on three widely used public video anomaly detection datasets, i.e., UCSD Ped2 dataset (Li et al., 2014) 111http://www.svcl.ucsd.edu/projects/anomaly/dataset.html, CUHK Avenue dataset (Lu et al., 2013) 222http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html, and ShanghaiTech Campus (SH-Tech) dataset (Luo et al., 2017b) 333https://github.com/StevenLiuWen/sRNN_TSC_Anomaly_Detection##shanghaitechcampus-anomaly-detection-dataset. We describe the dataset-specific characteristics and the effects on video anomaly detection performance, some details can be found in Table 1:

4.1.1. UCSD Ped2.

UCSD Ped2 includes pedestrians, vehicles largely moving in parallel to the camera plane.

4.1.2. CUHK Avenue.

CUHK Avenue includes pedestrians and objects both moving parallel to or toward/away from the camera. Slight camera motion is present in the dataset. Some of the anomalies are staged actions.

4.1.3. ShanghaiTech.

Different from the other datasets, the ShanghaiTech dataset is a multi-scene dataset (13 scenes), and includes pedestrians, vehicles, and sudden motions, and the ratios of each scene in the training set and test set can be different.

Figure 2. Examples of video anomaly detection. The blue lines in the line graphs delineate frame-level regularity scores. The green and red shaded segments in the line graphs indicate the ground truth normal and abnormal video segments respectively. The frames in the green boxes are regular frames from the regular video segments, the frames in the red boxes are abnormal frames from abnormal video segments. The abnormal objects are annotated.

4.2. Evaluation

The model was trained and evaluated on a system with an NVIDIA GeForce 1080 Ti GPU and implemented with PyTorch. To measure the effectiveness of our proposed CT-D2GAN framework for video anomaly detection, we report the area under the receiver operating characteristics (ROC) curve

i.e., AUC. Specifically, AUC is calculated by comparing the frame-level regularity scores with frame-level ground truth labels.

Method UCSD Ped2 CUHK SH-Tech
MPPCA+SF (2010) 61.3 - -
MDT (2010; 2014) 82.9 - -
Conv-AE (2016)  85.0  80.0  60.9
3D Conv (2017) 91.2 80.9 -
Stacked RNN (2017b) 92.2 81.7 68.0
ConvLSTM-AE (2017a) 88.1 77.0 -
memAE (2019) 94.1 83.3 71.2
memNormality (2020) 97.0 88.5 70.5
ClusterAE (2020) 96.5 86.0 73.3
AbnormalGAN (2017) 93.5 - -
Frame prediction (2018) 95.4 85.1 72.8
Pred+Recon (2020) 96.3 85.1 73.0
CT-D2GAN 97.2 85.9 77.7

Evaluated in (Liu et al., 2018);
-: Not evaluated in the study.
Ordered in publication year. The best performance in each dataset is highlighted in boldface.

Table 2. Frame-level video anomaly detection performance (AUC).

4.3. Video Anomaly Detection

To demonstrate the effectiveness of our proposed CT-D2GAN framework for video anomaly detection, we compare it against 12 different baseline methods. Among those, MPPCA (mixture of probabilistic principal component analyzers) + SF (social force) (Mahadevan et al., 2010), MDT (mixture of dynamic textures) (Mahadevan et al., 2010; Li et al., 2014) are handcrafted feature based methods; Conv-AE (Hasan et al., 2016), 3D Conv (Zhao et al., 2017), Stacked RNN (Luo et al., 2017b), and ConvLSTM-AE (Luo et al., 2017a) are encoder-decoder based approaches; MemAE (Gong et al., 2019), MemNormality (Park et al., 2020) and ClusterAE (Chang et al., 2020) are recent encoder-decoder based methods enhanced with memory module or clustering; AbnormalGAN (Ravanbakhsh et al., 2017), Frame prediction (Liu et al., 2018), and Pred+Recon (Tang et al., 2020) are methods based on adversarial training.

Table 2 shows the frame-level video anomaly detection performance (AUC) of various approaches. We observed that encoder-decoder based approaches in general outperform handcrafted feature based methods. This is because the handcrafted features are usually extracted based upon a different objective and thus can be sub-optimal. Within encoder-decoder based approaches, ConvLSTM-AE outperforms Conv-AE since it can better capture temporal information. We also notice that adversarial training based methods perform better than most baseline methods. Finally, our proposed CT-D2GAN framework achieves the best performance on UCSD Ped2 and SH-Tech, and close to the best performance in CUHK (Park et al., 2020). This is because our proposed model can not only capture the spatio-temporal patterns explicitly and effectively through convolutional transformer but also leverage the dual discriminator GAN based adversarial training to maintain local consistency at frame-level and global coherence in video sequences. Recent memory or clustering enhanced methods  (Park et al., 2020; Chang et al., 2020; Gong et al., 2019) show good performance and is orthogonal to our proposed framework and can integrate with our proposed framework in future work to further improve performance. Examples of video anomaly detection results overlaid on the abnormal activity ground truth of all three datasets are shown in Figure 2, along with example video frames from the regular and abnormal video segments.

Due to the multi-scene nature of SH-Tech dataset, we also analyzed the most ample single scene that constitutes 25% (83/330 clips) of training set and 32% (34/107 clips) of test set, the AUC is 87.5 which is much better than the overall dataset and reach similar level with other single-scene datasets. This could imply that generalizing to less ample scenes is still a challenging task given unbalanced training set.

Thanks to the convolutional transformer architecture and optimizations including spatial selective gate, our model is computationally efficient. At inference time, our model runs at 45 FPS on one NVIDIA GeForce 1080 Ti GPU.

Ablation setting AUC
Conv Transformer 94.2
Conv Transformer + image discriminator 95.7
Conv Transformer + video discriminator 96.9
U-Net + dual discriminator 95.5
CT-D2GAN 97.2
Table 3. Video anomaly detection performance under different ablation settings on UCSD Ped2 dataset.

4.4. Ablation Studies

To understand how each component contributes to the anomaly detection task, we conducted ablation studies with different settings: (1) convolutional transformer only without the adversarial training (Conv Transformer), (2) Conv Transformer with image discriminator only, (3) Conv Transformer with video discriminator only, (4) U-Net based generator (as has been utilized in image-to-image translation (Isola et al., 2017) and video anomaly detection (Liu et al., 2018)) with dual discriminator, and compare with our full CT-D2GAN model. The performance comparison can be found in Table 3. We observed that adversarial training can enhance the performance for anomaly detection, either with the image discriminator or the video discriminator. Video discriminator alone achieves nearly similar performance as using dual discriminator, but we observed the loss decreased faster when combined with image discriminator. Using image discriminator alone was not as effective, and the loss was less stable. Finally, we observed that CT-D2GAN achieved superior performance than U-Net with dual discriminator, suggesting that convolutional transformer can better capture the spatio-temporal dynamics and thus can make a more accurate detection.

4.5. Interpretation

We illustrate an example of predicted future frame and compare it with the previous frame and the ground truth future frame in Figure 3. The prediction performance is poor for the anomaly (red box). And also we noted that the model is able to capture the temporal dynamics by predicting the future behavior in normal part of the image (green box).

Figure 3. An example showing the future frame prediction in the normal part of the image (green box, pedestrian in this case) where we observe the model capturing the dynamics of the behavior, and abnormal part of the image (red box, bicycle in this case) where there is large prediction error. From left to right, we show the last frame in the input video clip (), the predicted future frame , and the ground truth future frame .

Self-attention weights under perturbation. It is not straightforward to directly interpret the temporal self-attention weight vector, as temporal self-attention is applied to an abstract representation of the video. Therefore, to further investigate the effectiveness of temporal self-attention, we perturb two frames of the video and run the inference on this perturbed video segment. For one frame (Figure 4

, red), we added a random Gaussian noise with zero mean and 0.1 standard deviation to the image to simulate the deterioration in video quality; for another frame (Figure

4, purple), we scaled the optical flow maps by 0.9 to simulate the frame rate distortion. We plot the temporal attention weights for the frame right after the two perturbed frames in Figure  4. The weights assigned to the perturbed frames are clearly lower than the others, implying less contribution to the attended map. This suggests that self-attention module can adaptively select relevant feature maps and is robust to input noise.

Figure 4. Temporal self-attention weights in perturbed video clip.

5. Conclusions

In this paper, we developed Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. The convolutional transformer which consists of three components, i.e., a convolutional encoder to capture the spatial patterns of the input video clip, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features, was employed to perform future frame prediction. A dual discriminator based adversarial training approach was used to maintain the local consistency of the predicted frame and the global coherence conditioned on the previous frames. Thorough experiments on three widely used video anomaly detection datasets demonstrate that our proposed CT-D2GAN is able to detect anomaly frames with superior performance.

References

  • (1)
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In

    International Conference on Machine Learning (ICML)

    . PMLR, 214–223.
  • Brox et al. (2004) Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004.

    High accuracy optical flow estimation based on a theory for warping. In

    European Conference on Computer Vision (ECCV)

    . Springer, 25–36.
  • Chang et al. (2020) Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. 2020. Clustering Driven Deep Autoencoder for Video Anomaly Detection. In European Conference on Computer Vision (ECCV). Springer, 329–345.
  • Chong and Tay (2017) Yong Shean Chong and Yong Haur Tay. 2017. Abnormal event detection in videos using spatiotemporal autoencoder. In International Symposium on Neural Networks (ISNN). Springer, 189–196.
  • Dong et al. (2020) Fei Dong, Yu Zhang, and Xiushan Nie. 2020. Dual Discriminator Generative Adversarial Network for Video Anomaly Detection. IEEE Access 8 (2020), 88170–88176.
  • Gong et al. (2019) Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In IEEE International Conference on Computer Vision (ICCV). IEEE, 1705–1714.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems (NIPS). 5767–5777.
  • Hasan et al. (2016) Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . IEEE, 733–742.
  • Ilg et al. (2017) Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2462–2470.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5967–5976.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2015).
  • Klambauer et al. (2017) Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS). 971–980.
  • Li et al. (2014) Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2014. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 1 (2014), 18–32.
  • Liu et al. (2018) Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future Frame Prediction for Anomaly Detection – A New Baseline. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6536–6545.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3431–3440.
  • Lu et al. (2013) Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 FPS in MATLAB. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2720–2727.
  • Luo et al. (2017a) Weixin Luo, Wen Liu, and Shenghua Gao. 2017a. Remembering history with convolutional LSTM for anomaly detection. In IEEE International Conference on Multimedia and Expo (ICME). IEEE, 439–444.
  • Luo et al. (2017b) Weixin Luo, Wen Liu, and Shenghua Gao. 2017b. A revisit of sparse coding based anomaly detection in stacked RNN framework. IEEE International Conference on Computer Vision (ICCV) 1, 2 (2017), 3.
  • Mahadevan et al. (2010) Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1975–1981.
  • Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).
  • Nguyen et al. (2017) Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017. Dual discriminator generative adversarial nets. In Advances in neural information processing systems (NIPS). 2670–2680.
  • Park et al. (2020) Hyunjong Park, Jongyoun Noh, and Bumsub Ham. 2020. Learning Memory-guided Normality for Anomaly Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 14372–14381.
  • Parmar et al. (2018) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In International Conference on Machine Learning (ICML). PMLR, 4052–4061.
  • Qin et al. (2017) Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison W. Cottrell. 2017.

    A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    . 2627–26332.
  • Ravanbakhsh et al. (2017) Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. 2017. Abnormal Event Detection in Videos using Generative Adversarial Nets. IEEE International Conference on Image Processing (ICIP) (2017), 1577–1581.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, 234–241.
  • Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NIPS). 802–810.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS). 568–576.
  • Song and Tao (2010) Dongjin Song and Dacheng Tao. 2010.

    Biologically Inspired Feature Manifold for Scene Classification.

    IEEE Transactions on Image Processing 19, 1 (2010), 174–184.
  • Sultani et al. (2018) Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-World Anomaly Detection in Surveillance Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6479–6488.
  • Tang et al. (2020) Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. 2020. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters 129 (2020), 123–130.
  • Tulyakov et al. (2018) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. MoCoGAN: Decomposing Motion and Content for Video Generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1526–1535.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). 6000–6010.
  • Xu et al. (2017) Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. 2017. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding 156 (2017), 117–127.
  • Xu et al. (2019) Han Xu, Pengwei Liang, Wei Yu, Junjun Jiang, and Jiayi Ma. 2019. Learning a Generative Model for Fusing Infrared and Visible Images via Conditional Generative Adversarial Network with Dual Discriminators.. In International Joint Conference on Artificial Intelligence (IJCAI). 3954–3960.
  • Yu and Koltun (2016) Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations (ICLR) (2016).
  • Yu et al. (2018) Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. 2018. Generative Image Inpainting With Contextual Attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5505–5514.
  • Zhang et al. (2019b) Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and V. Nitesh Chawla. 2019b. A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data. In Association for the Advancement of Artificial Intelligence (AAAI). AAAI, 1409–1416.
  • Zhang et al. (2019a) Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019a. Self-attention generative adversarial networks. In International Conference on Machine Learning (ICML). PMLR, 7354–7363.
  • Zhao et al. (2017) Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-Temporal AutoEncoder for Video Anomaly Detection. In ACM International Conference on Multimedia (ACM MM). ACM, 1933–1941.