FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation

06/16/2021 ∙ by Chaewon Park, et al. ∙ Yonsei University 0

Video anomaly detection has gained significant attention due to the increasing requirements of automatic monitoring for surveillance videos. Especially, the prediction based approach is one of the most studied methods to detect anomalies by predicting frames that include abnormal events in the test set after learning with the normal frames of the training set. However, a lot of prediction networks are computationally expensive owing to the use of pre-trained optical flow networks, or fail to detect abnormal situations because of their strong generative ability to predict even the anomalies. To address these shortcomings, we propose spatial rotation transformation (SRT) and temporal mixing transformation (TMT) to generate irregular patch cuboids within normal frame cuboids in order to enhance the learning of normal features. Additionally, the proposed patch transformation is used only during the training phase, allowing our model to detect abnormal frames at fast speed during inference. Our model is evaluated on three anomaly detection benchmarks, achieving competitive accuracy and surpassing all the previous works in terms of speed.



There are no comments yet.


page 3

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video anomaly detection refers to the task of recognizing unusual events in videos. It has gained attention due to the implementation of video surveillance systems. Surveillance cameras are widely used for public safety. However, the monitoring capacity is not up to the mark. Since abnormal events rarely happen in the real world compared to normal events, automatic anomaly detection systems are in high demand to reduce the monitoring burden. However, it is very challenging because obtaining the datasets is difficult owing to the imbalance of events and variable definitions of abnormal events based the context of each video.
One of the challenging factors of anomaly detection is the data imbalance problem, meaning that the abnormal scenes are more difficult to capture than normal scenes because of their scarcity in the real world. Therefore, datasets with an equal number of both types of scenes are hard to obtain, and consequently only the normal videos are provided as training data [3]. This is known as an unsupervised approach for anomaly detection used by most of previous works [11, 29, 33, 34]

. The unsupervised network needs to learn the representative features of the normal training set and sort the frames with outlying features to detect abnormal events. Autoencoder (AE) 

[12]-based methods [1, 45, 18] have proven to be successful for such a task. Frame predicting AEs [18, 37] and frame reconstructing AEs [28, 11]

have been proposed assuming that anomalies that are unseen in the training phase cannot be predicted or reconstructed when the model is trained only on normal frames. However, these methods do not consider the drawback of AE—that AE may generate anomalies as clearly as normal events due to the strong generalizing capacity of convolutional neural networks (CNNs) 

[10]. To minimize this factor, Gong  [10] and Park  [30] proposed memory-based methods to use only the most essential features of normal frames for the generation. However, the memory-based methods are not efficient for videos with various scenes because their performance is highly dependent on the number of items. Many memory items are required to read and update patterns of various scenes, slowing down the detection.
Another critical and challenging issue for video anomaly detection is the performing speed. The main purpose of anomaly detection is to detect abnormal events or emergencies immediately, but slow models do not meet this purpose. In the previous studies, the following factors are observed to slow down the detection speed: heavy pre-trained networks such as optical flow [18, 33, 34, 43], object detectors [8, 13], and pre-trained feature extractors [32, 36]. These modules are complex and computationally expensive.

Therefore, we take the detection speed into account and employ a patch transformation method that is used only during training. We implement this approach by artificially generating abnormal patches via applying transformation to patches randomly selected from the training dataset. We adopt spatial rotation transformation (SRT) and temporal mixing transformation (TMT) to generate a patch anomaly at a random location within a stacked frame cuboid. Given this anomaly-included frame cuboid, our AE is trained to learn the employed transformation and predict the upcoming normal frame. The purpose of SRT is to generate an abnormal appearance and encourage the model to learn spatially invariant features of normal events. For instance, when a dataset defines walking pedestrians as normal and all others as abnormal, by giving a sequence of a rotated person (e.g., upside-down, lying flat) and forcing the model to generate a normally standing person, the model learns normal patterns of pedestrians. TMT, which is shuffling the selected patch cube in the temporal axis to create abnormal motion, is intended to enhance learning temporally invariant features of normal events. Given a set of frames where an irregular motion takes place in a small area, the model has to learn how to rearrange the shuffled sequence in the right order to correctly predict the upcoming frame.
To the best of our knowledge, unlike [18, 10, 30, 33, 13, 32], our framework performs the fastest because there are no additional modules or pre-trained networks. Furthermore, the proposed patch transformation does not drop the speed because it is detached during detection. Likewise, we designed all components of our method considering the detection speed in an effort to make it suitable for anomaly detection in the real world.
We summarize our contributions as follows:

  • We apply a patch anomaly generation phase to the training data to enforce normal pattern learning, especially in terms of appearance and motion.

  • The proposed patch generation approach can be implemented with conjunction of any backbone networks during the training phase.

  • Our model performs at very high speed and at the same time, achieves competitive performance on three benchmark datasets without any pre-trained models(e.g, optical flow networks, object detectors, and pre-trained feature extractors).

Figure 2: The overview of our framework. During the training phase, SRT and TMT are employed to make our input . The AE is trained to generate a succeeding frame that mimics the normal frame. During the testing phase, frames are fed into the AE and the corresponding output is generated. The normality score is used to discriminate abnormal frames. The in this figure is a combination of and a difference map for better understanding. The values in brackets indicate [channel, temporal, height, width] of feature and (depth, height, width) of kernel in order.

2 Related work

2.1 AE-based approach

Frame predicting and reconstructing AEs have been proposed under the assumption that models trained only on normal data are not capable of predicting or reconstructing abnormal frames, because these are unseen during training. Some studies [18, 37, 30, 22] trained AEs that predict a single future frame from several successive input frames. Additionally, many effective reconstructing AEs [28, 45, 30, 5] have been proposed. Cho  [5] proposed two path AE, where two encoders were used to model appearance and motion features. Focusing on the fact that abnormal events occur in small regions, patch-based AEs [45, 41, 28, 7], have been proposed. However, it has been observed that AEs tends to generalize well to generate abnormal events strongly, mainly due to the capacity of CNNs, which leads to missing out on anomalies during detection. To alleviate this drawback, Gong  [10] and Park  [30] suggested networks that employ memory modules to read and update memory items. These methods showcased outstanding performance on several benchmarks. However, they are observed to be ineffective for large datasets due to the limitation of memory size. Furthermore, some works [18, 33, 34, 32]

have used optical flow to estimate motion features because information of temporal patterns is crucial in anomaly detection.

2.2 Transformation-based approach

Many image transformation methods, such as augmentations, have been proposed to increase recognition performance and robustness in varying environments in limited training datasets. This technique was first applied to image recognition and was later extended to video recognition. For the image-level modeling, Gidaris  [9]

suggested unsupervised learning for image classification by predicting the direction of the rotated input. Krizhevsky  

[17] used rotation, flipping, cropping, and color jittering to enhance learning spatially invariant features. Furtherefore, DeVries  [6] devised CutOut, a method that deletes a box at a random location to prevent the model from focusing only on the most discriminative regions. Zhang  [46] proposed MixUp which blends two training data on both the images and the labels. Yun  [44] put forth a combination of CutOut and MixUp, called CutMix. CutMix creates a new sample from two images by deleting a square shaped region from one image and replacing it with a patch from another image. For the video-level model, augmentation techniques have been extended to the temporal axis. Ji  [14] proposed a method called time warping and time masking, which randomly skips or adjusts temporal frames.
Several studies have used the techniques mentioned above for video anomaly detection based on the assumption that applying transformations to the input forces the network to embed critical information better. Zaheer  [45] suggested a pseudo anomaly module to create an artificial anomaly patch by blending two arbitrary patches from normal frames. They reconstructed both normal and abnormal patches, and trained a discriminator to predict the source of the reconstructed output. Hasan  [11] and Zhao  [48] sampled the training data by skipping a fixed number of frames in the temporal axis. Moreover, Joshi  [15] generated abnormal frames from normal frames by cropping an object detected with a semantic segmentation model and placing it in another region in the frame to generate an abnormal appearance. Wang  [40] applied random cropping, flipping, color distortion, rotation, and grayscale to the entire frame. In contrast to these methods, our network embeds normal patterns by training from frames with anomaly-like patches. We transform the input frames along the spatial axis or temporal axis to generate abnormal frames within training datasets.

3 Proposed approach

This section presents an explicit description of our model formation. Our model consists of two main phases: (1) the patch anomaly generation phase and (2) the prediction phase.

3.1 Overall architecture

Fig. 2 presents the overview of our framework. During the training phase, we first load adjacent frames to make a frame cuboid. After that, we apply our patch anomaly generation to the frame cuboid, which is forwarded to the AE. Our AE extracts spatial and temporal patterns of the input and generates a future frame. During inference, the patch anomaly generation is not employed. A raw frame cuboids is fed as an input to the AE. The difference between the output of the AE and the ground truth frame is used as a score to judge normality.

3.2 Patch anomaly generation phase

Abnormal events in videos are categorized into two large branches: (1) anomalies regarding appearances (e.g., pedestrians on a vehicle road) and (2) anomalies regarding motion (e.g., an illegal U-turn or fighting in public). Hence, it is important to learn both the appearance and motion features of normal situations to detect anomalies in both cases.
The patch anomaly generation phase takes place before feeding the frames to the generator. We load successive frames , resize each to , and concatenate them on the temporal axis to form a 4D cuboid , where denotes the number of channels for each frame. After that, we select a patch cuboid from a random location within to apply transformation. Since anomalies usually occur in foregrounds, we exclude a margin of 12.5 percent in length from the top and bottom of the width of

from the selection area. We heuristically find that these marginal regions are generally backgrounds. Therefore, they commonly do not contain moving objects. Thus, by limiting the range,

is more likely to capture foregrounds than backgrounds, encouraging the model to concentrate on the moving objects. Then we apply SRT or TMT to to form a transformed patch cuboid . Only one of the two is applied randomly for every input.
For SRT, each patch is rotated in a random direction between 0, 90, 180, and 270, following the approach of [9]. By forwarding these transformed frame cuboids to the frame generator, our network is encouraged to focus on the abnormal region and recognize the spatial features of the normal appearances. Suppose a network is being trained on a dataset of people walking on a road. When it is given a frame cuboid with an upside-down person created by 180 rotation among all the other normal pedestrians and is programmed to predict a next normal scene, the network would learn the spatial features of a normal person, such as the head and the feet are generally placed at the top and bottom, respectively. Our SRT is demonstrated as follows:


where represents the rotation function for a patch within the pixel range of in the width axis and in the height axis of input frame . denotes the randomly set direction for the frame, where is the index of the input frame in the range . Furthermore, and represent the fixed width and height of the patch, respectively. The final is generated by concatenating the transformed in the temporal axis.
TMT involves shuffling the sequence of the patch cuboid in the temporal axis with the intention of generating abnormal movement. The network needs to detect the awkward motion and match the sequence to normal before predicting the next frame to reduce the loss and generate a frame as similar as possible to the ground truth. For example, when the patch sequence is reversed, and a backward-walking person is generated within a frame where only forward walking people are annotated as normal, the model should find the correct sequence of the abnormal person based on the learned features to predict the correct trajectory. Our TMT function is as follows:


where denotes a function that copies a patch located in pixel range of in the width axis and in the height axis of input frame and pastes it to the frame. represents the shuffled sequence of patches (e.g. sequence when is 5). Same as SRT, the final is the stack of the transformed .

Our patch anomaly generation phase is computationally cheaper than the other methods that embed spatio-temporal feature extraction in networks, such as storing and updating memory items 

[10, 30], and estimating optical flow with pre-trained networks [18, 33, 43, 2]. Therefore, our patch anomaly generation phase boosts feature learning with low cost. Furthermore, this phase is not used during the inference, meaning that it does not affect the detection speed at all. Thus, our model is low in complexity and computational costs (see Section 4.3).

(a) SRT
(b) TMT
Figure 5: Visualization of (a) SRT and (b) TMT. The frames in the upper rows are components of . The regions marked in color are the locations of the selected . The frames in the lower rows are data transformed components of .

3.3 AE architecture

The AE in our network aims to learn prototypical features of normal events and produce an output frame based on those features. Its main task is to predict —the frame coming after —from an input frame cuboid . Therefore, it is necessary to learn the temporal features as well as the spatial features to generate the frame with fine quality. The architecture of our model follows that of U-Net [35], in which the skip connections between the encoder and the decoder boost generation ability by preventing gradient vanishing and achieving information symmetry. The encoder consists of a stack of three-layer blocks that reduce the resolution of the feature map. We employ 3D convolution [38]

to embed the temporal factor learning in our model. Specifically, the first block consists of one convolutional layer and one activation layer. The second and the last blocks are identical in structure: convolutional, batch normalization, and activation layers. The kernel size is set to

for all three layers. The decoder also consists of a stack of three-layer blocks and is symmetrical to the encoder except that the convolutional layers are replaced by deconvolutional layers to upscale the feature map. In addition, we use leakyReLU activation [25]

for the encoder and ReLU activation 

[27] for the decoder.
Likewise, the architecture of our AE is very simple compared to other previous studies, especially methods that employ pre-trained feature extractors [24, 36]. In that the running time is generally dependent on the simplicity of the model architecture, our AE is well designed, considering the speed.

3.4 Objective function and normality score

Prediction loss. Our model is trained to minimize the prediction loss. We use the distance (Eq. (3)) and structural similarity index (SSIM) [39] loss (Eq. (4)) to measure the difference between the generated frame and the ground truth frame . The distance and SSIM demonstrate the difference of frames at the pixel-level and similarity at the feature-level, respectively. The functions are as follows:


where and

denote the average and variance of each frame, respectively. Furthermore,

represents the covariance. and denote variables to stabilize the division. Following the work of Zhao  [47]

, we exploit a weighted combination of the two loss functions in our objective function as shown in Eq. (



and are the weights controlling the contribution of and , respectively. Consequently, our model is urged to generate outputs that resemble the ground truth frames at both the pixel and feature levels.

Method FPS Prediction-based CUHK Avenue [21] Shanghai Tech [24] UCSD Ped2 [26]

w/ pre-training

StackRNN [24] 50 81.7 68.0 92.2
FFP [18] 133 85.1 72.8 95.4
AD [34] 2 - - 95.5
AMC [29] 86.9 - 96.2
MemAE [10] 42 83.3 71.2 94.1
DummyAno [13] 11 90.4 84.9 97.8
AutoregressiveAE [1] - 72.5 95.4
AnoPCN [42] 10 86.2 73.6 96.8
GCLNC [49] 150 - 84.1 92.8
GMVAE [7] 120 83.4 - 92.2
VECVAD [43] 5 89.6 74.8 97.3
FewShotGAN [23] 85.8 77.9 96.2
AMmem [2] 86.6 73.7 96.6
MTL [8] 21 92.8 90.2 99.8

w/o pre-training

150Matlab [20] 150 80.9 - -
ConvAE [11] 70.2 60.9 90.0
HybridAE [28] 82.8 - 84.3
CVRNN [22] 85.8 - 96.1
IntegradAE [37] 30 83.7 71.5 96.2
MNAD [30] 78 88.5 70.5 97.0
MNAD [30] 56 82.8 69.8 90.2
CDDAE [4] 32 86.0 73.3 96.5
OG [45] - - 98.1
Baseline 195 83.2 72.1 95.7
Ours 195 85.3 72.2 96.3
Table 1: Frame-level AUC scores (%) of the state-of-the-art methods versus our architecture trained with patch anomaly generation phase. Pre-training includes any additional pre-trained models such as optical flow networks, object detectors, or feature extractors. The FPS values are based on the figures mentioned in each paper, and the ones with denotes FPS computed in our re-implementation, conducted on the same device and environment as our model for a fair comparison. The top two results are marked with bold and underline.

Frame-level anomaly detection.

When detecting anomalies in the testing phase, we adopt the peak signal to noise ratio (PSNR) as a score to estimate the abnormality of the evaluation set. We obtain this value between the predicted frame at the

period and the ground truth frame :


where N denotes the number of pixels in the frame. Our model fails to generate when contains abnormal events, resulting in a low value of PSNR and vice versa. Following the works of [18, 10, 24], we define the final normality score by normalizing ) of each video clip to the range .


Therefore, our model is capable of discriminating between normal and abnormal frames using the normality score of Eq. (7)

4 Experiments

This section demonstrates experimental results obtained on three datasets with both quantitative and qualitative results and explicit discussion. We also show the effectiveness of our patch anomaly generation phase for video anomaly detection.

4.1 Implementation details

We implement all of our experiments with PyTorch 

[31], using a single Nvidia GeForce RTX 3090. Our model is trained using Adam optimizer [16] with a learning rate of 0.0002. Additionally, a cosine annealing scheduler [19]

is used to reduce the learning rate to 0.0001. We train our model for 20 epochs on the Avenue dataset 

[21] and Ped2 dataset [26] and five epochs on the ShanghaiTech dataset [24]. The number of input frames is empirically set to 5. We load frames in gray scale, resize to , and normalize the intensity of pixels to

. In addition, we add random Gaussian noise to the training input where the mean is set to 0 and the standard deviation is chosen randomly between 0 and 0.03. Furthermore, we set

and to 60. The batch size is 4 during training. Optimal weights for the loss function in Eq. (5) are empirically measured as and .

We adopt the area under curve (AUC) of the receiver operating characteristic (ROC) curve obtained from the frame-level scores and the ground truth labels for the evaluation metric. This metric is used in most studies 

[28, 45] on video anomaly detection. The baseline model, mentioned throughout following sections, denotes our model without the patch anomaly generation phase. Since the first five frames of each clip cannot be predicted, they are ignored in the evaluation, following [18, 30, 37].

4.2 Datasets

CUHK Avenue [21]. This dataset captures an avenue at a campus. It consists of 16 training and 21 testing clips. Training clips contains only normal events and testing clips contains a total of 47 abnormal events such as running, loitering, and throwing objects. The frame resolution is , all in RGB scale. The size of people is inconsistent due to the camera angle. Furthermore, the camera is kept fixed most of the time, but brief shaking occurs in the evaluation set.

UCSD Ped2 [26]. The UCSD Ped2 dataset [26] is acquired from a pedestrian walkway by a fixed camera from a long distance. The training and the testing sets consists of 16 and 12 clips, respectively. Anomalies in the testing clips are non-pedestrian objects, for instance, bikes, cars, and skateboards. The frames are in gray scale with a resolution of .

ShanghaiTech Campus [24]. The ShanghaiTech Campus dataset [24] is acquired from 13 different scenes, each with training and testing sequences, unlike the two aforementioned datasets which follow the single scene formulation. There are a total of 330 training videos and 107 testing videos where non-pedestrian objects (e.g., cars, bicycles) and aggressive motions (e.g., brawling, chasing) are annotated as anomalies. Each frame is captured with RGB pixels. This dataset is among the largest datasets for video anomaly detection.

4.3 Experimental results

Impact of patch anomaly generation phase. Table 2 shows the impact of our patch anomaly generation estimated on Avenue [21] and Ped2 [26]. The results include five different conditions: (1) using only TMT, (2) using only SRT, (3) randomly applying TMT or SRT but with all patches rotated as a chunk in the same direction for SRT, where (represented as SRT* in Table 2), (4) randomly applying TMT or SRT with varying directions for each patch, and (5) applying both TMT and SRT to the selected . From the results, it appears that SRT has greater contribution than TMT to the detection performance. This is because our SRT rotates each patch randomly in varying directions resulting in generating anomalies in the motion as well as the appearance.

width=0.8 Method Avenue [21] ST [24] Ped2 [26] Baseline 83.2 72.1 95.7 TMT 83.0 72.2 95.1 SRT 85.0 72.4 96.0 TMT SRT* 84.5 72.1 95.2 TMT SRT 85.3 72.2 96.3 TMT SRT 84.6 72.2 96.2

Table 2: We demonstrate the impact of our patch anomaly generation by an ablation studies on CUHK Avenue [21], ShanghaiTech (ST) [24], and Ped2 [26]. We present frame-level AUC (%) of experiments on 5 variations: using only TMT, using only SRT, randomly selecting between TMT and single directional SRT (indicated as *), randomly selecting between TMT and SRT, and using both the TMT and SRT.

Performance comparison with existing works. We compare the frame-level AUC of our model with those of non-prediction-based methods [11, 24, 33, 36, 34, 28, 10, 13, 1, 30, 43, 45, 8] and prediction-based methods  [18, 37, 29, 30] (see Table 1). We find that our method achieves competitive performance on the three datasets with a very high temporal rate. Among the prediction-based methods, we exceed IntergadAE [37] in all datasets and show superior results especially in the Ped2 dataset [26]. Note that our model performs at par with other models without any additional modules whereas several other prediction-based models [18, 37, 29] employed pre-trained optical flow networks to estimate the motion features. Among the non-prediction-based networks, Georgescu  [8]

achieved superior performance by combining self-supervised learning with a pre-trained object detector.

Furthermore, we conduct a score gap comparison, inspired by Liu  [18] to present the discriminating capacity of our model. Fig. 6 shows that our model achieves higher gaps than FFP [18]—a prediction network boosted with optical flow loss and generative learning, and MNAD [30]—a prediction method that reads and updates memory items from a memory module. This demonstrates the effectiveness of our patch anomaly generation phase by the fact that the score distributions of normal and abnormal frames are significantly far apart from each other.

Figure 6: Following the work of Liu  [18], we compare our work with FFP [18] and MNAD [30] by calculating the score gap between normal frames and abnormal frames on CUHK Avenue [21] and UCSD Ped2 [26]. The gap is obtained by averaging the scores of normal frames and those of abnormal frames and subtracting the two values. A higher gap represents a higher capacity for discriminating normal and abnormal frames.
Figure 7: Results of ablation studies on patch size.

Running time. Our model boasts an astonishing speed of 195 frames per second (FPS). This rate is computed using UCSD Ped2 [26] test set with a single Nvidia GeForce RTX 3090 GPU. We obtain this by averaging the entire time consumed in both frame generation and anomaly prediction. To our knowledge, it is far faster than any other previous works. We show a fair comparison with other networks in Table 1. We re-implemented networks that distributed official codes in public on the same device and environment used for our network. The FPS for these is marked with in the table. We followed the figures mentioned in each paper for methods without publicly distributed codes. Note that our work is nearly 30 % faster than the second fastest ones [49, 20]. Moreover, we computed the number of trainable parameters (FLOPs) as proof of 195 FPS. Its value for our model is 2.15 M whereas it is 15.65 M for MNAD[30] with 67 FPS and 14.53 M for FFP[18] with 25 FPS. Our network is remarkably cheaper in computation than the compared methods.

Ablation studies on patch size. Fig. 7 shows the result of ablation experiments that we conducted on the Avenue dataset [21] to observe the effect of patch size. The patch size determines the smallest unit to be focused on by the AE. In all experiments of these ablation studies, only the size of the patch is changed between , , , and , while the frame resolution remains fixed at . It means that a comparably small region is captured in a with the size of , and a large region is captured in a with the size of . Our network shows the lowest accuracy when the patch size is , which is more than 10 percent of the frame size. When the patch is considerably large, the model focuses on larger movements than smaller ones. Abnormal conditions usually occur in small parts, hence, lower performance is observed in this case.

(a) CUHK Avenue
(b) ShanghaiTech
(c) UCSD Ped2
Figure 11: Score plot from evaluation. The red and blue lines denote and labels, respectively. Labels are 0 when frames are abnormal. (A) is obtained from Clips 4, 5, and 6 of Avenue [21]. Running, throwing a bag, and moving in the wrong direction are well detected. (B) is obtained from Clip 1 of ShanghaiTech [24]. Chasing and running are detected as anomalies. (C) is obtained from Clips 1, 2, 3, and 4 of Ped2 [26]. Captured anomalies within these clips are bicycles and a car.
(a) UCSD Ped2
(b) CUHK Avenue
(c) ShanghaiTech
Figure 15: Examples of predicted frames and difference maps compared to our baseline. Best viewed in color.

Qualitative results. We demonstrate the frame-level detecting performance or our model in Fig. 11. From the figure, it can be observed that rapidly decreases when anomalies appear in the frames. Once the abnormal objects disappear, increases immediately.
Furthermore, the object-level detecting capacity is observed in Fig. 15. We present the examples of predicted frames and the corresponding difference maps. Additionally, we emphasize the results by comparing each sample with those of our baseline model. In the example of Ped2 [26], the bicycle is the annotated anomaly, which is an unseen appearance. In Avenue [21] and ShanghaiTech [24], the annotated anomalies relate to motion: a man throwing a bag and a running person. The outputs generated by our model trained with the patch anomaly generation phase are significantly much blurrier than those of the baseline, validating the effectiveness of our transformation phase. Note that our model nearly erased the bag and the person in the examples of Avenue [21] and ShanghaiTech [24]. This proves that our model does not simply infer abnormal objects by copying from the inputs, which is what the baseline model does. Moreover, for the ShanghaiTech dataset [24], the difference map of our model shows a distinction in a larger region compared to that of the baseline. We observe that our model did not accept the motion in the input; it attempted to predict the trajectory of the runner as it as per the training. However, the baseline model generated a moderate copy of the input based on the given trajectory.

5 Conclusion

In this paper, we proposed an unsupervised prediction network for video anomaly detection with a patch anomaly generation phase. We designed a light-weight AE model to learn the common spatio-temporal features of normal frames. The proposed method generated transformed frame cuboids as inputs, by applying SRT or TMT to a random patch cuboid within the frame cuboid. Our model was encouraged to pay attention to the appearance and motion patterns of normal scenes. In addition, we discussed the impact of the patch anomaly generation by conducting ablation studies. Furthermore, the proposed method achieved competitive performance on three benchmark datasets and performed at a very high speed, which is as important as the detection capacity in anomaly detection.


  • [1] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara (2019-06)

    Latent space autoregression for novelty detection


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, Table 1, §4.3.
  • [2] R. Cai, H. Zhang, W. Liu, S. Gao, and Z. Hao (2021) Appearance-motion memory consistency network for video anomaly detection. Cited by: §3.2, Table 1.
  • [3] V. Chandola, A. Banerjee, and V. Kumar (2009-07) Anomaly detection: a survey. ACM Comput. Surv. 41 (3). External Links: ISSN 0360-0300, Link, Document Cited by: §1.
  • [4] Y. Chang, Z. Tu, W. Xie, and J. Yuan (2020) Clustering driven deep autoencoder for video anomaly detection. In European Conference on Computer Vision, pp. 329–345. Cited by: Table 1.
  • [5] M. Cho, T. Kim, I. Kim, and S. Lee (2020) Unsupervised video anomaly detection via normalizing flows with implicit latent features. arXiv preprint arXiv:2010.07524. Cited by: §2.1.
  • [6] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §2.2.
  • [7] Y. Fan, G. Wen, D. Li, S. Qiu, M. D. Levine, and F. Xiao (2020) Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder. Computer Vision and Image Understanding 195, pp. 102920. Cited by: §2.1, Table 1.
  • [8] M. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah (2020) Anomaly detection in video via self-supervised and multi-task learning. arXiv preprint arXiv:2011.07491. Cited by: §1, Table 1, §4.3.
  • [9] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.2, §3.2.
  • [10] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705–1714. Cited by: §1, §1, §2.1, §3.2, §3.4, Table 1, §4.3.
  • [11] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1, §2.2, Table 1, §4.3.
  • [12] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1.
  • [13] R. T. Ionescu, F. S. Khan, M. Georgescu, and L. Shao (2019) Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7842–7851. Cited by: §1, §1, Table 1, §4.3.
  • [14] J. Ji, K. Cao, and J. C. Niebles (2019) Learning temporal action proposals with fewer labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7073–7082. Cited by: §2.2.
  • [15] A. Joshi and V. P. Namboodiri (2019) Unsupervised synthesis of anomalies in videos: transforming the normal. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.2.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §2.2.
  • [18] W. Liu, W. Luo, D. Lian, and S. Gao (2018-06) Future frame prediction for anomaly detection – a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §3.2, §3.4, Table 1, Figure 6, §4.1, §4.3, §4.3.
  • [19] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §4.1.
  • [20] C. Lu, J. Shi, and J. Jia (2013-12) Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: Table 1, §4.3.
  • [21] C. Lu, J. Shi, and J. Jia (2013) Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pp. 2720–2727. Cited by: Table 1, Figure 11, Figure 6, §4.1, §4.2, §4.3, §4.3, §4.3, Table 2.
  • [22] Y. Lu, K. M. Kumar, S. shahabeddin Nabavi, and Y. Wang (2019) Future frame prediction using convolutional vrnn for anomaly detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. Cited by: §2.1, Table 1.
  • [23] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang (2020) Few-shot scene-adaptive anomaly detection. In European Conference on Computer Vision, pp. 125–141. Cited by: Table 1.
  • [24] W. Luo, W. Liu, and S. Gao (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, pp. 341–349. Cited by: §3.3, §3.4, Table 1, Figure 11, §4.1, §4.2, §4.3, §4.3, Table 2.
  • [25] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.3.
  • [26] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos (2010) Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1975–1981. Cited by: Table 1, Figure 11, Figure 6, §4.1, §4.2, §4.3, §4.3, §4.3, §4.3, Table 2.
  • [27] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §3.3.
  • [28] T. N. Nguyen and J. Meunier (2019) Hybrid deep network for anomaly detection. arXiv preprint arXiv:1908.06347. Cited by: §1, §2.1, Table 1, §4.1, §4.3.
  • [29] T. Nguyen and J. Meunier (2019-10) Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, Table 1, §4.3.
  • [30] H. Park, J. Noh, and B. Ham (2020) Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14372–14381. Cited by: §1, §1, §2.1, §3.2, Table 1, Figure 6, §4.1, §4.3, §4.3.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [32] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe (2018) Plug-and-play cnn for crowd motion analysis: an application in abnormal event detection. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1689–1698. Cited by: §1, §1, §2.1.
  • [33] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe (2017) Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 1577–1581. Cited by: §1, §1, §2.1, §3.2, §4.3.
  • [34] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe (2019) Training adversarial discriminators for cross-channel abnormal event detection in crowds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1896–1904. Cited by: §1, §2.1, Table 1, §4.3.
  • [35] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.3.
  • [36] W. Sultani, C. Chen, and M. Shah (2018) Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6479–6488. Cited by: §1, §3.3, §4.3.
  • [37] Y. Tang, L. Zhao, S. Zhang, C. Gong, G. Li, and J. Yang (2020) Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters 129, pp. 123–130. Cited by: §1, §2.1, Table 1, §4.1, §4.3.
  • [38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §3.3.
  • [39] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.4.
  • [40] Z. Wang, Y. Zou, and Z. Zhang (2020) Cluster attention contrast for video anomaly detection. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2463–2471. Cited by: §2.2.
  • [41] D. Xu, Y. Yan, E. Ricci, and N. Sebe (2017) Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding 156, pp. 117–127. Cited by: §2.1.
  • [42] M. Ye, X. Peng, W. Gan, W. Wu, and Y. Qiao (2019) Anopcn: video anomaly detection via deep predictive coding network. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1805–1813. Cited by: Table 1.
  • [43] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft (2020) Cloze test helps: effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 583–591. Cited by: §1, §3.2, Table 1, §4.3.
  • [44] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019)

    Cutmix: regularization strategy to train strong classifiers with localizable features

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032. Cited by: §2.2.
  • [45] M. Z. Zaheer, J. Lee, M. Astrid, and S. Lee (2020) Old is gold: redefining the adversarially learned one-class classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14183–14193. Cited by: §1, §2.1, §2.2, Table 1, §4.1, §4.3.
  • [46] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §2.2.
  • [47] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2016) Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging 3 (1), pp. 47–57. Cited by: §3.4.
  • [48] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X. Hua (2017) Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1933–1941. Cited by: §2.2.
  • [49] J. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li (2019) Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1237–1246. Cited by: Table 1, §4.3.