Log In Sign Up

Anomaly detection using prediction error with Spatio-Temporal Convolutional LSTM

by   Hanh Thi Minh Tran, et al.

In this paper, we propose a novel method for video anomaly detection motivated by an existing architecture for sequence-to-sequence prediction and reconstruction using a spatio-temporal convolutional Long Short-Term Memory (convLSTM). As in previous work on anomaly detection, anomalies arise as spatially localised failures in reconstruction or prediction. In experiments with five benchmark datasets, we show that using prediction gives superior performance to using reconstruction. We also compare performance with different length input/output sequences. Overall, our results using prediction are comparable with the state of the art on the benchmark datasets.


page 2

page 5


Adversarial Anomaly Detection for Marked Spatio-Temporal Streaming Data

Spatio-temporal event data are becoming increasingly available in a wide...

Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks

Automating the detection of anomalous events within long video sequences...

Conformal Anomaly Detection on Spatio-Temporal Observations with Missing Data

We develop a distribution-free, unsupervised anomaly detection method ca...

Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory Exchange

Video anomaly detection is a challenging task because most anomalies are...

LSTM-based Anomaly Detection for Non-linear Dynamical System

Anomaly detection for non-linear dynamical system plays an important rol...

A Hierarchical Spatio-Temporal Graph Convolutional Neural Network for Anomaly Detection in Videos

Deep learning models have been widely used for anomaly detection in surv...

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Video Anomaly Detection(VAD) has been traditionally tackled in two main ...

I Introduction

Automatically detecting abnormal events in video has been widely studied in recent years due to its broad range of applications, including wide-area surveillance and health monitoring. This problem is different from event detection where the event is clearly defined, since an anomaly is by definition unknown in advance and may arise from unfamiliar activities or activities in unfamiliar contexts.

The standard approach to anomaly detection has been to learn spatio-temporal models of normal activity using hand-crafted features [1, 2, 3, 4, 5, 6]

or deep feature representations

[7, 8]. An abnormality is detected when spatio-temporal patterns are observed that do not conform to the model of normality. Many different low-level features using dense optical flow (e.g., histograms [5], MHOF [3] ) and other patterns of spatio-temporal gradient [4]

have been used in the past. A model of normality is learned using these features extracted from training data and then used to determine numerical abnormality scores in test data. The model may be of several different kinds, including probabilistic models (e.g, mixture of probabilistic PCA

[1], mixture of dynamic texture [2]), domain based (e.g, one-class SVM [6]), sparse coding [3] and Sparse Combination Learning (SCL) [4]. All of these methods have been used for anomaly detection and localization within the image frame.

Recently, deep learning architectures have been successfully applied in many computer vision tasks including video anomaly detection. A key advantage of deep learning methods is that they can learn feature representations directly from training data without prior definition. For example, this can be done in an unsupervised manner using auto-encoders

[8, 7, 9]. A stacked de-noising autoencoder can be used to learn appearance and motion features for anomaly detection [8]. A Winner-take-all sparsity constraint combined within the autoencoder has been shown to produce flow-features that are more discriminative for a one class SVM [7] that is trained separately on the compressed representations learnt by the autoencoder.

Figure 1: The encoding-decoding structure used for future prediction or reconstruction with video volumes of frames.
Figure 2: The regularity score of video sequence of the CUHK Avenue dataset [4]. The score decreases when an anomaly (a running man) appears on the scene.

End-to-end deep learning approaches have also been proposed for anomaly detection [10, 11, 12, 14, 15]. The convolutional autoencoder (convolutional AE) can be used to learn a model of normality from video, then reconstruction error [10, 11, 12] or prediction error [14, 15]

provide a local measure for anomaly detection. A Generative Adversarial Network (GAN) can be employed to generate a normal distribution over some datasets

[11, 14] by jointly optimising with a discriminator that competes to distinguish what is a real normal sample from what is a generated one. The motion dynamic can be learnt using a multi-channel approach, fusing appearance and motion information, and a cross-channel task, forcing the generator to transform raw-pixel data into motion information and vice versa [11] or using FlowNet combined with U-net for a single frame prediction[14]. The combination of a convolutional autoencoder and U-net has also been used to build two-stream network with a shared encoder in which one decoder is for a single frame reconstruction and one is for translating an image to optical flow [15]. Another Spatio-Temporal autoencoder has been proposed for video anomaly detection [16]. The results show that applying 3D convolution in the encoder and 3D deconvolution in the decoder helps to enhance the capability of extracting motion patterns over the temporal dimension.

A memory module is proposed into the AE to address these problems [17, 18]

. The encoder inputs a normal video frame and extracts feature maps. The encoding features are then used to retrieve prototypical normal patterns in the memory items and to update the memory. Then the feature maps and aggregated memory items are fed into the decoder for reconstructing the input video frame or predicting the next frame. Using cosine similarity and the softmax function for matching probability between incoming encoding features and memory items, the global memory can be read and written to. Since normal patterns in training and testing sets may be different, the memory items are updated during training and testing time, with the use of a predefined threshold to prevent updating on anomaly patterns

[18]. However, it is impossible to find an optimal threshold to distinguish between normal and abnormal patterns under various scenarios. Meta-learning methodology is introduced into a Dynamic Prototype Unit (DPU) to learn prototypes for encoding normal dynamics and to enable the fast adaption capacity to a new scene with only a few training frames [19]. As in previous work [18], the DPU inputs the encoding feature maps, which are outputs of the encoder part of U-net, to generate a pool of dynamic prototypes. However it is trained in a fully differential attention manner in which attention mapping functions are implemented as fully connected layers and updated using gradient descent style. After training the AE backbone using only frame prediction loss, the DPU module is trained with the meta-training phase using frame pairs sampled from videos of diverse scenes. In the testing phase, in order to adapt the model to a new scene, the first few frames of the sequence in this scene are used to construct K-shot input-output frame pairs. The results show that the DPU is more memory-efficient than the memory module in previous work [17, 18].

Another approach to learning regular spatio-temporal patterns is to use a convolutional LSTM [13, 20]

. The motivation is that reconstruction over a longer duration using the memory of the LSTM should capture more complex flow patterns. The convolutional network is used to encode each frame, then feeding these encoding tensors to Convolutional LSTMs to memorize the change of the appearance which corresponds to motion information

[20]. Two Deconvolutional Networks (DeconvNet) are used, one for reconstructing past frames and to identify whether an anomaly occurs; and one for reconstructing the current frame. Thus the reconstruction error is an indicator of the change in appearance or motion. The temporal unit in [13, 20] is applied on the final spatial stage, which encodes high level representations. Interleaving RNNs between spatial convolution layers has recently been shown to improve performance on precipitation now-casting [21]. The model can learn temporal information on hierarchical spatial representations from low-level to high-level. In our work, we adopt the same architecture, except that we remain with convolutional LSTMs instead of the complex trajGRU RNN [21]. Our results show a comparable level of performance to the state of the art on benchmark datasets with fewer model parameters than state of the art models. Moreover, using prediction gives better performance than reconstruction. Finally, performance varies as expected with different prediction windows.

Ii Architecture

Figure 1 illustrates the encoding-decoding structure for future prediction or reconstruction, motivated by earlier work [21] and adapted for anomaly detection. At each time step, the network takes a video volume of video frames , and generates an output volume of the same size, predicting the future or reconstructing the input in reverse order .

Ii-a Encoding-decoding model

The structure consists of two networks, an encoding network and a decoding network (Fig. 1

). The encoder contains three convolutional layers, each followed by leaky ReLU with negative slope equal to 0.2


. In order to do down-sampling, we use all three convolutional layers with stride. The strided convolution allows the network to learn its own spatial down-sampling. Similarly, three deconvolution layers are used in the decoder to learn its own spatial up-sampling. The goal of temporal encoding is to capture and compress changes due to motion in the input sequence into encoding hidden states that allow the decoder to reconstruct the input or predict the future. Spatio-temporal LSTM cells

[23] are employed as a temporal encoder/decoder. At each time , the convolutional LSTM (convLSTM) module receives as input a new video frame after projection in the spatial feature space. This is used together with the memory content and output of the previous step

to compute new memory activations. Interleaving multiple convLSTMs between convolutional layers helps the model learn spatio-temporal dynamic information at different levels. The high level states capture global spatial-temporal representations while the lower level states retain the detail of local spatio-temporal representations. After the last frame is read, the decoding LSTMs take corresponding states from the encoder as their initial states and output an estimate for the target sequence (Fig.

1). The low-level states are combined with the up-sampling outputs as the initial states and inputs of decoding LSTMs helps to aggregate low-level information to the up-scaling data stream. Therefore, the output contains details on both background and object (Fig. 3).

Ii-B Input data layer

Method AUC/EER (%)
UCSDPed1 UCSDPed2 CUHK Avenue Subway Entrance Subway Exit
Conv-WTA[7] - -
AMDN[8] - -
GAN [11] - - - -
Conv-AE [10]
Past-Current-LSTM [20]
FlowNet-Unet-GAN [14]
Two-streams AE [15] -
MemAE [17] -
LMN [18] -
MPD* [19]
MPD [19]
Ours (prediction)
Table I: Performance comparison with the state of the art.

The input to the model is a video volume consisting of consecutive frames. Each frame is extracted from raw video, converted to a gray-scale image and resized to . The pixel values are scaled to the range . We stack frames in the dimension into video volumes and use them as the input of size to the encoder. Following [10], we generate more video sequences by concatenating frames with skipping strides of 1, 2 and 3, thereby simulating faster motion patterns. Although speed can be important in anomaly detection, we still carry out this augmentation to minimise over-fitting and to have a fair comparison with [10, 13]. Unlike [10], we do not stack precomputed optical flow into our input volume, in the expectation that the network can learn the necessary patterns of motion.

Iii Training

The weights and biases of each layer are learned by minimizing the regularized least squares error:


where is the predicted frame sequence (or the reconstructed frame sequence) from the model and is the target sequence. The first term is the prediction error (or the reconstruction error) and the second term is to regularize the weights. is a hyper-parameter used to balance the importance of two terms.

The weights in each convolutional layer are initialized from a zero-mean Gaussian distribution with standard deviation calculated from the number of input channels and the spatial filter size of the layer 

[24]. This is a robust initialization method that particularly considers the rectifier nonlinearities. We initialize the weights for convLSTM using a zero-mean Gaussian distribution with a fixed standard deviation of 0.01. The biases for all layers are initialized to zero. The input-to-hidden and hidden-to-hidden convolutional filters in the convLSTM cell are the same size.

Iii-a Anomalous event detection

The Adam [25] method is used to optimize the error in Eq. 1 with batch size , momentum of and , and weight decay  [26]. We train a network separately on each dataset so that the model learns the specific normal patterns. An event may be normal in one dataset but abnormal in another. For example, people going towards the turnstile to enter the subway station is normal in the Subway Entrance dataset but abnormal in the Subway Exit dataset. We start training the model with a learning rate of

. After 80 epochs, we stop training and use the model for anomaly detection.

Iv Regularity score for anomaly detection

Once the model is trained, the prediction error between each output frame and the target frame in the video sequence is computed, then errors of all frames are summed up to form the prediction error for a volume as follows:


The prediction error then is normalized to compute a regularity score of a testing volume as follows [10]:


where and are calculated over the prediction errors of all volumes in the same test video. If the regularity score is less than a threshold, the corresponding test volume is abnormal.

We also use the same architecture for reconstruction in our experiments. Instead of using the next frames as the target sequence, we use the input sequence in reverse order as the target. Replacing the target sequence in Eq. 2, we obtain the reconstruction error and use it for anomaly detection with the reconstruction model.

V Experiments

Our method is evaluated both quantitatively and qualitatively. We modify and use Caffe

[27] for all our experiments. Code and trained models are available at

Reconstruction (error ) - UCSDPed2 - biker
Prediction (error ) - UCSDPed2 - biker
Reconstruction (error ) - UCSDPed1 - car
Prediction (error )- UCSDPed1 - car
Reconstruction (error ) - CUHK Avenue - running
Prediction (error ) - CUHK Avenue - running
Figure 3: Prediction and reconstruction of third frame out of (middle), compared to target frame (left); accumulated per-pixel error over frames as blue-green-red colour map (right). Ground truth anomalies shown as rectangles. Taken from UCSDPed2, UCSDPed1 and CUHK Avenue. Best viewed in color

V-a Datasets

Our models are trained on five of the most commonly used datasets for anomaly detection: UCSD (UCSDPed1 and UCSDPed2)[2], CUHK Avenue [4], Subway (Entrance and Exit) [5]. The UCSD and CUHK datasets have separate training videos which contain mostly normal events. The first 12 minutes of Subway Entrance and the first 5 minutes of Subway Exit are used for training.

V-B Anomalous event detection

Two performance metrics are employed for evaluation and comparison with state of the art results: Equal Error Rate (EER) and Area Under the ROC Curve (AUC). The regularity score of each volume determines whether it is normal or abnormal. We follow the intuition that testing video volumes containing normal events generate high regularity scores (Eq. 3

) since they are similar to training data. A testing video sequence containing an anomaly gives a lower score. Setting different thresholds on the regularity score, volumes are classified into those that contain an anomaly and those that do not. These predictions are compared with ground-truth to give the equal error rate (EER) and area under the curve (AUC) of the resulting ROC curve (TPR versus FPR) generated by varying an acceptance threshold. Good performance has a low EER and high AUC.

Table I shows that the model trained for prediction performs comparably to state of the art results. Performance on UCSDPed1 is relatively poor, whilst for CUHK Avenue, the AUC is better than most methods, except FlowNet-Unet-GAN [14], MemAE [17], LMN [18], MPD [19]. However, MemAE [17], LMN [18] and MPD [19] have more parameters than our models which is shown in table III.

Method AUC/EER (%)
Table II: Comparison of AUC/EER with different models. is the number of frames in an input sequence and a target sequence.

Table II shows the results when different models are used. In the table, “Reconstruction” is for a model trained for reconstructing a sequence of 5 frames and “Prediction” is for models trained to predict frames. The model trained for future prediction gives better results than the reconstruction model. This may be because prediction will always try to draw back to normality, whereas reconstruction works from pre-sight of an anomalous sequence. The quality comparison between reconstruction and prediction is shown in Figure 3.

Methods Parameters (M) FPS
Conv-AE [10]
ST-AE [13]
STAE-3D [16]
MemAE [17]
LMN [18]
Table III: Comparison of model complexity and testing speed.

The number of model parameters for the method against different end-to-end trainable models in the state of the art are compared in Table III. We achieve 75 fps for anomaly detection with a GeForce GTX TITAN X, faster than other state of the art methods with the same setting [18].

As can be seen in Fig. 3, the future prediction of a biker becomes worse than the prediction of a pedestrian. The model is trained mostly on video sequences containing pedestrians, the prediction of the biker looks similar to the pedestrian. Here the prediction error is significantly larger than the reconstruction error.

Vi Conclusion

We have adapted a state of the art predictive encoder-decoder deep network to detect abnormal events in video. We evaluate detection performance using both sequence prediction and reconstruction, and show that prediction gives superior performance on anomaly detection. For the prediction model, we obtain competitive performance to state of the art methods on five standard datasets. Finally, we evaluate performance across different prediction windows, encompassing varying levels of motion complexity. Our future work includes investigating the fusion of gray-scale images and optical flow on input.


  • [1]

    Kim J, Grauman K. Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. 2009 IEEE conference on computer vision and pattern recognition 2009 Jun 20 (pp. 2921-2928).

  • [2] Mahadevan, Vijay, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. "Anomaly detection in crowded scenes." In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 1975-1981. IEEE, 2010.
  • [3] Cong, Yang, Junsong Yuan, and Ji Liu. "Sparse reconstruction cost for abnormal event detection." In CVPR 2011, pp. 3449-3456. IEEE, 2011.
  • [4] Lu, Cewu, Jianping Shi, and Jiaya Jia. "Abnormal event detection at 150 fps in matlab." In Proceedings of the IEEE international conference on computer vision, pp. 2720-2727. 2013.
  • [5] Adam, Amit, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. "Robust real-time unusual event detection using multiple fixed-location monitors." IEEE transactions on pattern analysis and machine intelligence 30, no. 3 (2008): 555-560.
  • [6] Wang, Siqi, En Zhu, Jianping Yin, and Fatih Porikli. "Anomaly detection in crowded scenes by SL-HOF descriptor and foreground classification." In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3398-3403. IEEE, 2016.
  • [7] Tran, Hanh TM, and David Hogg. "Anomaly detection using a convolutional winner-take-all autoencoder." In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association, 2017.
  • [8] Xu, Dan, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. "Learning deep representations of appearance and motion for anomalous event detection." arXiv preprint arXiv:1510.01553 (2015).
  • [9] Tran, Thi Minh Hanh. "Anomaly Detection in Video." PhD diss., University of Leeds, 2018.
  • [10] Hasan, Mahmudul, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, and Larry S. Davis. "Learning temporal regularity in video sequences." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733-742. 2016.
  • [11] Ravanbakhsh, Mahdyar, Enver Sangineto, Moin Nabi, and Nicu Sebe. "Training adversarial discriminators for cross-channel abnormal event detection in crowds." In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1896-1904. IEEE, 2019.
  • [12] Zhao, Bin, Li Fei-Fei, and Eric P. Xing. "Online detection of unusual events in videos via dynamic sparse coding." In CVPR 2011, pp. 3313-3320. IEEE, 2011.
  • [13]

    Chong, Yong Shean, and Yong Haur Tay. "Abnormal event detection in videos using spatiotemporal autoencoder." In International symposium on neural networks, pp. 189-196. Springer, Cham, 2017.

  • [14] Liu, Wen, Weixin Luo, Dongze Lian, and Shenghua Gao. "Future frame prediction for anomaly detection–a new baseline." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536-6545. 2018.
  • [15] Nguyen, Trong-Nguyen, and Jean Meunier. "Anomaly detection in video sequence with appearance-motion correspondence." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1273-1283. 2019.
  • [16] Zhao, Yiru, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. "Spatio-temporal autoencoder for video anomaly detection." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1933-1941. 2017.
  • [17] Gong, Dong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. "Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705-1714. 2019.
  • [18] Park, Hyunjong, Jongyoun Noh, and Bumsub Ham. "Learning memory-guided normality for anomaly detection." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14372-14381. 2020.
  • [19] Lv, Hui, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. "Learning normal dynamics in videos with meta prototype network." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15425-15434. 2021.
  • [20] Luo, Weixin, Wen Liu, and Shenghua Gao. "Remembering history with convolutional lstm for anomaly detection." In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 439-444. IEEE, 2017.
  • [21] Shi, Xingjian, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. "Deep learning for precipitation nowcasting: A benchmark and a new model." Advances in neural information processing systems 30 (2017).
  • [22] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." In Proc. icml, vol. 30, no. 1, p. 3. 2013.
  • [23]

    Shi, Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. "Convolutional LSTM network: A machine learning approach for precipitation nowcasting." Advances in neural information processing systems 28 (2015).

  • [24]

    He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." In Proceedings of the IEEE international conference on computer vision, pp. 1026-1034. 2015.

  • [25] Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
  • [26]

    Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).

  • [27] Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. "Caffe: Convolutional architecture for fast feature embedding." In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675-678. 2014.