Decoupled Appearance and Motion Learning for Efficient Anomaly Detection in Surveillance Video

11/10/2020 ∙ by Bo Li, et al. ∙ Ghent University 0

Automating the analysis of surveillance video footage is of great interest when urban environments or industrial sites are monitored by a large number of cameras. As anomalies are often context-specific, it is hard to predefine events of interest and collect labelled training data. A purely unsupervised approach for automated anomaly detection is much more suitable. For every camera, a separate algorithm could then be deployed that learns over time a baseline model of appearance and motion related features of the objects within the camera viewport. Anything that deviates from this baseline is flagged as an anomaly for further analysis downstream. We propose a new neural network architecture that learns the normal behavior in a purely unsupervised fashion. In contrast to previous work, we use latent code predictions as our anomaly metric. We show that this outperforms reconstruction-based and frame prediction-based methods on different benchmark datasets both in terms of accuracy and robustness against changing lighting and weather conditions. By decoupling an appearance and a motion model, our model can also process 16 to 45 times more frames per second than related approaches which makes our model suitable for deploying on the camera itself or on other edge devices.



There are no comments yet.


page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rising concerns for public security and safety have increased the number of surveillance cameras installed in our streets and public places [1, 15, 10, 3, 22]. Human operators in a control room continuously inspect these video streams on a multi-screen video wall, looking for abnormal events that may mandate further inspection. As human operators can only process a few video streams at the same time, part of the surveillance workflow must be automated when the number of cameras grows. By automating the detection of anomalous events, human operators can focus on the appropriate response to these events, e.g. by requesting a police intervention.

Since anomalies are context-specific [25]

, each video stream requires a tailored anomaly detection algorithm. A running person for example is considered an anomaly in a busy shopping street but it might be normal in a train station as people are often in a hurry to catch the train. In this work we introduce a new neural network architecture that is able to recognize anomalous events in a surveillance camera stream using only unsupervised training. We propose to decouple the learning of appearance and motion information which are the key factors for determining anomalies in a surveillance video. We first train an autoencoder by reconstructing individual frames to capture high level appearance features such as the location, shape and size of an object. Such features however can not guarantee the detection of anomalies that are caused by motion related features such as speed or trajectory. We therefore add a second component that further exploits the spatiotemporal information of the frequently seen events by predicting the latent code for a future frame using the stacked latent codes of the previous

frames as the input. The underlying assumption is that the anomalous events are rare occasions and will not be modeled accurately by the networks. The predicted latent code of anomalous frames will hence deviate significantly from the observed latent codes.

Our approach is easy to implement and achieves state-of-the-art performance on benchmark datasets. We however do not only focus on detection accuracy but also address several other obstacles for real-world deployment. Our model is much more efficient than related approaches, which could make it possible to evaluate our model at the network edge, on or nearby the surveillance camera itself, as opposed to streaming all data to a central point for analysis. Inference at the edge is also a more privacy friendly paradigm since a human operator will not see the camera data unless his intervention is needed. Lastly, our experimental results indicate that detection performance based on prediction of latent codes is more robust against changing weather and lighting conditions.

The remainder of this paper is organized as follows: In section 2 we give an overview of related anomaly detection methods. In section 3 we introduce our approach and we experimentally validate it on different benchmark datasets in section 4. In section 5 we show that our approach is more robust against different distortions. We conduct an ablation study in section 6 to analyze the role of different components of the model. We conclude in section 7 and give a few pointers for future research directions.

2 Related work

Deep learning is currently the state-of-the-art method for many computer vision related tasks [24] and is also the technique behind the state-of-the-art anomaly detection methods for video surveillance type data. We can differentiate three different approaches to do anomaly detection with deep learning: reconstruction based methods, prediction based methods and methods that use characteristics of the latent code to detect anomalies.

2.1 Reconstruction based methods

The most common approach is to build models that reconstruct their input. These models are based on an autoencoder architecture that contains a bottleneck for encoding high level features, creating a compressed representation of the input data. These compressed representations are then used to reconstruct the input data. The assumption here is that the reconstruction works fine for inputs that are similar to the data that was seen during training but that anomalous inputs can not be modelled accurately by the learned features, resulting in a poor reconstruction. The reconstruction error is then used as a metric to detect anomalies.

For our use case of anomaly detection in video surveillance, it is not enough to only model spatial information by processing individual frames, we also need to consider the temporal information to detect anomalies that are caused by motion, such as high speed or irregular movement. Different approaches have been explored to incorporate this information into the model. [6] exploit the spatiotemporal information by reconstructing multiple stacked frames using an autoencoder. [28]

apply a denoising autoencoder 


to reconstruct frames. They use optical flow maps to describe the motion information. Another option is to use Recurrent Neural Networks (RNN) or Long Short-Term Memory networks (LSTM) 

[24] to model the time dimension [4, 17]. Finally, there is also work that explores the use of a 3D convolutional networks (C3D) to model spatiotemporal representations. [26] show that C3D can encapsulate information related to shapes and motions in video sequence better than a 2D based model, thus boosting the anomaly detection accuracy.

2.2 Prediction based methods

Instead of reconstructing the input, it is also possible to predict future frames. This requires a better understanding of temporal information. [15]

use a generative adversarial network (GAN) 

[5] that takes stacked frames and optical flow features as input. Anomalies are detected at test time by measuring the difference between the predicted and observed future frame. Any deviation from the expected frame is considered as an anomaly. For more specialized applications we can also use domain specific features. [22] for example deal with human-related anomaly detection by reconstructing and predicting the decomposed global body movement and local body posture from the human skeleton movement.

2.3 Latent code based methods

Both lines of previous work generate expected frames and detect anomalies by measuring the difference in pixel space with the actual input frame. However, it is well known that pixel-wise similarity measurements do not necessarily correspond with human understanding of images [13, 21]

and are often very sensitive to minor changes in brightness or color. On the other hand, the high level features extracted by a neural network are shown to be less sensitive to these distortions 


. There are some very recent approaches that extract high level features with an autoencoder and then use a classifier such as a one-class SVM on the extracted features 

[2, 10]

to detect anomalies. The assumption is that the classifier will distribute the anomalies outside of the learned manifold. The work that is most similar to our approach is the Latent Space Autoregression model from 


. They use features from a deep 3D convolutional autoencoder combined with an autoregressive network to model the probability distribution underlying the latent representation. They combine the reconstruction error and the likelihood of the latent code to identify the anomalies. Our approach is similar in that we also extract features and work in a high level latent code. However, our approach uses a 2D autoencoder to extract features which highly reduces the number of parameters and computational cost. We use a less complicated feed forward model to do the prediction which does not make any assumptions on the distribution family of the latent code. It allows us to directly use the Mean Squared Error (MSE) between the predicted latent code and the latent code that is extracted from the encoder as the anomaly score. Finally, we explicitly predict the latent code of a future frame instead of relying on the reconstruction of the current frame to capture the motion information.

Figure 1: Overview of our approach. We use the same encoder to extract latent code for each input frame, where is the number of frames in per input sequence and is the frame that is 6 timesteps into future (). In the training phase, these latent code are used to (a) predict latent codes for future frames with the motion model and (b) reconstruct current frames with the decoder. In the inference phase, we only use the encoder and the motion model. (c) Conv3D layers are used in the motion model to learn spatiotemporal information. The figure is best viewed in color.

3 Architecture

In this paper we propose a decoupled architecture to learn the spatiotemporal information which is important for determining anomalies in surveillance videos. We first train an autoencoder to reconstruct individual input frames and aim to represent the appearance information such as shape, location and outlook of an object in the latent codes. Then to further stress the appearance information for frequently seen events and the dynamical aspects in a video, we stack these extracted latent codes from the encoder for a sequence of frames and use that as the input for a second network to predict the latent code for a future frame. The model is assumed to be only able to predict the latent codes for frequently seen events with high accuracy. The difference between the predicted and the observed latent codes is then used as the anomaly metric. The following sections explain these in detail.

3.1 Learning appearance features

To learn high level features, we use a U-Net [23] type autoencoder that is trained to reconstruct individual input frames. To force the model to focus on the foreground, we subtract a background frame from each input frame. This background frame is calculated as a frame with per-pixel mean RGB values over all training data. The encoder learns to extract latent codes from a single frame and the decoder learns to reconstruct the input based on the extracted features. The original U-Net architecture has shortcut connections between encoder and decoder. To avoid the trivial solution of copying feature maps from the encoder to the decoder and to improve the regularization power, we add a shortcut connection between the previous frame and current frame . In other words, the feature maps that are calculated using frame are concatenated with the feature maps from frame in the upsampling path for reconstructing frame . The detailed architecture is shown in Figure 1 (a) and (b).

3.2 Learning motion features

Spatiotemporal information is important for detecting anomalies in videos. However, the features that are extracted by encoding a single frame as described above can only focus on the spatial information such as shape, location and size of an object and cannot guarantee the detection of the motion-related anomalies. We thus include a second component that can attend to both spatial and temporal dimension to further learn the dynamical aspects of video sequence. Previous work considered learning the temporal information either by predicting optical flows using a pretrained FlowNet [6, 15] or by predicting future frames in pixel space [1, 15]

. This however has three drawbacks. First, the optical flow estimation is computationaly expensive as it requires around

seconds to evaluate a single frame on a GPU machine [8]. Secondly, we need to consider the interaction between appearance and motion information. For example, a vehicle driving with very high speed is usually an anomaly except when it is an ambulance. Independently encoding the appearance and motion information using a pretrained optical flow model cannot take this into account. Finally, the pixel-wise Mean Square Error (MSE) objective function for predicting future often generates blurry frames [21, 29]. Instead, we decide to predict the latent code of a future frame through a small motion learning model as shown in Fig. 1 (c).

To predict the latent code of frame , we extract latent codes for previous input frames . The extracted latent codes from the encoder for each of past frames are then concatenated along the temporal dimension and used as the input for the motion model to predict the latent code for a future frame . We use 3D convolutional layers in the motion model since these can attend to both motion and appearance whereas a 2D convolution layer is only able to work in the spatial direction [26]

. Each convolutional block includes a 3D convolutional layer with kernel size 3x3x3, stride 2 on the temporal dimension and stride 1 on the feature dimension. This is followed by a BatchNormalization 


and a leaky-relu activation layer. We use three convolutional blocks in the motion model.

3.3 Training

Our proposed framework consists of two parts: video frame reconstruction and latent code prediction, so the objective function to train both components end-to-end can be formulated as:


The first term measures the pixel-wise reconstruction loss where is the total number of pixels per frame and is the number of frames in a input sequence. We can only reconstruct the last frames if we input frames since the reconstruction of one frame requires the features from its previous frame. The second term is the MSE between the predicted and the observed latent code where is the number of elements in the latent code. The last term is an L2 regularization term where is kept to be 0.001. The model can be trained end-to-end but to bootstrap the encoder with useful features, we first focus on training the autoencoder for reconstruction ( and ) until the training loss for reconstruction converges. Then we focus on finetuning the weights of the motion model (, ) to use the motion information better.

3.4 Inference

At inference time, we discard the decoder and use the difference between predicted and actual latent code as the metric to determine whether a frame is an anomaly or not. The underlying assumption is that the model can predict the latent code for the normal frames with high accuracy but is not able to do so for anomalous frames. To measure the distance between latent codes, we apply two different metrics: Mean Squared Error (MSE) Eq. 2a and cosine-distance Eq. 2b.


After calculating the anomaly score for each frame, following [21], we normalize the score for each frameset to the range of [0,1] using Eq. 3. A frame that has the anomalous score higher than a threshold is considered as anomaly. Depending on the dataset, contains all frames of a video or just the frames in a sliding window for long videos.


4 Experiments

In this section, we compare our methods to state-of-the-art approaches on public benchmarks. We do not only focus on detection accuracy but also compare the computational cost of the models since this is often the bottleneck that limits the performance in the real world. In addition, we also evaluate the proposed anomaly detector on multiple distorted environments and show that our method is more robust against changes in lighting and weather that are common in the real world.

4.1 Experimental Setup

To evaluate the effectiveness of our proposed methods, we use the same datasets as [15], including the UCSD Pedestrian dataset [20], the CUHK Avenue dataset [16] and the ShanghaiTech dataset [15]. The details of these datasets are described by [15]. As is common, we report the Area Under Curve (AUC) score as the accuracy metric.

The experimental settings are shown in Table 1

. We use 8 input frames for UCSDPed1 and UCSDPed2 datasets and 6 input frames for the Avenue and ShanghaiTech datasets. More frames are needed to encode the motion and appearance of the much smaller objects in the UCSD Pedestrian datasets than in the Avenue and ShanghaiTech datasets. The resolution of the frames is reduced using bilinear interpolation, keeping the original aspect ratio. As for the architecture, we use five encoder blocks for the UCSDPed1 dataset and four encoder blocks for the other datasets. This is done because the scenes in the UCSDPed1 dataset include more objects and we need to increase the model’s capacity in order to encode the appearance- and motion-related features.

UCSDPed1 UCSDPed2 Avenue ShanghaiTech Input Size 128x192 128x192 128x224 128x224 Num Input () 8 8 6 6 Encoder Block 5 4 4 4
Table 1: Design choices for each evaluation dataset.

We train the model for 50 epochs in an end-to-end fashion with initial learning rate

which decays by 0.1 every 20 epochs. The ShanghaiTech dataset contains data from multiple cameras. We trained individual models per camera and observed no significant performance difference with a model that is trained on data from all cameras. We train the model with the Adam optimizer [12] for all our experiments. The code will be released on website.

For the evaluation, we use feature-wise MSE (Eq. 2a) to calculate the anomaly score for Avenue dataset and feature-wise cosine-distance (Eq. 2b) for all other datasets. We normalize the anomaly score in UCSDPed1 and UCSDPed2 dataset with Eq. 3 using all the frames in a test video. For the ShanghaiTech datasets, we use the same sliding-window approach as [1].

4.2 Results

Table 2 compares our results with those of other unsupervised deep learning based methods for anomaly detection. Our approach outperforms the existing methods in terms of frame-level AUC score on all four datasets. The decoupled mechanism and the combined learning of appearance and motion information improves the training process and allows the extracted and predicted latent codes to be more representative for the frequently seen events and thus improve the anomaly detection accuracy. We further investigate the role of decoupling, the use of Conv3D layers in the motion model as well as different anomaly metrics in section 6.

UCSD Ped1 UCSD Ped2 Avenue Shanghai -Tech MDT [19] 81.8 82.9 - - ConvAE [6] 81.0 90.0 70.2 - ConvLSTM [17] 75.5 88.1 77.0 - Unmasking [11] 68.4 82.2 80.6 - Hinami [7] - 92.2 - - StackRNN [18] - 92.2 81.7 - FFP-MC [15] 83.1 95.4 84.9 72.8 LatentAuto [1] - 95.4 - 72.5 ours 84.70.3 95.10.3 88.80.3 74.20.1
Table 2:

Frame-level AUC score with 95% confidence interval (4 runs) on UCSDPed1, UCSDPed2, Avenue and ShanghaiTech datasets. We outperform most of the existing approaches on all the datasets.

The initial goal of our approach was to develop an architecture that is more efficient than previous approaches. We benchmark our method with the approaches that have publicly available code on an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz with an GeForce GTX 1080 Ti. We reimplemented the methods described by [6]

in Tensorflow to allow for a fair comparison

111The benchmark code is also available at website. The results are displayed in Table 3. Our method outperforms the other approaches by a large margin in terms of the number of frames that can be processed per second. Compared to the Latent-Auto approach [1] that also detects anomalies using latent code, we can process 1645 times more frames per second. The main reason our method is more efficient is because the model independently encodes appearance and motion information. We extract the appearance information using a relatively efficient 2D convolutional network and process the combined features using a small 3D convolutional network, whereas other approaches build the entire network around 3D convolutions making it much more expensive. Because we extract latent codes from individual frames, each frame only needs to be processed once. We can save each of the last latent codes that are needed in the motion model and re-use them for the next predictions. In contrast, models that use 3D convolutions process each frame times, each time at a different position in the stack, predicting a different frame, making it much more computationally expensive. Since we predict future latent codes and use the prediction error in latent space as our anomaly metric, we do not need the decoder part at inference, again reducing the computational cost. Also, compared to other models that use anomaly scoring metrics based on latent codes, we do not impose any distribution constraint on the latent code, giving the model the freedom to fit the data as best as possible.

ConvAE [6] FFP+MC [15] LatentAuto [1] Ours UCSDPed1 75 63 2 81 UCSDPed2 75 63 2 90 Avenue 71 48 5 77 ShanghaiTech 71 48 5 77
Table 3: FPS for different methods. Our method is more efficient than the existing approaches

A disadvantage of working with latent codes is that it is harder to interpret the model. It is however possible to also use the decoder at inference time to generate predicted frames and to measure the pixel wise reconstruction metrics. This allows us to localize the part of the frame that contains the anomaly. We show these results in Figure 2. The red boxes in each frame are the regions that have the prediction error larger than a threshold. The green boxes show the ground-truth annotations for the Avenue and ShanghaiTech dataset (the UCSD pedestrian dataset only has frame-level labels). These results empirically confirm that our model can detect motion- and appearance-related anomalies, such as the skater, cyclist, car, running and gymnastics events (first four columns). The last two columns of Figure 2 show false positives, frames that were labeled as normal but that were flagged as anomalies by our model. Two of these show noise or camera movements that were not seen during training. It also shows that the model is more likely to incorrectly flag objects as anomalies if they are closer to the camera.

Figure 2: True positive (first four columns) and false positive (last two columns) detections of our framework. Examples are selected from the UCSD pedestrian dataset (first row), Avenue dataset (second row) and ShanghaiTech dataset (last row). # indicates the testing video index. The red boxes are the regions that have highest pixel-wise prediction error and the green box are the ground truth bounding boxes for the anomalous event. We can successfully detect the motion- and appearance- related anomalies and tend to incorrectly flag objects as anomalies if they are closer to camera. The figure is best viewed in color.

5 Robustness of the model

The datasets we used in the previous section are commonly used datasets that allow us to compare anomaly detection techniques quantitatively. They however all contain relatively clean data, recorded at similar times during the day and under clear weather conditions. These datasets are therefore not necessarily representative of real world surveillance footage where external factors such as weather and time of day will severely influence the performance of the model. We argue that in addition to their anomaly detection performance and computational cost, we should also compare the robustness and generalization of the models to these external factors. In this section we investigate two types of robustness and show that by working with latent code anomaly metrics we are more robust than other approaches that use pixel-wise metrics.

5.1 Modelling long term temporal information

In section 4.2, we showed that our approach is by design much more efficient than existing techniques. To reduce the computational cost even further we could reduce the number of times we activate the model. In surveillance video, anomalies are typically in view of the camera during multiple seconds. It should be enough to process only a few frames of this window to detect the anomaly. If instead of running our model every 40 ms, we run it every 200 ms, then this obviously results in a lower total computational cost but this also makes the task much harder for the network since we now need to predict five times further into the future. In this way, the model is forced to encode longer term temporal information and the prediction task is more challenging since the future frame will differ substantially from the previous frames.

To explore this trade-off, we subsample the video sequence and only keep every frame in our training and test data. Figure 3 shows how sensitive the latent code metric (red line) is compared to the pixel-wise metric (green line). Both techniques follow the same trend but the latent code metric consistently performs better than the pixel-wise metric, especially when the gap between input frames becomes large. This illustrates the power of latent codes and their capability of modelling longer term temporal information.

Figure 3: The anomaly detection accuracy (frame-level AUC score with 95% confidence interval) when we have low fps input. z-mse is calculated using Eq. 2a on the latent code and p-mse is the MSE between the predicted frames and actual frames. The figure is best viewed in color.

5.2 Generalization to other lightning conditions

The performance of anomaly detection in surveillance video is impacted severely by factors such as varying illumination, multiple weather conditions, on- and off-peak traffic profiles, degradation of the camera and so on. Therefore, in this section, we investigate the robustness of our proposed method to these distortions. We train a model using original frames from the Avenue dataset and then analyze the performance on distorted test set frames. We adjust brightness, blur the image and add rain to the test frames using the Automold toolkit222 Figure 4 shows some examples of the distorted frames using different levels of rain and brightness.

Figure 4: The distorted frames using Avenue dataset. (a) original frame, (b) and (c) have heavy rain with brightness degree 0.5 and 0.7 respectively and (d),(e) and (f) show torrential rain with the brightness 0.6, 0.8 and 1.0 respectively. The figure is best viewed in color.

Figure 5 shows the frame-level anomaly detection accuracy for different distortion levels. The X-axis shows the relative brightness compared to the original frame. The different curves show the performance of using Mean Squared Error (MSE) in pixel space (p) and in latent space (z) as our anomaly metrics for different levels of rain added to the image. As expected the model performance drops when the brightness decreases, but our model is consistently more robust than the baseline model that uses pixel wise metrics. Adding rain to the test frames also reduces the detection performance but again, our feature-wise latent code MSE performs significantly better as anomaly scoring metric than the pixel-wise MSE. These results verify that our proposed methods using feature-wise MSE in the latent code to identify anomalies is more robust to different outdoor situations than pixel-wise MSE measurements.

Figure 5: The averaged anomaly detection accuracy (frame-level AUC score) on the augmented frames where z means the latent code feature-wise MSE and p means the prediction pixel-wise MSE. The latent code anomalous score measurement is more robust on unseen weather conditions compared to other approaches. The figure is best viewed in color.

6 Ablation study

In section 3 we introduced our model together with some design choices. In this section we look closer to three of these and investigate how these contribute to our results.

6.1 Reconstruction vs prediction

To understand the impact of the anomaly detection metrics on the detection accuracy, we report the frame level AUC score using pixel-wise reconstruction error, pixel-wise prediction error and feature-wise latent code error in Table 4. Adding the motion model highly improves the anomaly detection accuracy for all the datasets since it encodes spatiotemporal information better. Compared to the performance using pixel-wise prediction error, the use of latent code prediction error tends to be better and more stable since it is more robust to the noise in the image as indicated by section 5

. To qualitatively understand how the model differentiates between normal and abnormal frames, we conduct an experiment on the moving-mnist dataset.

We train the same model as shown in Fig. 1 using the video sequences that are created by letting randomly selected digits 4 and 7 from the training set of MNIST [14] move horizontally or vertically with a speed of 2 following [18] (see Fig. 6 (a)). This model is then tested on video sequences that include all types of digits from the test set of MNIST dataset and two new shapes (circle and square) that are moving also horizontally or vertically with a speed of 2 or speed of 4. The input, reconstruction, prediction and prediction error are shown in Fig. 6 (b) and (c).

Fig. 6 (b) shows the model output for input objects that move with a speed of 2. The model can make good reconstructions and predictions for already seen digits 4 and 7 but tend to predict the unseen digits to be one of the already seen digits. For example, it predicts the circle and square to be similar to 4 and digit 8 to be similar to 7. The difference between the reconstruction and prediction indicates that the model cannot make a good prediction of the latent code for the unseen digits and this allows us to detect the appearance related anomalies. For the objects that are moving with a higher speed as shown in Fig. 6 (c), the model produces a prediction that falls behind the actual input. This is because the designed motion model can further exploit the speed mismatch of the objects during training and testing and is thus able to detect the motion related anomalies.

Figure 6: Experimental results on moving-mnist dataset. (a) training digits 4 and 7 are moving horizontally or vertically randomly with speed 2. (b) and (c) show the testing digits and shapes that are moving in a similar fashion but with speed 2 (b) and speed 4 (c). From left to right, the columns in (b) and (c) are input, reconstruction, prediction and prediction error. The model cannot accurately reconstruct and predict for both appearance-related (unseen digits) and motion-related (unseen moving speed) anomalies.
Conv3D ConvLSTM Ped1 Ped2 Avenue ShanghaiTech Ped1 Ped2 Avenue ShanghaiTech Reconstruction error 80.60.2 91.30.5 83.70.3 55.80.2 74.80.4 87.50.1 83.10.2 52.60.4 Prediction error 82.20.4 90.90.5 89.20.2 72.70.2 81.20.4 90.30.8 89.00.3 70.20.2 Latent code error 84.70.3 95.10.3 88.80.3 74.20.1 82.80.6 94.10.1 88.90.3 71.30.4
Table 4: Abnormal event detection results (in %) with 95% confidence interval. The Reconstruction error and Prediction error are pixel-wise MSE and the Latent code error is calculated feature-wise. We achieved better performance using latent-code prediction error as anomalous score and using Conv3D layers in the motion model.

6.2 Design of the motion model

The motion model is an important factor in the design of our architecture since it is required to encode the typical appearance and motion information of the frequently seen events. Therefore, we also replaced the Conv3D layers in the motion model with ConvLSTM to study the impact of the design of the motion model on anomaly detection performance. The anomaly detection accuracy using pixel-wise reconstruction, prediction error and feature-wise latent code error for different datasets are shown in Table 4. We achieved better or similar performance on all the datasets using Conv3D layers in the motion model. One of the possible reasons is that the Conv3D layers can fit the data better and thus extract more representative features. In addition, we observed that the model that uses ConvLSTM layers have delayed detection results such that it fails to detect the beginning of anomalous events and it also reports more false alarms after the anomalous events due to the slowly response.

7 Conclusion and future work

In this paper we introduced a novel architecture that is able to detect anomalies in real world surveillance footage using only unsupervised training. The model consists of two parts where the first part extracts appearance features from individual frames and the second part uses these features to predict the latent code for a future frame. In contrast to previous works, our model uses a prediction in latent space as a metric to detect anomalies. We showed that is able to outperform other techniques that use reconstruction or pixel based prediction metrics. Because of the decoupled appearance and motion feature learning, our model is also much more efficient than related approaches. Where other techniques use expensive 3D convolutions to analyze a stack of frames, we process each frame individually and then combine the information with a much smaller 3D convolutional model. This allows us to process 16 to 45 times more frames using the same computational budget. Finally, we show that using latent space features makes the model more robust against distortions such as changing lighting or weather conditions.

Anomaly detection in real world surveillance data is a very challenging topic with many useful applications. For future work, we argue that more research is needed to deal with changing environments, weather and lighting conditions as well as with camera degradation.


This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme, and from imec under the CityFlows AAA programme.


  • [1] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara (2019)

    Latent Space Autoregression for Novelty Detection


    Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition

    Cited by: §1, §2.3, §3.2, §4.1, §4.2, Table 2, Table 3.
  • [2] S. Bouindour, H. Snoussi, M. M. Hittawe, N. Tazi, and T. Wang (2019) An on-line and adaptive method for detecting abnormal events in videos using spatio-temporal convnet. Applied Sciences 9 (4). External Links: Link, ISSN 2076-3417, Document Cited by: §2.3.
  • [3] V. Chandola, A. Banerjee, and V. Kumar (2009-07) Anomaly detection: a survey. ACM Comput. Surv. 41 (3), pp. 15:1–15:58. External Links: ISSN 0360-0300, Link, Document Cited by: §1.
  • [4] Y. S. Chong and Y. H. Tay (2017) Abnormal event detection in videos using spatiotemporal autoencoder. In Advances in Neural Networks - ISNN 2017, F. Cong, A. Leung, and Q. Wei (Eds.), Cham, pp. 189–196. External Links: ISBN 978-3-319-59081-3 Cited by: §2.1.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §2.2.
  • [6] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 733–742. Cited by: §2.1, §3.2, §4.2, Table 2, Table 3.
  • [7] R. Hinami, T. Mei, and S. Satoh (2017) Joint detection and recounting of abnormal events by learning deep generic knowledge. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3619–3627. Cited by: Table 2.
  • [8] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017-07) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §3.2.
  • [9] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37

    ICML’15, pp. 448–456. External Links: Link Cited by: §3.2.
  • [10] R. T. Ionescu, F. S. Khan, M. Georgescu, and L. Shao (2019-06) Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.3.
  • [11] R. T. Ionescu, S. Smeureanu, B. Alexe, and M. Popescu (2017) Unmasking the abnormal events in video. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2914–2922. Cited by: Table 2.
  • [12] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  • [13] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1558–1566. External Links: Link Cited by: §2.3.
  • [14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 1558-2256 Cited by: §6.1.
  • [15] W. Liu, D. L. W. Luo, and S. Gao (2018) Future frame prediction for anomaly detection – a new baseline. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, §3.2, §4.1, Table 2, Table 3.
  • [16] C. Lu, J. Shi, and J. Jia (2013) Abnormal event detection at 150 fps in matlab. 2013 IEEE International Conference on Computer Vision, pp. 2720–2727. Cited by: §4.1.
  • [17] W. Luo, W. Liu, and S. Gao (2017-07) Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), Vol. , pp. 439–444. External Links: Document, ISSN Cited by: §2.1, Table 2.
  • [18] W. Luo, W. Liu, and S. Gao (2017-10) A revisit of sparse coding based anomaly detection in stacked rnn framework. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 341–349. External Links: Document, ISSN Cited by: Table 2, §6.1.
  • [19] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos (2010-06) Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1975–1981. External Links: Document, ISSN Cited by: Table 2.
  • [20] V. Mahadevan, W. LI, V. Bhalodia, and N. Vasconcelos (2010) Anomaly detection in crowded scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1975–1981. Cited by: §4.1.
  • [21] M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. CoRR abs/1511.05440. Cited by: §2.3, §3.2, §3.4.
  • [22] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh (2019) Learning regularity in skeleton trajectories for anomaly detection in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11996–12004. Cited by: §1, §2.2.
  • [23] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §3.1.
  • [24] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural Networks 61, pp. 85 – 117. External Links: ISSN 0893-6080, Document, Link Cited by: §2.1, §2.
  • [25] X. Song, M. Wu, C. Jermaine, S. Ranka, et al. (2007) Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19 (5), pp. 631–645. Cited by: §1.
  • [26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 4489–4497. External Links: ISBN 978-1-4673-8391-2, Link, Document Cited by: §2.1, §3.2.
  • [27] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1096–1103. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §2.1.
  • [28] D. Xu, Y. Yan, E. Ricci, and N. Sebe (2017) Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding 156, pp. 117 – 127. Note: Image and Video Understanding in Big Data External Links: ISSN 1077-3142, Document, Link Cited by: §2.1.
  • [29] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §2.3, §3.2.