Attentioned Convolutional LSTM InpaintingNetwork for Anomaly Detection in Videos

by   Itamar Ben-Ari, et al.

We propose a semi-supervised model for detecting anomalies in videos inspiredby the Video Pixel Network [van den Oord et al., 2016]. VPN is a probabilisticgenerative model based on a deep neural network that estimates the discrete jointdistribution of raw pixels in video frames. Our model extends the Convolutional-LSTM video encoder part of the VPN with a novel convolutional based attentionmechanism. We also modify the Pixel-CNN decoder part of the VPN to a frameinpainting task where a partially masked version of the frame to predict is given asinput. The frame reconstruction error is used as an anomaly indicator. We test ourmodel on a modified version of the moving mnist dataset [Srivastava et al., 2015]. Our model is shown to be effective in detecting anomalies in videos. This approachcould be a component in applications requiring visual common sense.


Visual anomaly detection in video by variational autoencoder

Video anomalies detection is the intersection of anomaly detection and v...

Video Pixel Networks

We propose a probabilistic video model, the Video Pixel Network (VPN), t...

Future Frame Prediction Using Convolutional VRNN for Anomaly Detection

Anomaly detection in videos aims at reporting anything that does not con...

Hybrid Deep Network for Anomaly Detection

In this paper, we propose a deep convolutional neural network (CNN) for ...

Video Colorization using CNNs and Keyframes extraction: An application in saving bandwidth

In this paper we tackle the problem of Colorization of Grayscale Videos ...

Anomaly Locality in Video Surveillance

This paper strives for the detection of real-world anomalies such as bur...

Unsupervised detection of mouse behavioural anomalies using two-stream convolutional autoencoders

This paper explores the application of unsupervised learning to detectin...

1 Introduction

Real-time anomaly detection in videos has significant value across many domains such as robot patrolling [Chakravarty et al., 2007]

and visual inspection of manufacturing processes. The task remains challenging due to the complexity and variability of the data and high computational cost in an edge device setting. Current approaches range from supervised models based on Convolutional Neural Networks (CNNs) architectures

[Sabokrou et al., 2018]

, long-term temporal dynamic models such as Recurrent Neural Networks (RNNs)

[Radford et al., 2018] and unsupervised models for video features learning [Zhang et al., 2016, Pham et al., 2011, Zhao et al., 2011]. In this paper we propose an encoder-decoder network where the input is a sequence of frames with the last frame partially masked and the output is a reconstruction of that frame. We use the reconstruction error as an indicator for an anomaly in the sequence, where we assume the model will reconstruct a masked pixel with a typical value, in accordance with that pixel’s spatial and temporal context. A pixel containing unexpected / out of context value will be poorly reconstructed and indicate an anomaly. Our model is inspired by the Video Pixel Networks (VPN) [Kalchbrenner et al., 2016] with two main differences - 1. We add a convolutional based attention mechanism where the filters weights are dynamic and input dependent. This mechanism utilizes the local structure of images better than a standard global weighting attention mechanism. 2. A partially masked version of the frame to predict is given as input to the model (see figure 1(b)) - this eliminates the need for masked convolutions and enables the computation of the predicted distribution of all pixels to be parallelized. The model can also utilize information from an unmasked neighborhood of a predicted pixel which makes the prediction task tractable. Using the proposed modifications above our model is able to find anomalies in videos in an unsupervised manner and in real-time.

2 Related work

2.1 Anomaly Detection

Unsupervised and semi-supervised video anomaly detection models can be classified into three main categories - 1. Representation learning for reconstruction: Encoder-Decoder Methods which transform the input into a hidden representation and then try to reconstruct it. Anomalies are represented by poorly reconstructed deviations from the source. Principal Component Analysis (PCA) and Auto-encoders (AEs) are examples of such models. 2. Predictive modeling: where the sequence of frames is viewed as a time series and the model’s task is to predict the next frame pixels values distribution. Anomalies are represented by pixels with low likelihood values. Auto-Regressive models and Convolutional-LSTMs are examples of such models. 3) Generative models: e.g., Generative Adversarial Networks (GAN) and Variational Auto-Encoders (VAE), which can compute a measure of frame abnormality.

2.2 Video Pixel Network

VPN is a frame predictive model shown to give SOTA results on the moving mnist and pushing robots datasets [Finn et al., 2016]. The architecture of the VPN consists of two parts: A CNN resolution preserving encoder and a Pixel-CNN decoder [van den Oord et al., 2016]

. The CNN encoder output is aggregated over time by a Convolutional-LSTM in order to capture temporal dependencies. The Pixel-CNN decoder uses masked convolutions to model space and color dependencies in the predicted frame (by allowing a flow of information from previously predicted pixels to a current predicted pixel). The last layer of the Pixel-CNN decoder is a softmax layer over 256 intensity values for each color channel in each pixel.

3 Our model

3.1 Convolutional based attention mechanism

The relevant context window for frame prediction may vary in size and frames importance distribution. An attention mechanism is a popular tool used to overcome memory limitations of recurrent models and bring to focus relevant parts of a context window. Since current attention mechanisms do not leverage the local structure of images, we propose the use of a convolution with input dependent filter weights to generate an attention like mechanism [Shen et al., 2017]

. We use a small meta-network to output context-sensitive convolution filters, which are then applied to a tensor of concatenated Convolutional-LSTM outputs (representing the context window). The Convolutional-LSTM and convolutional attention output tensors preserve the spatial dimensions and local structure of the video frames. This allows us to concatenate the partially masked frame as an additional channel of the attention output tensor and forward it to the inpainting network for reconstruction (see figure


3.2 Convolutions with masked frames for image inpainting

In the VPN model the frame to be predicted is given as input in the training phase. The PixelCNN decoder uses masked convolutions to ensure the predicted pixel does not "see" its label (i.e. true value). The masked convolution only uses information from pixels preceding the predicted pixel (for a top-bottom left-right pixel order), enabling the network to model some of the spatial dependencies in the predicted frame. In inference time the pixels are predicted sequentially. In our anomaly detection reconstruction approach the frame to be predicted is also given as input but is partially masked, blocking the flow of information from a label to a masked pixel. The modeling of spatial dependencies of a pixel is enabled by using information from non-masked pixels in its neighborhood. We use a grid mask with random shifts where the portion of masked pixels in the frame is 95% (see figure 1(b)). This way the model learns a general structure of the frame and must rely on temporal dependencies. In inference time the same procedure is applied, so the pixels are predicted in parallel, resulting in real-time detection.

3.3 Loss function as an anomaly measure

We use the log-likelihood of the pixels values given the network predicted distribution as a loss function. The average pixels log-likelihood is used as a global score for frame abnormality, where we assume the pixels are independently distributed given the unmasked pixels and context window frames. The loss is defined as:


where are the pixels of the frame to reconstruct in time , is the masked frame, are all the frames prior to the -th frame, are the network parameters, is the value of channel of the pixel of frame and is the predicted distribution for that value.

In the training phase we train our network only on anomaly free videos. This way the network learns to predict a distribution for pixel values showing normal behavior, and will give low probability predictions for abnormal values in inference time. We use the log-likelihood as an anomaly measure where low likelihood pixel values indicate higher chance for these pixels to show an anomaly.

4 Experiments

We evaluate each contribution proposed in this paper, convolutional-based attention and masked frame reconstruction, on a modified version of the Moving MNIST dataset [Srivastava et al., 2015]. We show that our model can learn both the temporal and spatial aspects of the movies and automatically detect anomalies without explicit supervision. We compare two methods as baselines: the original VPN model and Conv-LSTM network (which detects abnormal frames based on the reconstruction error [Medel and Savakis, 2016]), together with two variants of our model - the first omits the masked frame from the input and the second does not use attention.

Dataset - The Moving MNIST is a common dataset consisting of two digits moving independently in a frame (potentially overlapping) with constant velocity. It consists of sequences of 20 frames of size . The training sequences are generated on-the-fly by sampling MNIST digits and generating trajectories with randomly sampled velocity and angle. The training set was downloaded from [Srivastava et al., 2015] and consists of 10000 sequences. Our test set consists of both normal and corrupted sequences. In order to generate a corrupted sequence, we replace the last frame with the first frame and paint a "corruption" of black pixels on a digit (see figure 1). These corruptions are in two dimensions - temporal (changing the frame order) and spatial (painting the black square).

Evaluation Metric

- We use the Equal Error Rate (EER) which is the accuracy value for equal precision and recall, a standard metric in abnormal event detection.

Results - Table 4 shows the EER for the different models tested. Our model outperforms both the baseline models (VPN and Conv-LSTM) and the partial variations of our model, showing the importance of each contribution. Replacing the last frame with the first tests the ability of the models to detect temporal anomalies in the sequence. In such anomalies the attention mechanism improves the model’s ability to capture the abnormal frame-to-frame changes. The black square corruption tests the ability of the model to capture spatial dependencies. Our masked frame approach captures the dependencies between a masked pixel and its unmasked neighborhood, resulting in the reconstruction of the original values of the blackened pixels, i.e. predicting low probability for zero values.

(a) Network topology
(b) Masked frame
Figure 1: Example of a corrupted moving MNIST sequence. On the left there are 3 un-corrupted frames from the beginning of the sequence. On the right, a black square corruption (top) and a temporal corruption (bottom), along with their prediction loss map (for both the corrupted frames and their un-corrupted version). The bright values in the loss maps correspond to high cross-entropy loss values indicating an anomaly.


EER [%]


VPN 82.1
Our model 87
Our model w/o the masked convolutions 84.6
Our model w/o the conv-attention mechanism 85.7