Learning Memory-guided Normality for Anomaly Detection

03/30/2020 ∙ by Hyunjong Park, et al. ∙ 4

We address the problem of anomaly detection, that is, detecting anomalous events in a video sequence. Anomaly detection methods based on convolutional neural networks (CNNs) typically leverage proxy tasks, such as reconstructing input video frames, to learn models describing normality without seeing anomalous samples at training time, and quantify the extent of abnormalities using the reconstruction error at test time. The main drawbacks of these approaches are that they do not consider the diversity of normal patterns explicitly, and the powerful representation capacity of CNNs allows to reconstruct abnormal video frames. To address this problem, we present an unsupervised learning approach to anomaly detection that considers the diversity of normal patterns explicitly, while lessening the representation capacity of CNNs. To this end, we propose to use a memory module with a new update scheme where items in the memory record prototypical patterns of normal data. We also present novel feature compactness and separateness losses to train the memory, boosting the discriminative power of both memory items and deeply learned features from normal data. Experimental results on standard benchmarks demonstrate the effectiveness and efficiency of our approach, which outperforms the state of the art.



There are no comments yet.


page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of detecting abnormal events in video sequences, e.g., vehicles on sidewalks, has attracted significant attention over the last decade, which is particularly important for surveillance and fault detection systems. It is extremely challenging for a number of reasons: First, anomalous events are determined differently according to circumstances. Namely, the same activity could be normal or abnormal (e.g., holding a knife in the kitchen or in the park). Manually annotating anomalous events is in this context labor intensive. Second, collecting anomalous datasets requires a lot of effort, as anomalous events rarely happen in real-life situations. Anomaly detection is thus typically deemed to be an unsupervised learning problem, aiming at learning a model describing normality without anomalous samples. At test time, events and activities not described by the model are then considered as anomalies.

There are many attempts to model normality in video sequences using unsupervised learning approaches. At training time, given normal video frames as inputs, they typically extract feature representations and try to reconstruct the inputs again. The video frames of large reconstruction errors are then treated as anomalies at test time. This assumes that abnormal samples are not reconstructed well, as the models have never seen them during training. Recent methods based on convolutional neural networks (CNNs) exploit an autoencoder (AE) 

[1, 17]. The powerful representation capacity of CNNs allows to extract better feature representations. The CNN features from abnormal frames, on the other hand, are likely to be reconstructed by combining those of normal ones [22, 8]. In this case, abnormal frames have low reconstruction errors, often occurring when a majority of the abnormal frames are normal (e.g., pedestrians in a park). In order to lessen the capacity of CNNs, a video prediction framework [22] is introduced that minimizes the difference between a predicted future frame and its ground truth. The drawback of these methods [1, 17, 22] is that they do not detect anomalies directly [35]. They instead leverage proxy tasks for anomaly detection, e.g., reconstructing input frames [1, 17] or predicting future frames [22], to extract general feature representations rather than normal patterns. To overcome this problem, Deep SVDD [35] exploits the one-class classification objective to map normal data into a hypersphere. Specifically, it minimizes the volume of the hypersphere such that normal samples are mapped closely to the center of the sphere. Although a single center of the sphere represents a universal characteristic of normal data, this does not consider various patterns of normal samples.

We present in this paper an unsupervised learning approach to anomaly detection in video sequences considering the diversity of normal patterns. We assume that a single prototypical feature is not enough to represent various patterns of normal data. That is, multiple prototypes (i.e., modes or centroids of features) exist in the feature space of normal video frames (Fig. 1). To implement this idea, we propose a memory module for anomaly detection, where individual items in the memory correspond to prototypical features of normal patterns. We represent video frames using the prototypical features in the memory items, lessening the capacity of CNNs. To reduce the intra-class variations of CNN features, we propose a feature compactness loss, mapping the features of a normal video frame to the nearest item in the memory and encouraging them to be close. Simply updating memory items and extracting CNN features alternatively in turn give a degenerate solution, where all items are similar and thus all features are mapped closely in the embedding space. To address this problem, we propose a feature separateness loss. It minimizes the distance between each feature and its nearest item, while maximizing the discrepancy between the feature and the second nearest one, separating individual items in the memory and enhancing the discriminative power of the features and memory items. We also introduce an update strategy to prevent the memory from recording features of anomalous samples at test time. To this end, we propose a weighted regular score measuring how many anomalies exist within a video frame, such that the items are updated only when the frame is determined as a normal one. Experimental results on standard benchmarks, including UCSD Ped2 [21], CUHK Avenue [24] and ShanghaiTech [25], demonstrate the effectiveness and efficiency of our approach, outperforming the state of the art.

The main contributions of this paper can be summarized as follows:

  • [leftmargin=*]

  • We propose to use multiple prototypes to represent the diverse patterns of normal video frames for unsupervised anomaly detection. To this end, we introduce a memory module recording prototypical patterns of normal data on the items in the memory.

  • We propose feature compactness and separateness losses to train the memory, ensuring the diversity and discriminative power of the memory items. We also present a new update scheme of the memory, when both normal and abnormal samples exist at test time.

  • We achieve a new state of the art on standard benchmarks for unsupervised anomaly detection in video sequences. We also provide an extensive experimental analysis with ablation studies.

Our code and models are available online: https://cvlab.yonsei.ac.kr/projects/MNAD.

2 Related work

Anomaly detection.

Many works formulate anomaly detection as an unsupervised learning problem, where anomalous data are not available at training time. They typically adopt reconstructive or discriminative approaches to learn models describing normality. Reconstructive models encode normal patterns using representation learning methods such as an AE [48, 36], a sparse dictionary learning [6, 49, 24], and a generative model [43]. Discriminative models characterize the statistical distributions of normal samples and obtain decision boundaries around the normal instances e.g., using Markov random field (MRF) [15], a mixture of dynamic textures (MDT) [28], Gaussian regression [4], and one-class classification [39, 27, 14]

. These approaches, however, often fail to capture the complex distributions of high-dimensional data such as images and videos 


Figure 2: Overview of our framework for reconstructing a video frame. Our model mainly consists of three parts: an encoder, a memory module, and a decoder. The encoder extracts a query map  of size  from an input video frame  at time . The memory module performs reading and updating items  of size  using queries  of size , where the numbers of items and queries are and , respectively, and . The query map  is concatenated with the aggregated (i.e., read) items . The decoder then inputs them to reconstruct the video frame . For the prediction task, we input four successive video frames to predict the fifth one. (Best viewed in color.)

CNNs have allowed remarkable advances in anomaly detection over the last decade. Many anomaly detection methods leverage reconstructive models [9, 25, 5, 33] exploiting feature representations from e.g., a convolutional AE (Conv-AE) [9], a 3D Conv-AE [50]

, a recurrent neural network (RNN) 

[29, 25, 26], and a generative adversarial network (GAN) [33]. Although CNN-based methods outperform classical approaches by large margins, they even reconstruct anomalous samples with a combination of normal ones, mainly due to the representation capacity of CNNs. This problem can be alleviated by using predictive or discriminative models [22, 35]. The work of [22] assumes that anomalous frames in video sequences are unpredictable, and trains a network for predicting future frames rather than the input itself [22]

. It achieves a remarkable performance gain over reconstructive models, but at the cost of runtime for estimating optical flow between video frames. It also requires ground-truth optical flow to train a sub-network for computing flow fields. Deep SVDD 

[35] leverages CNNs as mapping functions that transform normal data into the center of the hypersphere, whereas forcing anomalous samples to fall outside the sphere, using the one-class classification objective. Our method also lessens the representation capacity of CNNs but using a different way. We reconstruct or predict a video frame with a combination of items in the memory, rather than using CNN features directly from an encoder, while considering various patterns of normal data. In case of future frame prediction, our model does not require computing optical flow, and thus it is much faster than the current method [22]. Deep-Cascade [37] detects various normal patches using cascaded deep networks. In contrast, our method leverages memory items to record the normal pattern explicitly even in test sequences. Concurrent to our method, Gong et al. introduce a memory-augmented autoencoder (MemAE) for anomaly detection [8]. It also uses CNN features but using a 3D Conv-AE to retrieve relevant memory items that record normal patterns, where the items are updated during training only. Unlike this approach, our model better records diverse and discriminative normal patterns by separating memory items explicitly using feature compactness and separateness losses, enabling using a small number of items compared to MemAE (10 vs 2,000 for MemAE). We also update the memory at test time, while discriminating anomalies simultaneously, suggesting that our model also memorizes normal patterns of test data.

Memory networks.

There are a number of attempts to capture long-term dependencies in sequential data. Long short-term memory (LSTM) 

[11] addresses this problem using local memory cells, where hidden states of the network record information in the past partially. The memorization performance is, however, limited, as the size of the cell is typically small and the knowledge in the hidden state is compressed. To overcome the limitation, memory networks [45]

have recently been introduced. It uses a global memory that can be read and written to, and performs a memorization task better than classical approaches. The memory networks, however, require layer-wise supervision to learn models, making it hard to train them using standard backpropagation. More recent works use continuous memory representations 

[40] or key-value pairs [30]

to read/write memories, allowing to train the memory networks end-to-end. Several works adopt the memory networks for computer vision tasks including visual question answering 

[19, 7], one-shot learning [38, 13, 2], image generation [51], and video summarization [20]. Our work also exploits a memory module but for anomaly detection with a different memory updating strategy. We record various patterns of normal data to individual items in the memory, and consider each item as a prototypical feature.

3 Approach

We show in Fig. 2 an overview of our framework. We reconstruct input frames or predict future ones for unsupervised anomaly detection. Following [22], we input four successive video frames to predict the fifth one for the prediction task. As the prediction can be considered as a reconstruction of the future frame using previous ones, we use almost the same network architecture with the same losses for both tasks. We describe hereafter our approach for the reconstruction task in detail.

Our model mainly consists of three components: an encoder, a memory module, and a decoder. The encoder inputs a normal video frame and extracts query features. The features are then used to retrieve prototypical normal patterns in the memory items and to update the memory. We feed the query features and memory items aggregated (i.e., read) to the decoder for reconstructing the input video frame. We train our model using reconstruction, feature compactness, and feature separateness losses end-to-end. At test time, we use a weighted regular score in order to prevent the memory from being updated by abnormal video frames. We compute the discrepancies between the input frame and its reconstruction and the distances between the query feature and the nearest item in the memory to quantify the extent of abnormalities in a video frame.

Figure 3:

Illustration of reading and updating the memory. To read items in the memory, we compute matching probabilities 

in (1) between the query  and items (), and apply a weighted average of the items with the probabilities to obtain the feature . To update the items, we compute another matching probabilities  in (4) between the item  and the queries (). We then compute a weighted average of the queries in the set  with the corresponding probabilities, and add it to the initial item  in (3

). c: cosine similarities; s: a softmax function; w: a weighted average; n: max normalization;

: a set of indices for the -th memory item. See text for details. (Best viewed in color.)

3.1 Network architecture

3.1.1 Encoder and decoder

We exploit the U-Net architecture [34], widely used for the tasks of reconstruction and future frame prediction [22]

, to extract feature representations from input video frames and to reconstruct the frames from the features. Differently, we remove the last batch normalization 


and ReLU layers 

[18] in the encoder, as the ReLU cuts off negative values, restricting diverse feature representations. We instead add an L2 normalization layer to make the features have a common scale. Skip connections in the U-Net architecture may not be able to extract useful features from the video frames especially for the reconstruction task, and our model may learn to copy the inputs for the reconstruction. We thus remove the skip connections for the reconstruction task, while retaining them for predicting future frames. We denote by  and  a video frame and a corresponding feature (i.e., a query) from the encoder at time , respectively. The encoder inputs the video frame  and gives the query map  of size , where , , are height, width, and the number of channels, respectively. We denote by (), where , individual queries of size  in the query map . The queries are then inputted to the memory module to read the items in the memory or to update the items, such that they record prototypical normal patterns. The detailed descriptions of the memory module are presented in the following section. The decoder inputs the queries and retrieved memory items and reconstructs the video frame .

3.1.2 Memory

The memory module contains  items recording various prototypical patterns of normal data. We denote by   the item in the memory. The memory performs reading and updating the items (Fig. 3).


To read the items, we compute the cosine similarity between each query  and all memory items , resulting in a 2-dimensional correlation map of size . We then apply a softmax function along a vertical direction, and obtain matching probabilities  as follows:


For each query , we read the memory by a weighted average of the items  with the corresponding weights , and obtain the feature as follows:


Using all items instead of the closest item allows our model to understand diverse normal patterns, taking into account the overall normal characteristics. That is, we represent the query  with a combination of the items  in the memory. We apply the reading operator to individual queries, and obtain a transformed feature map  (i.e., aggregated items). We concatenate it with the query map  along the channel dimension, and input them to the decoder. This enables the decoder to reconstruct the input frame using normal patterns in the items, lessening the representation capacity of CNNs, while understanding the normality.


For each memory item, we select all queries declared that the item is the nearest one, using the matching probabilities in (1). Note that multiple queries can be assigned to a single item in the memory. See, for example, Fig. 5 in Sec. 4.3. We denote by the set of indices for the corresponding queries for the -th item in the memory. We update the item using the queries indexed by the set  only as follows:


where is the L2 norm. By using a weighted average of the queries, rather than summing them up, we can concentrate more on the queries near the item. To this end, we compute matching probabilities  similar to (1) but by applying the softmax function to the correlation map of size along a horizontal direction as


and renormalize it to consider the queries indexed by the set  as follows:


We update memory items recording prototypical features at both training and test time, since normal patterns in training and test sets may be different and they could vary with various factors, e.g., illumination and occlusion. As both normal and abnormal frames are available at test time, we propose to use a weighted regular score to prevent the memory items from recording patterns in the abnormal frames. Given a video frame , we use the weighted reconstruction error between  and as the regular score :


where the weight function is


and and are spatial indices. When the score  is higher than a threshold , we regard the frame  as an abnormal sample, and do not use it for updating memory items. Note that we use this score only when updating the memory. The weight function allows to focus more on the regions of large reconstruction errors, as abnormal activities typically appear within small parts of the video frame.

3.2 Training loss

We exploit the video frames as a supervisory signal to discriminate normal and abnormal samples. To train our model, we use reconstruction, feature compactness, and feature separateness losses (, and , respectively), balanced by the parameters  and as follows:

Reconstruction loss.

The reconstruction loss makes the video frame reconstructed from the decoder similar to its ground truth by penalizing the intensity differences. Specifically, we minimize the L2 distance between the decoder output  and the ground truth :


where we denote by the total length of a video sequence. We set the first time step to 1 and 5 for reconstruction and prediction tasks, respectively.

Feature compactness loss.

The feature compactness loss encourages the queries to be close to the nearest item in the memory, reducing intra-class variations. It penalizes the discrepancies between them in terms of the L2 norm as:


where is an index of the nearest item for the query defined as,


Note that the feature compactness loss and the center loss [44] are similar, as the memory item 

corresponds the center of deep features in the center loss. They are different in that the item in (

10) is retrieved from the memory, and it is updated without any supervisory signals, while the cluster center in the center loss is computed directly using the features learned from ground-truth class labels. Note also that our method can be considered as an unsupervised learning of joint clustering and feature representations. In this task, degenerate solutions are likely to be obtained [44, 47]. As will be seen in our experiments, training our model using the feature compactness loss only makes all items similar, and thus all queries are mapped closely in the embedding space, losing the capability of recording diverse normal patterns.

Feature separateness loss.

Similar queries should be allocated to the same item in order to reduce the number of items and the memory size. The feature compactness loss in (10) makes all queries and memory items close to each other, as we extract the features (i.e., queries) and update the items alternatively, resulting that all items are similar. The items in the memory, however, should be far enough apart from each other to consider various patterns of normal data. To prevent this problem while obtaining compact feature representations, we propose a feature separateness loss, defined with a margin of  as follows:


where we set the query , its nearest item  and the second nearest item  as an anchor, and positive and hard negative samples, respectively. We denote by an index of the second nearest item for the query :


Note that this is different from the typical use of the triplet loss that requires ground-truth positive and negative samples for the anchor. Our loss encourages the query and the second nearest item to be distant, while the query and the nearest one to be nearby. This has the effect of placing the items far away. As a result, the feature separateness loss allows to update the item nearest to the query, whereas discarding the influence of the second nearest item, separating all items in the memory and enhancing the discriminative power.

3.3 Abnormality score

We quantify the extent of normalities or abnormalities in a video frame at test time. We assume that the queries obtained from a normal video frame are similar to the memory items, as they record prototypical patterns of normal data. We compute the L2 distance between each query and the nearest item as follows:


We also exploit the memory items implicitly to compute the abnormality score. We measure how well the video frame is reconstructed using the memory items. This assumes that anomalous patterns in the video frame are not reconstructed by the memory items. Following [22], we compute the PSNR between the input video frame and its reonstruction:


where is the number of pixels in the video frame. When the frame  is abnormal, we obtain a low value of PSNR and vice versa. Following [22, 8, 25], we normalize each error in (14) and (15) in the range of [0, 1] by a min-max normalization [22]. We define the final abnormality score  for each video frame as the sum of two metrics, balanced by the parameter , as follows:


where we denote by the min-max normalization [22] over whole video frames, e.g.,


4 Experiments

4.1 Implementation details


We evaluate our method on three benchmark datasets and compare the performance with the state of the art. 1) The UCSD Ped2 dataset [21] contains 16 training and 12 test videos with 12 irregular events, including riding a bike and driving a vehicle. 2) The CUHK Avenue dataset [24] consists of 16 training and 21 test videos with 47 abnormal events such as running and throwing stuff. 3) The ShanghaiTech dataset [25] contains 330 training and 107 test videos of 13 scenes. It is the largest dataset among existing benchmarks for anomaly detection.


We resize each video frame to the size of 256 256 and normalize it to the range of [-1, 1]. We set the height and the width of the query feature map, and the numbers of feature channels  and memory items  to 32, 32, 512 and 10, respectively. We use the Adam optimizer [16] with and

, with a batch size of 4 for 60, 60, and 10 epochs on UCSD Ped2 

[21], CUHK Avenue [24], and ShanghaiTech [25], respectively. We set initial learning rates to 2e-5 and 2e-4, respectively, for reconstruction and prediction tasks, and decay them using a cosine annealing method [23]. For the reconstruction task, we use a grid search to set hyper-parameters on the test split of UCSD Ped1 [21]: , , , and . We use different parameters for the prediction task similarly chosen using a grid search: , , , and . All models are trained end-to-end using PyTorch [32], taking about 1, 15 and 36 hours for UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively, with an Nvidia GTX TITAN Xp.

Methods Ped2 [21] Avenue [24] Shanghai [25]
MPPCA [15] 69.3 - -
MPPC+SFA [15] 61.3 - -
MDT [28] 82.9 - -
AMDN [46] 90.8 - -
Unmasking [41] 82.2 80.6 -
MT-FRCN [10] 92.2 - -
AMC [31] 96.2 86.9 -
Recon. ConvAE [9] 85.0 80.0 60.9
TSC [25] 91.0 80.6 67.9
StackRNN [25] 92.2 81.7 68.0
AbnormalGAN [33] 93.5 - -
MemAE w/o Mem. [8] 91.7 81.0 69.7
MemAE w/ Mem. [8] 94.1 83.3 71.2
Ours-R w/o Mem. 86.4 80.6 65.8
Ours-R w/ Mem. 90.2 82.8 69.8
Pred. Frame-Pred [22] 95.4 85.1 72.8
Ours-P w/o Mem. 94.3 84.5 66.8
Ours-P w/ Mem. 97.0 88.5 70.5
Table 1: Quantitative comparison with the state of the art for anomaly detection. We measure the average AUC (%) on UCSD Ped2 [21], CUHK Avenue [24], and ShanghaiTech [25]. Numbers in bold indicate the best performance and underscored ones are the second best.

4.2 Results

Comparison with the state of the art.

We compare in Table 1 our models with the state of the art for anomaly detection on UCSD Ped2 [21], CUHK Avenue [24], and ShanghaiTech [25]. Following the experimental protocol in [22, 8, 25], we measure the average area under curve (AUC) by computing the area under the receiver operation characteristics (ROC) with varying threshold values for abnormality scores. We report the AUC performance of our models using memory modules for the tasks of frame reconstruction and future frame prediction. For comparison, we also provide the performance without the memory module. The suffices ‘-R’ and ‘-P’ indicate the reconstruction and prediction tasks, respectively.

From the table, we observe three things: (1) Our model with the prediction task (Ours-P w/ Mem.) gives the best results on UCSD Ped2 and CUHK Avenue, achieving the average AUC of 97.0% and 88.5%, respectively. This demonstrates the effectiveness of our approach to exploiting a memory module for anomaly detection. Although our method is outperformed by Frame-Pred [22] on ShanghaiTech, it uses additional modules for estimating optical flow, which requires more network parameters and ground-truth flow fields. Moreover, Frame-Pred leverages an adversarial learning framework, taking lots of effort to train the network. On the contrary, our model uses a simple AE for extracting features and predicting the future frame, and thus it is much faster than Frame-Pred (67 fps vs. 25 fps). This suggests that our model offers a good compromise in terms of AUC and runtime; (2) Our model with the reconstruction task (Ours-R w/ Mem.) shows the competitive performance compared to other reconstructive methods on UCSD Ped2, and outperforms them on other datasets, except MemAE [8]. Note that MemAE exploits 3D convolutions with 2,000 memory items of size 256. On the contrary, our model uses 2D convolutions and it requires 10 items of size 512. It is thus computationally much cheaper than MemAE: 67 fps for our model vs. 45 fps for MemAE; (3) Our memory module boosts the AUC performance significantly regardless of the tasks on all datasets. For example, the AUC gains are 2.7%, 4.0%, and 3.7% on UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively, for the prediction task. This indicates that the memory module is generic and it can be added to other anomaly detection methods.

Figure 4: Qualitative results for future frame prediction on (top to bottom) UCSD Ped2 [21], CUHK Avenue [24], and ShanghaiTech [25]: input frames (left); prediction error (middle); abnormal regions (right). We can see that our model localizes the regions of abnormal events. Best viewed in color.

With an Nvidia GTX TITAN Xp, our current implementation takes on average 0.015 seconds to determine abnormality for an image of size 256 256 on UCSD Ped2 [21]. Namely, we achieve 67 fps for anomaly detection, which is much faster than other state-of-the-art methods based on CNNs, e.g., 20 fps for Unmasking [41], 50 fps for StackRNN [25], 25 fps for Frame-Pred [22], and 45 fps for MemAE [8] with the same setting as ours.

Qualitative results.

We show in Fig. 4 qualitative results of our model for future frame prediction on UCSD Ped2 [21], CUHK Avenue [24], and ShanghaiTech [25]. It shows input frames, prediction error, and abnormal regions overlaid to the frame. For visualizing the anomalies, we compute pixel-wise abnormality scores similar to (16). We then mark the regions whose abnormality scores are larger than the average value within the frame. We can see that 1) normal regions are predicted well, while abnormal regions are not, and 2) abnormal events, such as the appearance of vehicle, jumping and fight on UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively, are highlighted.

4.3 Discussions

Ablation study.

We show an ablation analysis on different components of our models in Table 2. We report the AUC performance for the variants of our models for reconstruction and prediction tasks on UCSD Ped2 [21]. As the AUC performance of both tasks shows a similar trend, we describe the results for the frame reconstruction in detail.

We train the baseline model in the first row with the reconstruction loss, and use PSNR only to compute abnormality scores. From the second row, we can see that our model with the memory module gives better results. The third row shows that the AUC performance even drops when the feature compactness loss is additionally used, as the memory items are not discriminative. The last row demonstrates that the feature separateness loss boosts the performance drastically. It provides the AUC gain of 3.8%, which is quite significant. The last four rows indicate that 1) feature compactness and separateness losses are complementary, 2) updating the memory item using with normal frames only at test time largely boosts the AUC performance, and 3) our abnormality score , using both PSNR and memory items, quantifies the extent of anomalies better than the one based on PSNR only.

Task Memory Ped2 [21]
Recon. - - - - 86.4
Pred. - - - - 94.3
Table 2: Quantitative comparison for variants of our model. We measure the average AUC (%) on UCSD Ped2 [21].
Memory items.

We visualize in Fig. 5 matching probabilities in (1) from the model trained with/without the feature separateness loss for the reconstruction task on UCSD Ped2 [21]. We observe that each query is highly activated on a few items with the separateness loss, demonstrating that the items and queries are highly discriminative, allowing the sparse access of the memory. This also indicates that abnormal samples are not likely to be reconstructed with a combination of memory items.

Figure 5: Visualization of matching probabilities in (1) learned with (left) and without (right) the feature separateness loss (blue: low, yellow: high). We randomly select 10 query features for the purpose of visualization. Best viewed in color.
Figure 6: t-SNE [42] visualization for query features and memory items. We randomly sample 10K query features, learned with (left) and without (right) the feature separateness loss, from UCSD Ped2 [21]. The features and memory items are shown in points and stars, respectively. The points with the same color are mapped to the same item. The feature separateness loss enables separating the items, recording the diverse prototypes of normal data. Best viewed in color.
Feature distribution.

We visualize in Fig. 6 the distribution of query features for the reconstruction task, randomly chosen from UCSD Ped2 [21], learned with and without the feature separateness loss. We can see that our model trained without the separateness loss loses the discriminability of memory items, and thus all features are mapped closely in the embedding space. The separateness loss allows to separate individual items in the memory, suggesting that it enhances the discriminative power of query features and memory items significantly. We can also see that our model gives compact feature representations.

Reconstruction with motion cues.

Following [8], we use multiple frames for the reconstruction task. Specifically, we input sixteen successive video frames to reconstruct the ninth one. This achieves AUC of 91.0% for UCSD Ped2, providing the AUC gain of 0.8% but requiring more network parameters (4MB).

5 Conclusion

We have introduced an unsupervised learning approach to anomaly detection in video sequences that exploits multiple prototypes to consider the various patterns of normal data. To this end, we have suggested to use a memory module to record the prototypical patterns to the items in the memory. We have shown that training the memory using feature compactness and separateness losses separates the items, enabling the sparse access of the memory. We have also presented a new memory update scheme when both normal and abnormal samples exist, which boosts the performance of anomaly detection significantly. Extensive experimental evaluations on standard benchmarks demonstrate the our model outperforms the state of the art.


This research was partly supported by Samsung Electronics Company, Ltd., Device Solutions under Grant, Deep Learning based Anomaly Detection, 2018–2020, and R&D program for Advanced Integrated-intelligence for Identification (AIID) through the National Research Foundation of KOREA(NRF) funded by Ministry of Science and ICT (NRF-2018M3E3A1057289).


  • [1] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In NIPS, Cited by: §1.
  • [2] Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei (2018) Memory matching networks for one-shot image recognition. In CVPR, Cited by: §2.
  • [3] R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §2.
  • [4] K. Cheng, Y. Chen, and W. Fang (2015) Video anomaly detection and localization using hierarchical feature representation and gaussian process regression. In CVPR, Cited by: §2.
  • [5] Y. S. Chong and Y. H. Tay (2017) Abnormal event detection in videos using spatiotemporal autoencoder. In ISNN, Cited by: §2.
  • [6] Y. Cong, J. Yuan, and J. Liu (2011) Sparse reconstruction cost for abnormal event detection. In CVPR, Cited by: §2.
  • [7] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang (2019)

    Heterogeneous memory enhanced multimodal attention model for video question answering

    In CVPR, Cited by: §2.
  • [8] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In ICCV, Cited by: §1, §2, §3.3, §4.2, §4.2, §4.2, §4.3, Table 1.
  • [9] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In CVPR, Cited by: §2, Table 1.
  • [10] R. Hinami, T. Mei, and S. Satoh (2017) Joint detection and recounting of abnormal events by learning deep generic knowledge. In ICCV, Cited by: Table 1.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §2.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.1.1.
  • [13] Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio (2017) Learning to remember rare events. In ICLR, Cited by: §2.
  • [14] V. Kaltsa, A. Briassouli, I. Kompatsiaris, L. J. Hadjileontiadis, and M. G. Strintzis (2015) Swarm intelligence for detecting interesting events in crowded environments. IEEE TIP. Cited by: §2.
  • [15] J. Kim and K. Grauman (2009) Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In CVPR, Cited by: §2, Table 1.
  • [16] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • [17] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §3.1.1.
  • [19] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016)

    Ask me anything: dynamic memory networks for natural language processing

    In ICML, Cited by: §2.
  • [20] S. Lee, J. Sung, Y. Yu, and G. Kim (2018) A memory network approach for story-based temporal summarization of 360 videos. In CVPR, Cited by: §2.
  • [21] W. Li, V. Mahadevan, and N. Vasconcelos (2013) Anomaly detection and localization in crowded scenes. IEEE TPAMI. Cited by: §1, Figure 4, Figure 6, §4.1, §4.1, §4.2, §4.2, §4.2, §4.3, §4.3, §4.3, Table 1, Table 2.
  • [22] W. Liu, W. Luo, D. Lian, and S. Gao (2018) Future frame prediction for anomaly detection–a new baseline. In CVPR, Cited by: §1, §2, §3.1.1, §3.3, §3, §4.2, §4.2, §4.2, Table 1.
  • [23] I. Loshchilov and F. Hutter (2016)

    SGDR: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §4.1.
  • [24] C. Lu, J. Shi, and J. Jia (2013) Abnormal event detection at 150 FPS in MATLAB. In ICCV, Cited by: Figure 1, §1, §2, Figure 4, §4.1, §4.1, §4.2, §4.2, Table 1.
  • [25] W. Luo, W. Liu, and S. Gao (2017) A revisit of sparse coding based anomaly detection in stacked RNN framework. In ICCV, Cited by: §1, §2, §3.3, Figure 4, §4.1, §4.1, §4.2, §4.2, §4.2, Table 1.
  • [26] W. Luo, W. Liu, and S. Gao (2017) Remembering history with convolutional lstm for anomaly detection. In ICME, Cited by: §2.
  • [27] J. Ma and S. Perkins (2003)

    Time-series novelty detection using one-class support vector machines

    In IJCNN, Cited by: §2.
  • [28] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos (2010) Anomaly detection in crowded scenes. In CVPR, Cited by: §2, Table 1.
  • [29] J. R. Medel and A. Savakis (2016) Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390. Cited by: §2.
  • [30] A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In EMNLP, Cited by: §2.
  • [31] T. Nguyen and J. Meunier (2019) Anomaly detection in video sequence with appearance-motion correspondence. In ICCV, Cited by: Table 1.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. Cited by: §4.1.
  • [33] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe (2017) Abnormal event detection in videos using generative adversarial nets. In ICIP, Cited by: §2, Table 1.
  • [34] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §3.1.1.
  • [35] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In ICML, Cited by: §1, §2.
  • [36] M. Sabokrou, M. Fathy, M. Hoseini, and R. Klette (2015) Real-time anomaly detection and localization in crowded scenes. In CVPRW, Cited by: §2.
  • [37] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette (2017) Deep-Cascade: cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE TIP. Cited by: §2.
  • [38] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In ICML, Cited by: §2.
  • [39] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation. Cited by: §2.
  • [40] S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In NIPS, Cited by: §2.
  • [41] R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu (2017) Unmasking the abnormal events in video. In ICCV, Cited by: §4.2, Table 1.
  • [42] L. Van Der Maaten (2014) Accelerating t-SNE using tree-based algorithms. JMLR. Cited by: Figure 6.
  • [43] N. Vaswani, A. K. Roy-Chowdhury, and R. Chellappa (2005) “Shape Activity”: a continuous-state hmm for moving/deforming shapes with application to abnormal activity detection. IEEE TIP. Cited by: §2.
  • [44] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    In ECCV, Cited by: §3.2.
  • [45] J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In ICLR, Cited by: §2.
  • [46] D. Xu, Y. Yan, E. Ricci, and N. Sebe (2017) Detecting anomalous events in videos by learning deep representations of appearance and motion. CVIU. Cited by: Table 1.
  • [47] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In CVPR, Cited by: §3.2.
  • [48] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang (2016)

    Deep structured energy based models for anomaly detection

    arXiv preprint arXiv:1605.07717. Cited by: §2.
  • [49] B. Zhao, L. Fei-Fei, and E. P. Xing (2011) Online detection of unusual events in videos via dynamic sparse coding. In CVPR, Cited by: §2.
  • [50] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X. Hua (2017) Spatio-temporal autoencoder for video anomaly detection. In ACM MM, Cited by: §2.
  • [51] M. Zhu, P. Pan, W. Chen, and Y. Yang (2019) DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, Cited by: §2.