Object-centric Auto-encoders and Dummy Anomalies for Abnormal Event Detection in Video

by   Radu Tudor Ionescu, et al.

Abnormal event detection in video is a challenging vision problem. Most existing approaches formulate abnormal event detection as an outlier detection task, due to the scarcity of anomalous data during training. Because of the lack of prior information regarding abnormal events, these methods are not fully-equipped to differentiate between normal and abnormal events. In this work, we formalize abnormal event detection as a one-versus-rest binary classification problem. Our contribution is two-fold. First, we introduce an unsupervised feature learning framework based on object-centric convolutional auto-encoders to encode both motion and appearance information. Second, we propose a supervised classification approach based on clustering the training samples into normality clusters. A one-versus-rest abnormal event classifier is then employed to separate each normality cluster from the rest. For the purpose of training the classifier, the other clusters act as dummy anomalies. During inference, an object is labeled as abnormal if the highest classification score assigned by the one-versus-rest classifiers is negative. Comprehensive experiments are performed on four benchmarks: Avenue, ShanghaiTech, UCSD and UMN. Our approach provides superior results on all four data sets. On the large-scale ShanghaiTech data set, our method provides an absolute gain of 12.1 et al., CVPR 2018].


page 2

page 4

page 7


A Scene-Agnostic Framework with Adversarial Training for Abnormal Event Detection in Video

Abnormal event detection in video is a complex computer vision problem t...

Unsupervised Synthesis of Anomalies in Videos: Transforming the Normal

Abnormal activity recognition requires detection of occurrence of anomal...

Abnormal Behavior Detection Based on Target Analysis

Abnormal behavior detection in surveillance video is a pivotal part of t...

Detecting abnormal events in video using Narrowed Motion Clusters

We formulate the abnormal event detection problem as an outlier detectio...

What do we learn? Debunking the Myth of Unsupervised Outlier Detection

Even though auto-encoders (AEs) have the desirable property of learning ...

Clustering Images by Unmasking - A New Baseline

We propose a novel agglomerative clustering method based on unmasking, a...

Toward a Taxonomy and Computational Models of Abnormalities in Images

The human visual system can spot an abnormal image, and reason about wha...

1 Introduction

Figure 1:

Our anomaly detection framework based on training convolutional auto-encoders on top of object detections. In the training phase (represented in dashed lines), the concatenated motion and appearance latent representations are clustered and a one-versus-rest classifier is trained to discriminate between the formed clusters. In the inference phase, we label a test sample as abnormal if the highest classification score is negative, i.e. the sample is not attributed to any class. Best viewed in color.

Abnormal event detection in video is a challenging task in computer vision, since it is hard to define abnormal events independent of context. For example, a truck driving by on the street is considered a perfectly normal event. However, if the truck drives through a pedestrian area, then it is regarded as an abnormal event. Another example that illustrates the importance of context is a scenario in which two people are fighting in a boxing ring (normal event) versus fighting on the street (abnormal event). In addition to the reliance on context, abnormal events rarely occur and are generally dominated by more familiar (normal) events. Therefore, it is difficult to obtain a sufficiently representative set of anomalies, making it hard to employ traditional supervised learning methods.

Most existing anomaly detection approaches [2, 5, 14, 17, 22, 24, 25, 35, 37]

are based on outlier detection and learn a model of normality from training videos containing only familiar events. At test time, events are labeled as abnormal if they deviate from the normality model. Different from these approaches, we address abnormal event detection by formulating the task as a multi-class classification problem instead of an outlier detection problem. Since the training data contains only normal events, we first apply k-means clustering in order to find clusters representing various types of normality (see Figure 

1). Next, we train a binary classifier following the one-versus-rest scheme in order to separate each normality cluster from the others. During training, normality clusters are treated as different categories, leading to the synthetic generation of abnormal training data. During inference, the highest classification score corresponding to a given test sample represents the normality score of the respective sample. If the score is negative, the sample is labeled as abnormal (since it does not belong to any normality class). To our knowledge, we are the first to treat the abnormal event detection task as a discriminative multi-class classification problem.

In general, existing abnormal event detection frameworks extract features at a local level [7, 9, 14, 21, 22, 23, 24, 30, 31, 36], global (frame) level [20, 25, 26, 27, 32], or both [5, 6, 11]. All these approaches extract features without explicitly taking into account the objects of interest. In this paper, we propose an object-centric approach by applying a fast yet powerful single-shot detector (SSD) [18] on each frame, and learning deep unsupervised features using convolutional auto-encoders (CAE) on top of the detected objects, as shown in Figure 1. This enables us to explicity focus only on the objects present in the scene. In addition, it allows us to accurately localize the anomalies in each frame. Although auto-encoders have been used before for abnormal event detection [11, 30, 35], to our knowledge, we are the first to train object-centric auto-encoders.

In summary, the novelty of our paper is two-fold. First, we train object-centric convolutional auto-encoders for both motion and appearance. Second, we propose a supervised learning approach by formulating the abnormal event detection task as a multi-class problem. We conduct experiments on the Avenue [22], the ShanghaiTech [23], the UCSD [24] and the UMN [25] data sets, and compare our approach with the state-of-the-art abnormal event detection methods [6, 7, 9, 11, 12, 13, 14, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 34, 35, 36]. The empirical results clearly show that our approach achieves superior performance compared to the state-of-the-art methods on all data sets. Furthermore, on the Avenue and the ShanghaiTech data sets, our approach provides considerable absolute gains of and , respectively, over the top state-of-the-art method [20].

We organize the paper as follows. We present related work on abnormal event detection in Section 2. We describe our approach in Section 3. We present the abnormal event detection experiments in Section 4. We draw our final conclusions in Section 5.

2 Related Work

Abnormal event detection is commonly formalized as an outlier detection task [2, 5, 6, 9, 14, 17, 22, 24, 25, 28, 34, 35, 36, 37], in which the general approach is to learn a model of normality from training data and label the detected outliers as abnormal events. Several abnormal event detection approaches [5, 6, 9, 22, 28] are based on learning a dictionary of atoms representing normal events, and on labeling the events not represented in the dictionary as abnormal. Recent abnormal event detection approaches have employed locality sensitive hashing filters [36]

or deep learning 

[11, 12, 20, 23, 26, 27, 30, 32, 34, 35] to achieve better results. Smeureanu et al. [32]

employed a one-class Support Vector Machines (SVM) model based on deep features provided by convolutional neural networks (CNN) pre-trained on the ILSVRC benchmark 

[29], while Ravanbakhsh et al. [26] combined pre-trained CNN models with low-level optical-flow maps. Luo et al. [23]

proposed a Temporally-coherent Sparse Coding approach, which can be mapped to a stacked Recurrent Neural Network that facilitates parameter optimization and accelerates anomaly prediction. The approach presented in 

[27] is based on training Generative Adversarial Nets (GAN) using normal frames and corresponding optical-flow images in order to learn an internal representation of the scene normality. The test data is compared with both the appearance and the motion representations reconstructed by the GAN and abnormal areas are detected by computing local differences. Liu et al. [20] proposed a method for abnormal event detection based on a deep future frame prediction framework. The approach uses the difference between a predicted future frame and the ground-truth frame to detect abnormal events. For a better detection rate, the authors added a temporal constraint based on optical flow, along with the spatial constraints.

Similar to our own approach, which learns features in an unsupervised fashion, there are a few works that have employed unsupervised steps for abnormal event detection [9, 11, 28, 30, 34, 35]. In [9], the authors presented a method that constructs a model of familiar events from training data. The model is incrementally updated in an unsupervised manner as new patterns are observed in the test data. Ren et al. [28]

used an unsupervised approach, spectral clustering, to construct a dictionary of atoms, each representing a single type of normal behavior. Interestingly, some recent works do not require any kind of training data in order to detect abnormal events 

[7, 13, 21].

More closely-related to our work are methods that employ features learned with auto-encoders [11, 30, 34, 35] or extracted from the classification branch of Fast R-CNN [12]. In order to learn deep feature representations in an unsupervised manner, Xu et al. [34, 35] employed Stacked Denoising Auto-Encoders on multi-scale patches. In the end, they used one-class SVM classifiers to detect abnormal events. Hasan et al. [11]

proposed two autoencoders, one that is learned on conventional handcrafted spatio-temporal local features, and another one that is learned end-to-end using a fully convolutional feed-forward architecture. On the other hand, Sabokrou et al. 

[30] combined 3D deep auto-encoders and 3D convolutional neural networks into a cascaded framework.

Differences of our approach. Different from these recent related works [11, 30, 34, 35], we propose to train auto-encoders on object detections provided by a state-of-the-art detector [18]. The most similar work to ours is that of Hinami et al. [12]

. They also proposed an object-centric approach, but our detection, feature extraction and training stages are different. While Hinami et al. 

[12] used geodesic [16] and moving object proposals [10], we employ a single-shot detector [18] based on Feature Pyramid Networks (FPN). In the feature extraction stage, Hinami et al. [12] fine-tuned the classification branch of the Fast R-CNN model on multiple visual tasks to exploit semantic information that is useful for detecting and recounting abnormal events. In contrast, we learn unsupervised deep features with convolutional auto-encoders. Also differing from Hinami et al. [12] and all other works, we formalize the abnormal event detection task as a multi-class problem and propose to train a one-versus-rest SVM on top of k-means clusters. A similar approach was adopted by Caron et al. [4] in order to train deep generic visual features in an unsupervised manner.

3 Method

Motivation. Since the training data contains only normal events, supervised learning methods that require both positive (normal) and negative (abnormal) samples cannot be directly applied for the abnormal event detection task. However, we believe that including any form of supervision is an important step towards obtaining better performance in practice. Motivated by this intuition, we conceive a framework that incorporates two approaches for including supervision. The first approach consists of employing a single-shot object detector [18], which is trained in a supervised fashion, in order to obtain object detections that are subsequently used throughout the rest of the processing pipeline. The second approach consists of training supervised one-versus-rest classifiers on artificially-generated classes representing different kinds of normality. The classes are generated by previously clustering the training samples. Our entire framework is composed of four sequential stages that are described in detail below. These are the object detection stage, the feature learning stage, the model training stage, and the inference stage.

Object detection. We propose to detect objects using a single-shot object detector based on FPN [18], which offers an optimal trade-off between accuracy and speed. This object detector is specifically chosen because it can accurately detect smaller objects, due to the FPN architecture, and it can process about frames per second on a GPU. These advantages are of utter importance for developing a practical abnormal event detection framework. The object detector is applied on a frame by frame basis in order to obtain a set of bounding boxes for the objects in each frame . We use the bounding boxes to crop the objects. The resulting images are converted to grayscale. Next, the images are directly passed to the feature learning stage, in order to learn object-centric appearance features. At the same time, we use the images containing objects in order to compute gradients representing motion. For this step, we additionally consider the images cropped from a previous and a subsequent frame. As illustrared in Figure 1, we choose the frames at index and , with respect to the current frame . Since the temporal distance between the frames is not significant, we do not need to track the objects. Instead, we simply consider the bounding boxes determined at frame in order to crop the objects at frames and . For each object, we obtain two image gradients, one representing the change in motion from frame to frame and one representing the change in motion from frame to frame . Finally, the image gradients are also passed to the feature learning stage, in order to learn object-centric motion features.

Figure 2: Normal and abnormal objects (left) and gradients (right) with reconstructions provided by the appearance (left) and the motion (right) convolutional auto-encoders. The samples are selected from the Avenue [22], the ShanghaiTech [23], the UCSD Ped2 [24] and the UMN [25] test videos, and are not seen during training the auto-encoders.

Feature learning.

In order to obtain a feature vector for each object detection, we train three convolutional auto-encoders. One auto-encoder takes as input cropped images containing objects, and it inherently learns latent appearance features. The other two auto-econders take as input the gradients that capture how the object moved before and after the detection moment, respectively. These auto-encoders learn latent motion features. All three auto-encoders are based on the same lightweight architecture, which is composed of an encoder with

convolutional and max-pooling blocks, and a decoder with

upsampling and convolutional blocks and an additional convolutional layer for the final output. For each CAE, the size of the input is , and the size of the output is the same. All convolutional layers are based on

filters. Each convolutional layer, except the very last one, is followed by ReLU activations. The first two convolutional layers of the encoder contain

filters each, while the third convolutional layer contains filters. The max-pooling layers of the encoder are based on

filters with stride

. The resulting latent feature representation of each CAE is composed of activation maps of size . In the decoder, each resize layer upsamples the input activations by a factor of two, using the nearest neighbor approach. The first convolutional layer in the decoder contains filters. The following two convolutional layers of the decoder contain filters each. The fourth (and last) convolutional layer of the decoder contains a single filter of size . The main purpose of the last convolutional layer is to reduce the output depth from to . The auto-encoders are trained with the Adam optimizer [15]

using the pixel-wise mean squared error as loss function:


where and are the input and the output images, each of size pixels (in our case, ).

The auto-encoders learn to represent objects detected in the training video containing only normal behavior. When we provide as input objects with abnormal behavior, the reconstruction error of the auto-encoders is expected to be higher. Furthermore, the latent features should represent known (normal) objects in a different and better way than unknown (abnormal) objects. Some input-output CAE pairs selected from the test videos in each data set considered in the evaluation are shown in Figure 2. We notice that the auto-encoders generally provide better reconstructions for normal objects, confirming our intuition. The final feature vector for each object detection sample is a concatenation of the latent appearance features and the latent motion features. Since the latent activation maps of each CAE are , the final feature vectors have dimensions.

Model training. We propose a novel training approach by formalizing the abnormal event detection task as a multi-class classification problem. The proposed approach aims to compensate for the lack of truly abnormal training samples, by constructing a context in which a subset of normal training samples can play the role of dummy abnormal samples with respect to another subset of normal training samples. This is achieved by clustering the normal training samples into clusters using k-means. We consider that each cluster represents a certain kind of normality, different from the other clusters. From the perspective of a given cluster , the samples belonging to the other clusters (from the set ) can be viewed as (dummy) abnormal samples. Therefore, we can train a binary classifier , in our case an SVM, to separate the positively-labeled data points in a cluster from the negatively-labeled data points in clusters , as follows:


where is a test sample that must be classified either as normal or abnormal, is the vector of weights and is the bias term. We note that the negative samples can actually be considered as more closely-related to the samples in cluster than truly abnormal samples. Hence, the discrimination task is more difficult, and it can help the SVM to select better support vectors. For each cluster , we train an independent binary classifier . The final classification score for one data sample is the highest score among the scores returned by the classifiers. In other words, the classification score for one data sample is selected according to the one-versus-rest scheme, commonly used when binary classifiers are employed for solving multi-class problems.

Inference. In the inference phase, each test sample is classified by the binary SVM models. The highest classification score is used (with a change of sign) as the abnormality score for the respective test sample :


By putting together the scores of the objects cropped from a given frame, we obtain a pixel-level anomaly prediction map for the respective frame. If the bounding boxes of two objects overlap, we keep the maximum abnormality score for the overlapping region. To obtain frame-level predictions, we consider the highest score in the prediction map as the anomaly score of the respective frame. We then apply a Gaussian filter to temporally smooth the final frame-level anomaly scores.

4 Experiments

4.1 Data Sets

We consider four data sets for the abnormal event detection experiments.

Avenue. The Avenue data set [22] is composed of training videos and test videos. In total, the Avenue data set contains frames for training and frames for testing. The resolution of each frame is pixels. The locations of anomalies are annotated in ground-truth pixel-level masks for each frame in the test videos.

ShanghaiTech. The ShanghaiTech Campus data set [23] is one of the largest data sets for anomaly detection in video. Unlike other data sets, it contains different scenes with various lighting conditions and camera angles. There are training videos and test videos. The test set contains a total of abnormal events annotated at the pixel-level. There are frames in the whole data set. The resolution of each video frame is pixels.

UCSD. The UCSD Pedestrian data set [24] is composed of two subsets, namely Ped1 and Ped2. As Hinami et al. [12], we exclude Ped1 from the evaluation, because it has a significantly lower frame resolution of . Another problem with Ped1 is that some recent works report results only on a subset of videos [26, 27, 34], while others [13, 24, 20, 21] report results on all test videos. We thus consider only UCSD Ped2, which contains training and test videos. The resolution of each frame is pixels. There are frames for training and for testing. The videos illustrate various crowded scenes, and anomalies include bicycles, vehicles, skateboarders and wheelchairs crossing pedestrian areas.

UMN. The UMN Unusual Crowd Activity data set [25] is composed of three different crowded scenes of various lengths. The first scene contains frames, the second scene contains frames, and the third scene contains frames. The resolution of each frame is pixels. In the normal settings people walk around in the scene, and the abnormal behavior is defined as people running in all directions.

4.2 Evaluation

As evaluation metrics, we opt for the ROC curve and the corresponding

area under the curve (AUC), computed with respect to ground-truth frame-level annotations. We use the same frame-level AUC definition as in previous works [6, 7, 13, 22, 20, 24, 34]. At the frame-level, a frame is considered a correct detection if it contains at least one abnormal pixel. Before the evaluation, we smooth the pixel-level detection maps with the same filter used by [7, 13, 22], in order to obtain the final abnormality maps.

4.3 Parameter and Implementation Details

In the object detection stage, we employ a single-shot detector based on FPN [18] that is pre-trained on the COCO data set [19]

. The detector is downloaded from the TensorFlow detection model zoo. For the training set, we keep the detections with a confidence level higher than

, and for the test set, we keep those with a confidence level higher than . The convolutional auto-encoders used in the feature learning stage are implemented in TensorFlow [1]. We train the auto-encoders for epochs with the learning rate set to , and for another epochs with the learning rate set to . We use mini-batches of samples. We train independent auto-encoders for each of the four data sets considered in the evaluation. To cluster the training samples, we employ the k-means implementation from VLFeat [33] based on the original Lloyd algorithm [8]. We use k-means++ [3] initialization. We repeat the clustering times and choose the partitioning with the minimum energy. In all the experiments, we set the number of k-means clusters to . We set the regularization parameter of the linear SVM (implemented in VLFeat [33]) to .

4.4 Results

Method Avenue Shanghai UCSD UMN
Tech Ped2
Kim et al. [14] - - -
Mehran et al. [25] - -
Mahadevan et al. [24] - - -
Cong et al. [6] - - -
Saligrama et al. [31] - - -
Lu et al. [22] - - -
Dutta et al. [9] - - -
Xu et al. [34, 35] - - -
Hasan et al. [11] -
Del Giorno et al. [7] - -
Zhang et al. [36] - -
Smeureanu et al. [32] - -
Ionescu et al. [13] -
Luo et al. [23] -
Hinami et al. [12] - - -
Ravanbakhsh et al. [27] - -
Sabokrou et al. [30] - - -
Ravanbakhsh et al. [26] - -
Liu et al. [20] -
Liu et al. [21] -
Table 1: Abnormal event detection results (in ) in terms of frame-level AUC on the Avenue [22], the ShanghaiTech [23], the UCSD Ped2 [24] and the UMN [25] data sets. Our framework is compared with several state-of-the-art approaches [6, 7, 9, 11, 12, 13, 14, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 34, 35, 36], which are listed in temporal order.

We compare our approach with several state-of-the-art approaches [6, 7, 9, 11, 12, 13, 14, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 34, 35, 36] on the Avenue, the ShanghaiTech, the UCSD Ped2 and the UMN data sets. The corresponding frame-level AUC scores are presented in Table 1.

Avenue. On the Avenue data set, we are able to surpass the results reported in all previous works. Compared to the most recent works [13, 20, 21, 23, 32], our framework brings an improvement of more than in terms of frame-level AUC. With a frame-level AUC of , our approach is the only method that surpasses the threshold of on the Avenue data set.

We note that Hinami et al. [12] argued that the Avenue test set contains five videos (, , , and ) with static abnormal objects that are not properly annotated. Hence, they only evaluated their approach on a subset (Avenue17) that excludes these five videos. We also compare our performance with that reported in [12], being sure to exclude the same five test videos for a fair comparison. Our frame-level AUC score on the Avenue17 subset is , which is almost better than the frame-level AUC of reported by Hinami et al. [12]. It is worth noting that our approach yields better performance on the Avenue17 subset than on the full Avenue data set, indicating that the five removed test videos are indeed more difficult than those left in Avenue17. As Hinami et al. [12] observed, the removed videos contain abnormal objects that are not properly annotated, hence methods are prone to reach higher false positive rates on these five test videos.

Figure 3: Frame-level anomaly detection scores between and (on the horizontal axis) provided by our approach, for various test videos selected from the Avenue [22], the ShanghaiTech [23], the UCSD Ped2 [24] and the UMN [25] data sets. Ground-truth abnormal events are represented in cyan and our scores are illustrated in red. Best viewed in color.
Figure 4: True positive (left) versus false positive (right) detections of our framework. Examples are selected from the Avenue [22] (first row), the ShanghaiTech [23] (second row), the UCSD Ped2 [24] (third row) and the UMN [25] (fourth row) data sets. Best viewed in color.

Figure 3 (a) depicts the frame-level anomaly scores produced by our approach against the ground-truth labels on test video of the Avenue data set. We notice that our scores correlate well with the ground-truth labels. There are four abnormal events in this video and we can easily identify three of them, without including any false positive detections. We also show some examples of true positive and false positive detections in Figure 4 (top row). The true positive abnormal events are (from left to right) a person running, a person walking in the wrong direction, a person picking up an object and a person throwing an object. The first false positive detection represents two people that are detected in the same bounding box by the object detector. The other false positive detection is a person walking in the wrong direction that is labeled as abnormal too soon.

ShanghaiTech. Since ShanghaiTech is the newest data set for abnormal event detection, there are only a few recent approaches reporting results on this data set [20, 23]. Besides these, Luo et al. [23] additionally evaluated a previously published method [11] when they introduced the data set. On the ShanghaiTech data set, the state-of-the-art performance of is reported by Liu et al. [20]. We outperform their approach by a large margin of . With a frame-level AUC of , our approach is the only one to surpass the threshold on ShanghaiTech.

In Figure 3 (b), we display our frame-level anomaly scores against the ground-truth labels on a ShanghaiTech test video with three abnormal events. On this video, we can clearly observe a strong correlation between our anomaly scores and the ground-truth labels. Some localization results from different scenes in the ShanghaiTech data set are illustrated in the second row of Figure 4. The true positive abnormal events detected by our framework are (from left to right) two bikers in a pedestrian area, a person robbing another person, a person jumping and two people fighting. The false positive abnormal events are triggered because, in each case, there are two people in the same bounding box and our system labels the unusual appearance and motion generated by the two objects as abnormal.

UCSD Ped2. While older approaches [14, 25] report frame-level AUC scores under , most approaches proposed in the last three years [11, 12, 20, 21, 23, 26, 27, 35, 36] reach frame-level AUC scores between and on UCSD Ped2. For instance, the frameworks based on auto-encoders [11, 34, 35] attain results of around . Liu et al. [20] recently outperformed the previous works, reporting a frame-level AUC of . We further surpass their state-of-the-art result, reaching the top frame-level AUC of on UCSD Ped2. Our score is above the score reported by Liu et al. [20], above the second-best score reported by Ravanbakhsh et al. [27], and more than higher than the scores reported by other frameworks based on auto-encoders [11, 34, 35].

As for the other data sets, we compare our frame-level anomaly scores against the ground-truth labels on a test video from UCSD Ped2 in Figure 3 (c). On this particular video, our frame-level AUC is above , indicating that our approach can precisely detect the abnormal event. Furthermore, the qualitative results presented in the third row of Figure 4, show that our approach can also localize the abnormal events from UCSD Ped2. The true positive abnormal events are (from left to right) a biker in a pedestrian area, two bikers in a pedestrian area, two bikers and a skater in a pedestrian area and a biker and a skater in a pedestrian area. As for ShanghaiTech, the false positive abnormal detections are caused by two people in the same bounding box.

UMN. It is worth noting that UMN seems to be the easiest abnormal event detection data set, since most works report frame-level AUC scores above , with some works [9, 27, 30] even surpassing . The top score of is reported by Sabokrou et al. [30], and we reach the same performance on the UMN data set. We note that the second scene seems to be slightly more difficult than the other two scenes, since our frame-level AUC score on this scene is , while the frame-level AUC scores on the other scenes are and , respectively. For this reason, we choose to illustrate the frame-level anomaly scores against the ground-truth labels for the second scene from UMN in Figure 3 (d). Overall, our anomaly scores correlate well with the ground-truth labels, but there are some normal frames with high abnormality scores just before the third abnormal event in the scene.

In the fourth row of Figure 4, we present some localization results provided by our framework. The true positive examples represent people running around in all directions, while the false positive detections are triggered by two people in the same bounding box and a person bending down to pick up an object. We note that the false positive examples are selected from the second scene, as it was impossible to find false positive detections in the other scenes.

4.5 Discussion

Figure 5: Frame-level AUC scores on ShanghaiTech obtained by selecting values for the number of clusters from the set .
Figure 6: Frame-level AUC scores on ShanghaiTech obtained by selecting values for the SVM regularization parameter from the set .

While the results presented in Table 1 show that our approach can outperform the state-of-the-art methods on four evaluation sets, we also aim to address questions about the robustness of our features and parameter choices, and to discuss the running time of our framework.

Parameter selection. We present results with various parameter choices on the largest and most difficult evaluation set, namely ShanghaiTech. We first variate the number of clusters by selecting values in the set . The corresponding frame-level AUC scores are presented in Figure 5. The results presented in Figure 5 indicate that the number of clusters does not play a significant role in our multi-class classification framework, since the accuracy variations are lower than . With only one exception (for ), our results are always higher than . We also variate the regularization parameter of the SVM, by considering values in the set . The corresponding frame-level AUC scores are presented in Figure 6. The results presented in Figure 6 show that the performance variation is lower than , and the frame-level AUC scores are always higher than . We believe that this happens because the classes are linearly separable, since they are generated by clustering the samples with k-means into disjoint clusters. Overall, we conclude that our high improvement () over the state-of-the-art approach [20], cannot be explained by a convenient choice of parameters.

Method Score
Frame-level CAE features + one-class SVM (baseline)
Pre-trained SSD features + one-versus-rest SVM
CAE appearance features + one-versus-rest SVM
CAE motion features + one-versus-rest SVM
Combined CAE features + one-class SVM
Combined CAE features + one-versus-rest SVM
Table 2: Frame-level AUC scores (in ) on ShanghaiTech [23] obtained by removing various components from our framework versus a baseline based on frame-level features and one-class SVM.

Ablation results. In Table 2, we present feature ablation results, as well as results for a one-class SVM based on our full object-centric feature set, on the ShanghaiTech data set. When we remove the object detector and train auto-encoders at the frame-level, we obtain a frame-level AUC of , which demonstrates the importance of extracting object-centric features. We note that the frame-level auto-encoders have an additional convolutional layer and the input resolution is increased to . When we replace the object-centric CAE features with pre-trained SSD features (extracted right before the SSD class predictor), the frame-level AUC is only , which shows the importance of learning features with auto-encoders. By removing either the appearance or the motion object-centric CAE features from our model, the results drop by less than . This shows that both appearance and motion features are relevant for the abnormal event detection task. By replacing our multi-class approach based on k-means and one-versus-rest SVM with a one-class SVM, the performance drops by . This indicates that formalizing the abnormal event detection task as a multi-class problem is indeed useful.

Running time. The single-shot object detector [18] requires about milliseconds to process a single frame. Hence, it can run at about frames per second (FPS). With a reasonable average of objects per frame, our feature extraction and inference stages require about milliseconds per frame. Thus, we can process about frames per second. However, the entire pipeline requires about milliseconds to infer the anomaly scores for a single frame, which translates to FPS. We note that more than of the processing time is spent detecting objects on a frame by frame basis. The running time can be improved by replacing the current object detector with a faster one. We note that all running times were measured on an Nvidia Titan Xp GPU with 12 GB of RAM.

5 Conclusion and Future Work

We proposed a novel framework for abnormal event detection in video that is based on training object-centric convolutional auto-encoders and on formalizing the abnormal event detection as a multi-class problem. The empirical results obtained on four data sets indicate that our approach outperforms a series of state-of-the-art approaches [6, 7, 9, 11, 12, 13, 14, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 34, 35, 36]. In future work, we aim to improve our framework by segmenting and tracking objects.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: A system for large-scale machine learning.

    In Proceedings of OSDI, pages 265–283, 2016.
  • [2] B. Antic and B. Ommer. Video parsing for abnormality detection. In Proceedings of ICCV, pages 2415–2422, 2011.
  • [3] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. In Proceedings of SODA, pages 1027–1035, 2007.
  • [4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze.

    Deep Clustering for Unsupervised Learning of Visual Features.

    In Proceedings of ECCV, volume 11218, pages 139–156, 2018.
  • [5] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang. Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In Proceedings of CVPR, pages 2909–2917, 2015.
  • [6] Y. Cong, J. Yuan, and J. Liu. Sparse reconstruction cost for abnormal event detection. In Proceedings of CVPR, pages 3449–3456, 2011.
  • [7] A. Del Giorno, J. Bagnell, and M. Hebert. A Discriminative Framework for Anomaly Detection in Large Videos. In Proceedings of ECCV, pages 334–349, 2016.
  • [8] Q. Du, V. Faber, and M. Gunzburger. Centroidal Voronoi Tessellations: Applications and Algorithms. SIAM Review, 41(4):637–676, 1999.
  • [9] J. K. Dutta and B. Banerjee. Online Detection of Abnormal Events Using Incremental Coding Length. In Proceedings of AAAI, pages 3755–3761, 2015.
  • [10] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik. Learning to Segment Moving Objects in Videos. In Proceedings of CVPR, pages 4083–4090, 2015.
  • [11] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis. Learning temporal regularity in video sequences. In Proceedings of CVPR, pages 733–742, 2016.
  • [12] R. Hinami, T. Mei, and S. Satoh. Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge. In Proceedings of ICCV, pages 3639–3647, 2017.
  • [13] R. T. Ionescu, S. Smeureanu, B. Alexe, and M. Popescu. Unmasking the abnormal events in video. In Proceedings of ICCV, pages 2895–2903, 2017.
  • [14] J. Kim and K. Grauman. Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates. In Proceedings of CVPR, pages 2921–2928, 2009.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of ICLR, 2015.
  • [16] P. Krähenbühl and V. Koltun. Geodesic Object Proposals. In Proceedings of ECCV, volume 8693, pages 725–739, 2014.
  • [17] W. Li, V. Mahadevan, and N. Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):18–32, 2014.
  • [18] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of CVPR, pages 2117–2125, 2017.
  • [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In Proceedings of ECCV, pages 740–755, 2014.
  • [20] W. Liu, W. Luo, D. Lian, and S. Gao. Future Frame Prediction for Anomaly Detection – A New Baseline. In Proceedings of CVPR, pages 6536–6545, 2018.
  • [21] Y. Liu, C.-L. Li, and B. Póczos. Classifier Two-Sample Test for Video Anomaly Detections. In Proceedings of BMVC, 2018.
  • [22] C. Lu, J. Shi, and J. Jia. Abnormal Event Detection at 150 FPS in MATLAB. In Proceedings of ICCV, pages 2720–2727, 2013.
  • [23] W. Luo, W. Liu, and S. Gao. A Revisit of Sparse Coding Based Anomaly Detection in Stacked RNN Framework. In Proceedings of ICCV, pages 341–349, 2017.
  • [24] V. Mahadevan, W.-X. LI, V. Bhalodia, and N. Vasconcelos. Anomaly Detection in Crowded Scenes. In Proceedings of CVPR, pages 1975–1981, 2010.
  • [25] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection using social force model. In Proceedings of CVPR, pages 935–942, 2009.
  • [26] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe. Plug-and-Play CNN for Crowd Motion Analysis: An Application in Abnormal Event Detection. In Proceedings of WACV, pages 1689–1698, 2018.
  • [27] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe. Abnormal Event Detection in Videos using Generative Adversarial Nets. In Proceedings of ICIP, pages 1577–1581, 2017.
  • [28] H. Ren, W. Liu, S. I. Olsen, S. Escalera, and T. B. Moeslund. Unsupervised Behavior-Specific Dictionary Learning for Abnormal Event Detection. In Proceedings of BMVC, pages 28.1–28.13, 2015.
  • [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, K. A., A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [30] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette. Deep-cascade: Cascading 3D Deep Neural Networks for Fast Anomaly Detection and Localization in Crowded Scenes. IEEE Transactions on Image Processing, 26(4):1992–2004, 2017.
  • [31] V. Saligrama and Z. Chen. Video anomaly detection based on local statistical aggregates. In Proceedings of CVPR, pages 2112–2119, 2012.
  • [32] S. Smeureanu, R. T. Ionescu, M. Popescu, and B. Alexe. Deep Appearance Features for Abnormal Behavior Detection in Video. In Proceedings of ICIAP, volume 10485, pages 779–789, 2017.
  • [33] A. Vedaldi and B. Fulkerson. VLFeat: An Open and Portable Library of Computer Vision Algorithms. http://www.vlfeat.org/, 2008.
  • [34] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection. In Proceedings of BMVC, pages 8.1–8.12, 2015.
  • [35] D. Xu, Y. Yan, E. Ricci, and N. Sebe. Detecting Anomalous Events in Videos by Learning Deep Representations of Appearance and Motion. Computer Vision and Image Understanding, 156:117–127, 2017.
  • [36] Y. Zhang, H. Lu, L. Zhang, X. Ruan, and S. Sakai. Video anomaly detection based on locality sensitive hashing filters. Pattern Recognition, 59:302–311, 2016.
  • [37] B. Zhao, L. Fei-Fei, and E. P. Xing. Online Detection of Unusual Events in Videos via Dynamic Sparse Coding. In Proceedings of CVPR, pages 3313–3320, 2011.