Anomalies are patterns that deviate from what is expected. When dealing with surveillance videos, anomalies consist of illegal behaviors or dangerous situations that could represent a threat to public safety. The need for automatic systems able to spot and report such events is strategic in real-world applications. For this reason, we aim to detect anomalies such as assaults, robberies, and burglaries in videos. Recently, Sultani et al. 
propose a large dataset and a MIL-based solution for this challenging computer vision task, with the aim of bridging the gap between the recording capability of surveillance cameras and the limited number of human monitors. Where previous work on the topic only considered full-frame videos[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], we propose to exploit the inherent locality of anomalies and study whether the use of spatiotemporal tubes  may help anomaly detection.
We draw inspiration from recent results on the importance of locality in other computer vision challenges, e.g., [14, 15, 16]. Mettes et al. show that even a single spatial point provides valuable information for action localization  and that it is possible to localize instances of actions leveraging only video-level labels and pseudo-annotations . Chéron et al.  study the impact of different levels of supervision for action localization with a single, flexible model. In their recent work, Hinami et al.  run a detector at each frame and compute the anomaly score related to visual concepts such as actions, objects, and attributes. They exploit an environment-specific model to identify unusual features that are likely to explain the abnormal pattern. Although their work employs locality, they only consider single frames instead of action tubes and do not study the benefit given by the detection. Like , we also address anomaly detection as a regression problem and propose a model composed of a video encoder followed by a fully trainable regression network. Rather than relying on full-frame video segments, we integrate our model with a tube extraction module that lets the analysis focus on a particular set of spatiotemporal coordinates.
The first and chief contribution of this paper is the new approach to anomaly detection, based on action tubes instead of full frames. As a second contribution, we propose a new trainable model for anomaly detection designed to deal with different locations in the same video segment. Third, to show the potential of our approach, we enrich 100 surveillance videos from UCF-Crime  with spatiotemporal annotations for unusual events: UCFCrime2Local is the first dataset for anomaly detection with bounding box supervision in its train and test set. Finally, our experiments prove the importance of locality, the robustness of our model to different kinds of errors, and its reliability when adopted to provide weak annotations on new videos. Before detailing our experiments, we discuss our model and the creation of UCFCrime2Local.
2 Locality in Anomaly Detection
Given an input clip, we aim to determine whether the observed scene is normal or anomalous. Equivalently, we want our model to output the probability that an unusual event is taking place in the input video. The output of our model is a continuous number between 0 and 1, so we are casting anomaly detection to a regression problem, like. Additionally, we want to focus on the precise locality where the anomaly occurs: we do so by including a novel tube extraction module in our architecture. With this approach, we can change the granularity of our analysis from full-frame videos to spatiotemporal tubes. As depicted in Fig. 1, our model consists of three main components: a tube extraction module, a video encoder, and a regression network.
2.1 Tube Extraction
Given an input video and a set of coordinates, our tube extraction module produces a spatiotemporal volume as output. We implement tube extraction as a composition of crop and resize functions for each frame of the input video. Note that, if the coordinates match with the full frame, we do not focus on a precise location. We use this particular setup to evaluate our method on full-frame videos and provide a baseline in our experiments.
Details. Given an input video composed of 16 frames with resolution and RGB channels, thus having shape , we expect a set of 4 coordinates, identifying a bounding box, for each frame. We crop the portion of the video corresponding to the provided set of locations and then resize each frame to . Thus, input and output shapes of this block are equal.
2.2 Video Encoder
Given an input tube, we want to obtain a visual representation that encodes action information. Different from [17, 19, 20], who use a 2D image-based architecture, we take advantage of the temporal dynamics of the video. Sultani et al.  use C3D , a 3D ConvNet, to extract features from a starting video segment. Recently, Carreira and Zisserman 
propose to inflate the 2D filters of a convolutional network to 3D kernels (I3D) to take advantage of the spatiotemporal nature of the video. By pre-training on ImageNet and Kinetics , their model achieves state-of-the-art results for action recognition. We adopt I3D to encode information from our input video. We also follow the well-known two-stream approach  to combine appearance and motion information, which was successfully applied earlier on other computer vision tasks such as action localization , and actor and action segmentation .
We extract features from the inception block before the last max-pooling layer, then we take an average pooling along the temporal dimension, similar to. The results of this operation is a feature cube with shape . We apply the same procedure to optical flow, which is computed using the algorithm described by Farnebäck . We concatenate the two volumes, corresponding to RGB and flow, after the average pooling layer.
2.3 Regression Network
Given an input volume encoding the information about appearance and motion, our regression network outputs the anomaly score relative to the starting video. Since the score ranges between 0 and 1, we can interpret it as the probability that an unusual event is taking place in the investigated tube. Given an input video , its anomaly score must comply with the following:
where the threshold drives the binary classification into normal and anomalous videos. Ideally, anomalous segments will score close to 1, while regular videos will map into values approaching 0.
In our model, we perform regression with a convolutional layer, followed by a stack of fully-connected layers. The convolution helps to preserve the dependencies over the feature maps along the spatial dimension while drastically reducing the number of parameters needed before the first fully-connected layer.
Our training sample consists of a 16-frame video segment , a set of coordinates describing a spatiotemporal tube, denoted as , and a ground truth label . We employ the mean squared error as objective function:
where is the anomaly score relative to the action tube extracted from the input video , and is a binary label.
We train our model using SGD optimizer with a learning rate of 0.001. We also adopt Nesterov momentum with intensity 0.9. We randomly select 5 video segments as a mini-batch, and we train for 10 epochs. We compute the loss for each mini-batch as shown in Eq.2 and we back-propagate the gradients along the regression network.
3 Our Dataset
To the best of our knowledge, none of the existing anomaly detection datasets [1, 7, 29, 30, 31] provides spatiotemporal annotations for unusual events in its training set. To overcome the lack of labeled data, we enrich a portion of the recently-proposed UCF-Crime with spatiotemporal annotations. We start by selecting six among the 13 anomalous categories that are present in UCF-Crime, with particular attention to human-based anomalies: Arrest, Assault, Burglary, Robbery, Stealing, and Vandalism. We then select 100 videos belonging to the designated categories, resulting in more than an hour of video sequences. Finally, we annotate bounding boxes for anomalous events. Although we do not use them in our experiments, action class labels for tasks such as action recognition or localization are available. After the annotation process, we add 200 negative clips from UCF-Crime. We leverage the fact that normal samples do not require further annotation. We split the dataset into training and testing set, respectively composed of 210 and 90 videos. Table 1 reports some statistics about the videos in UCFCrime2Local. Our annotations are publicly available at http://imagelab.ing.unimore.it/UCFCrime2Local.
Since the videos mainly contain human-based anomalies, a possibility to speed up the annotation would be to run a person detector for each frame and then merge the boxes along the temporal dimension. Although this approach is appealing, it suffers from many problems that make it inconvenient. First of all, it could not be generalized to annotate non-human events, such as road accidents or explosions. Second, not all the anomalies are single-person events: how could we capture the complexity of the interactions occurring in a fight scene? What about a robber threating a victim? Moreover, while the availability of off-the-shelf detectors makes this choice appealing, a failure in the detection phase could seriously compromise the whole annotation process. For all of these reasons, we decide to annotate bounding boxes by hand. We choose Vatic  for his intuitive user interface, which can make the annotation relatively fast.
Annotation Policy. We are aware that annotating spatiotemporal tubes is a delicate process that can be subject to many types of errors. For this reason, we adopt an annotation policy to enforce consistency in our work. An abnormal event comprehends the main characters directly performing the action and the secondary players that are eventually involved. For instance, one or more police officers are the main actors in an arrest scene, while we include the captured person as a secondary actor. Concerning the temporal dimension, multiple anomalies of the same type can appear throughout an entire video: an anomaly begins when all the actors are visible, and it ends if they leave the scene. Regarding the spatial dimension, a single frame can contain only one bounding box, which should be large enough to include all the information about the anomalous action. For instance, in a stealing situation, the stolen object should be included in the bounding box, as it provides relevant information about the action. Fig. 2 reports more examples from our annotations.
|Number of videos||100 (69)||200 (141)|
|Total length (min)||66.3||112.1|
|Average length (sec)||39.8||33.6|
|Min/Max length (sec)||4.6/135.9||6.8/59.8|
4.1 Full Frames Vs. Action Tubes
In our first experiment, we test whether the use of locality can ease anomaly detection. Thanks to our flexible model and our annotations, we distinguish two settings:
Video Segment. We train our model on full frames without the information about the locality. In this setup, we can think of our tube extraction module as performing identity mapping from its input to the output. With a similar approach, we test this network on whole-frame videos.
Oracle Tube. During training, the coordinates given to the tube extractor is our ground-truth spatial annotation, hence the name Oracle Tube. We do the same during the test, challenging the network to discriminate between ordinary action tubes and atypical ones.
Results. In Table 2, we report the results for this experiment. Like , we use the receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC) to evaluate our method. The Oracle Tube setup, using locality, performs better than the Video Segment approach. This considerable improvement demonstrates that locality does ease anomaly detection.
4.2 Robustness to Localization Errors
In our second experiment, we progressively relax the “oracle” assumption during the testing phase of our Oracle Tube network. We aim to test the robustness of our model to localization errors, exploring at the same time the spatial extent of anomalies. We do that by perturbing the ground-truth coordinates supplied to our tube extraction module at test time with different kinds of errors. First, we reduce each side of the bounding boxes by a factor of and . Second, we make the bounding boxes larger and larger until they incorporate the full frame. Finally, we use the original box dimension while translating the center from its ideal position. We apply translations of and pixels towards the top-left corner of the frame.
Results. Numbers from Table 3 show that a network trained with information about the locality performs better than the full-frame baseline even when dealing with localization errors at test time. In particular, we find out that for boxes dimensions in the range to than the ones we provided, results are very close to the “ideal” case, or even better. We conclude that adding some contextual information to action tubes helps anomaly detection.
4.3 Independence and Reliability
Finally, we want to test what occurs when we pick multiple action tubes from a video. Ideally, spatiotemporal volumes containing anomalies will score higher than regular ones. In this experiment, we let our model predict the anomaly scores for different tubes from each video segment in our testing set. In this way, we obtain anomaly scores for a single input clip, each of them related to a particular tube. We then apply different aggregation functions to determine the final score and report the results in Table 4.
Weakly-supervised Approach. The procedure described in the previous lines is a way to provide weak annotations for unseen videos, as it yields a set of spatiotemporal tubes and their corresponding anomaly scores. A simple method to test the reliability of our model is to use these proposals as weak annotations for training a weakly-supervised network. We collect for this purpose 100 new anomalous videos from UCF-Crime  belonging to the six categories described in Sec. 3. We employ our model to get spatiotemporal proposals in the following way: we keep the highest score for each input clip along with its action tube, then we set the target label according to the rule:
where is a threshold for our weakly-supervised proposals. In this experiment, we empirically set . We repeat the same process for negative samples, fixing the label . We employ this new set of videos, coordinates, and labels to train our model from zero. In Fig. 3, we compare our weakly-supervised extension to our previous settings trained with full supervision.
Results. Our weakly-supervised approach achieves better results than our strong-supervised network. This is not due solely to the proposals made by the latter, but also to the strong supervision that, though indirectly, influenced the training of our weakly-supervised extension. In this way, we show that proposals made over the new videos are valuable spatiotemporal annotations.
|Max||Top-10 avg||Top-20 avg||Top-30 avg||Top-45 avg|
In this paper, we explore the importance of locality in anomaly detection. To that end, we enrich a portion of an existing dataset with spatiotemporal annotations, creating the first dataset for anomaly detection with bounding box supervision in both train and test set. We propose a new anomaly detection model for dealing with different spatiotemporal volumes in a single video. Experimental results show that: (1) locality helps anomaly detection; (2) our method is robust to different kinds of localization errors at test time and (3) it can provide spatiotemporal proposals over a potentially large set of unseen videos. These proposals are shown to be valuable for training a new weakly-supervised network. We hope future work will follow our exploration, and consider locality when dealing with anomaly detection.
-  Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-world anomaly detection in surveillance videos”, in CVPR, 2018.
-  Jaechul Kim and Kristen Grauman, “Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates”, in CVPR, 2009.
-  Ramin Mehran, Alexis Oyama, and Mubarak Shah, “Abnormal crowd behavior detection using social force model”, in CVPR, 2009.
-  Chang Liu, Guijin Wang, Wenxin Ning, Xinggang Lin, Liang Li, and Zhou Liu, “Anomaly detection in surveillance video using motion direction statistics”, in ICIP, 2010.
-  Xinyi Cui, Qingshan Liu, Mingchen Gao, and Dimitris N Metaxas, “Abnormal detection using interaction energy potentials”, in CVPR, 2011.
-  Bin Zhao, Li Fei-Fei, and Eric P Xing, “Online detection of unusual events in videos via dynamic sparse coding”, in CVPR, 2011.
-  Cewu Lu, Jianping Shi, and Jiaya Jia, “Abnormal event detection at 150 fps in matlab”, in ICCV, 2013.
-  Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe, “Learning deep representations of appearance and motion for anomalous event detection”, in BMVC, 2015.
-  Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis, “Learning temporal regularity in video sequences”, in CVPR, 2016.
-  Allison Del Giorno, J Andrew Bagnell, and Martial Hebert, “A discriminative framework for anomaly detection in large videos”, in ECCV, 2016.
-  Ziping Zhu, Jingjing Wang, and Nenghai Yu, “Anomaly detection via 3d-hof and fast double sparse representation”, in ICIP, 2016.
-  Weixin Luo, Wen Liu, and Shenghua Gao, “A revisit of sparse coding based anomaly detection in stacked rnn framework”, ICCV, 2017.
-  Mihir Jain, Jan Van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees GM Snoek, “Action localization with tubelets from motion”, in CVPR, 2014.
-  Pascal Mettes, Jan C van Gemert, and Cees GM Snoek, “Spot on: Action localization from pointly-supervised proposals”, in ECCV, 2016.
-  Pascal Mettes, Cees GM Snoek, and Shih-Fu Chang, “Localizing actions from video labels and pseudo-annotations”, in BMVC, 2017.
-  Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, and Cordelia Schmid, “A flexible model for training action localization with varying levels of supervision”, in NIPS, 2018.
-  Ryota Hinami, Tao Mei, and Shin’ichi Satoh, “Joint detection and recounting of abnormal events by learning deep generic knowledge.”, in ICCV, 2017.
-  Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset”, in CVPR, 2017.
-  Jefferson Ryan Medel and Andreas Savakis, “Anomaly detection in video using predictive convolutional long short-term memory networks”, arXiv preprint arXiv:1612.00390, 2016.
-  Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao, “Future frame prediction for anomaly detection–a new baseline”, arXiv preprint arXiv:1712.09867, 2017.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks”, in ICCV, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge”, International Journal of Computer Vision, 2015.
-  Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al., “The kinetics human action video dataset”, arXiv preprint arXiv:1705.06950, 2017.
-  Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos”, in NIPS, 2014.
-  Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid, “Action tubelet detector for spatio-temporal action localization”, in ICCV, 2017.
-  Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek, “Actor and action video segmentation from a sentence”, in CVPR, 2018.
“Two-frame motion estimation based on polynomial expansion”,in Scandinavian conference on Image analysis, 2003.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
“Dropout: a simple way to prevent neural networks from overfitting”,
The Journal of Machine Learning Research, 2014.
-  “Unusual crowd activity dataset of university of minnesota”, http://mha.cs.umn.edu/.
-  Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz, “Robust real-time unusual event detection using multiple fixed-location monitors”, TPAMI, 2008.
-  Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos, “Anomaly detection in crowded scenes”, in CVPR. IEEE, 2010.
-  Carl Vondrick, Donald Patterson, and Deva Ramanan, “Efficiently scaling up crowdsourced video annotation”, International Journal of Computer Vision, 2013.