Accurate perception of accident scenarios is a key challenge for advanced driver assistance systems (ADAS). Is an accident going to happen? Who will be involved? What type of accident is it? These critical questions demand detection, localization, and classification of on-road anomalies for proper reaction and event data recording . We propose accident perception as a When-Where-What pipeline: When the anomalous event starts and ends, Where the anomalous regions are in each video frame, and What
the anomaly type is. This pipeline can be mapped to two computer vision tasks: video anomaly detection (VAD) and video action recognition (VAR). VAD predicts per-frame anomaly scores to answer theWhen question, and computes per-pixel or per-object anomaly scores as intermediate step to implicitly answer the Where
question. VAR classifies video type to answer theWhat question.
Training deep learning-based methods for VAD and VAR has been made possible by large-scale labeled datasets. There are video datasets available for surveillance applications of VAD, including CUHK, ShanghaiTech Campus , and UCF-crime , and for human activity recognition for VAR, including Sports-1M  and Kinetics . For traffic anomalies, recent first-person video datasets such as StreetAccident  and A3D  have annotations of anomaly start and end times, while DADA  provides human attention maps from video spectator eye-gaze. However, no large-scale dataset and benchmark yet covers the full When-Where-What pipeline.
This paper introduces Detection of Traffic Anomaly (DoTA), a large-scale benchmark dataset for traffic VAD and VAR. DoTA contains videos with anomaly categories  and multiple anomaly participants in different driving scenarios. DoTA provides rich annotation for each anomaly: type (category), temporal annotation, and anomalous object bounding box tracklets. Taking advantage of this large-scale dataset with rich anomalous object annotations, we propose a novel VAD evaluation metric called Spatio-temporal Area Under Curve (STAUC). STAUC is motivated by the popular frame-level Area Under Curve (AUC). While AUC uses a per-frame anomaly score which is usually averaged from a pixel-level or object-level score map, STAUC takes such score map and computes how much of it overlaps with the annotated anomalous region. This overlap ratio is used as a weighting factor for true positive predictions with STAUC. STAUC thus has AUC as its upper bound.
We benchmark existing VAD baselines and state-of-the-art methods on DoTA using both AUC and STAUC. We also propose a simple-but-effective ensemble method that improves the performance of any single approach, offering a new direction to explore. Extensive experiments show the importance of using this new metric in VAD research. To further complete the pipeline, we also benchmark recent VAR methods such as R(2+1)D  and SlowFast  on DoTA. Experiments show that applying generalized VAR methods to traffic anomaly understanding is far from perfect, motivating more research in this area.
This paper offers three contributions. First, we introduce DoTA, a large-scale ego-centric traffic video dataset to support VAD and VAR; to the best of our knowledge this is the largest traffic video anomaly dataset and the first containing detailed temporal, spatial, and categorical annotations. Second, we identify problems with the commonly-used AUC metric and propose a new spatio-temporal evaluation metric (STAUC) to address them. We benchmark state-of-the-art VAD methods with both AUC and STAUC and show the effectiveness of our new metric. Finally, we provide benchmarks of state-of-the-art VAR algorithms on DoTA, which we hope will encourage further research to manage challenging ego-centric traffic video scenarios.
2 Related Work
Existing Video Anomaly Detection (VAD) datasets are generally from surveillance cameras. For example, UCSD Ped1/Ped2 , CUHK Avenue , and ShanghaiTech  were collected from campus surveillance cameras and include anomalies like prohibited objects and abnormal movements, while UCF-Crime  includes accidents, robbery, and theft. Anomaly detection in egocentric traffic videos has very recently attracted attention. Chan et al.  propose the StreetAccident dataset of on-road accidents with 620 video clips collected from dash cameras. The last ten frames of each clip are annotated as anomalous. Yao et al.  propose the A3D dataset containing 1,500 anomalous videos in which abnormal events are annotated with the start and end times. Fang et al.  introduce the DADA dataset for driver attention prediction in accidents, while Herzig et al.  extract a collision dataset with 803 videos from BDD100K , In contrast, our DoTA dataset is much larger (nearly 5,000) but, much more importantly, contains richer annotations that support the whole When-Where- What anomaly analysis pipeline.
Existing VAD models mainly focus on the When problem but are also implicitly related to Where. Hasan et al.  propose a convolutional Auto-Encoder (ConvAE) to model the normality of video frames by reconstructing stacked input frames. Convolutional LSTM Auto-Encoder (ConvLSTMAE) is used in [30, 6, 29] capture regular visual and motion patterns. Luo et al.  propose a stacked RNN for temporally-coherent sparse coding (TSC-sRNN). Liu et al.  detect anomalies by looking for differences between predicted future frames and actual observations. Gong et al.  propose an MemAE network to query pre-saved memory units for reconstruction, while Wang et al.  design generalized one-class sub-spaces for discriminative regularity modeling. Other work has recently studied object-centric approaches. Ionescu et al. et al. 
model human skeleton regularity with local-global autoencoders and compute per-object anomaly scores. VAD in egocentric traffic scenarios is a new and challenging problem due to dynamic foreground and background, perspective projection, and complicated scenes. The most related work to ours is TAD
, which predicts future object bounding boxes from past time steps with RNN encoder-decoders, where the standard deviation of predictions serves as the anomaly score. We benchmark stat-of-the-art VAD methods and their variants on DoTA dataset.
Action Recognition methods address the What problem to classify traffic anomalies. Two-stream networks  and temporal segment networks (TSN)  leverage RGB and optical flow data. Tran et al.  first proposed 3D convolutional networks (C3D) for spatiotemporal modeling, followed by an inflated model . Recent work substitutes 3D convolution with 2D and 1D convolution blocks (R(2+1)D ) to improve effectiveness and efficiency. Feichtenhofer et al.  propose the SlowFast model to extract video features from low and high frame rate streams. Online action detection in untrimmed, streaming videos is addressed by De Geest et al. , while Gao et al.  propose a reinforce encoder-decoder(RED) to tackle action prediction and online action recognition. Shou et al.  model temporal consistency with a generative adverserial network (GAN). Xu et al.  propose a temporal recurrent network (TRN) leveraging future prediction to aid online action detection. Gao et al. 
uses reinforcement learning to detect the start time of actions. We benchmark VAR methods on DoTA dataset, and discuss online action detection in supplement.
3 The Detection of Traffic Anomaly (DoTA) Dataset
We introduce DoTA, the first publicly-available When-Where-What pipeline dataset with temporal, spatial, and categorical annotations.111The dataset will be made publicly available upon publication. To build DoTA, we collected more than 6,000 video clips from YouTube channels and selected diverse dash camera accident videos from different countries under different weather and lighting conditions. We avoided videos with accidents that were not visible or camera fall-off from wind shield, resulting in 4,677 videos with resolution. Though the original videos are at fps, we extracted frames at fps for annotations and experiments in this paper. Table 1 compares DoTA with other ego-centric traffic anomaly datasets.
|Dataset||# videos||# frames||Annotations|
|DADA ||2,000||648,476||(30fps)||temporal, spatial (eye-gaze)|
|DoTA||4,677||731,932||(10fps)||temporal, spatial (tracklets), categories|
We annotated the dataset using a custom tool based on Scalabel222https://scalabel.ai/. Labeling traffic anomalies is subjective, especially for properties like start and end times. To produce high quality annotations, each video was labeled by three annotators, and the temporal and spatial (categorical) annotations were merged by taking average (mode) to minimize individual biases. Our 12 human annotators had different levels of driving experience.
|1||ST||Collision with another vehicle which starts, stops, or is stationary|
|2||AH||Collision with another vehicle moving ahead or waiting|
|3||LA||Collision with another vehicle moving laterally in the same direction|
|4||OC||Collision with another oncoming vehicle|
|5||TC||Collision with another vehicle which turns into or crosses a road|
|6||VP||Collision between vehicle and pedestrian|
|7||VO||Collision with an obstacle in the roadway|
|8||OO||Out-of-control and leaving the roadway to the left or right|
Temporal Annotations. Each DoTA video is annotated with anomaly start and end times, which separates it into three temporal partitions: precursor, which is normal video preceding the anomaly, the anomaly window, and post-anomaly, which is normal activity following the anomaly. Duration distributions are shown in Fig. 2(a). Since early detection is essential for on-road anomalies [4, 35]
, we asked the annotators to estimate the anomaly start as the time when the anomaly was inevitable. The anomaly end was meant to be the time when all anomalous objects are out of the field of view or are stationary. Our annotation is different from where a frame is marked as anomaly start if half of the anomaly participant appears in the camera view; such a start time can be too early because anomaly participants often appear for a while before they start to behave abnormally. Our annotation is also distinct from  and  where the anomaly start is marked when a crash happens, which does not support early detection.
Spatial Annotations. DoTA is the first traffic anomaly dataset to provide detailed spatio-temporal annotation of anomalous objects. Each anomaly participant is assigned a unique track ID, and their bounding box is labeled from anomaly start to anomaly end or until the object is out of view. We consider seven common traffic participant categories: person, car, truck, bus, motorcycle, bicycle, and rider, following the BDD100K style . Statistics of object categories and per-video anomalous object numbers are shown in Fig. 2(c) and 2(d). DADA  also provides spatial annotations by capturing video observers’ eye-gaze for driver attention studies. However, they have shown that eye-gaze does not always coincide with the anomalous region, and that gaze can have 1 to 2 seconds delay from anomaly start. Thus our tracklets provide improved annotation for spatio-temporal anomaly detection studies.
Anomaly Categories. Each DoTA video is assigned one of the 9 categories listed in Table 2, as defined in . We have observed that the same anomaly category with different viewpoints are visually distinct, as shown in Fig. 2. Therefore we split each category to ego-involved and non-ego (marked with *), resulting in 18 categories total. Sometimes the category can be ambiguous, particularly when one anomaly is followed by another. For example, an oncoming out-of-control (OO*) vehicle might result in an oncoming collision (OC) with the ego vehicle. In such cases, we annotate the anomaly category as the dominant one in the video, i.e, the one that lasts longer during the anomaly period. The distribution of videos of each category is shown in Fig. 2(b).
4 Video Anomaly Detection (VAD) Methods
We benchmark both unsupervised and supervised VAD. Unsupervised VAD is divided into frame-level and object-centric methods according to different input and output types. Supervised VAD is similar to temporal action detection but outputs a binary label indicating anomaly or no-anomaly.
4.1 Frame-level Unsupervised VAD
Frame-level unsupervised VAD methods detect anomalies by either reconstructing past frames or predicting future frames and computing the reconstruction or prediction error. We benchmark three methods and their variants in this paper.
ConvAE  is a spatio-temporal autoencoder model which encodes temporally stacked images with 2D convolutional encoders and decodes with deconvolutional layers to reconstruct the input (Fig. 3(a)). The per-pixel reconstruction error forms an anomaly score map and the mean squared error (MSE) is computed as a frame-level anomaly score,
where and are the ground truth and reconstructed/predicted frames, represents all frame pixels, and is also called anomaly score map. To further compare the effectiveness of image and motion features, we implement ConvAE(gray) and ConvAE(flow) to reconstruct the grayscale image and the dense optical flow, respectively. The input to ConvAE(flow) is a stacked historical flow map with size , acquired from pre-trained FlowNet2 .
ConvLSTMAE  is similar to ConvAE but models spatial and temporal features separately. A 2D CNN encoder first captures spatial information from each frame, then a multi-layer ConvLSTM recurrently encodes temporal features. Another 2D CNN decoder then reconstructs input video clips (Fig. 3(b)). We also implemented ConvLSTMAE(gray) and ConvLSTMAE(flow).
AnoPred  is a frame-level VAD method taking four continuous previous RGB frames as input and applying UNet to predict a future RGB frame (Fig. 3(c)). AnoPred boosts prediction accuracy with a multi-task loss incorporating image intensity, optical flow, gradient, and adversarial losses. AnoPred was proposed for surveillance cameras. However, traffic videos are much more dynamic, making future frame prediction difficult. Therefore we also benchmark a variant of AnoPred to focus on video foreground. We use Mask-RCNN  pre-trained on Cityscapes  to acquire object instance masks for each frame, and apply instance masks to input and target images, resulting in a AnoPred+Mask method that only predicts foreground objects and ignores noisy backgrounds such as trees and billboards. In contrast to [14, 6]
, AnoPred uses Peak Signal to Noise Ratio,as anomaly score with better results.
4.2 Object-centric Unsupervised VAD
TAD  models normal bounding box trajectories in traffic scenes with a multi-stream RNN encoder-decoder  (Fig. 3(d)) to encode past trajectories and ego motion and to predict future object bounding boxes. Prediction results are collected; prediction consistency instead of accuracy is used to compute per-object anomaly scores. Per-object scores are averaged to form a per-frame score.
Ionescu et al.  propose to treat object normality as multi-modal and use k-means to find the normality clusters in hidden space. Liu et al.  use margin learning (ML) to enforce large distances between normal and abnormal features. We combine these ideas and propose TAD+ML, as shown in Fig. 3(e). We adopt k-means to cluster encoder hidden features. Each cluster is considered one normality, i.e. one type of normal motion, so that each training sample is initialized with a cluster ID as its normality label. Then we used a center loss  to enforce tight distribution of samples from the same normality and to enforce samples from different normalities to be distinguishable. Center loss is more efficient than triplet loss  in large batch training. Fig 3(e) shows an example of visualized hidden features after ML. Note that we removed the ego motion branch in TAD+ML for simplicity as it does not affect results.
Frame-level VAD methods focus on appearance while object-centric methods focus more on object motion. We are not aware of any method combining the two. Appearance-only methods may fail with drastic variance in lighting conditions and motion-only methods may fail when trajectory prediction is imperfect. In this paper, we combine AnoPred+Mask and TAD+ML, into an ensemble method. We trained each method independently and fused their output anomaly scores by average pooling. We have observed that such a late fusion is better than fusing hidden features in an early stage and training the two models together, since their hidden features are scaled differently. AnoPred+Mask encodes one feature per frame, while TAD+ML has one feature per object.
4.3 Supervised VAD as Online Action Detection
VAD can also be interpreted as binary action detection with normal and abnormal classes. We benchmark multiple video action detection methods on DoTA to provide insight in supervised VAD. We use an ImageNet pre-trained ResNet50 model to collect frame features and train different classifiers: 1) FC, a three-layer fully-connected network for image classification; 2) LSTM, a one-layer LSTM classifier for sequential image classification; and 3) Encoder-Decoder, an LSTM model with an encoder classifying current frames and a decoder predicting future classes. We also train the temporal recurrent network (TRN)  which is built upon encoder-decoder except predictions are fed back to the encoder to improve performance.
5 A New Evaluation Metric
5.1 Critique of Current VAD Evaluation
Most VAD methods compute an anomaly score for each frame by averaging scores over all pixels or objects. Current evaluation method plots receiver operating characteristic (ROC) curves using temporally concatenated scores and computes an area under curve (AUC) metric. AUC measures how well a VAD method answers the When question but ignores Where since averaged anomaly score lacks spatial information. We argue AUC is insufficient to fully evaluate VAD performance. In computing AUC, a true positive is a prediction where the model predicts high anomaly score for a positive frame. Fig. 5 shows two positive frames and their corresponding score maps computed by the four benchmarked VAD methods. Although the maps are different, the anomaly scores averaged from these maps are similar, meaning they are treated similarly in AUC evaluation. This results in similar AUCs among all methods, which leads to a conclusion that all perform similarly. However, AnoPred (Fig. 4(b)) predicts high scores for trees and other noise. AnoPred+Mask and TAD+ML (Fig. 4(c) and 4(d)) predict high scores for unrelated vehicles. Ensemble (Fig. 4(e)) alleviates these problems but still has high anomaly scores outside the labeled anomalous regions. Note that score maps of TAD+ML and Ensemble are pseudo-maps introduced in Section 5.2. Although these methods yield similar AUCs, VAD methods should be distinguished by their abilities to localize anomalous regions. Anomalous region localization is essential because it improves reaction to anomalies, e.g. collision avoidance, and aids in model explanation, e.g. a model predicts a car-to-car collision because it finds anomalous cars, not trees or noise. This motivates a new spatio-temporal metric to better address both When and Where questions.
5.2 The Spatial-Temporal Area Under Curve (STAUC) Metric
First, calculate the true anomalous region rate () for each positive frame,
where is the anomaly score map from Eq. (1), represents all frame pixels, is the annotated anomalous frame region (i.e., the union of all annotated bounding boxes). is a scalar describing how much of the anomaly score is located within the true anomalous region. is inspired by anomaly segmentation tasks where the overlap between prediction and annotation is computed . Next, calculate the spatio-temporal true positive rate (),
where represents all true positive predictions and represents all ground truth positive frames. is a weighted TPR where each true positive is weighted by its . We then use and FPR to plot a spatio-temporal ROC (STROC) curve and then calculate the STAUC. Note that STAUCAUC and the two are equal in the best case where .
Object-centric VAD [44, 20, 31] computes per-object anomaly scores instead of an anomaly score map . To generalize the STAUC metric to object-centric methods, we first create pseudo-anomaly score maps per Fig. 4(d)
. Each object has a 2D Gaussian distribution centered in its bounding box. Pixel score is then computed as the sum of the scores calculated from all boxes it occupies,
where and are coordinates of pixel and is center location, width, and height of object bounding box . For Ensemble method, we take the average of and as the anomaly score map in Fig. 4(e). This map is used like in Eq. (2) to compute and STAUC.
is not robust to anomalous region size . When , could be small even though all anomaly scores are high in . We thus propose selecting the top of pixels with the largest anomaly scores as candidates, and compute from these candidates instead of all pixels. Selecting a constant can be arbitrary. An extremely small such as may result in a biased candidate set dominated by false or true detections such that or . To address this issue, we compute an adaptive for each frame based on the size of its annotated anomalous region as given by
The average of DoTA is with a standard deviation . The minimum and maximum values are and , showing extreme cases where the anomalous object is very small (far away) or large (nearby).
A critical consideration for any new metric is its robustness to hyper parameters. We have tested STAUC with for different VAD methods per Fig. 5(a), STAUC slightly decreases with increasing but stabilizes when is large indicating STAUC is robust. Fig. 5(b) shows that STROC curves with different are close, especially when , and their upper bound is the traditional ROC. is selected for our benchmarks based on each frame’s annotation and its corresponding mid-range STAUC value.
We benchmarked VAD and VAR with the When-Where-What pipeline. We randomly partitioned DoTA into 3,275 training and 1,402 test videos and use these splits for both tasks. Unsupervised VAD models must be trained only with normal data, so we extract precursor frames from each video for training. Supervised VAD and VAR models are trained using all training data.
6.1 Task 1: Video Anomaly Detection (VAD)
Implementation Details. We trained all ConvAE and ConvLSTMAE variants using AdaGrad with learning rate and batch size 24. AnoPred, TAD, and their variants are trained per the original papers. TAD+ML uses k-means () and center loss weight [20, 25]. To train supervised methods, we first extract image features using ImageNet pre-trained ResNet50, then train each model with learning rate , batch size 16. All models are trained on NVIDIA TITAN XP GPUs. To fairly compare frame- and object-based methods, we ignore videos with unknown category or without objects, resulting in 1,305 test videos.
Overall Results. The top four rows of Table 3 show performance of ConvAE and ConvLSTMAE with grayscale or optical flow inputs. Generally, using optical flow achieves better AUC, indicating motion is an informative feature for this task. However, all baselines achieve low STAUC, meaning that they cannot localize anomalous regions well. AnoPred achieves AUC but only STAUC, while AnoPred+mask has lower AUC but higher STAUC. By applying instance masks, the model focuses on foreground objects to avoid computing high scores for background, resulting in slightly lower AUC but much higher STAUC. This supports our hypothesis that higher AUC does not imply a better VAD model, while STAUC reveals its ability to localize anomalous regions. TAD outperforms AnoPred on both metrics by specifically focusing on object motion and location, both of which are important indicators of traffic anomalies. The margin learning (ML) module further improves TAD by a small margin. Our Ensemble method achieves the best AUC and STAUC among all methods, indicating that combining frame-level appearance and motion features is a direction worth investigating in future VAD research.
State-of-the-art supervised methods such as TRN achieve higher AUC than unsupervised methods. These methods focus on temporal modeling and simplify spatial modeling by using pre-trained features. We believe that exploring spatial modeling could further boost the performance of supervised methods. However, since these models directly predict an anomaly score for each frame rather than computing an anomaly score map, it is not straightforward to compute STAUC for them. Other ways such as a soft attention or class activation map might help model explainability in the future [23, 46].
|ConvAE (gray) ||Unsupervised||Gray||64.3||7.4|
|ConvAE (flow) ||Flow||66.3||7.9|
|ConvLSTMAE (gray) ||Gray||53.8||12.7|
|ConvLSTMAE (flow) ||Flow||62.5||12.2|
|AnoPred  + Mask||Masked RGB||64.8||42.1|
|TAD ||Box + Flow||69.2||43.3|
|TAD  + ML [20, 25]||Box + Flow||69.7||43.7|
|Ensemble||RGB + Box + Flow||73.0||48.5|
Per-class Results. Table. 4 shows per-class results of unsupervised methods: AnoPred, AnoPred+Mask, TAD+ML and Ensemble. We observe that STAUC (unlike AUC) distinguishes performance by anomaly type, offering guidance as researchers seek to improve their methods. For example, Ensemble has comparable AUCs on OC and VP anomalies ( vs ) but significantly different STAUCs ( vs ), showing that anomalous region localization is harder on VP. Similar trends exist for the AH*, LA*, VP* and VO* columns. Second, frame-level and object-centric methods compensate each other in VAD as shown by the Ensemble method’s highest AUC and STAUC values in most columns. Third, localizing anomalous regions in non-ego anomalies is more difficult, as STAUCs on ego-involved anomalies are generally higher. One reason is that ego-involved anomalies have better dashcam visibility and larger anomalous regions, making them easier to detect. Table 4 also shows the difficulties of detecting different categories, with AH*, VP, VP*, VO* and LA* especially challenging for all methods. We observed that pedestrians in VP and VP* videos become occluded or disappear quickly after an anomaly happens, making it hard to detect the full anomaly event. AH* has a similar issue since sometimes the vehicle ahead is largely occluded by the vehicle it impacts. VO* is a rarer case in which a vehicle hits obstacles such as bumpers or traffic cones which are typically not detected and are sometimes occluded by the anomalous vehicle. Vehicles involved in LA* usually move towards each other slowly until they collide and stop, making the anomaly subtle thus hard to distinguish.
|Individual Anomaly Class AUC:|
|Individual Anomaly Class STAUC:|
Qualitative Results. Fig. 7 shows per-frame anomaly scores and s of three methods on a video where they all achieve high AUCs. AnoPred+Mask has low along the video, indicating failure of correctly localizing anomalous regions. TAD+ML computes high anomaly scores but low in the left example due to inaccurate trajectory prediction for the left car. In the right image, it finds one of the anomalous cars but also marks an unrelated car by mistake. Ensemble combines the benefits of both with anomaly scores for 20-30th anomaly frames always higher than normal frames. It computes high TARR during 10-20th anomaly frames as shown in the left score map. The right map shows a failure case combining the failure of AnoPred+Mask and TAD+ML. Although these methods achieve high AUC, their spatial localization is limited per . More qualitative results are shown in our supplement.
6.2 Task 2: Video Action Recognition (VAR)
The goal of VAR is to assign each video clip to one anomaly category. We benchmark seven VAR methods on DoTA: C3D , I3D , R3D , MC3 , R(2+1)D , TSN  and SlowFast . The previous training/test split is used. Unknown UK(*) anomalies are ignored, yielding 3216 training and 1369 test videos. We trained all models with SGD, learning rate 0.01 and batch size 16 on NVIDIA TITAN XP GPUs. Models are initialized with Sports-1M  (C3D) or Kinetics 
(rest) pre-trained weights; 0.5 probability random horizontal flip offers data augmentation. For evaluation, we randomly select ten clips from each test video per except TSN which uses 25 frames per video.
Table 5 lists the backbone network of each model and its per-class accuracy. Although newer methods R(2+1)D and SlowFast achieve higher average accuracy, all candidates suffer from low accuracy on DoTA, indicating that traffic anomaly classification is challenging. First, distant anomalies and occluded objects have low visibility thus are hard to classify. For example, VO(*) are hard to classify due to low visibility and diverse obstacle types per Section 6.1. AH* and OC* are also difficult since the front or oncoming vehicles are often occluded. Second, some anomalies are visually similar to others. For example, ST(*) are rare and look similar to AH(*) or LA(*) (Fig.2) since the only difference is whether the collided vehicle is starting, stopping, or stationary. Third, anomaly category is usually determined by the frames around anomaly start time, while the later frames do not reveal this category clearly. We have observed - accuracy improvement when testing models only on first half of each clip. Additional benchmarks are available in our supplement.
7 Conclusion and Future Work
This paper investigated a When-When-What pipeline for traffic anomaly detection. We introduced a large-scale dataset containing temporal, spatial, and categorical annotations and benchmarked state-of-the-art VAD and VAR methods. We proposed a new spatial-temporal area under curve (STAUC) metric to better evaluate VAD performance. Experiments showed STAUC outperforms AUC but that traffic video anomaly detection and classification problems are far from solved. DoTA offers the community new data for further VAD and VAR research and also can be used to study important object (visual saliency) detection, online detection of traffic anomaly, and validation and verification of autonomous driving efforts.
This research has been supported by the National Science Foundation under awards CNS 1544844 and CAREER IIS-1253549, and by the IU Office of the Vice Provost for Research, the IU College of Arts and Sciences, and the IU Luddy School of Informatics, Computing, and Engineering through the Emerging Areas of Research Project “Learning: Brains, Machines, and Children.” We also thank Derek Lukacs and RedBrickAI333https://www.redbrickai.com/ for supporting our data annotation work. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of the U.S. Government, or any sponsor.
-  (2017) IGLAD-international harmonized in-depth accident data. In ESV, Cited by: §1, §3.
-  (2019) MVTec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, Cited by: §5.2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2, §6.2.
-  (2016) Anticipating accidents in dashcam videos. In ACCV, Cited by: §1, §2, Table 1, §3.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: Table 3.
-  (2017) Abnormal event detection in videos using spatiotemporal autoencoder. In ISNN, Cited by: §2, §4.1, §4.1, Table 3.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §4.1.
-  (2016) Online action detection. In ECCV, Cited by: §2.
-  (2019) DADA: a large-scale benchmark and model for driver attention prediction in accidental scenarios. arXiv:1912.12148. Cited by: §1, §2, Table 1, §3, §3.
-  (2019) Slowfast networks for video recognition. In ICCV, Cited by: §1, §2, §6.2.
-  (2017) Red: reinforced encoder-decoder networks for action anticipation. BMVC. Cited by: §2.
-  (2019) StartNet: online detection of action start in untrimmed videos. In ICCV, Cited by: §2.
-  (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In ICCV, Cited by: §2.
-  (2016) Learning temporal regularity in video sequences. In CVPR, Cited by: §2, §4.1, §4.1, Table 3.
-  (2017) Mask R-CNN. In ICCV, Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.3.
-  (2019) Spatio-temporal action graph networks. In CVPRW, Cited by: §2.
-  (1997) Long short-term memory. Neural Computation. Cited by: Table 3.
-  (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In CVPR, Cited by: §4.1.
-  (2019) Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In CVPR, Cited by: §2, §4.2, §5.2, §6.1, Table 3.
Large-scale video classification with convolutional neural networks. In CVPR, Cited by: §1, §6.2.
-  (2017) The kinetics human action video dataset. arXiv:1705.06950. Cited by: §1, §6.2.
-  (2018) Textual explanations for self-driving vehicles. In ECCV, Cited by: §6.1.
-  (2013) Anomaly detection and localization in crowded scenes. TPAMI. Cited by: §2.
-  (2019) Margin learning embedded prediction for video anomaly detection with a few anomalies. In IJCAI, Cited by: §4.2, §6.1, Table 3.
-  (2018) Future frame prediction for anomaly detection–a new baseline. In CVPR, Cited by: §1, §2, §2, §4.1, Table 3.
-  (2013) Abnormal event detection at 150 fps in matlab. In ICCV, Cited by: §1, §2.
-  (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In ICCV, Cited by: §2.
-  (2017) Remembering history with convolutional lstm for anomaly detection. In ICME, Cited by: §2.
-  (2016) Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv:1612.00390. Cited by: §2.
-  (2019) Learning regularity in skeleton trajectories for anomaly detection in videos. In CVPR, Cited by: §2, §5.2.
-  (2018) Online detection of action start in untrimmed, streaming videos. In ECCV, Cited by: §2.
-  (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Cited by: §2.
-  (2018) Real-world anomaly detection in surveillance videos. In CVPR, Cited by: §1, §2.
-  (2018) Anticipating traffic accidents with adaptive loss and large-scale incident db. In CVPR, Cited by: §3.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §2, §6.2.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §1, §2, §6.2.
-  (2019) GODS: generalized one-class discriminative subspaces for anomaly detection. In ICCV, Cited by: §2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §2, §6.2.
A discriminative feature learning approach for deep face recognition. In ECCV, Cited by: §4.2.
-  (2019) Temporal recurrent networks for online action detection. In ICCV, Cited by: §2, §4.3, Table 3.
-  (2020) The smart black box: a value-driven high-bandwidth automotive event data recorder. TITS. Cited by: §1.
-  (2019) Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In ICRA, Cited by: §4.2.
-  (2019) Unsupervised traffic accident detection in first-person videos. In IROS, Cited by: §1, §2, §2, Table 1, §3, §4.2, §5.2, Table 3.
-  (2018) Bdd100k: a diverse driving video database with scalable annotation tooling. arXiv:1805.04687. Cited by: §2, §3.
Learning deep features for discriminative localization. In CVPR, Cited by: §6.1.
9 Additional DoTA Dataset Example
Fig. 8 shows one sampled frame sequence for each anomaly category in our DoTA dataset. Each row shows five frames sampled from one DoTA video, including two frames from the normal precursor, two frames from the anomaly window (marked by a red boundary), and one frame from the post-anomaly. The annotated bounding boxes of anomalous objects are shown by shadowed rectangles, and objects across frames are in consistent colors. Anomaly category abbreviations are listed to the left, where “*” indicating non-ego anomalies.
This figure illustrates that some samples from different categories look similar, for example ST (row 1) is similar to both AH (row 3) and OC (row 7) except that in ST the front car is stationary. The AH* sample is similar to the OC* sample since it is difficult to distinguish front and rear vehicle views. The VP sample is close to the TC sample due to the similarity between a pedestrian and a rider. Moreover, some non-ego anomalies can have low visibility due to their distance from the camera, such as the VP* and the OO* example in Fig. 8. VO and VO* are anomalies where vehicles collide with unexpected/auxiliary obstacles such as dropped cargo and traffic cones. Note that VO and OO are two anomaly categories with no bounding box label typically provided; by definition, VO and OO do not involve traffic participants.
10 Additional VAD (Task 1) Results
We present more qualitative results of AnoPred+Mask, TAD+ML and Ensemble methods in this supplement section. Fig. 8(a) shows an ego-involved ahead collision (AH). AnoPred+Mask computes a high anomaly score in the early frames by mistake since the prediction of the left car is inaccurate, as shown in the score map. TAD+ML computes a low anomaly score for this frame and therefore the Ensemble method benefits. The right example shows the TAD+ML method correctly computing a high score for the ahead car but also another high score for the bus on the right. The ensemble benefits from AnoPred+Mask so that it focuses more attention on the ahead car instead of the bus. Fig. 8(b) shows a failure case where all methods perform poorly in detecting a non-ego turning/crossing accident (TC*). The left example shows that all methods compute high anomaly scores for normal frames, where the silver car had to brake before turning to the right to avoid the black car which is turning into its lane. This example reveals that the tested unsupervised methods predict false alarms for near-incidence events. The right example shows that TAD+ML misses one of the anomalous cars, which is captured by AnoPred+Mask. This can be caused by the failure of object tracking in collision scenarios.
11 Additional VAR (Task 2) Results
In our submitted paper, we benchmarked several state-of-the-art video action recognition (VAR) models on DoTA dataset. Fig. 10 show the confusion matrices of R(2+1)D and SlowFast, two of the best models evaluated in our experiments. In addition to Table 5 in the paper, the confusion matrix shows the most confusing categories to help us understand challenging scenarios provided in the DoTA dataset. We make three observations from Fig. 10. First, both models have similar confusion matrices, indicating that they perform similarly on DoTA dataset. Second, some categories are confused with other specific categories due to their similarities. Among all categories, TC, TC*, OC and OO* are four classes for which many categories are confused. One reason is that there are a large number of samples for these categories in DoTA. Another reason is the similarities among categories. For example OO* is usually an out-of-control vehicle swerving on the road and finally leaving the roadway. Other non-ego anomalies, while having their own features, often result in similar irregular motions, resulting in confusion with OO*. Third, ego-involved categories are usually not confused with non-ego categories. This indicates that although the per-class recognition is difficult, current methods could capably distinguish ego-involved and non-ego anomalies.
12 Task 3: Online Action Detection
We provide benchmarks for online video action detection on DoTA dataset. Online action detection recognizes the anomaly type by only observing the current and past frames, making it suitable for autonomous driving applications. Since online action detection does not have a full observation of the whole video sequence, online action detection is considered a more difficult task than is traditional VAR. In this supplementary material, we provide benchmarks of several state-of-the-art online action detection methods on DoTA dataset. We use the same four online methods that have been used in supervised VAD: FC, LSTM, Encoder-decoder and TRN. The only difference is that the classifiers are designed to predict only one out of the 16 anomaly categories. We use the same training configurations to train these models. Table 6 shows the per-class average precision (AP) and the mean average prediction (mAP).
Quantitative Results. We observe that although TRN, a state-of-the-art method, achieves the highest mAP, all methods suffer from low precision on DoTA. Similar to what we have observed in the paper’s VAD and VAR experiments, online action detection is also difficult for ST, ST*, VP, VP*, VO and VO*. AH* an OC* are also difficult due to the highly occluded front of a typical oncoming vehicle. We also observe that ego-involved anomalies are easier to recognize than non-ego anomalies due to their higher visibility.
Qualitative Results. Fig. 11 shows some examples of TRN results on our DoTA dataset. The bar plots show the classification confidences of each frame. Cyan colors represent anomalous frames while gray colors represent background (normal) frames. We make the following observations from this experiment: 1) Transition frames between normal and abnormal events are hard to classify. For example class confidences are low at the frames where color changes, i.e., anomaly start and end frames; 2) Subsequent frames after an anomaly begins can be hard to detect. For example confidence significantly decreases at around the 40th frame of first example and the 60th frame of the third example; 3) Visually similar anomalies and gentle anomalies are hard to detect. In the bottom failure case, the confidence of ground truth anomaly class LA* is always low. These frames are either classified as background (normal) or AH* due to the fact that this LA* anomaly is visually similar to a typical AH* anomaly since this collision is relatively gentle.