Unsupervised Traffic Accident Detection in First-Person Videos

03/02/2019 ∙ by Yu Yao, et al. ∙ University of Michigan Indiana University 0

Recognizing abnormal events such as traffic violations and accidents in natural driving scenes is essential for successful autonomous and advanced driver assistance systems. However, most work on video anomaly detection suffers from one of two crucial drawbacks. First, it assumes cameras are fixed and videos have a static background, which is reasonable for surveillance applications but not for vehicle-mounted cameras. Second, it poses the problem as one-class classification, which relies on arduous human annotation and only recognizes categories of anomalies that have been explicitly trained. In this paper, we propose an unsupervised approach for traffic accident detection in first-person videos. Our major novelty is to detect anomalies by predicting the future locations of traffic participants and then monitoring the prediction accuracy and consistency metrics with three different strategies. To evaluate our approach, we introduce a new dataset of diverse traffic accidents, AnAn Accident Detection (A3D), as well as another publicly-available dataset. Experimental results show that our approach outperforms the state-of-the-art.



There are no comments yet.


page 1

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous driving has the potential to transform the world as we know it, revolutionizing transportation by making it faster, safer, cheaper, and less labor intensive. A key challenge is building autonomous systems that can accurately perceive and safely react to the huge diversity in situations that are encountered on real-world roadways. The problem is that driving situations obey a long-tailed distribution, such that a very small number of common situations makes up the vast majority of what a driver encounters, and a virtually infinite number of rare scenarios — animals running into the roadway, cars driving on the wrong side of the street, etc. — are the rest. While each of these individual scenarios is rare, they can and do happen. In fact, the chances that one of them will occur on any given day is actually quite high.

Existing work in computer vision has applied deep learning-based visual classification to recognize actions in the video collected by dashboard-mounted cameras 

[1, 2]. The long-tailed distribution of driving events means that unusual events may occur so infrequently that it may be impossible to collect training data for them, or to even anticipate that they might occur [3]. In fact, some studies indicate that driverless cars would need to be tested for billions of miles before enough of these rare situations occur to even accurately measure system safety [4], much less to collect sufficient training data to make them work well.

Fig. 1: Overview of our proposed approach. For each time , we monitor the accuracy and consistency of all traffic participants’ predicted bounding boxes from previous frames and calculate the scene’s anomaly score.

An alternative approach is to avoid modeling all possible driving scenarios, but instead to train models that recognize “normal,” safe roadway conditions, and then signal an anomaly when events that do not fit the model are observed. Unlike the fully-supervised classification-based work, this unsupervised approach would not be able to identify exactly what anomaly has occurred, but it may still provide sufficient information for the driving system to recognize an unsafe situation and take evasive action. This paper proposes a novel approach that learns a deep neural network model to predict the future locations of objects such as cars, bikes, pedestrians, etc., in the field of view of a dashboard-mounted camera on a moving ego-vehicle. These models can be easily learned from massive collections of dashboard-mounted video of normal driving, and no manual labeling is required. We then compare those predication locations to the actual locations observed in the next few video frames. We hypothesize that anomalous roadway events can be detected by looking for major deviations between the predicted and actual locations, because unexpected roadway events (such as cars striking other objects) result in sudden unexpected changes in an object’s speed or position.

Perhaps the closest related work to ours is Liu et al[3]

, who also detect anomalous events in video. Their technique tries to predict entire future RGB frames and then looks for deviations between those and observed RGB frames. But while their approach can work well for static cameras, accurately predicting whole frames is extremely difficult when cameras are rapidly moving as in the driving scenario. We side-step this difficult project by detecting objects and predicting their trajectories, as opposed to trying to predict whole frames. To model the moving camera, we explicitly predict the future odometry of the ego-vehicle; this also allows us to detect significant deviations of the predicted and real ego-motion, which can be used to classify if the ego-vehicle is involved in the accident or is just an observer. We evaluate our technique in extensive experiments on three datasets, including a new labeled dataset of some 1,500 video traffic accidents from dashboard cameras that we collected from YouTube. We find that our method significantly outperforms a number of baselines, including the published state-of-the-art in anomaly detection.

Ii Related Work

Trajectory Prediction. Extensive research has investigated trajectory prediction, often posed as a sequence-to-sequence generation problem. Alahi et al[5] introduce a Social-LSTM for pedestrian trajectories and their interactions. The proposed social pooling method is further improved by Gupta et al.[6] to capture global context in a Generative Adversarial Network (GAN). Social pooling is also applied to vehicle trajectory prediction in Deo et al[7] with multi-modal maneuver conditions. Other work [8, 9] captures scene context information using attention mechanisms to assist trajectory prediction. Lee et al[10]

incorporate Recurrent Neural Networks (RNNs) with conditional variational autoencoders (CVAEs) to generate multimodal predictions and choose the best by ranking scores.

While the above methods are designed for third-person views from static cameras, recent work has considered vision in first-person (egocentric) videos that capture the natural field of view of the person or agent (e.g., vehicle) wearing the camera to study the camera wearer’s actions [11, 12], trajectories [13], interactions [14, 15], etc. Bhattacharyya et al[16] predict future locations of pedestrians from vehicle-mounted cameras, modeling observation uncertainties with a Bayesian LSTM network. Yagi et al[17] incorporate different kinds of cues into a convolution-deconvolution (Conv1D) network to predict pedestrians’ future locations. Yao et al[18] extend this work to autonomous driving scenarios, by proposing a multi-stream RNN Encoder-Decoder (RNN-ED) architecture with both past vehicle locations and image features as inputs for anticipating vehicle locations.

Video Anomaly Detection. Video anomaly detection has received considerable attention in computer vision and robotics [19]

. Previous work mainly focuses on video surveillance scenarios typically using an unsupervised learning method on the reconstruction of normal training data. For example, Hasan 

et al[20] propose a 3D convolutional Auto-Encoder (Conv-AE) to model non-anomalous frames. To take advantage of temporal information, [21, 22] use a Convolutional LSTM Auto-Encoder (ConvLSTM-AE) to capture regular visual and motion patterns simultaneously. Luo et al[23] propose a special framework of sRNN, called temporally-coherent sparse coding (TSC), to preserve the similarities between frames within normal and abnormal events. Liu et al[3] detect anomalies by looking for differences between a predicted future frame and the actual frame. However, in dynamic autonomous driving scenarios, it is hard to reconstruct either the current or future RGB frames due to the ego-car’s intense motion. It is even harder to detect abnormal events. This paper proposes detecting accidents on roads by using the difference between predicted and actual trajectories of other vehicles. Our method not only eliminates the computational cost of reconstructing full RGB frames, but also localizes potential anomaly participants.

Prior work has also detected anomalies such as moving violations and car collisions on roads. Chan et al[1] introduce a dataset of crowd-sourced dashcam videos and a dynamic-spatial-attention RNN model for accident detection. Herzig et al[2]

propose a Spatio-Temporal Action Graph (STAG) network to model the latent graph structure of spatial and temporal relations between objects. These methods are based on supervised learning that requires arduous human annotations and makes the unrealistic assumption that all abnormal patterns have been observed in the training data. This paper considers the challenging but practical problem of predicting accidents with unsupervised learning. To evaluate our approach, we introduce a new dataset with traffic accidents involving objects such as cars and pedestrians.

Iii Unsupervised Traffic Accident Detection
in First-Person Videos

Autonomous vehicles must monitor the roadway ahead for signs of unexpected activity that may require evasive action. A natural way to detect these anomalies is to look for unexpected or rare movements in the first-person perspective of a front-facing, dashboard-mounted camera on a moving ego-vehicle. Prior work [3] proposes monitoring for unexpected scenarios by using past video frames to predict the current video frame, and then comparing it to the observed frame and looking for major differences. However, this does not work well for moving cameras on vehicles, where the perceived optical motion in the frame is induced by both moving objects and camera ego-motion. More importantly, anomaly detection systems do not need to accurately predict all information in the frame, since anomalies are unlikely to involve peripheral objects such as houses or billboards by the roadside. This paper thus assumes that an anomaly may exist if an object’s real-world observed trajectory deviates from the predicted trajectory. For example, when a vehicle should move through an intersection but instead suddenly stops, a collision may have occurred.

Following Liu et al[3], our model is trained with a large-scale dataset of normal, non-anomalous driving videos. This allows the model to learn normal patterns of object and ego motions, then recognize deviations without the need to explicitly train the model with examples of every possible anomaly. This video data is easy to obtain and does not require hand labeling. Considering the influence of ego-motion on perceived object location, we incorporate a future ego-motion prediction module  [18] as an additional input. At test time, we use the model to predict the current locations of objects based on the last few frames of data and determine if an abnormal event has happened based on three different anomaly detection strategies strategies per Section III-B.

Fig. 2: Overview of the future object localization model.

Iii-a Future Object Localization

Iii-A1 Bounding Box Prediction

Following  [18], we denote an observed object’s bounding box at time , where (, ) is the location of the center of the box and and are its width and height in pixels, respectively. We denote the object’s future bounding box trajectory for the frames after time to be , where each is a bounding box parameterized by center, width, and height. Given the image evidence observed at time , a visible object’s location , and its corresponding historical information , our future object localization model predicts . This model is inspired by the multi-stream RNN encoder-decoder framework of Yao et al[18], but with completely different network structure. For each frame, [18] receives and re-processes the previous 10 frames before making a decision, whereas our model only needs to process the current information, making it much faster at inference time. Our model is shown in Figure 2

. Two encoders based on gated recurrent units (GRUs) receive an object’s current bounding box and pixel-level spatiotemporal features as inputs, where spatiotemporal features are extracted by a region-of-interest pooling (RoIPool) operation using bilinear interpolation from precomputed optical flow fields.

Iii-A2 Ego-Motion Cue

Ego-motion information of the moving camera has been shown necessary for accurate future object localization [18, 16]. Let be the ego-vehicle’s pose at time ; where is the yaw angle and and

are the positions along the ground plane with respect to the vehicle’s starting position in the first video frame. We predict the ego-vehicle’s odometry by using another RNN encoder-decoder module to encode ego-position change vector

and decode future ego-position changes . We use the change in ego-position to eliminate accumulated odometry errors. The output is then combined with the hidden state of the future object localization decoder to form the input into the next time step.

Iii-A3 Missed Objects

We build a list of trackers per [24] to record the current bounding boxes , the predictions , and tracker ages of each object

. At each time step, we update the observed trackers and initialize a new tracker when a new object is detected. For objects that are missed (i.e., suddenly disappear), we use their previously predicted bounding boxes as their estimated current location and run future object localization with RoIPool features from those predicted boxes per Algorithm 

1. This missed object mechanism is essential in our prediction-based anomaly detection method to eliminate the impact of failed object detection or tracking in any given frame. For example, if an object with a normal motion pattern is missed for several frames, the FOL is still expected to give reasonable predictions except for some accumulated deviations. On the other hand, if an anomalous object is missed during tracking  [24], FOL-Track will make a prediction using its previously predicted bounding box whose region can be totally displaced and can result in inaccurate predictions. In this case, some false alarms and false negatives can be eliminated by using the metrics presented in Section III-B3.

Input : Observed bounding boxes where , observed image evidence , trackers of all objects with track IDs
Output : Updated trackers
1 is the maximum age of a tracker
2 for  do // update observed trackers
3       if  then
4             initialize
6       else
9       end if
11 end for
12for  do // update missed trackers
13       if  then
14             remove from
16       else
19       end if
21 end for
Algorithm 1 FOL-Track Algorithm
Fig. 3: Overview of our unsupervised traffic accident detection methods. The three brackets correspond to: (1) Predicted bounding box accuracy method (pink); (2) Predicted box mask accuracy method (green); (3) Predicted bounding box consistency method (purple). All methods use multiple previous FOL outputs to compute anomaly scores.

Iii-B Traffic Accident Detection

In this section, we propose three different strategies for traffic accident detection by monitoring the prediction accuracy and consistency of objects’ future locations. The key idea is that object trajectories and locations in non-anomalous events can be precisely predicted, while deviations from predicted behaviors suggest an anomaly.

Iii-B1 Predicted Bounding Boxes - Accuracy

One simple method for recognizing abnormal events is to directly measure the similarity between predicted object bounding boxes and their corresponding observations. The future object localization (FOL) model predicts bounding boxes of the next future frames, i.e., at each time each object has bounding boxes predicted from time to , respectively. We first average the positions of the bounding boxes, then compute intersection over union (IoU) between the averaged bounding box and the observed box location, where higher IoU means greater agreement between the two boxes. We average computed IoU values over all observed objects then compute an aggregate anomaly score ,


where is the total number of observed objects, and is the predicted bounding box from time of object at time . This method relies upon accurate object tracking to match the predicted and observed bounding boxes.

Iii-B2 Predicted Box Mask - Accuracy

Although tracking algorithms such as Deep-SORT [24] offer reasonable accuracy, it is still possible to lose or mis-track objects. We found that inaccurate tracking particularly happens in severe traffic accidents because of the twist and distortion of object appearances. Moreover, severe ego-motion also results in inaccurate tracking due to sudden changes in object locations. This increases the number of false negatives of the metric proposed above, which simply ignores objects that are not successfully tracked in a given frame. To solve this problem, we first convert all areas within the predicted bounding boxes to binary masks, with areas inside the boxes having value and backgrounds having , and do the same with the observed boxes. We then calculate an anomaly score as the IoU between these two binary masks,


where is pixel on mask , is the -th bounding box, is the predicted mask from time , and is the observed mask at . In other words, while the metric in the last section compares bounding boxes on an object-by-object basis, this metric simply compares the bounding boxes of all objects simultaneously. The main idea is that accurate prediction results will still have a relative large IoU compared to the ground truth observation.

Dataset # videos # training frames # testing frames # anomaly events typical participants
UCSD Ped1/Ped2 [25] 98 9,350 9,210 77 bike, pedestrian, cart, skateboard
CUHK Avenue [26] 37 15,328 15,324 47 bike, pedestrian
UCF-Crime [27] 1,900 1,610 videos 290 videos 1,900 car, pedestrian, animal
ShanghaiTech [23] 437 274,515 42,883 130 bike, pedestrian
Street Accidents (SA) [1] 994 82,900 16,500 165 car, truck, bike
A3D 1,500 79,991 (HEV-I) 128,175 1500 car, truck, bike, pedestrian, animal
TABLE I: Comparison of publicly available datasets for video anomaly detection. Surveillance videos. Egocentric videos (training frames are all normal videos, while some test frame videos contain anomalies.)

Iii-B3 Predicted Bounding Boxes - Consistency

The above methods rely on accurate detection of objects in concurrent frames to compute anomaly scores. However, the detection of anomaly participants is not always accurate due to changes in appearance and mutual occlusions. We hypothesize that visual and motion features about an anomaly do not only appear once it happens, but usually are accompanied by a salient pre-event. We thus propose another strategy to detect anomalies by computing consistency of future object localization outputs from several previous frames while eliminating the effect of inaccurate detection and tracking.

As discussed in Section III-B1, our model has predicted bounding boxes for each object in video frame

. We compute the standard deviation (STD) between all

predicted bounding boxes to measure their similarity,


We compute the maximum STD over the four components of the bounding boxes since different anomalies may be indicated by different effects on the bounding box, e.g., suddenly stopped cross traffic may only have large STD along the horizontal axis. A low STD suggests the object is following normal movement patterns thus the predictions are stable, while a high standard deviation suggestions abnormal motion. For all three methods, we follow [3] to normalize computed anomaly scores for evaluation.

Iv Experiments

To evaluate our method on realistic traffic scenarios, we introduce a new dataset AnAn Accident Detection (A3D) of on-road abnormal event videos compiled as 1500 video clips from a YouTube channel [28] of dashboard cameras from different cars in East Asia. Each video contains an abnormal traffic event at different temporal locations. We labeled each video with anomaly start and end times under the consensus of three human annotators. The annotators were instructed to label the anomalies based on common sense, with the start time defined to be the point at which the accident is inevitable and the end time the point when all participants recover a normal moving condition or fully stop.

We compare our A3D dataset with existing video anomaly detection datasets in Table I. A3D includes a total of 128,175 frames (ranging from 23 to 208 frames) at 10 frame per second and is clustered into 18 types of traffic accidents each labeled with a brief description. A3D includes driving scenarios with different weather conditions (e.g., sunny, rainy, snowy, etc.), places (e.g., urban, countryside, etc.), and participant types (e.g., cars, motorcycles, pedestrians, animals, etc.). In addition to start and end times, each traffic anomaly is labeled with a binary value indicating whether the ego-vehicle is involved to provide a better understanding of the event. Note that this could especially benefit the first-person vision community. For example, rear-end collisions are the most difficult to detect from traditional anomaly detection methods. About of accidents in the dataset involve the ego-vehicle, and others are observed by moving cars from a third-person perspective.

Since A3D does not contain normal videos, we use the publicly available Honda Egocentric View Intersection (HEV-I) [18] dataset to train our model. HEV-I was designed for future object localization and consists of 230 on-road videos at intersections in the San Francisco Bay Area. Each video is 10-60 seconds in length. Since HEV-I and A3D were collected in different places with different kinds of cameras, there is no overlap between the training and testing datasets. Following prior work [18], we produce object bounding boxes using Mask-RCNN [29] pre-trained on the COCO dataset and find tracking IDs using Deep-SORT [24].

Iv-a Implementation Details

We implemented our model in PyTorch 

[30] and performed experiments on a system with Pascal Nvidia Titan Xp GPU111The code and dataset will be made publicly available upon publication.. We use ORB-SLAM 2.0 [31] for ego odometry calculation and compute optical flow using FlowNet 2.0 [32]. We use a 55 RoIPool operator to produce the final flattened feature vector . The gated recurrent unit (GRU) [33]

is our basic RNN cell. GRU hidden state sizes for future object localization and the ego-motion prediction model were set to 512 and 128, respectively. To learn network parameters, we use the RMSprop 

[34] optimizer with default parameters, learning rate

, and no weight decay. Our models were optimized in an end-to-end manner, and the training process was terminated after 100 epochs using a batch size of 32. The best model was selected according to its performance in future object localization.

Iv-B Evaluation Metrics

For evaluation, we follow the literature of video anomaly detection [25] and compute frame-level Receiver Operation Characteristic (ROC) curves and Area Under the Curve (AUC). A higher AUC value indicates better performance.

Iv-C Baselines

K-Nearest Neighbor Distance. We segment each video into a bag of short video chunks with 16 frames. Each chunk is labeled as either normal or anomalous based on the annotation of the 8-th frame. We then feed each chunk into an I3D [35] network pre-trained on Kinetics dataset, and extract the outputs of the last fully connected layer as its feature representations. All videos in the HEV-I dataset are used as normal data. The normalized distance of each test video chunk to the centroid of its nearest normal (-NN) video chunks are computed as the anomaly score. We show results of and in this paper.

Conv-AE [20]. We reimplement the Conv-AE model for unsupervised video anomaly detection by following [20]. The input images are encoded by 3 convolutional layers and 2 pooling layers, and then decoded by 3 deconvolutional layers and 2 upsampling layers for reconstruction. Anomaly score computation is from [20]. The model is trained on a mixture of the SA (Table I) and the HEV-I dataset for 20 epochs and the best model is selected.

State-of-the-art [3]. The future frame prediction network with Generative Adversarial Network (GAN) achieved the state-of-the-art results for video anomaly detection. This work detects abnormal events by leveraging the difference between a predicted future frame and its ground truth. To fairly compare with our method, we used the publicly available code by the authors of [3] and finetuned on the same dataset as Conv-AE. Training is terminated after 100,000 iterations and the best model is selected.

Methods A3D A3D (w/o Ego) SA [1]
-NN (K = 1) 48.0 51.3 48.2
-NN (K = 5) 47.8 51.2 48.1
Conv-AE[20] 49.5 49.9 50.4
State-of-the-art [3] 46.1 50.7 50.4
FOL-AvgIoU 49.7 57.0 53.4
FOL-MinIoU 48.4 56.0 52.6
FOL-Mask 54.1 54.9 54.8
FOL-AvgSTD (pred only) 59.3 60.2 55.8
FOL-MaxSTD (pred only) 60.1 59.8 55.6
TABLE II: Experimental results on A3D and SA dataset.

Iv-D Results on A3D Dataset

Quantitative Results. We evaluated baselines, a state-of-the-art method, and our proposed method on the A3D dataset. As shown in the first column of Table II, our method outperforms the NN baseline as well as Conv-AE and state-of-the-art. As a comparative study, we evaluate performance of our future object localization (FOL) methods with the three metrics presented in Section III-B. FOL-AvgIoU uses the metrics in Eq. 1, while FOL-MinIoU is a variation where we evaluate minimum IoU over all observed objects instead of computing the average, resulting in not only anomaly detection but also anomalous object localization. However, FOL-MinIoU

can perform worse since it is not robust to outliers such as failed prediction of a normal object, which is more frequent in videos with a large number of objects.

FOL-Mask uses the metrics in Eq. 3 and significantly outperforms the above two methods. This method does not rely on accurate tracking, so it handles cases including mis-tracked objects. However, it may mis-label a frame as an anomaly if object detection loses some normal objects. Our best methods use the prediction-only metric defined in Eq. (4) which has two variations FOL-AvgSTD and FOL-MaxSTD. Similar to the IoU based methods, FOL-MaxSTD finds the most anomalous object in the frame. By using only prediction, our method is free from unreliable object detection and tracking when an anomaly happens, including the false negatives (in IoU based methods) and the false positives (in Mask based methods) caused by losing objects. However, this method can fail in cases where predicting future locations of an object is difficult, e.g., an object with low resolution, intense ego-motion, or multiple object occlusions due to heavy traffic.

We also evaluated the methods by removing videos where ego cars (A3D w/o Ego in Table II) are involved in anomalies to show how ego motion influences anomaly detection. As shown in the first and the second columns of Table II, FOL-AvgIoU and FOL-MinIoU are impacted by ego-motion while the other methods are relatively robust. This further shows that it is necessary to reduce dependency on accurate object detection and tracking when anomalies occur.

Qualitative Results. Fig. 4 shows two sample results of our best method and the published state-of-the-art on the A3D dataset. For example, in the upper one, predictions of all observed traffic participants are accurate and consistent at the beginning. The ego car is hit at around the -th frame by the white car on its left, causing inaccurate and unstable predictions thus generating high anomaly scores. After the crash, the ego car stops and the predictions recover, as presented in the last two images. Fig. 6 shows a failure case where our method makes false alarms at the beginning due to inconsistent prediction of the very left car occluded by trees. This is because our model takes all objects into consideration equally rather than focusing on important objects. False negatives show that our method is not able to detect an accident if participants are totally occluded (e.g. the bike) or the motion pattern is accidentally normal from a particular viewpoint (e.g. the middle car).

Iv-E Results on the SA Dataset

We also compared the performance of our model and baselines on the Street Accident (SA) [1] dataset of on-road accidents in Taiwan. This dataset was collected from dashboard cameras with 720p resolution from the driver’s point-of-view. Note that we use SA only for testing, and still train on HEV-I dataset. We follow prior work [1] and report evaluation results with 165 test videos containing different anomalies. The right-most column in Table II shows the results of different methods on SA. In general, our best method outperforms all baselines and the published state-of-the-art. The SA testing dataset is much smaller than A3D, and we have informally observed that it is biased towards anomalies involving bikes. It also contains videos collected from cyclist head cameras which have irregular camera angles and large vibrations. Figure 6 shows an example of anomaly detection in the SA dataset.

V Conclusion

This paper proposes an unsupervised deep learning framework for traffic accident detection from egocentric videos. A key challenge is rapid motion of the ego-car, which means visual reconstruction of either current or future RGB frames from regular training data is difficult. Instead, we predict traffic participant trajectories as well as their future locations, and monitor the anticipation accuracy and consistency as a signal that an anomaly may have occurred. We introduce a new dataset consisting of a variety of real-world accidents on roads. We evaluate our method on two traffic accident detection datasets. Experiments show that our model significantly outperforms published baselines.

Fig. 4: Two examples of our best method and a state-of-the-art method on the A3D dataset.
Fig. 5: An example of our best method and a state-of-the-art method on SA dataset[1].
Fig. 6: A failure case of our method on the A3D dataset with false alarms and false negatives.