One of the biggest challenges faced by the autonomous vehicles is the accurate anticipation of accidents and taking necessary actions to avoid them. These accidents include vehicles colliding with one another, animals, pedestrians, and road signs. Accurate prediction of accidents ahead of time is imminent to avoid critical casualities.
The task of anticipation and detection falls into two separate categories. Accident detection is related to action recognition where the network has a complete temporal context available at test time. Accident anticipation is a challenging task because it requires predicting a future event by making use of the limited temporal information as all practical systems are causal. The anticipation problem is evaluated by assessing how early the network predicts the event (in our case accident) ahead of time. Thus, accident anticipation should be tackled differently from detection/recognition problem.
Recognizing abnormal activity such as traffic accidents in a real life scenario poses a challenge for researchers because there exist a wide variety of vehicles that interact with one another to cause accidents.
If a major portion of training data consists of accidents caused by only one kind of vehicle-to-vehicle interaction, the network might not generalize well on other kinds of accidents. For example, if the majority of accidents in the training data consist of cars colliding with motorbikes, the network will give poor performance for accidents caused by a car hitting another car. It is thus important to develop an anticipation neural network that models relationship between appearance features of different objects in a frame so that the network can generalize well on a wide variety of inter-object interactions.
Previous work  aims at anticipating accidents by giving attention to individual objects in a frame without taking into account their interactions with one another. In , the spatial and appearance-wise non-linear interaction between an agent and its environment is used to assess risk in the future. In this work, we primarily aim at defining the relationship between appearance features of objects only that can be generally used.
We propose a novel method for accident anticipation that takes into account object-to-object interactions in a given frame of a video sequence. Our Feature Aggregation block strengthens every query object feature by adding a weighted sum of all other object features in a given video frame to the query object. The weighted sum represents global context specific to the query object whereas the attention weights are defined by appearance relationships between different objects in a single frame. Moreover, we use the sequence modelling power of Recurrent Neural Networks for accident anticipation. The refined object features along with the full frame features are input to a Long Short Term Memory (LSTM) network that returns an anticipation probability value corresponding to every frame. In this work, we focus on using the FA block for spatial domain only and use an LSTM to capture long-range temporal dependency between frames.
The rest of the paper is structured as follows. Research related to our work is discussed in Section ii@. In Section iii@, we describe the individual components of our method in detail. Implementation details and experimental results are given in Section iv@. Finally, Section v@ concludes the paper.
Ii Related Work
Ii-a Video Classification
Recurrent Neural Networks (RNN) have been used extensively in the past for Sequence/video classification. Yang et al.  model high dimensional sequential data with RNN architectures for video classification. Chen et al.  explore video classification by aggregating the frame level feature representations to generate video-level predictions using variants of Recurrent Neural Networks and several fusion strategies. Hao et al. classify videos using a two-stream Convolutional Neural Network that uses both static frames and optical flows to classify sequences. For large-scale video classification, extracting optical flow features is computationally expensive. Tang et al.  propose a motion halluciantion network by which optical flow features are imagined using appearance features. Their method cuts down the computational cost by half for the two-stream video classification. Miech et al.  propose a two-stream architecture that models audio and visual features in a video sequence. In , the authors propose a local feature integration framework based on attention clusters for efficient video classification.
Ii-B Accident Detection and Anticipation
Extensive research has been carried out in the domain of accident detection. Chan et al.  use a RNN with dynamic attention to anticipate accidents in videos. Yao et al.  use an unsupervised approach involving future object localization and an RNN encoder-decoder network to detect accidents. The network is trained on normal training samples whereas the detection of accidents is carried out at test time. Herzig et al.  classify input segments into accident or non-accident segments using a Spatio-Temporal Action Graph (STAG) network. Their supervised method involves refining the input features using non-local blocks  to capture the spatial relation between objects and temporal relation between frames in a video sequence. Singh et al. 
use an unsupervised approach involving a denoising autoencoder to extract deep representation from normal CCTV videos at training time. Xuet al.  propose a Temporal Recurrent Network (TRN) that models temporal context over a long period which helps in anticipation of future actions in the video sequence. In , the authors adopt an agent-centric approach to measure the riskiness of each region with respect to the agent. In , the authors propose a novel adaptive loss for early anticipation of accidents.
Iii Accident Anticipation in Dashcam Videos
Accident Anticipation systems predict the occurrence of an accident as early as possible at testing. During training, the network is given a sequence of frames and labels defined as
Here, is the -th frame in a video, is the total number of videos, denotes the number of frames in the -th video,
represents one-hot encoded video-level label indicating the presence of an accident, andis the frame index at which the accident started. For normal videos, is set to . At test time, the network is given each frame one at a time to predict the occurrence of accidents as early as possible in the video sequence. Specifically, a network tries to anticipate an accident correctly at the -th frame with such that - is maximized. Note that in anticipation, unlike detection, partial observations till time are available to the network at test time.
We first describe the preliminaries of a recurrent neural network that is a constituent of our method. Afterwards, we describe our method in detail.
Iii-a Recurrent Neural Networks
Recurrent Neural Networks are considered a powerful tool for sequence modelling. We use a Long-Short Term Memory network for accident anticipation in video sequences. LSTMs allow global aggregation of features over time
by introducing memory into the system. In Fig. 2, data flow of an LSTM is shown where represents the hidden state generated at current time step , is the cell state that
captures long-range temporal dependency and is the input to the LSTM. There are three different gates in an LSTM block: input gate , forget gate , and output gate . These gates filter the input in a way that the model learns useful information in the sequence by letting it pass through the LSTM and blocks the rest of the input using the forget gate. The data flow for LSTM is shown mathematically as
In the equations above, is an element-wise product, and and
are sigmoid function and hyperbolic tangent function respectively., , and are all learnable parameters of an LSTM.
Iii-B Feature Extraction
We begin our method by first detecting objects in individual frames of a video using Faster-RCNN . The number of objects in a frame is limited to . We then extract -dimensional features for the objects present in the frames using a pre-trained VGG . Similar features are extracted for the whole frame as well. Thus, the object features and full frame features for the -th frame are given as
Iii-C Feature Aggregation Block
In this subsection, we describe the FA Block to globally aggregate features over a frame. In order to detect accidents, it is important that the network understands the global context surrounding an object (a vehicle in our case) in a given frame. The main purpose of this block is to comprehend object interactions in the neural network. The FA block has further two components.
Iii-C1 Appearance Comparison
This part of the FA block computes appearance relationship between objects in a given frame. We use for the query object and to represent all possible objects in a given frame including the query object . The object features and are passed through fully connected layers with parameters and , respectively.
where , , and
are all learnable parameters of fully connected layers. In order to show how object relations evolve over time, we add the hidden representationinto the output of fully connected layers.
where is a learnable parameter of fully connected layer, is the hyperbolic tangent function, and and show learnable transformations of object features and
, respectively. By transforming the objects’ features, we can estimate the appearance comparison between objects effectively in a subspace.
We use dot product similarity as an appearance relation function between the transformation of objects’ features in order to obtain the unnormalized attention weights .
where represents a transpose function. The unnormalized attention weights are normalized by using a softmax function so that the sum of all attention weights related to query object is 1, i.e.,
is an attention weight showing the importance of object ’s feature with respect to object . The attention weights are computed for all objects in a frame and packed together in a matrix .
Iii-C2 Feature Refinement
The FA block strengthens the feature of each query object by adding a weighted sum of all objects present in a frame to the query object. The weights indicate the appearance relation between the objects. First, a learnable transformation of object is obtained by passing it through a linear fully connected layer with parameters and .
Afterwards, this representation is multiplied with the corresponding attention weights following the soft attention mechanism .
where is the total number of objects in a given frame. The weighted summation of objects represents a global context for the query object . The features of query object at time represented as are refined by adding to produce .
The Feature aggregation block is similar to the non-local block given in  but is significantly altered for accident anticipation. Non-local blocks modify the features of each query position in a feature map by aggregating a weighted sum from all positions. These positions refer to every location in a 2D feature map. In our method, we focus on strengthening object features instead of features of each individual pixel location. Moreover, we design our FA block such that in addition to capturing pairwise relationship between objects, it also learns the evolution of objects over time. This is done by adding an additional term including the hidden state of LSTM into our network.
Iii-D Traffic Accident Anticipation
The overview of accident anticipation model is given in Fig. 4. The refined object features from a FA block are aggregated together to form per-frame descriptor of size .
-dimensional vector captures information of object interactions in a given frame. In order to have a better understanding of the scene, these features are combined with corresponding full frame features.
where ; indicates concatenation. The resulting features are input to an LSTM, which outputs a hidden state . The hidden state is projected into probability values for the two classes, i.e., accident and non-accident by using a fully connected layer with parameters and .
The output of the fully connected layer is normalized by a softmax activation fucntion.
During training, each frame is assigned a video-level label
which is a one-hot encoded vector. For accident videos, the loss function is the exponential cross entropy, which gives more importance to frames that are closer to accident, hence, producing larger anticipation probability values for such frames. For non-accident videos, the loss function is the simple cross entropy. The loss for every frame is added for the entire video sequence, averaged and then back propagated.
The loss is given as
Iii-E Architectural Details
All fully connected layers with parameters , , , , and
compute non-linear transformations of their inputs where, , , , , and the non-linearity is . Fully connected layers with parameters , and linearly transform the input where , and . The number of layers in the recurrent neural network is fixed to 1. The final fully connected layer with parameters and
uses softmax activation function.
We first describe the details of Street Accident (SA) 
dataset that is used in the experiment. Then, we explain the implementation details and the evaluation metrics. Finally, we show that our method anticipates accidents earlier than state-of-the-art approaches on the SA dataset.
The Street Accident (SA)  dataset contains videos captured across six cities in Taiwan. The videos have been recorded with a frame rate of 20 frames per second. The frames extracted from these videos have a spatial resolution of 1280 x 720. Each video is 5 seconds long containing 100 frames where the accident videos contain an accident at the last 10 frames. Street Accident is a complex dataset captured with different lighting conditions and involves a wide variety of accidents. The SA dataset contains 620 positive videos (with an accident) and 1130 negative videos (without an accident). Following the experimental procedure in , we use 1266 videos (455 positive and 829 negative) for training and 467 videos ( 165 positive and 301 negative) at test time.
Iv-B Implementation Details
We implemented our method in Tensorflow and performed experiments on a system with a single Nvidia Geforce 1080 GPU having 8GB of memory. We used the appearance features provided by for SA dataset. The objects were extracted using Faster-RCNN . VGG-16  was used to extract features from full frames and objects present in a frame which were first resized to a frame resolution of 224 x 224. The features were extracted from
layer of VGG having a dimension of 4096. These features were passed through a linear embedding to reduce their dimensionality to 256 before giving them as an input to our network. We used LSTM with a hidden state size of 512 and a dropout of 0.5. Considering the training time and memory limit, we limited the number of objects in a frame to 9. The parameters of the network were initialized randomly with a normal distribution having mean of 0 and a standard deviation of 0.01. The model was trained with a learning rate of 0.0001 using Adam optimizer and a batch size of 10. Training was performed for 40 epochs on the SA dataset.
Iv-C Evaluation Metrics
For evaluation, we use mean Average Precision (mAP) and Average Time-to-Accident (ATTA) as our evaluation metrics.
Iv-C1 Average Precision
For every frame, our network returns a softmax probability value showing the risk of accidents in the future. If the value is above a threshold and the video is an accident video, it is considered as a True Positive (TP) prediction, and, in case of a value below than a threshold, a False Negative (FN). Similarly, for a non-accident video, if the probability value is below a threshold, it is considered as a True Negative (TN) prediction, and in case of a value above than a threshold, it is a False Positives (FP). These values are obtained for all the frames in all the video sequences. Precision () and Recall () are computed as follows
|Method||mAP (%)||ATTA (s)|
|VGG + full frame feature||37.3||3.21|
After finding precision and recall at different values of threshold, we compute mean Average Precision. The general definition of mean Average Precision is area under the precision-recall curve.
where is precision as a function of recall .
Iv-C2 Average Time to Accident
At every threshold , we find the first value in every positive video when the accident probability is above a threshold. If the accident starts at frame , then - is Time-to-Accident (TTA). We average all TTAs for all the positive videos to get a single TTA at a given threshold. After computing TTA at different thresholds, we average all TTAs to find Average Time-to-Accident (ATTA). A higher ATTA value means earlier anticipation of accidents.
Iv-D Quantitative Results
We first describe the following five variants of our FA block.
Fully connected layers with parameters and are removed from FA block.
Softmax function is replaced by multiplication with 1/.
activation function is replaced by ReLU.
Instead of using dot product similarity as relation fucntion, we use the relation network module proposed in  to find attention weights. It is given as
where is a concatenation operation. and are learnable parameters of fully connected layer that project the concatenated vector to a scalar value.
This is our final network with fully connected layers and , dot product similarity, softmax and activation function.
The quantitative results are given in Table I. From the experimental results, it is seen that,
our method outperforms numerous recent methods. L-RA and L-RAI are two variants of  as stated in their work. It is interesting to note that our feature aggregation method performs better in predicting the accidents earlier than dynamic parameter prediction  and adaptive loss based method , even though our mAP value is lower as compared to these methods. This is because our work primarily focuses on achieving a higher ATTA value for practical driving applications capable of predicting most of the accidents earlier, thus avoiding causalities. The adaptive loss strategy  is aimed at early anticipation of accidents but we observe that our method anticipates accidents earlier than  with a simple exponential cross entropy function. Results for ,  and  are taken from  as the evaluation protocol is same. Fig. 5 shows the comparison between our method and different state-of-the art approaches. It can be seen that FA-final has the highest ATTA value with a reasonable mean average precision (mAP) that is comparable with other approaches.
Fig. 7 shows Precision vs. Recall curves for different variants of FA block. The graph indicates that FA-final has the highest area under the curve whereas FA-4 gives the lowest. This shows that using dot product similarity in an embedding space as an appearance relation function gives better results than the concatenation operation . The other three variants have almost similar curves as FA-final with very little difference.
Iv-E Qualitative results
We show qualitative results of accident anticipation in Fig. 6, 8 and 9. As seen from the results, it is evident that our method is able to differentiate between negative and positive videos. For a negative video, as seen in Fig. 8, the anticipation probability does not exceed the threshold indicating no accidents. Fig. 9 shows a false alarm that was raised for a negative video. It can be attributed to the fact that objects were too close to each other in the video sequence.
This paper presents a novel Feature Aggregation block that is used for anticipation of road accidents. The FA block refines each object’s features by using the appearance relation between different objects in a given frame. We showed that using FA block along with an LSTM provides us with the complementary information related to both spatial and temporal domain of a video sequence. The quantitative and qualitative results on the challenging Street Accident (SA) dataset show that our method outperforms the state-of-the art methods in anticipating accidents earlier. As future work, we plan to incorporate other relation information between objects in a scene for accident anticipation.
-  R. Herzig, E. Levi, H. Xu, H. Gao, E. Brosh, X. Wang, A. Globerson, and T. Darrell, “Spatio-temporal action graph networks,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun,
“Agent-centric risk assessment: Accident anticipation and risky region
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2222–2230.
-  F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” in Asian Conference on Computer Vision. Springer, 2016, pp. 136–153.
-  T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Y. Yang, D. Krompass, and V. Tresp, “Tensor-train recurrent neural networks for video classification,” in
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 3891–3900.
-  S. Chen, X. Wang, Y. Tang, X. Chen, Z. Wu, and Y.-G. Jiang, “Aggregating frame-level features for large-scale video classification,” arXiv preprint arXiv:1707.00803, 2017.
-  H. Ye, Z. Wu, R.-W. Zhao, X. Wang, Y.-G. Jiang, and X. Xue, “Evaluating two-stream cnn for video classification,” in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp. 435–442.
-  Y. Tang, L. Ma, and L. Zhou, “Hallucinating optical flow features for video classification,” arXiv preprint arXiv:1905.11799, 2019.
-  A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classification,” arXiv preprint arXiv:1706.06905, 2017.
-  X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen, “Attention clusters: Purely attention based local feature integration for video classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7834–7843.
-  Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsupervised traffic accident detection in first-person videos,” arXiv preprint arXiv:1903.00618, 2019.
-  X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
-  D. Singh and C. K. Mohan, “Deep spatio-temporal representation for detection of road accidents using stacked autoencoder,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 3, pp. 879–887, 2018.
-  M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5532–5541.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
-  A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971.
-  G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with r* cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1080–1088.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Advances in neural information processing systems, 2017, pp. 4967–4976.