Spatio-Temporal Scene-Graph Embedding for Autonomous Vehicle Collision Prediction

In autonomous vehicles (AVs), early warning systems rely on collision prediction to ensure occupant safety. However, state-of-the-art methods using deep convolutional networks either fail at modeling collisions or are too expensive/slow, making them less suitable for deployment on AV edge hardware. To address these limitations, we propose sg2vec, a spatio-temporal scene-graph embedding methodology that uses Graph Neural Network (GNN) and Long Short-Term Memory (LSTM) layers to predict future collisions via visual scene perception. We demonstrate that sg2vec predicts collisions 8.11 earlier than the state-of-the-art method on synthesized datasets, and 29.47 more accurately on a challenging real-world collision dataset. We also show that sg2vec is better than the state-of-the-art at transferring knowledge from synthetic datasets to real-world driving datasets. Finally, we demonstrate that sg2vec performs inference 9.3x faster with an 88.0 power, and 92.8 industry-standard Nvidia DRIVE PX 2 platform, making it more suitable for implementation on the edge.



There are no comments yet.


page 2

page 4

page 6

page 8

page 9


Short-Term Power Prediction for Renewable Energy Using Hybrid Graph Convolutional Network and Long Short-Term Memory Approach

Accurate short-term solar and wind power predictions play an important r...

Lane Attention: Predicting Vehicles' Moving Trajectories by Learning Their Attention over Lanes

Accurately forecasting the future movements of surrounding vehicles is e...

DriveGuard: Robustification of Automated Driving Systems with Deep Spatio-Temporal Convolutional Autoencoder

Autonomous vehicles increasingly rely on cameras to provide the input fo...

Comparison of Different Methods for Time Sequence Prediction in Autonomous Vehicles

As a combination of various kinds of technologies, autonomous vehicles c...

Safe Deep Q-Network for Autonomous Vehicles at Unsignalized Intersection

We propose a safe DRL approach for autonomous vehicle (AV) navigation th...

A Visual Neural Network for Robust Collision Perception in Vehicle Driving Scenarios

This research addresses the challenging problem of visual collision dete...

roadscene2vec: A Tool for Extracting and Embedding Road Scene-Graphs

Recently, road scene-graph representations used in conjunction with grap...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The synergy of Artificial Intelligence (AI) and the Internet of Things (IoT) has accelerated the advancement of Autonomous Vehicle (AV) technologies, which is expected to revolutionize transportation by reducing traffic and improving road safety 

[33, 6]. However, recent reports of AV crashes suggest that there are still significant limitations. For example, multiple fatal Tesla Autopilot crashes can primarily be attributed to perception system failures [23, 24]. Additionally, the infamous fatal collision between an Uber self-driving vehicle and a pedestrian can be attributed to perception and prediction failures by the AV [22]. These accidents (among others) have eroded public trust in AVs, and nearly 50% or more of the public have expressed their mistrust in AVs [10]. Current statistics indicate that perception and prediction errors were factors in over 40% of driver-related crashes between conventional vehicles [19]. However, a significant number of reported AV collisions are also the result of these errors [29, 36]. Thus, in this paper, we aim to improve the safety and acceptance of AVs by incorporating scene-graphs

into the perception pipeline of collision prediction systems to improve scene understanding.

For the past several years, automotive manufacturers have begun equipping consumer vehicles with statistics-based collision avoidance systems based on calculated Single Behavior Threat Metrics (SBTMs) such as Time to Collision (TTC), Time to React (TTR), etc. [15, 9]. However, these methods lack robustness since they make significant assumptions about the behavior of vehicles on the road. A very limiting assumption they make is that vehicles do not diverge from their current trajectories [9]. SBTMs can also fail in specific scenarios. For example, TTC can fail when following a vehicle at the same velocity within a very short distance [9]

. As a result, these methods are less capable of generalizing and can perform poorly in complex road scenarios. Moreover, to reduce false positives, these systems are designed to respond at the last possible moment

[31]. Under such circumstances, the AV control system can fail to take timely corrective actions [25]

if the system fails to predict a collision or estimates the TTC inaccurately.

More effective collision prediction methods using Deep Learning (DL) have also been proposed in the literature. However, these approaches can be limited because they do not explicitly capture the relationships between the objects in a traffic scene. Understanding these relationships could be critical as it is suggested that a human’s ability to understand complex scenarios and identify potential risks relies on cognitive mechanisms for representing structure and reasoning about inter-object relationships 

[5]. These models also require large datasets that are often costly or unsafe to generate. Synthetic datasets are typically used to augment the limited real-world data to train the models in such cases [11]. However, these trained models must then be able to transfer the knowledge gained from synthetic datasets to real-world driving scenarios. Furthermore, DL models contain millions of parameters and require IoT edge devices with significant computational power and memory to run efficiently. Likewise, hosting these models on the cloud is infeasible because it requires persistent low-latency internet connections.

In summary, the key research challenges associated with autonomous vehicle collision prediction are:

  1. Capturing complex relationships between road participants.

  2. Detecting future collisions early enough such that the AV can take corrective actions.

  3. Generalizing to a wide range of traffic scenarios.

  4. Developing algorithms that can run efficiently on automotive IoT edge devices.

Fig. 1: How sg2vec predicts collisions using scene-graphs. Each node’s color indicates its attention score (importance to the collision likelihood) from orange (high) to green (low).

In this work, we propose the use of scene-graphs to represent road scenes and model inter-object relationships as we hypothesize that this will improve perception and scene understanding. Recently, several works have shown that graph-based methods that capture and model complex relationships between entities can improve performance at high-level tasks such as behavior classification [20, 18] and semantic segmentation [16]. Scene-graphs are used in many domains to abstract the locations and attributes of objects in a scene and model inter-object relationships [18, 20, 21]. In our prior work, [40], we demonstrated that a scene-graph sequence classification approach can better assess the subjective risk of driving clips compared to a conventional CNN+LSTM-based approach; our approach could also better generalize and transfer knowledge across datasets. This paper extends the approach presented in [40] to collision prediction by enabling the prediction of future road states via changes to the temporal modeling components of the architecture and changes in the problem formulation. The scene-graph representation we propose represents traffic objects as nodes and the relationships between them as edges. The novelty of our scene-graph representation lies in our graph construction technique that is specifically designed for higher-level AV decisions such as collision prediction. One of our key contributions is showing that a graph-based intermediate representation can be very effective and efficient for higher-level AV tasks directly related to driving decisions.

Our proposed methodology for collision prediction, sg2vec, is shown in Figure 1. It combines the scene-graph representation with a graph-embedding architecture to generate a sequence of scene-graph embeddings for the sequence of visual inputs perceived by an AV. The graph embedding technique we use is based on the core MR-GCN framework [20] adapted for the collision prediction problem. The sequence of graph embeddings is then input to a Long Short-Term Memory (LSTM) network to make the final prediction on the possibility of a future collision. To the best of our knowledge, our work is the first to propose using scene-graphs for early collision prediction.

Our paper makes the following key research contributions:

  1. We demonstrate that our sg2vec collision prediction methodology significantly outperforms the current state of the art on simulated lane-change datasets and a very challenging real-world collision dataset containing a wide range of driving actions, collision types, and weather/road conditions.

  2. We demonstrate that sg2vec can transfer knowledge gained from simulated data to real-world driving data more effectively than the state-of-the-art method.

  3. We show that sg2vec performs faster inference and requires less power than the state-of-the-art method on the industry-standard Nvidia DRIVE PX 2 autonomous driving hardware platform, used in all 2016-2018 Tesla models for their Autopilot system [1].

Ii Related Work

Ii-a Early Collision Prediction

Since collision prediction is key to the safety of AVs, a wide range of solutions have been proposed by academia and industry. As mentioned earlier, current consumer vehicles use statistics-based SBTMs for collision prediction but can perform poorly in complex situations [15, 9] or react too late to avoid collisions [25, 31]. Expanding on these approaches, companies like Mobileye and Nvidia have proposed more comprehensive mathematical models for ensuring AV safety, namely Responsibility-Sensitive Safety (RSS) [30] and Nvidia Safety Force Field [26], respectively. However, these models are heavily rule-based and can thus be fragile in complex situations with high uncertainty. Additionally, computing future trajectory constraints with RSS is non-trivial and can require vehicle-specific calibration [12].

Model-based probabilistic and deep learning approaches for collision prediction have also been proposed. For example, [2] proposes a model-based probabilistic technique that uses the roadway geometry, ego trajectory, and position/velocity of road objects to predict future object positions. However, this model is highly conservative and is likely to have a high false-positive rate. Similarly, [35] and [41] use model-based approaches but require significant domain knowledge about the driving scene, such as road geometry information as well as accurate vehicle position and velocity information. [34] proposes a deep learning collision prediction approach. Still, due to its use of pre-processed trajectory data captured from cameras overlooking a highway, it is not ego-centric and cannot be practically used for on-vehicle collision prediction. In a different approach, [32] proposes a Deep Predictive Model (DPM) that used a Bayesian Convolutional LSTM for collision risk assessment where image data, vehicle telemetry data, and driving inputs were all factors in the risk assessment decision. However, this approach was only evaluated on simulated street scenes containing two vehicles and no other dynamic objects. Thus, DPM’s performance may suffer when evaluated on more complex road scenarios.

In contrast to these existing works, we propose sg2vec which captures structural and relational information of a road scene in a scene-graph representation and computes a spatio-temporal embedding to predict collisions. Additionally, we perform experiments that were not done in many prior works, such as evaluating each model’s capability to transfer knowledge, efficiency on AV hardware, performance on a complex real-world crash dataset, and ability to predict collisions early. We primarily compare our methodology with the DPM as it is the state-of-the-art data-driven collision prediction framework for AVs that considers both spatial and temporal factors. Although the DPM uses multiple modalities for sensing, the results in [32] show that it achieves an accuracy (of 81.95%) that is just 0.24% less using just the image sensing modality. In this work, we compare our proposed sg2vec methodology and the DPM on image-only datasets, which is fair because the DPM’s performance does not vary much with the inclusion of other modalities.

Ii-B AV Scene-Graphs and Optimization Techniques

Several works have proposed graph-based methods for scene understanding. For example, [20]

proposed a multi-relational graph convolutional network (MR-GCN) that uses both spatial and temporal information to classify vehicle driving behavior. Similarly, in

[18], an Ego-Thing and Ego-Stuff graph are used to model and classify the ego vehicle’s interactions with moving and stationary objects, respectively. In our prior work, we demonstrated that a scene-graph sequence embedding approach assesses driving risk better than the state-of-the-art CNN-LSTM approach [40]. In [40], we utilized an architecture consisting of MR-GCN layers for spatial modeling and an LSTM with attention for temporal modeling; however, this architecture was only capable of performing binary sequence-level classification over a complete video clip. Thus, although our prior architecture could accurately assess the subjective risk of complete driving sequences, it was not capable of predicting the future state of a scene.

Current autonomous driving systems consume a substantial amount of power (up to 500 Watts for the Nvidia DRIVE AGX Pegasus), demanding more robust cooling and power delivery mechanisms. Thus, many have tried to optimize AV tasks for efficiency without sacrificing performance. Existing approaches have proposed methods for jointly optimizing power consumption and latency for localization [3], perception [4], and control [14]. However, to the best of our knowledge, no work has explored this optimization for AV safety systems, such as collision prediction systems.

Fig. 2: An illustration of our scene-graph extraction process.
Fig. 3: An illustration of sg2vec’s architecture.

Iii Scene-Graph Embedding Methodology

In sg2vec, we formulate the problem of collision prediction as a time-series classification problem where the goal is to predict if a collision will occur in the near future. Our goal is to accurately model the spatio-temporal function , where


where implies a collision in the near future and otherwise. Here the variable denotes the image captured by the on-board camera at time . The interval between each frame varies with the camera sampling rate.

sg2vec consists of two parts (Figure 3) : (i) the scene-graph extraction, and (ii) collision prediction through spatio-temporal embedding, described in Section III-A and Section III-B respectively.

Iii-a Scene-Graph Extraction

The first step of our methodology is the extraction of scene-graphs for the images of a driving scene. The extraction pipeline forms the scene-graph for an image as in  [38, 37] by first detecting the objects in the image and then identifying their relations based on their attributes. The difference from prior works lies in the construction of a scene-graph that is designed for higher-level AV decisions. We propose extracting a minimal set of relations such as directional relations and proximity relations. From our design space exploration we found that adding many relation edges to the scene-graph adds noise and impacts convergence while using too few relation types reduces our model’s expressivity. The best approach we found across applications involves constructing mostly ego-centric relations for a moderate range of relation types. Figure 2 shows an example of the graph extraction process.

We denote the extracted scene-graph for the frame by . Each scene-graph is a directed, heterogeneous multi-graph, where denotes the nodes and is the adjacency matrix of the graph . As shown in Fig. 2, nodes represent the identified objects such as lanes, roads, traffic signs, vehicles, pedestrians, etc., in a traffic scene. The adjacency matrix indicates the pair-wise relations between each object in . The extraction pipeline first identifies the objects by using Mask R-CNN [13]. Then, it generates an inverse perspective mapping (also known as a “birds-eye view” projection) of the image to estimate the locations of objects relative to the ego car, which are used to construct the pair-wise relations between objects in . For each camera angle, we calibrate the birds-eye view projection settings using known fixed distances, such as the lane length and width, as defined by the highway code. This enables us to estimate longitudinal and lateral distances accurately in the projection. For datasets captured by a single vehicle, this step only needs to be performed once. However, for datasets with a wide range of camera angles such as the 620-dash dataset introduced later in the paper, this process needs to be performed once per vehicle. With a human operator, we found that this calibration step takes approximately 1 minute per camera angle on average.

The extraction pipeline identifies three kinds of pair-wise relations: proximity relations (e.g. visible, near, very_near, etc.), directional (e.g. Front_Left, Rear_Right, etc.) relations, and belonging (e.g. car_1 isIn left_lane) relations. Two objects are assigned the proximity relation, {Near_Collision (4 ft.), Super_Near (7 ft.), Very_Near (10 ft.), Near (16 ft.), Visible (25 ft.)} provided the objects are physically separated by a distance that is within that relation’s threshold. The directional relation, {Front_Left, Left_Front, Left_Rear, Rear_Left, Rear_Right, Right_Rear, Right_Front, Front_Right}, is assigned to a pair of objects, in this case between the ego-car and another car in the view, based on their relative orientation and only if they are within the Near threshold distance from one another. Additionally, the isIn relation identifies which vehicles are on which lanes (see Fig. 2). We use each vehicle’s horizontal displacement relative to the ego vehicle to assign vehicles to either the Left Lane, Middle Lane, or Right Lane using the known lane width. Our abstraction only considers three-lane areas, and, as such, we map vehicles in all left lanes and all right lanes to the same Left Lane node Right Lane node respectively. If a vehicle overlaps two lanes (i.e., during a lane change), it is mapped to both lanes.

Iii-B Collision Prediction

As shown in Figure 3, in our collision prediction methodology, each image is first converted into a scene-graph with the pipeline mentioned in Section III-A. Each node

is initialized by a one-hot vector (

embedding), denoted by . Then, the MR-GCN [28] layers are used to update these embeddings via the edges in . Specifically, the -th MR-GCN layer computes the node embedding for each node , denoted as , as follows:


where denotes the set of neighbors of node with respect to the relation , is a trainable relation-specific transformation for relation , and is the self-connection for each node that accounts for the influence of on  [28]. After multiple MR-GCN layers, the output of each layer is concatenated to produce the final embedding for each node, denoted by , where is the index of the last layer.

The final embeddings for scene-graph , denoted by , are then passed through a self-attention graph pooling (SAGPooling) layer that filters out irrelevant nodes from the graph, creating the pooled set of node embeddings and their edges . In this layer, we use a graph convolution layer to predict the score and then use to perform top-k filtering to filter out the irrelevant nodes in the scene-graph [17].

Then, for each scene-graph , the corresponding is passed through the graph readout layer that condenses the node embeddings (using operations such as sum, mean, max, etc.) to a single graph embedding . Then, this spatial embedding is passed to the temporal model (LSTM) to generate a spatio-temporal embedding as follows:


The hidden state of the LSTM is updated for each timestamp . Lastly, each spatio-temporal embedding

is then passed through a Multi-Layer Perceptron (MLP) that outputs each class’s confidence value. The two outputs of the MLP are compared, and

is set to the index of the class with the greater confidence value (0 for no-collision or 1 for collision). During training, we calculate the cross-entropy loss between each set of non-binarized outputs

and the corresponding labels for backpropagation.

Iv Experimental Results

This section provides extensive experimental results to demonstrate sg2vec’s performance, efficiency, and transferability compared to the state-of-the-art collision prediction model, DPM [32]. For sg2vec, we used 2 MR-GCN layers, each of size 64, one SAGPooling layer with a pooling ratio of 0.25, one add-readout layer, one LSTM layer with hidden size 20, one MLP layer with an output of size 2, and a LogSoftmax to generate the final confidence value for each class. For the DPM, we followed the architecture used in [32]

, which uses one 64x64x5 Convolutional LSTM (ConvLSTM) layer, one 32x32x5 ConvLSTM layer, one 16x16x5 ConvLSTM layer, one MLP layer with output size 64, one MLP layer with output size 2, and a Softmax to generate the final confidence value. For both models, we used a dropout of 0.1 and ReLU activation. The learning rates were 0.00005 for

sg2vec and 0.0001 for DPM. We ran the experiments shown in Sections IV-B and IV-C on a Windows PC with an AMD Ryzen Threadripper 1950X processor, 16 GB RAM, and an Nvidia GeForce RTX 2080 Super GPU.

Fig. 4: Examples of driving scenes from our a) synthetic datasets, b) typical real-world dataset, and c) complex real-world dataset. In a), all driving scenes occur on highways with the same camera position and clearly defined road markings; lighting and weather are dynamically simulated in CARLA. In b) driving scenes occur on multiple types of clearly marked roads but lighting, camera angle, and weather are consistent across scenes. c) contains a much broader range of camera angles as well as more diverse weather and lighting conditions, including rain, snow, and night-time driving; it also contains a large number of clips on unpaved or unmarked roadways, as shown.

Iv-a Dataset Preparation

We prepared three types of datasets for our experiments: (i) synthesized datasets, (ii) a typical real-world driving dataset, and (iii) a complex real-world driving dataset. Examples from each dataset are shown in Figure 4. Our synthetic datasets focus on the highway lane change scenario as it is a common AV task. To evaluate the transferability of each model from synthetic datasets to real-world driving, we prepared a typical real-world dataset containing lane-change driving clips. Finally, we prepared the complex real-world driving dataset to evaluate each model’s performance on a challenging dataset containing a broad spectrum of collision types, road conditions, and vehicle maneuvers. All datasets were collected at a 1280x720 resolution, and each clip spans 1-5 seconds.

Iv-A1 Synthetic Datasets

To synthesize the datasets, we developed a tool using CARLA [11]

, an open-source driving simulator, and CARLA Scenario Runner


to generate lane change video clips with/without collisions. We generated a wide range of simulated lane changes with different numbers of cars, pedestrians, weather and lighting conditions, etc. We also customized each vehicle’s driving behavior, such as their intended speed, probability of ignoring traffic lights, or the chance of avoiding collisions with other vehicles. We generated two synthetic datasets: a

271-syn dataset and a 1043-syn dataset, containing 271 and 1,043 video clips, respectively. These datasets have no-collision:collision label distributions of 6.12:1 and 7.91:1, respectively. In addition, we sub-sampled the 1043-syn dataset to create 306-syn: a balanced dataset that has a 1:1 distribution. Our synthetic scene-graph datasets222 and our source code333 are open-source and available online.

Iv-A2 Typical Real-World Driving Dataset

This dataset, denoted as 571-honda, is a subset of the Honda Driving Dataset (HDD) [27] containing 571 lane-change video clips from real-world driving with a distribution of 7.21:1. The HDD was recorded on the same vehicle during mostly safe driving in the California Bay Area.

Iv-A3 Complex Real-World Driving Dataset

Our complex real-world driving dataset, denoted as 620-dash, contains very challenging real-world collision scenarios drawn from the Detection of Traffic Anomaly dataset [39]. This dataset contains a wide range of drivers, car models, driving maneuvers, weather/road conditions, and collision types, as recorded by on-board dashboard cameras. Since the original dataset contains only collision clips, we prepared 620-dash by splitting each clip in the original dataset into two parts: (i) the beginning of the clip until 1 second before the collision, and (ii) from 1 second before the collision until the end of the collision. We then labeled part (i) as ‘no-collision’ and part (ii) as ‘collision.’ The 620-dash dataset contains 315 collision video clips and 342 non-collision driving clips.

Iv-A4 Labeling and Pre-Processing

We labeled the synthetic datasets and the 571-honda dataset using human annotators. The final label assigned to a clip is the average of the labels assigned by the human annotators rounded to 0 (no collision) and 1 (collision/near collision). Each frame in a video clip is given a label identical to the entire clip’s label to train the model to identify the preconditions of a future collision.

For sg2vec, all the datasets were pre-processed using the scene-graph extraction pipeline mentioned in Section III-A to construct the scene-graphs for each video clip. For a given sequence, sg2vec can leverage the full history of prior frames for each new prediction. For the DPM, the datasets were pre-processed to match the input format used in its original implementation [32]. Thus, the DPM uses 64x64 grayscale versions of the clips in the datasets turned into sets of sub-sequences for a clip of length defined as follows.


Since DPM only uses five prior frames to make each prediction, we also present results for sg2vec using the same length of history, denoted as sg2vec (5-frames) in the results.

Iv-B Collision Prediction Performance

We evaluated sg2vec and the DPM using classification accuracy, area under the ROC curve (AUC) [7], and Matthews Correlation Coefficient (MCC) [8]. MCC is considered a balanced measure of performance for binary classification even on datasets with significant class imbalances. The MCC score outputs a value between -1.0 and 1.0, where 1.0 corresponds to a perfect classifier, 0.0 to a random classifier, and -1.0 to an always incorrect classifier. Although class re-weighting helps compensate for the dataset imbalance during training, classification accuracy is typically less reliable for imbalanced datasets, so the primary metric we use to compare the models is MCC. We used stratified 5-fold cross-validation to produce the final results shown in Table I and Figure 5.

Dataset Model Accuracy AUC MCC
271-syn sg2vec (5-frames) 0.8979 0.9541 0.5362
271-syn sg2vec 0.8812 0.9457 0.5145
271-syn DPM 0.8733 0.8939 0.2160
306-syn sg2vec (5-frames) 0.7946 0.8653 0.5790
306-syn sg2vec 0.8372 0.9091 0.6812
306-syn DPM 0.6846 0.6881 0.3677
1043-syn sg2vec (5-frames) 0.9142 0.9623 0.5323
1043-syn sg2vec 0.9095 0.9477 0.5385
1043-syn DPM 0.8834 0.9175 0.2912
620-dash sg2vec (5-frames) 0.6534 0.7113 0.3053
620-dash sg2vec 0.7007 0.7857 0.4017
620-dash DPM 0.4890 0.4717 -0.0366
TABLE I: Classification accuracy, AUC, and MCC for sg2vec (Ours) and DPM.

Iv-B1 Synthetic Datasets

The performance of sg2vec and the DPM on our synthetic datasets is shown in Table I. We find that our sg2vec achieves higher accuracy, AUC, and MCC on every dataset, even when only using five prior frames as input. In addition to predicting collisions more accurately, sg2vec also infers 5.5x faster than the DPM on average. We attribute this to the differences in model complexity between our sg2vec architecture and the much larger DPM model. Interestingly, sg2vec (5-frames) achieves slightly better accuracy and AUC than sg2vec on the imbalanced datasets and slightly lower overall performance on the balanced datasets. This is likely because the large number of safe lane changes in the imbalanced datasets adds noise during training and makes the full-history version of the model perform slightly worse. However, the full model can learn long-tail patterns for collision scenarios and performs better on the balanced datasets.

The DPM achieves relatively high accuracy and AUC on the imbalanced 271-syn and 1043-syn datasets, but suffers significantly on the balanced 306-syn dataset. This drop indicates that the DPM could not identify the minority class (collision) well and tended to over-predict the majority class (no-collision). In terms of MCC, the DPM scores higher on the 306-syn dataset than what it scores on the other datasets. This result is because the 306-syn dataset has a balanced class distribution compared to the other datasets, which could enable the DPM to improve its prediction accuracy on the collision class.

In contrast, the sg2vec methodology performs well on both balanced and imbalanced synthetic datasets with an average MCC of 0.5860, an average accuracy of 87.97%, and an average AUC of 0.9369. Since MCC is scaled from -1.0 to 1.0, sg2vec achieves a 14.72% higher average MCC score than the DPM model.

The results from our sg2vec ablation study are shown in Table III and support our hypothesis that both spatial modeling with MRGCN and temporal modeling with LSTM are core to sg2vec’s collision prediction performance. However, the MRGCN appears to be slightly more critical to performance than the LSTM. Interestingly the choice of pooling layer (no pooling, Top-K pooling, or SAG Pooling) does not seem to significantly affect performance at this task as long as LSTM is used; when no LSTM is used SAG Pooling presents a clear performance improvement.

Iv-B2 Complex Real-World Dataset

The performance of both the models significantly drops on the highly complex real-world 620-dash dataset due to the variations in the driving scenes and collision scenarios. This drop is to be expected as this dataset contains a wide range of driving actions, road environments, and collision scenarios, increasing the difficulty of the problem significantly. We took several steps to try and address this performance drop. First, we improved the birds-eye view (BEV) calibration on this dataset in comparison to the other datasets. Since the varying camera angles and road conditions in this dataset impact our ability to properly calibrate sg2vec

’s BEV projection in a single step, we created custom BEV calibrations for each clip in the dataset, which improved performance somewhat. However, as shown in Figure 4c, a significant part of the dataset consists of driving clips on roads without any discernible lane markings, such as snowy, unpaved, or unmarked roadways. These factors make it challenging to correlate known fixed distances (i.e., the width and length of lane markings) with the projections of these clips. To further improve performance on this particular dataset, we performed extensive architecture and hyperparameter tuning. We found that, with one MRGCN layer of size 64, one LSTM layer with hidden size 100, no SAGPooling layer, and a high learning rate and batch size, we achieved significantly better performance than the model architecture discussed in the beginning of Section


(2 MRGCN layers of size 64, one LSTM layer with hidden size 20, and a SAGPooling layer with a keep ratio of 0.5). We believe this indicates that the temporal features of each clip in this dataset are more closely related to collision likelihood than the spatial features in each clip. As a result, the additional spatial modeling components were likely causing overfitting and skewing the spatial embedding output. The spatial embeddings remained more general with a simpler spatial model (1 MRGCN and no SAGPooling). This change, combined with using a larger LSTM layer, enabled the model to capture more temporal features when modeling each clip and better generalize to the testing set. Model performance on this dataset and similar datasets could likely be improved by acquiring more consistent data via higher-resolution cameras with fixed camera angles and more accurate BEV projection approaches. However, as collisions are rare events, there are little to no datasets containing real-world collisions that meet these requirements. Despite these limitations,

sg2vec outperforms the DPM model by a significant margin, achieving 21.17% higher accuracy, 31.40% higher AUC, and a 21.92% higher MCC score. Since DPM achieves a negative MCC score, its performance on this dataset is worse than that of a random classifier (MCC of 0.0). Consistent with the synthetic dataset results, sg2vec using all frames performs better on the balanced 620-dash dataset than sg2vec (5-frames). Overall, these results show that, on very challenging and complex real-world driving scenarios, sg2vec can perform much better than the current state-of-the-art.

Iv-B3 Time of Prediction

Since collision prediction is a time-sensitive problem, we evaluated our methodology and the DPM on their average time-of-prediction (ATP) for video clips containing collisions. To calculate the ATP, we recorded the first frame index in each collision clip when the model correctly predicts that a collision would occur. We then averaged these indices and compared them with the average collision video clip length. Essentially, ATP gives an estimate of how early each model can predict a future collision. These results are shown in Table II. On the 1043-syn dataset, sg2vec achieves 0.1725 for the ratio of the ATP and the average sequence length while the DPM achieves a ratio of 0.2382, indicating that sg2vec predicts future collisions 39.07% earlier than the DPM on average. In the context of real-world collision prediction, the average sequence in the 1043-syn dataset represents 1.867 seconds of data. Thus, our methodology predicted collisions 122.7 milliseconds earlier than DPM on average. This extra time can be critical for ensuring that the AV avoids an impending collision.

Dataset Model ATP Avg. Seq. Len. Ratio
271-syn sg2vec (Ours) 10.004 33.920 0.2949
271-syn DPM 17.399 32.899 0.5289
1043-syn sg2vec (Ours) 6.442 37.343 0.1725
1043-syn DPM 9.018 37.856 0.2382
TABLE II: Average time of prediction (ATP) for collisions.
Experiment Spatial Model Graph Pooling Temporal Model Acc. MCC
Ablation Study MLP none none 0.7605 0.2612
MLP none LSTM 0.7660 0.2874
MRGCN none none 0.8605 0.4792
MRGCN none LSTM 0.8931 0.5561
Graph Attn. and Pooling MRGCN Top-K none 0.8288 0.3458
MRGCN SAGPool none 0.8738 0.5032
MRGCN Top-K LSTM 0.9014 0.5565
MRGCN SAGPool LSTM 0.9076 0.5407
TABLE III: sg2vec ablation study on the 1043-syn dataset.

Iv-C Transferability From Synthetic to Real-World Datasets

The collision prediction models trained on simulated datasets must be transferable to real-world driving as it can differ significantly from simulations. To evaluate each model’s ability to transfer knowledge, we trained each model on a synthetic dataset before testing it on the 571-honda dataset. No additional domain adaptation was performed. We did not evaluate transferability to the 620-dash dataset because it contains a wide range of highly dynamic driving maneuvers that were not present in our synthesized datasets. As such, evaluating transferability between our synthesized datasets and the 620-dash dataset would yield poor performance and would not provide insight. Figure 5 compares the accuracy and MCC for both the models on each training dataset and the 571-honda dataset after transferring the trained model.

Fig. 5: Performance after transferring the models trained on synthetic 271-syn and 1043-syn datasets to the real-world 571-honda dataset.

We observe that the sg2vec model achieves a significantly higher MCC score than the DPM model after the transfer, suggesting that our methodology can better transfer knowledge from a synthetic to a real-world dataset compared to the state-of-the-art DPM model. The drop in MCC values observed for both the models when transferred to the 571-honda dataset can be attributed to the characteristic differences between the simulated and real-world datasets; the 571-honda dataset contains a more heterogeneous set of road environments, lighting conditions, driving styles, etc., so a drop in performance after the transfer is expected. We also note that the MCC score for the sg2vec model trained on 271-syn dataset drops more than the model trained on the 1043-syn dataset after the transfer, likely due to the smaller training dataset size. Regarding accuracy, the sg2vec model trained on 1043-syn achieves 4.37% higher accuracy and the model trained 271-syn dataset achieves 1.47% lower accuracy than the DPM model trained on the same datasets. The DPM’s similar accuracy after transfer likely results from the class imbalance in the 571-honda dataset. Overall, we hypothesize that sg2vec’s use of an intermediate representation (i.e., scene-graphs) inherently improves its ability to generalize and thus results in an improved ability to transfer knowledge compared to CNN-based deep learning approaches.

Iv-D Evaluation on Industry-Standard AV Hardware

To demonstrate that the sg2vec is implementable on industry-standard AV hardware, we measured its inference time (milliseconds), model size (kilobytes), power consumption (watts), and energy consumption per frame (milli-joules) on the industry-standard Nvidia DRIVE PX 2 platform, which was used by Tesla for their Autopilot system from 2016 to 2018 [1]. Our hardware setup is shown in Figure 6. For the inference time, we evaluated the average inference time (AIT) in milliseconds taken by each algorithm to process each frame. We recorded power usage metrics using a power meter connected to the power supply of the PX 2. To ensure that the reported numbers only reflected each model’s power consumption and not that of background processes, we subtracted the hardware’s idle power consumption from the averages recorded during each test. For a fair comparison, we captured the metrics for the core algorithm (i.e., the sg2vec and DPM model), excluding the contribution from data loading and pre-processing. Both models were run with a batch size of 1 to emulate the real-world data stream where images are processed as they are received. For comparison, we also show the AIT on a PC for the two models.

Our results are shown in Table IV. sg2vec performs inference 9.3x faster than the DPM on the PX 2 with an 88.0% smaller model and 32.4% less power, making it undoubtedly more practical for real-world deployment. Our model also uses 92.8% less energy to process each frame, which can be beneficial for electric vehicles with limited battery capacity. With an AIT of 0.4828 ms, sg2vec can theoretically process up to 2,071 frames/second (fps). In contrast, with an AIT of 4.535 ms, the DPM can only process up to 220 fps. In the context of real-world collision prediction, this means that sg2vec could easily support multiple 60 fps camera inputs from the AV while DPM would struggle to support more than three.

Fig. 6: Our experimental setup for evaluating sg2vec and DPM on the industry-standard Nvidia DRIVE PX 2 hardware.
Model PC AIT (ms) PX2 AIT (ms) Size (KB) Power (W) Energy/frame (mJ)
sg2vec 0.2549 0.4828 331 2.99 1.44
DPM 1.393 4.535 2,764 4.42 20.0
TABLE IV: Performance evaluation of inference on 271-syn on the Nvidia DRIVE PX 2.

V Conclusion

With our experiments, we demonstrated that our scene-graph embedding methodology for collision prediction, sg2vec, outperforms the state-of-the-art method, DPM, in terms of average MCC (0.5055 vs. 0.2096), average inference time (0.255 ms vs. 1.39 ms), and average time of prediction (39.07% sooner than DPM). Additionally, we demonstrated that sg2vec could transfer knowledge from synthetic datasets to real-world driving datasets more effectively than the DPM, achieving an average transfer MCC of 0.327 vs. 0.060. Finally, we showed that our methodology performs faster inference than the DPM (0.4828 ms vs. 4.535 ms) with a smaller model size (331 KB vs. 2,764 KB) and reduced power consumption (2.99 W vs. 4.42 W) on the industry-standard Nvidia DRIVE PX 2 autonomous driving platform. In the context of real-world collision prediction, these results indicate that sg2vec is a more practical choice for AV safety and could significantly improve consumer trust in AVs. Few works have explored graph-based solutions for other complex AV challenges such as localization, path planning, and control. These are open research problems that we reserve for future work.


  • [1] (2016-10) All new Teslas are equipped with NVIDIA’s new Drive PX 2 AI platform for self-driving - Electrek. Note: [Online; accessed 9. Nov. 2020] Cited by: item 3, §IV-D.
  • [2] M. Althoff, O. Stursberg, and M. Buss (2009) Model-based probabilistic collision detection in autonomous driving. IEEE Transactions on Intelligent Transportation Systems 10 (2), pp. 299–310. Cited by: §II-A.
  • [3] B. Asgari, R. Hadidi, N. S. Ghaleshahi, and H. Kim (2020) PISCES: power-aware implementation of slam by customizing efficient sparse algebra. 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §II-B.
  • [4] S. Baidya, Y. Ku, H. Zhao, J. Zhao, and S. Dey (2020) Vehicular and edge computing for emerging connected and autonomous vehicle applications. 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §II-B.
  • [5] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §I.
  • [6] T. Bijlsma, A. Buriachevskyi, A. Frigerio, Y. Fu, K. Goossens, A. O. Örs, P. J. van der Perk, A. Terechko, and B. Vermeulen (2020) A distributed safety mechanism using middleware and hypervisors for autonomous vehicles. 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1175–1180. Cited by: §I.
  • [7] A. P. Bradley (1997)

    The use of the area under the roc curve in the evaluation of machine learning algorithms

    Pattern recognition 30 (7), pp. 1145–1159. Cited by: §IV-B.
  • [8] D. Chicco and G. Jurman (2020) The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics 21 (1), pp. 6. Cited by: §IV-B.
  • [9] J. Dahl, G. R. de Campos, C. Olsson, and J. Fredriksson (2018) Collision avoidance: a literature review on threat-assessment techniques. IEEE Transactions on Intelligent Vehicles 4 (1), pp. 101–113. Cited by: §I, §II-A.
  • [10] Deloitte Development LLC (2020) Global Automotive Consumer Study - North America. Technical report Deloitte Development LLC. Cited by: §I.
  • [11] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §I, §IV-A1.
  • [12] B. Gassmann, F. Pasch, F. Oboril, and K. Scholl (2020) Integration of formal safety models on system level using the example of responsibility sensitive safety and carla driving simulator. In International Conference on Computer Safety, Reliability, and Security, pp. 358–369. Cited by: §II-A.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn.

    Proceedings of the IEEE international conference on computer vision

    , pp. 2961–2969.
    Cited by: §III-A.
  • [14] C. Huang, S. Xu, Z. Wang, S. Lan, W. Li, and Q. Zhu (2020) Opportunistic intermittent control with safety guarantees for autonomous systems. 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §II-B.
  • [15] H. L. D. Institute (2012) Volvo collision avoidance features: initial results. Highway Loss Data Institute Bulletin 29 (5). Cited by: §I, §II-A.
  • [16] L. Kunze, T. Bruls, T. Suleymanov, and P. Newman (2018) Reading between the lanes: road layout reconstruction from partially segmented scenes. 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 401–408. Cited by: §I.
  • [17] J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. arXiv preprint arXiv:1904.08082. Cited by: §III-B.
  • [18] C. Li, Y. Meng, S. H. Chan, and Y. Chen (2020) Learning 3d-aware egocentric spatial-temporal interaction via graph convolutional networks. 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8418–8424. Cited by: §I, §II-B.
  • [19] A. S. Mueller, J. B. Cicchino, and D. S. Zuby (2020) What humanlike errors do autonomous vehicles need to avoid to maximize safety?. Journal of Safety Research. Cited by: §I.
  • [20] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, and A. Namboodiri (2020) Towards accurate vehicle behaviour classification with multi-relational graph convolutional networks. arXiv preprint arXiv:2002.00786. Cited by: §I, §I, §II-B.
  • [21] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, and A. Namboodiri (2020) Understanding dynamic scenes using graph convolution networks. arXiv preprint arXiv:2005.04437. Cited by: §I.
  • [22] National Transportation Safety Board (2019) Collision between vehicle controlled by developmental automated driving system and pedestrian. Technical report Technical Report NTSB/HAR-19/03, National Transportation Safety Board. Cited by: §I.
  • [23] National Transportation Safety Board (2020) Collision Between a Sport Utility Vehicle Operating With Partial Driving Automation and a Crash Attenuator. Technical report Technical Report NTSB/HAR-20/01, National Transportation Safety Board. Cited by: §I.
  • [24] National Transportation Safety Board (2020) Collision Between Car Operating with Partial Driving Automation and Truck-Tractor Semitrailer. Technical report Technical Report NTSB/HAB-20/01, National Transportation Safety Board. Cited by: §I.
  • [25] J. Nilsson, A. C. Ödblom, and J. Fredriksson (2015) Worst-case analysis of automotive collision avoidance systems. IEEE Transactions on Vehicular Technology 65 (4), pp. 1899–1911. Cited by: §I, §II-A.
  • [26] D. Nistér, H. Lee, J. Ng, and Y. Wang (2019) The safety force field. NVIDIA White Paper. Cited by: §II-A.
  • [27] V. Ramanishka, Y. Chen, T. Misu, and K. Saenko (2018) Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7699–7707. Cited by: §IV-A2.
  • [28] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. European Semantic Web Conference, pp. 593–607. Cited by: §III-B.
  • [29] B. Schoettle and M. Sivak (2015) A preliminary analysis of real-world crashes involving self-driving vehicles. Technical report Technical Report UMTRI-2015-34, University of Michigan Transportation Research Institute. Cited by: §I.
  • [30] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2017) On a formal model of safe and scalable self-driving cars. arXiv preprint arXiv:1708.06374. Cited by: §II-A.
  • [31] S. Sontges, M. Koschi, and M. Althoff (2018) Worst-case analysis of the time-to-react using reachable sets. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1891–1897. Cited by: §I, §II-A.
  • [32] M. Strickland, G. Fainekos, and H. B. Amor (2018) Deep predictive models for collision risk assessment in autonomous driving. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II-A, §II-A, §IV-A4, §IV.
  • [33] S. vom Dorff, B. Böddeker, M. Kneissl, and M. Fränzle (2020) A fail-safe architecture for automated driving. 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 828–833. Cited by: §I.
  • [34] X. Wang, J. Liu, T. Qiu, C. Mu, C. Chen, and P. Zhou (2020) A real-time collision prediction mechanism with deep learning for intelligent transportation system. IEEE transactions on vehicular technology 69 (9), pp. 9497–9508. Cited by: §II-A.
  • [35] Y. Wang, Z. Liu, Z. Zuo, Z. Li, L. Wang, and X. Luo (2019) Trajectory planning and safety assessment of autonomous vehicles based on motion prediction and model predictive control. IEEE Transactions on Vehicular Technology 68 (9), pp. 8546–8556. Cited by: §II-A.
  • [36] C. Xu, Z. Ding, C. Wang, and Z. Li (2019) Statistical analysis of the patterns and characteristics of connected and autonomous vehicle involved crashes. Journal of safety research 71, pp. 41–47. Cited by: §I.
  • [37] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5410–5419. Cited by: §III-A.
  • [38] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. Proceedings of the European conference on computer vision (ECCV), pp. 670–685. Cited by: §III-A.
  • [39] Y. Yao, X. Wang, M. Xu, Z. Pu, E. Atkins, and D. Crandall (2020)

    When, where, and what? a new dataset for anomaly detection in driving videos

    arXiv preprint arXiv:2004.03044. Cited by: §IV-A3.
  • [40] S. Yu, A. V. Malawade, D. Muthirayan, P. P. Khargonekar, and M. A. Al Faruque (2021) Scene-graph augmented data-driven risk assessment of autonomous vehicle decisions. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I, §II-B.
  • [41] L. Zhang, W. Xiao, Z. Zhang, and D. Meng (2020) Surrounding vehicles motion prediction for risk assessment and motion planning of autonomous vehicle in highway scenarios. IEEE Access 8, pp. 209356–209376. Cited by: §II-A.