roadscene2vec: A Tool for Extracting and Embedding Road Scene-Graphs

Recently, road scene-graph representations used in conjunction with graph learning techniques have been shown to outperform state-of-the-art deep learning techniques in tasks including action classification, risk assessment, and collision prediction. To enable the exploration of applications of road scene-graph representations, we introduce roadscene2vec: an open-source tool for extracting and embedding road scene-graphs. The goal of roadscene2vec is to enable research into the applications and capabilities of road scene-graphs by providing tools for generating scene-graphs, graph learning models to generate spatio-temporal scene-graph embeddings, and tools for visualizing and analyzing scene-graph-based methodologies. The capabilities of roadscene2vec include (i) customized scene-graph generation from either video clips or data from the CARLA simulator, (ii) multiple configurable spatio-temporal graph embedding models and baseline CNN-based models, (iii) built-in functionality for using graph and sequence embeddings for risk assessment and collision prediction applications, (iv) tools for evaluating transfer learning, and (v) utilities for visualizing scene-graphs and analyzing the explainability of graph learning models. We demonstrate the utility of roadscene2vec for these use cases with experimental results and qualitative evaluations for both graph learning models and CNN-based models. roadscene2vec is available at



There are no comments yet.


page 8

page 24


Learning Latent Scene-Graph Representations for Referring Relationships

Understanding the semantics of complex visual scenes often requires anal...

Specifying Object Attributes and Relations in Interactive Scene Generation

We introduce a method for the generation of images from an input scene g...

RAIST: Learning Risk Aware Traffic Interactions via Spatio-Temporal Graph Convolutional Networks

A key aspect of driving a road vehicle is to interact with the other roa...

Learning Dynamic Graph Representation of Brain Connectome with Spatio-Temporal Attention

Functional connectivity (FC) between regions of the brain can be assesse...

Spatio-Temporal Scene-Graph Embedding for Autonomous Vehicle Collision Prediction

In autonomous vehicles (AVs), early warning systems rely on collision pr...

Spatio-Temporal Road Scene Reconstruction using Superpixel MRF

Scene models construction based on image rendering is a hot topic in the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous Vehicles (AVs) are expected to revolutionize personal mobility, logistics, and road safety litman2017autonomous. However, recent accidents involving Tesla Autopilot and Uber’s self-driving cars indicate that the development of safe and robust AVs remains a difficult challenge NTSB2019uber; NTSB2018; NTSB2019. Current statistics indicate that perception and prediction errors were factors in over 40% of driver-related crashes between conventional vehicles mueller2020humanlike

, leading both researchers and industry leaders to race to address these problems via advanced AV perception systems. Until recently, most AV perception architectures relied entirely on deep learning techniques, centered around Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs)

yurtsever2019risky; bojarski2016end; tao2021stereo; xiao2020attention

, or model-based methods, which use known road geometry information and vehicle trajectory models to estimate the state of the road scene

sontges2018worst; nister2019safety. Although these approaches have been successful in typical use cases, they are limited in their ability to obtain a higher-level human-like understanding of complex road scenarios as they cannot explicitly capture inter-object relationships or the overall structure of the road scene.

Research has suggested that humans rely on cognitive mechanisms for identifying the structure of a scene and reasoning about inter-object relations when performing complex tasks and identifying risk battaglia2018relational. As such, capturing and identifying the complex relationships between road objects is a key in designing an effective human-like AV perception system. To address the limitations of these existing AV perception methods, several groups have proposed using a variant of knowledge graphs known as scene-graphs to model the state of the road and capture the relationships between objects yu2021scene; mylavarapu2020towards; li2019learning. A scene-graph representation encodes rich semantic information of an image or observed scene, essentially bringing an abstraction of objects and their complex relationships as illustrated in Figure 1. Each of these related works proposes a different form of scene-graph representation, but all demonstrated significant performance improvements over conventional perception methods. In li2019learning, the authors propose a 3D-aware egocentric spatio-temporal interaction framework that uses both an Ego-Thing graph and an Ego-Stuff graph, which encode how the ego vehicle interacts with both moving and stationary objects in a scene, respectively. In mylavarapu2020towards

, the authors propose a pipeline using a multi-relational graph convolutional network (MR-GCN) for classifying the driving behaviors of traffic participants. The MR-GCN is constructed by combining spatial and temporal information, including relational information between moving objects and landmark objects. In our prior work

yu2021scene, we demonstrated that a spatio-temporal scene-graph embedding can be used to identify the subjective risk of driving maneuvers significantly more effectively than the state-of-the-art deep learning method. In addition, our method is able to better transfer knowledge and is more explainable.

Figure 1: How camera data can be used to construct a road scene-graph representation.

Although a wide range of scene-graph

based AV perception approaches have been proposed, each method was developed from scratch, requiring significant time and resource investment by each research group. Although tools exist to perform preprocessing and graph learning (e.g., Pytorch and Pytorch Geometric), to the best of our knowledge there exists no tool for systematically converting road scenes into

scene-graphs in this field. As a result, each research group must start developing their scene-graph construction methodology from the ground up, wasting time and effort that could be better spent using the resultant scene-graph representations to solve more complex research problems. To address this problem, we propose roadscene2vec: a tool for systematically extracting and embedding road scene-graphs. roadscene2vec

enables researchers to quickly and easily extract scene graphs from camera data, evaluate different graph construction methodologies, and use several different graph and machine learning algorithms to generate spatio-temporal graph embeddings for a wide range of AV tasks. We envision

roadscene2vec to serve the following use cases:

  • Converting image-based datasets as well as datasets generated by the CARLA simulator dosovitskiy2017carla into scene-graphs.

  • Enabling the exploration of different scene-graph construction methodologies for a given application via a flexible, reconfigurable, and user-friendly scene-graph extraction framework.

  • Allowing researchers to explore various spatio-temporal graph embedding methods, supporting customized algorithms for further design exploration.

  • Providing a set of baselines drawn from state-of-the-art works used for different AV applications (CNN and CNN-LSTM based algorithms).

  • We provide scene-graph visualization utilities to enhance design space exploration for graph construction.

We target camera data as opposed to lidar, radar, or other sensor types since images are the most rich and detailed modality, providing high resolution details about the scene as well as color information. This information can be used for better identifying the context of the scene and relations between participants. If other modalities are added, it is unlikely that much more information will be added to the scene graph; only the robustness of the system and precision of the graph will be improved. Besides, current state-of-the-art AV perception architectures utilizing sensor fusion still have shortcomings fang2021invisible. Furthermore, the vast majority of publicly available AV datasets primarily contain image data.

1.1 Novel Contributions

Our novel contributions for this research community are:

  1. We present roadscene2vec: a flexible scene-graph construction and embedding framework that allows researchers to experiment with different graph extraction formulations to find the best one for their problem.

  2. We provide an end-to-end graph learning framework for modeling the scene-graph representations. Our framework enables automated experimentation and metrics logging over a wide range of graph learning AV applications. We also provide templates to facilitate users defining their own models and problems.

  3. We provide many visualization tools and utilities for inspecting and understanding the scene-graphs including attention maps, color coding by classes or relation type, birds-eye view projection, embedding projection, etc. This enables users to interpret their results easily without having to design their own visualizer.

  4. We provide state-of-the-art CNN-based models drawn from recent AV papers for cross-comparison with graph-learning based techniques.

1.2 Paper Organization

The rest of our paper is laid out as follows. In Section 2 we discuss related works. In Section 3 we introduce the core functionality of our tool and its methodology. In Section 4 we provide usage examples. In Section 6 we demonstrate the practical real-world value of our tool by evaluating it on several common use cases. Finally, in Section 7 we present our conclusions.

2 Related Work

In this section, we begin by describing some general AV design philosophies. Then we talk about some graph-based approach used in scene understanding. Lastly, we briefly discuss the existing tools or libraries.

2.1 AV Design Methodologies

The two common design approaches for AV systems are (i) end-to-end deep learning architectures yurtsever2019survey and (ii) modular architectures. Modular approaches are implemented as a pipeline of separate components for performing each sub-task of the AV (e.g., perception, localization, planning, control), while end-to-end approaches generate actuator outputs (e.g., steering, brake, accelerator) directly from their sensory inputs bojarski2016end

. One advantage of a modular design approach is the division of a task into an easier-to-solve set of sub-tasks that have been addressed in other fields such as robotics, computer vision, and vehicle dynamics, from which prior knowledge can be leveraged. However, one disadvantage of such an approach is the complexity of implementing, running, and validating the complete pipeline 


. End-to-end approaches can achieve good performance with a smaller network size and lowed implementation costs because they perform feature extraction from sensor inputs implicitly through the network’s hidden layers 

bojarski2016end. However, the authors in chen2015deepdriving point out that the needed level of supervision is too weak for the end-to-end model to learn critical control information (e.g., from image to steering angle), so it can fail to handle complicated driving maneuvers or be insufficiently robust to disturbances.

A third approach called the direct perception approach was first proposed by DeepDriving chen2015deepdriving. In this approach, a set of affordance indicators, such as the distance to lane markings and other cars in the current and adjacent lanes, are extracted from an image and serve as an intermediate representation (IR) for generating the final control output. They show that the use of this IR is effective for simple driving tasks such as lane following as well as enabling better generalization to real-world environments. Similarly, bansal2018chauffeurnet uses a collection of filtered images as the IR. They state that the IR used in their approach allows the training to be conducted on either real or simulated data, facilitating testing and validation in simulations before testing on a real car. Moreover, they show that it is easier to synthesize perturbations to the driving trajectory in the IR than at the raw sensor inputs themselves, enabling them to produce non-expert behaviors such as off-road driving and collisions. The authors in yurtsever2019risky use Mask-RCNN he2017mask to color the vehicles in each input image, producing a form of IR. In contrast to the works mentioned above, roadscene2vec utilizes a scene-graph IR that encodes the spatial and semantic relations between all the traffic participants in a frame. This form of representation is similar to a knowledge graph with the key distinction that scene-graphs explicitly encode knowledge about a visual scene.

2.2 Graph-based Driving Scene Understanding

In the literature, several works have applied graph-based formulations for driving scene understanding. In li2019learning, the authors propose a 3D-aware egocentric Spatio-temporal interaction framework that uses both an Ego-Thing graph and an Ego-Stuff graph, which encode how the ego vehicle interacts with both moving and stationary objects in a scene, respectively. In mylavarapu2020towards, the authors propose a pipeline using a multi-relational graph convolutional network (MR-GCN) for classifying the driving behaviors of traffic participants. The MR-GCN is constructed by combining spatial and temporal information, including relational information between moving objects and landmark objects. In tian2020road, the authors propose extracting road scene graphs in a manner that includes pose information for the purpose of scene layout reconstruction. A similar approach was also proposed in kunze2018reading. Authors in liu2021real propose using a probabilistic graph approach for explainable traffic collision inference. In our prior work, we demonstrated that a scene-graph representation used with an MRGCN leads to state of the art performance at assessing the subjective risk of driving maneuvers yu2021scene. In our tool, we implement examples of multi-relational graph learning models (MRGCN and MRGIN) as well as model skeletons to enable users to easily evaluate other graph learning model formulations.

2.3 Graph Extraction and Graph Learning Libraries

Other libraries for extracting scene-graphs from input images have been proposed. yang2018graph proposed the Graph R-CNN model, which extracts scene graphs by identifying the set of individual objects in the image before identifying the spatial relations between the objects. With this process, Graph R-CNN is able to extract the spatial features of the scene in the form of a scene-graph. tang2020sgbenchmark

provides a benchmark for evaluating several kinds of scene-graph generation models on image datasets. The scene-graph representations extracted by these tools is then used for semantic understanding and labeling tasks, such as image captioning and visual question answering. Although these tools and models are successful at these tasks, they do not incorporate specific domain knowledge relevant to the AV problem space. Autonomous driving is a highly complex problem on its own so AV algorithms must utilize domain knowledge including driving rules, road layout and markings, as well as light and sign information. Furthermore, AV algorithms must account for temporal factors; the aforementioned tools operate on individual images and thus do not account for these safety-critical temporal factors.

Regarding graph learning tools and libraries, several tools such as GraphGYM you2020design, DGL wang2019deep, and OGB hu2020open exist for quickly and easily evaluating several graph learning models on problems including node/graph classification and regression. However, none of these pre-existing tools enable scene-graph generation; they can only be used with existing graph data. Our proposed tool is the only tool which enables both the extraction and learning of AV-specific scene-graphs.

3 Roadscene2vec Architecture

This section introduces roadscene2vec’s architecture, features, and intended workflow. Our roadscene2vec is implemented as a Python library, integrating various external packages such as APIs from PyTorch, PyTorch Geometric, Detectron2, and CARLA. roadscene2vec consists of four key modules: (i) data generation (data.gen) and preprocessing (data.proc), (ii) scene-graph extraction (scene_graph), (iii) model training and evaluation (learning), and (iv) visualization (util). We detail each module in the following subsections.

Figure 2: Workflow for using roadscene2vec to preprocess a dataset; extract scene-graphs from the dataset; and select, train, and evaluate a model on the dataset.

3.1 Dataset Generation Tools (data.gen)

The module data.gen in roadscene2vec allow researchers to synthesize driving data for their research. To successfully handle complex and long-tail driving scenarios, deep learning approaches typically train their models on large datasets that contain a wide range of ”corner cases.” However, generating such datasets is expensive and time-consuming in the real-world dosovitskiy2017carla. Thus, most researchers instead use synthesized datasets containing plenty of these corner cases to evaluate their research ideas.

For this purpose, roadscene2vec integrates the open-source driving simulator, CARLA dosovitskiy2017carla, which allows users to generate driving data by controlling a vehicle (either in manual mode or autopilot mode) in simulated driving scenarios. On top of that, roadscene2vec also integrates the CARLA Scenario Runner which contains a set of atomic controllers that enable users to automate the execution of complex driving maneuvers.

In roadscene2vec, data.gen produces each driving clip in CARLA’s simulated world by (i) selecting one autonomous car randomly, (ii) switching its mode to manual mode, and (iii) using the Scenario Runner to command the vehicle to change lanes. In addition, the data generation tool in roadscene2vec manipulates the various presets in CARLA to specify the number of cars, pedestrians, weather and lighting conditions, etc., for making the generated driving data more diverse. Moreover, through the APIs provided by the Traffic Manager (TM) of the CARLA simulator, the tool can customize the driving characteristics of every autonomous vehicle in the simulated world, such as the intended speed considering the current speed limit, the chance of ignoring the traffic lights, or the chance of neglecting collisions with other vehicles. Overall, the tool allows users to simulate a wide range of very realistic urban driving environments and generate synthesized datasets suitable for training and testing a model.

Using the CARLA Python API and the CARLA Scenario Runner, we implemented a tool in the data.gen module for extracting the road scene’s state information as well as the corresponding ego-centric camera images directly from the CARLA simulator for use in roadscene2vec. For each frame in a driving sequence, we store the attributes of the objects in the scene as a Python dictionary. These attributes include object type, location, rotation, lane assignment, acceleration, velocity, and light status. For static objects such as traffic lights and signage, we store the type of object, its location, and light state (light color) or sign value (e.g., speed limit). We refer to the datasets in this format as CARLA datasets. In addition, our tool supports using image-based datasets, such as the camera data extracted from CARLA or the Honda Driving Dataset Ramanishka_behavior_CVPR_2018 used in our experiments. The code provided in our data.gen module can be modified to support other driving actions, such as turning, accelerating, braking, and overtaking.

Under the data.gen module, roadscene2vec also provides an annotation tool for quickly and easily labeling both CARLA datasets and image datasets. The annotator offers a graphical user interface (GUI) that enables users to view, label, exclude, or trim specific driving sequences. Our annotator enables users to assign one label for each sequence and supports averaging multiple independent labelers’ decisions. Our annotators GUI is shown in Figure 3. In addition to the annotation tool, we also provide dataset utilities such as train-test splitting, k-fold cross-validation, and downsampling as part of the trainers in the learning.util module.

Figure 3: The user interface of the annotator tool, used to label, filter, and trim datasets.

3.2 Data Preprocessing (

The data storage and preprocessing functions are implemented through the data.proc module of roadscene2vec. To use a new dataset with roadscene2vec, it must first have the correct directory structure defined in our repository. Next, the input dataset can go through one of the two workflows shown in Figure 2: (i) the dataset is preprocessed into a ”RawImageDataset” to be used with CNNs and other image processing models directly, or (ii) the dataset is sent to the corresponding scene-graph extractor to generate scene-graph representations of every frame in the dataset (discussed in Section 3.3

). The preprocessing step is necessary for the conventional deep-learning models as the input images often need to be resized, reshaped, or sub-sampled before being trained with models to meet memory and space constraints. After preprocessing, the RawImageDataset object stores the sets of driving video clips as image sequences, the labels associated with the video clips, and metadata (such as sequence name/action type). For each image in each clip in the dataset, the image preprocessor loads the image using OpenCV, resizes and recolors the image according to the configuration settings, and stores the image as a PyTorch Tensor. The resulting RawImageDataset object is then serialized and stored as a pickle (.pkl) file.

3.3 Road Scene-Graph Extraction (roadscene2vec.scene_graph.extraction)

Here, we describe how an input dataset is converted into a ”SceneGraphDataset” object via our scene-graph extraction framework. We first describe how the entities and relations in the scene-graph are defined and configured before discussing the specific steps needed to extract scene-graphs from both CARLA and image-based datasets.

3.3.1 Entity and Relation Extraction

Parameter Description
actor_names The list of object types. The default list is based on the actor types defined by the CARLA simulator.
relation_names The list of all implemented relation types.
car_names / moto_names / bicycle_names / etc. Object names defined in the CARLA simulator. These lists are used to cross-reference the object type for a given CARLA vehicle name.
directional_thresholds Defines the set of enabled directional relations and their thresholds in degrees.
directional_relation_list Defines the pairs of object types for which directional relations will be extracted.
proximity_thresholds Defines the set of enabled distance relations and their thresholds in feet.
proximity_relation_list Defines the pairs of object types for which proximity relations will be extracted.
lane_threshold Represents 50% of the width of a lane in feet. If an object is more than this distance from the ego car’s center, it is considered to be in the left or right lane.
Table 1: Scene graph configuration options and their descriptions. Each of these parameters can be reconfigured by the user to produce custom scene-graphs.

A list of roadscene2vec’s user-configurable scene-graph extraction settings is shown in Table 1. In our formulation, each ”actor” (object) in the scene-graph is assigned a type from the set {car, motorcycle, bicycle, pedestrian, lane, light, sign}, matching those defined by CARLA. Users can reconfigure the set of object types to support other dataset types, applications, or ontologies.

The default relation extraction pipeline we implement identifies three kinds of pair-wise relations: proximity relations (e.g. visible, near, very_near, etc.), directional relations (e.g. Front_Left, Rear_Right, etc.), and belonging relations (e.g. car_1 isIn left_lane). Two objects are assigned the proximity relation, {Near_Collision (4 ft.), Super_Near (7 ft.), Very_Near (10 ft.), Near (16 ft.), Visible (25 ft.)} provided the objects are physically separated by a distance that is within that relation’s threshold. The directional relation, {Front_Left, Left_Front, Left_Rear, Rear_Left, Rear_Right, Right_Rear, Right_Front, Front_Right}, is assigned to a pair of objects, in this case between the ego-car and another car in the view, based on their relative orientation and only if they are within the near threshold distance from one another. Additionally, the isIn relation identifies which vehicles are on which lanes (see Fig. 1). We use each vehicle’s horizontal displacement relative to the ego vehicle to assign vehicles to either the Left Lane, Middle Lane, or Right Lane using the known lane width. Our current abstraction only considers three-lane areas, and, as such, we map vehicles in all left lanes and all right lanes to the same Left Lane node Right Lane node, respectively. If a vehicle overlaps two lanes (i.e., during a lane change), it is mapped to both lanes.

The set of possible entity types, relation types, relation thresholds, and valid object pairs is defined in the scene_graph_config file. These settings are entirely user re-configurable, enabling broad design space exploration of different graph extraction methodologies. After graph extraction is completed, the set of all scene-graph sequences, metadata, and labels are saved as a SceneGraphDataset.

3.3.2 Carla Scene-Graph Extraction

Since the CARLA datasets contain a dictionary with a list of objects and their attributes, we directly use this dictionary to initialize the nodes in the scene-graph. Each node is assigned its type label from the set of actor_names and its corresponding attributes (e.g., position, angle, velocity, current lane, light status, etc.) for relation extraction. Once all nodes are added to the scene-graph, we extract relations between each pair of objects in the scene.

3.3.3 Image Scene-Graph Extraction

To extract scene-graphs from image-based datasets, the set of objects in a scene and their attributes must be extracted from each image. We use Mask-RCNN he2017mask to extract the set of objects in the image as well as their bounding boxes. Next, we compute the inverse-perspective mapping transformation of the image, yielding a top-down ’birds-eye view’ (BEV) projection of the scene. By generating this projection and projecting the bounding box coordinates from the original image into the birds-eye view, we can estimate the position of each vehicle relative to the ego-vehicle with reasonably high fidelity. This position information, along with the object class information, is used to construct the scene-graphs. However, the BEV projection needs to be re-calibrated for each dataset, as typically, each dataset uses a different camera angle and camera configuration. To facilitate this calibration step, we provide a BEV calibration utility in scene_graph.extraction.bev. This utility provides an interactive way for the user to select the road area and calibrate the BEV projection for a new dataset with a single step.

3.3.4 Scene-Graph Visualization

Our scene-graph visualization tool, located in the roadscene2vec.util module, consists of a GUI that simultaneously displays an input image side by side with its corresponding scene-graph, as is shown in Figure 8. This tool enables researchers to experiment with a wide range of relation types and distance thresholds and quickly optimize their scene-graph extraction settings for their specific application or dataset.

3.4 Scene-Graph Embedding (roadscene2vec.learning)

The learning module contains our framework for splitting datasets as well as training, testing, and scoring models at various tasks. It also contains our graph learning models as well as the baseline deep learning models. The model submodule contains the model definitions while the util

submodule contains the training, evaluation, and scoring functions. The training code supports implementing k-fold cross-validation, a user-definable train:test split, and downsampling and class weighting to correct dataset imbalances. The model specification, training hyperparameters, and dataset configuration settings are loaded from the

learning_config file, which is user-modifiable. Next, we introduce the models available in roadscene2vec.

3.4.1 Graph Learning Models (roadscene2vec.learning.model)

Figure 4: Graph learning model configuration options provided in roadscene2vec.

The graph learning models we provide in roadscene2vec enable various configurations of both spatial modeling and temporal modeling components as shown in Figure 4. The spatial modeling components that can be configured include (i) graph convolution layers, (ii) graph pooling and graph attention layers, and (iii) graph readout operations. The temporal modeling components that can be configured include (i) temporal modeling layers and (ii) temporal attention layers. Our experiments use MRGCN and MRGIN models that are identical in structure and differ only in the type of spatial modeling used. Next, we discuss these components in more detail.

Spatial Modeling (Spatial_Model)

We provide two multi-relational graph convolution implementations based on (i) graph convolutional networks (GCNs) kipf2016semi and (ii) graph isomorphism networks (GINs) xu2018powerful. These layers propagate node embeddings across edges via graph convolutions, resulting in a new set of node embeddings. The two implementations differ with regard to how data is propagated through successive graph convolutions. Graph pooling is used to filter the set of node embeddings in the graph to only those most useful for the task. We enable two types of graph pooling layers extended for multi-relational use cases: Self-Attention Graph Pooling (SAGPool) lee2019self and Top-K Pooling (TopkPool) gao2019graph. After pooling, a global readout operation is used to collect the set of pooled node embeddings into a unified graph embedding. We implement max, mean, and add readout operations.

Temporal Modeling (Temporal_Model)

The temporal model we implement uses Long Short-Term Memory (LSTM) layers to convert the sequence of

scene-graph embeddings to either (i) one spatio-temporal embedding (for sequence classification tasks) or (ii) a sequence of spatio-temporal embeddings (for graph classification/prediction tasks). For graph classification/collision prediction tasks, the output from an LSTM layer for each input scene-graph embedding is collected as a sequence of spatio-temporal scene-graph embeddings that is then sent to an MLP layer to produce the final set of model outputs. For sequence classification tasks, a temporal readout operation is applied to to compute a single spatio-temporal sequence embedding by (i) extracting only the last hidden state of the LSTM (LSTM-last), (ii) taking the sum over , or (iii) using a temporal attention layer (LSTM-attn) to compute an attention-weighted sum of the different elements of as described in bahdanau2014neural.

3.4.2 Baseline Models (roadscene2vec.learning.model)

In addition to the graph learning models that are core to roadscene2vec, we also provide a set of baseline deep learning models for quickly and easily comparing to typical image-processing approaches. These baselines include (i) a ResNet-50 he2016deep CNN classifier and (ii) a CNN+LSTM classifier yurtsever2019risky. The motivation for using these baselines stems from their prevalence in AV image processing tasks, such as risk assessment yurtsever2019risky. Users can easily use other graph/deep-learning models with our framework as long as their model follows the same, typical PyTorch model structure.

3.4.3 Performance Evaluation and Hyperparameter Optimization

To enable live monitoring of training runs and in-depth analysis of the effects of different hyperparameter settings on performance, we integrate our library with Weights and Biases (W&B)333 W&B is a free, publicly available tool for tracking experiments, visualizing performance, identifying hyperparameter importance, and organizing results. We believe this integration will enable researchers to identify trends in the data and optimize model performance more quickly.

4 Usage Examples

In this section, we describe some of roadscene2vec’s use-cases. First, Section 4.1 exhibits a fundamental use-case in which an image frame is converted into a scene-graph and then into a fixed-length embedding . Next, the use cases of roadscene2vec for two risk-based autonomous driving applications (subjective risk assessment and collision prediction) are described in Section 4.2 and Section 4.3, respectively. In Section 4.4, we discuss how roadscene2vec can be used for performing and evaluating transfer learning. Finally, in Section 4.5, we show how roadscene2vec can be used to analyze the explainability of the graph learning models.

4.1 Use Case 1: Converting an Ego-Centric Observation Into a Scene-Graph

Our high-level algorithm for converting an input image into a scene-graph is shown in Algorithm 1. Let us walk through a typical workflow for converting an image dataset into a set of scene-graph embeddings. First, the image is preprocessed by the preprocessor to set the dataset format and image sizing. Next, the extractor extracts the scene-graph from the image. These scene-graphs can then be visualized using the visualizer tool we provide. The following script streamlines the execution of this use case:

    > python examples/

These scripts take configuration information directly from the data_config and scene_graph_config files in the config module. The config files indicate which type of dataset is being used (CARLA or image-based) as well as the location and extraction settings for the dataset. The scene_graph_config file also allows the reconfiguration of the relation extraction settings as shown in Table 1. The choice of relation extraction settings changes the scene-graph structure, which can change how the graph learning model processes the data.

1 Input: A sequence of images from a driving video clip . Output: A sequence of scene graphs for . def IMG2GRAPH():
2       Obj_Detection() Attr_Extraction() Graph_Extraction() return
5       { } for  in  do
6             IMG2GRAPH()
7       end for
8      return
Algorithm 1 Use Case 1 - Extracting a sequence of scene-graphs from a driving clip.

4.2 Use Case 2: Subjective Risk Assessment

Figure 5: The architecture of our configurable scene-graph based AV perception model. Our two pre-implemented temporal modeling pipelines for specific AV tasks are shown (sequence classification and graph prediction). However, users can remove or replace these model components for performing other AV tasks such as graph classification or scenario classification.

In prior AV research, attempts to improve vehicle safety have involved modeling either the objective risk or the subjective risk of driving scenes grayson2003risk; fuller2005towards; bao2019personalized. The objective risk

is defined as the objective probability of an accident occurring and is typically determined by statistical analysis 

grayson2003risk. In contrast, subjective risk refers to the driver’s own perceived risk and is an output of the driver’s cognitive process fuller2005towards; bao2019personalized. Since subjective risk accounts for the human behavior perspective and its critical role in anticipating risks bao2020personalized; bao2019personalized; fuller2005towards, it has the potential to assess contextual risk better than objective methods and thus better assure passenger safety. Further, studies such as  trankle1990risk; grayson2003risk provide direct evidence that a driver’s subjective risk assessment is inversely related to the risk of traffic accidents. Within this context, AVs must be able to understand driving scenes and quantify the subjective risk of driving decisions.

Given this motivation, we show that the graph learning models available in roadscene2vec can be used to convert these extracted scene-graphs into spatio-temporal scene-graph embeddings for the task of subjective risk assessment, as was done in our prior work yu2021scene.

4.2.1 Problem Formulation

In our prior work yu2021scene, and here, we make the same assumption used in yurtsever2019risky that the set of driving sequences can be partitioned into two jointly exhaustive and mutually exclusive subsets: risky and safe. We denote the sequence of images of length by . We assume the existence of a spatio-temporal function that outputs whether a sequence of driving actions is safe or risky via a risk label , as given in Equation 1.




Overall, the goal of the model is to learn to approximate the function . Our algorithmic implementation of this use case is shown in Algorithm 2.

1 Input: A sequence of images from a driving video clip . Output: Risk assessment . def SEQ2VEC():
2       { } for  in  do
3             Spatial_Model()
4       end for
5       Temporal_Model() Activation(MLP()) if  then
6             return
7       else if  then
8             return
11def RISK_ASSESS():
12       EXTRACT_SEQ() SEQ2VEC() return
Algorithm 2 Use Case 2 - Scene-graph embedding for risk assessment

4.2.2 Training

To achieve this goal, we train the graph learning model using the extracted sequences of scene-graphs as inputs and the subjective risk labels given by human annotators for each sequence. As such, the problem becomes a simple sequence classification problem, where the goal is to classify a given sequence of images as ”risky” or ”safe”. The configuration settings for training the model are available in the learning_config file in the config module. The following command can be used to train the model for risk assessment:

    > python examples/

4.3 Use Case 3: Collision Prediction

Figure 6: Demonstration of collision prediction using scene-graphs. Each node’s color indicates its attention score (importance to the collision likelihood) from orange (high) to green (low).

In our third use case, we demonstrate how roadscene2vec can be used to study approaches for predicting future vehicle collisions. In contrast to Use Case 2, which is a sequence classification problem, collision prediction has safety-critical time constraints and uses the history of prior scene-graphs to make predictions about the state of future graphs. Current statistics indicate that perception and prediction errors were factors in over 40% of driver-related crashes between conventional vehicles mueller2020humanlike. However, a significant number of reported AV collisions are also the result of these errors schoettle2015preliminary; xu2019statistical. With this motivation, we show that scene-graphs can be used to represent road scenes and model inter-object relationships to improve perception and scene understanding. An example of our methodology is shown in Figure 6.

4.3.1 Problem Formulation

We formulate the problem of collision prediction as a time-series classification problem where the goal is to predict if a collision will occur in the near future. Our goal is to accurately model the spatio-temporal function , where


where implies a collision in the near future and otherwise. Here the variable denotes the image captured by the on-board camera at time . The interval between each frame varies with the camera sampling rate. Our implementation of Use Case 3 is shown in Algorithm 2.

1 Input: A sequence of images from a driving video clip . Output: Sequence of collision likelihood predictions: . def GRAPH2VEC(, , ):
2       Spatial_Model() Temporal_Model() Activation(MLP()) if  then
3             return
4       else if  then
5             return
9       EXTRACT_SEQ() [0, 0, …, 0] , [0, 0, …, 0] { } for  in  do
10             GRAPH2VEC(, , )
11       end for
12      return
Algorithm 3 Use Case 3 - Scene-graph embedding for collision prediction

4.3.2 Training

To train a model for this application, we adjust the model to produce one output per graph instead of one output per sequence. For the application of collision prediction, we also assign each frame in a video clip a label identical to the entire clip’s label to train the model to identify the preconditions of a future collision and predict it as early as possible. The following command can be used to train the model for collision prediction:

    > python examples/

4.4 Use Case 4: Transfer Learning

Models trained on simulated datasets must be able to transfer their knowledge to real-world driving scenarios as they can differ significantly from simulations. One key advantage of using scene-graphs is that they are a form of Intermediate Representation (IR), meaning that they provide a higher level of abstraction compared to image data alone. This abstraction means that scene-graphs are generally better able to transfer knowledge across datasets and domains, such as from simulated data to real-world driving data. Since this is a key benefit of using a graph-based approach and is a critical use case for validating AV safety, roadscene2vec supports running transfer learning experiments between any two datasets. To implement this use case, we use the original dataset to train the model and use the user-specified transfer dataset to test the model. No additional domain adaptation is performed. The workflow for Use Case 4 is shown in Algorithm 4. The following script runs an example of transfer learning.

    > python examples/
1 Input: Source dataset , transfer dataset , model

, and training epochs

Output: Transfer learning result . def TRAIN(, , ):
2       for  in  do
3             Loss_Function(, ) Update_Model(, )
4       end for
5      return
7def EVALUATE(, ):
8       Score(, ) return
11       TRAIN() EVALUATE() return
Algorithm 4 Use Case 4 - Transfer learning evaluation

4.5 Use Case 5: Explainability Analysis

Explainability refers to the ability of a model to communicate the factors that influenced its decision-making process for a given input, particularly those that might lead the model to make incorrect decisions adadi2018peeking; knyazev2019understanding. Since deep-learning models are typically black-boxes, they are difficult to diagnose and adjust when failures occur. Thus, models which can better explain their decision-making process are easier to verify, debug, and make safer. Our library enables users to analyze the explainability of different model architectures by visualizing the node attention scores of a graph learning model for a given input. The workflow of this use case is shown in Algorithm 5. First, using a pre-trained graph learning model, we run inference on a dataset and record the model’s spatial and temporal attention scores for each sequence to a CSV file. Then, we visualize the node attention scores for each scene-graph and color code the nodes according to their attention score. For a given graph, the nodes with higher attention scores had a more significant impact on the decision made by the model.

1 Input: A sequence of images from a driving video clip , trained model . Output: Risk assessment result , node attention scores and temporal attention score for each graph in . def SEQ2VEC_ATTN():
2       { }, { } for  in  do
             Spatial_Model()   // from SAGPool layer
4       end for
       Temporal_Model()   // from LSTM-attn layer
5       Activation(MLP()) if  then
6             return
7       else if  then
8             return
12       EXTRACT_SEQ() SEQ2VEC_ATTN() return
Algorithm 5 Use Case 5 - Explainability analysis of scene-graph risk assessment

5 Experiments

In this section, we present results from running each use case presented in Section 4 as well as details on the datasets and metrics used to evaluate each model.

5.1 Dataset Preparation

For experiments, we prepared two types of driving datasets: (i) synthesized lane-changing datasets (271-syn and 1043-syn), and (ii) typical real-world driving datasets (571-honda and 1361-honda). We labeled all of the datasets using our annotator tool as described in Section 3.1. More details on the datasets as well as the labeling process can be found in yu2021scene. We randomly split each dataset into a training set and a testing set by the ratio 7:3 such that the split is stratified, i.e., the proportion of risky to safe lane change clips in the training and testing sets is the same. The models are first trained on the training set before being evaluated on the testing set. The final score of a model on a dataset is computed by averaging over the testing set scores for five different stratified train-test splits.

5.2 Model Configuration

In our experiments, we use two graph learning architectures denoted MRGCN and MRGIN. Both models consist of the following structure: two graph convolution layers of size 64, one SAGPooling layer with 0.5 pooling ratio, one add readout layer, and one problem-specific temporal model as defined in Figure 5. The two architectures only differ in the way successive graph convolutions are processed, as discussed in Section 3.4.1. As for the baselines, we evaluate the ResNet-50 CNN classifier and the CNN+LSTM classifier in our experiments. All models were evaluated using 5-fold cross-validation with the average test performance over the five folds presented as the final result.

5.3 Use Case 1 Evaluation: Scene-Graph Extraction

In Figure 7, we show an example where two scene graphs are extracted from the same input image with different relation extraction settings. The graph at the bottom contains relations between all pairs of vehicles in the scene; for each pair of vehicles, if the two vehicles are within some distance threshold, the distance and direction relations are constructed. The graph at the top left is similar, but it only contains relations between the ego vehicle and each other vehicle. This figure shows one example of the ways that our tool enables flexible graph construction for different applications. A demonstration of our visualizer tool is shown in Figure 8. As shown, our visualizer allows the user to inspect how objects detected in the input image translate to the objects and relations in the scene-graph.

Figure 7: Demonstration of Scene-Graph extraction with two different relation extraction settings. Zoom in for details.
Figure 8: A demonstration of our scene-graph visualization tool that enables the user to inspect: (i) an original input image, (ii) the object detection results, (iii) the birds-eye view projection of the image, and (iv) the resultant scene-graph.

5.4 Use Case 2 Evaluation: Subjective Risk Assessment

Here, we demonstrate how roadscene2vec can be used to train and evaluate several models for the subjective risk assessment use case. We used classification accuracy and the Area Under the Curve (AUC) bradley1997use of the Receiver Operating Characteristic (ROC) to score the models. AUC, sometimes referred to as a balanced accuracy measure sokolova2009systematic, measures the probability that a binary classifier ranks a positive sample more highly than a random negative sample. This is a more balanced measure for measuring accuracy, especially with imbalanced datasets (i.e., 271-syn, 1043-syn, 571-honda).

Table 2 shows a comparison between MRGCN, MRGIN, ResNet-50, and CNN+LSTM yurtsever2019risky models for driving scene risk assessment. The results show that the MRGCN based approach consistently outperforms the other models across all the datasets in terms of both classification accuracy and AUC. We found that the performance difference between the scene-graph based approaches and the CNN-based approaches increased when the training datasets were smaller, indicated that the graph-based methods could likely learn a good representation with less data.

Metric Dataset MRGCN MRGIN ResNet-50 CNN+LSTM yurtsever2019risky
Accuracy 271-syn 0.9320 0.8561 0.6938 0.8033
1043-syn 0.9580 0.8784 0.9053 0.7742
571-honda 0.8710 0.8310 0.7689 0.6041
1361-honda 0.8655 0.7245 0.6839 0.7158
AUC 271-syn 0.9620 0.9437 0.7371 0.8394
1043-syn 0.9780 0.9591 0.9616 0.8221
571-honda 0.9105 0.8903 0.8343 0.6670
1361-honda 0.9124 0.8164 0.7340 0.7560
Table 2: Risk assessment result for MRGCN, MRGIN, ResNet-50, and CNN+LSTM.

5.5 Use Case 3 Evaluation: Collision Prediction

Next, we evaluated the models in roadscene2vec at collision prediction using classification accuracy, AUC, and Matthews Correlation Coefficient (MCC) chicco2020advantages. MCC is considered a balanced performance measure for binary classification, even on datasets with significant class imbalances. The MCC score outputs a value between -1.0 and 1.0, where 1.0 corresponds to a perfect classifier, 0.0 to a random classifier, and -1.0 to an always incorrect classifier. The results from our evaluation are shown in Table 3.

Once again, MRGCN outperforms the other models on the synthetic datasets. However, on the 571-honda dataset, the ResNet-50 model outperforms MRGCN across all metrics. Upon deeper inspection of the results, we found that the ResNet-50 model had a higher FNR than the MRGCN and a lower FPR than the MRGCN, suggesting that the ResNet-50 model is less sensitive than the MRGCN. Given that collision prediction is a safety-critical application, this behavior may not necessarily be desirable; however, decision boundary tuning could be used to fine-tune the sensitivity for the final application’s requirements.

On both Use Case 2 and 3, MRGIN underperforms MRGCN, likely because MRGCN is a more general framework while MRGIN is designed to perform well at graph topology analysis problems, such as graph isomorphism testing. MRGIN may outperform MRGCN on different problem formulations or graph construction formulations if they play to these strengths of MRGIN.

Metric Dataset MRGCN MRGIN ResNet-50 CNN+LSTM yurtsever2019risky
Accuracy 271-syn 0.8812 0.8028 0.7039 0.7184
1043-syn 0.9095 0.7803 0.8080 0.8029
571-honda 0.6922 0.7230 0.7340 0.5606
AUC 271-syn 0.9457 0.8724 0.7564 0.7607
1043-syn 0.9477 0.8826 0.9026 0.8493
571-honda 0.7775 0.7844 0.7802 0.5871
MCC 271-syn 0.5145 0.3046 0.3320 0.1474
1043-syn 0.5385 0.2852 0.4602 0.2436
571-honda 0.2142 0.1908 0.3547 0.1347
Table 3: Collision prediction accuracy, AUC, and MCC for different models in roadscene2vec.

5.6 Use Case 4 Evaluation: Transfer Learning

Here, we demonstrate how roadscene2vec can be used to evaluate each model’s ability to transfer the knowledge learned from simulated datasets to real-world datasets. As part of this use case, roadscene2vec uses the model weights and parameters learned from training on the simulated dataset (271-syn or 1043-syn in this case) directly for testing on the real-world driving dataset (571-honda) with no domain adaptation steps. We show the results of this evaluation for the MRGCN, ResNet-50, and CNN+LSTM models in Table 4.

As expected, the performance of all models degrades when tested on 571-honda dataset. However, as Table 4 shows, the accuracy of the MRGCN only drops by 3.5% and 6.5% when the model is trained on 271-syn and 1043-syn, respectively, while the CNN+LSTM’s performance drops by 27.9% and 17.3%, respectively. Furthermore, the MRGCN achieves a higher accuracy score than the CNN+LSTM when transferring from the smaller 271-syn dataset, once again indicating that scene-graph models can better model the problem even when trained on smaller amounts of data. The ResNet-50 model performs worst and classifies most of the sequences as risky, resulting in an accuracy score nearly equivalent to the proportion of risky sequences in the 571-honda dataset ( 17.25%). These results suggest that the scene-graph models can transfer knowledge more effectively than the CNN-based models.

Experiment Model Original Acc. Transfer Acc.
271-syn to 571-honda ResNet-50 0.7039 0.1899 (-0.514)
CNN+LSTM yurtsever2019risky 0.8033 0.5244 (-0.279)
MRGCN 0.9040 0.8690 (-0.035)
1043-syn to 571-honda ResNet-50 0.8080 0.1725 (-0.636)
CNN+LSTM yurtsever2019risky 0.7742 0.6010 (-0.173)
MRGCN 0.9520 0.8870 (-0.065)
Table 4: The results of comparing transferability between MRGCN, ResNet-50, and CNN+LSTM yurtsever2019risky. In this experiment, we trained each model on both the 271-syn dataset and 1043-syn dataset. Then we evaluated the accuracy of the trained model on both original dataset and 571-honda dataset without any domain adaptation.

5.7 Use Case 5 Evaluation: Explainability Analysis

To demonstrate roadscene2vec’s tools for evaluating explainability, we run our included explainability analysis tool on our MRGCN model trained for risk assessment on the 271-syn dataset. The result from analyzing one of the sequences from the dataset is shown in Figure 9. As shown, the attention scores are highest on the nodes which present the highest degree of risk. Additionally, the graph with the highest attention score for the other vehicle is also the graph corresponding to the collision with the other vehicle.

Figure 9: A demonstration of how Use Case 5 enables explainability analysis. For this driving sequence, it can be clearly seen how the node attention scores shift to give higher weight to the approaching vehicle as its distance to the ego car reduces.

6 Discussion

6.1 Practicality

Although roadscene2vec is intended to be a tool that benefits the research community, its practicality and carryover to real-world applications are equally important. As shown with Use Case 4, roadscene2vec enables researchers to directly evaluate the ability of models trained on synthetic data to transfer their knowledge to real-world driving scenes. Many research papers often overlook this critical problem, leading to a disconnect between simulated trials and real-world performance. Our tool better enables the study of this crucial problem area and allows researchers to analyze the real-world practicality of various graph-based methodologies. Furthermore, we show that roadscene2vec is directly compatible with both the real-world honda driving dataset Ramanishka_behavior_CVPR_2018 as well as the popular open-source driving simulator, CARLA dosovitskiy2017carla, making our tool useful for a wide range of potential AV applications.

6.2 Limitations and Future Work

Although roadscene2vec provides a suite of tools for training and evaluating both scene-graph-based and CNN-based models, there are some limitations to its capabilities. For example, roadscene2vec currently only supports input data in the format of ground-truth data from the CARLA simulator or image data from a forward-facing camera; it currently does not support radar, lidar, or multi-camera data. We selected image data and CARLA data as the primary input modalities because these data types are the ones most used by AV researchers currently. Although radar and lidar data are useful and well-studied in specific applications such as localization and sensor fusion, most AV research papers exploring perception and control methodologies use camera-based inputs. However, this limitation can be overcome by implementing preprocessors for extracting (or fusing) scene-graphs from these different modalities. Thus, roadscene2vec does not currently support multiple sensing modalities but could support them as part of future work. Furthermore, our tool does not implement more than a few common types of perception algorithms and use cases. However, our tool is designed to be modular and re-configurable to support custom models and problem formulations. We expect that researchers will design custom architectures and models for the various well-studied problems in the AV domain and provide instructions in our repository for integrating the custom models with roadscene2vec’s workflow. Thus, we leave the study of other AV applications and model architectures as future work. We also welcome outside contributions to our open-source tool to improve its utility for the research community further.

7 Conclusion

It is clear from current research as well as the examples shown in this paper that scene-graph representations of road scenes can be beneficial for a wide range of AV applications. In this paper, we introduced and demonstrated our tool for exploring and studying the applications of road scene-graphs, named roadscene2vec. We showed that our re-configurable graph-construction methodology enables the study of different graph layouts for various problems. We also demonstrated performance evaluations for conventional CNN architectures and graph-based models for two common AV perception use cases: risk assessment and collision prediction. Furthermore, we showed how our tool facilitates studying the transferability and explainability of graph-based AV models for both synthetic and real-world data. We believe our open-source tool fills a significant gap in the research community and will enable deeper study of the applicability and practicality of graph-based solutions for AV problems.


This work was partially supported by the National Science Foundation (NSF) under award CMMI-1739503 and by Graduate Assistance in Areas of National Need (GAANN) under award P200A180052. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agency.