Graph and Recurrent Neural Network-based Vehicle Trajectory Prediction For Highway Driving

Integrating trajectory prediction to the decision-making and planning modules of modular autonomous driving systems is expected to improve the safety and efficiency of self-driving vehicles. However, a vehicle's future trajectory prediction is a challenging task since it is affected by the social interactive behaviors of neighboring vehicles, and the number of neighboring vehicles can vary in different situations. This work proposes a GNN-RNN based Encoder-Decoder network for interaction-aware trajectory prediction, where vehicles' dynamics features are extracted from their historical tracks using RNN, and the inter-vehicular interaction is represented by a directed graph and encoded using a GNN. The parallelism of GNN implies the proposed method's potential to predict multi-vehicular trajectories simultaneously. Evaluation on the dataset extracted from the NGSIM US-101 dataset shows that the proposed model is able to predict a target vehicle's trajectory in situations with a variable number of surrounding vehicles.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/25/2020

PiP: Planning-informed Trajectory Prediction for Autonomous Driving

It is critical to predict the motion of surrounding vehicles for self-dr...
02/09/2019

Data-Driven Vehicle Trajectory Forecasting

An active area of research is to increase the safety of self-driving veh...
03/09/2022

Adaptive Trajectory Prediction via Transferable GNN

Pedestrian trajectory prediction is an essential component in a wide ran...
09/19/2018

Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems

Predicting the future location of vehicles is essential for safety-criti...
06/14/2021

Heterogeneous Edge-Enhanced Graph Attention Network For Multi-Agent Trajectory Prediction

Simultaneous trajectory prediction for multiple heterogeneous traffic pa...
05/20/2022

An efficient Deep Spatio-Temporal Context Aware decision Network (DST-CAN) for Predictive Manoeuvre Planning

To ensure the safety and efficiency of its maneuvers, an Autonomous Vehi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous driving is expected to improve the safety and efficiency of our daily transportation thanks to the technological advancements in both algorithms and hardwares. While a typical autonomous driving system consists of four modules: perception, decision-making, planning, and control, researchers recently argue that autonomous vehicles will be safer if they can precisely predict future locations of its surrounding vehicles [li2019grip]. To this purpose, many trajectory prediction methods have been proposed, which fall in three categories, physics-based [ammoun2009real], maneuver-based [hermes2009long, laugier2011probabilistic], and interaction-aware methods [deo2018convolutional, li2019grip, mo2020recog, zhao2020tnt, mo2021heterogeneous]. More about this taxonomy can be found in [lefevre2014survey]. However, trajectory prediction is challenging in that driving is a complex interactive behavior [xing2021toward], where the motion of a vehicle is affected by not only its driving style but also its surrounding vehicles, and the number of surrounding vehicles can be variant in different traffic situations.

Thanks to the availability of many real-world collected driving datasets [ushighway101, interactiondataset]

and the success of neural networks, data-driven interaction-aware methods dominate the field of trajectory prediction in the last years. Most of these methods jointly consider temporal and spatial features 

[deo2018convolutional, zhao2019multi, mo2020recog]. Convolutional social pooling (CS-LSTM) [deo2018convolutional]

applies long short-term memory network (LSTM) 

[hochreiter1997long] to individual vehicles’ past tracks to extract their dynamics then aligns these dynamics into a target-centered occupancy grid to represent the spacial interaction. A CNN is then used to extract interaction feature from the grid. The performance of CS-LSTM can be affected by the size of the occupancy grid. It ignores the vehicle which is aggressively approaching the target vehicle but still outside the grid. Authors of [mo2020interaction] proposed to consider eight closet surrounding vehicles that have the most impact on the target vehicle’s behavior rather than many vehicles in an occupancy grid. However, requiring the exact eight neighboring vehicles limited their model to be applied to situations where the number of surrounding vehicles varies.

Representing inter-vehicular interaction as a graph and applying graph neural network algorithms to model the interaction attracted great interest in the past two years [diehl2019graph, li2019grip, zhao2020tnt, mo2020recog]. Authors of [diehl2019graph] conceptually proved that modeling a traffic scene as a graph to utilize the power of GNN increases prediction quality on a more-interactive highway dataset. They used only current information in their model and suggested integrating recurrent neural networks with GNNs in future works. GRIP [li2019grip] designed several graph convolutional blocks to extract interaction feature, which is then fed to an LSTM-based Encoder-Decoder to predict future trajectories. GRIP treats all the nodes equally when predicting a target vehicle’s trajectory, which fails to emphasize the effects of the target vehicle’s own dynamics. GRIP cannot accommodate state-of-the-art GNNs for interaction modeling to take advantage the advances in GNNs, like attention mechanisms. ReCoG [mo2020recog] modeled the relationships among vehicles and infrastructure as a heterogeneous graph and adopted state-of-the-art GNNs for the interaction feature. ReCoG focused on single vehicular trajectory prediction for urban driving, where the road structure affects vehicles’ trajectories significantly.

Inspired by [diehl2019graph], this work improves the CNN-LSTM-based trajectory prediction method proposed in [mo2020interaction] by integrating RNNs and GNNs to handle the situation with varying number of surrounding vehicles and investigates the graph modeling’s potential on the multi-vehicular trajectory prediction. The proposed model uses RNNs to extract dynamics features of all vehicles, then applies a GNN on a star-like directed graph, where a node corresponding to a vehicle contains its sequential feature and an edge from one node to another node implies that the latter’s behavior is affected by the former, to summarize the inter-vehicular interaction. Finally, an RNN decoder is applied to the combination of the target vehicle’s dynamics feature and its interaction feature for single vehicular trajectory prediction.

Fig. 1: Illustration of the proposed model in this study. RNNs with shared weights are used to encode the dynamics features of vehicles individually. A GNN-based interaction encoder is applied to these dynamics features, which are contained in corresponding nodes in a directed interaction graph, to summarize the inter-vehicular interaction feature. Finally an LSTM decoder predicts the trajectory by jointly consider the target vehicle’s dynamics and interaction features.

The main contributions of this work can be summarized as follows:

  • A Graph-based interaction-aware trajectory prediction method is proposed.

  • Ablative studies are conducted to show the necessity to jointly consider individual dynamics and interaction features.

  • The potential of the proposed method to be applied to multi-vehicular trajectory prediction is investigated.

The rest of this paper is organized as below. Sec. II expatiates the proposed method. Sec. III describes experimental settings. Sec. IV evaluates the proposed model on the single trajectory prediction task and investigates its potential for multi-vehicular trajectory prediction. Finally, Sec. V concludes this paper and points out future directions.

Ii Method

This section formulates the trajectory prediction problem and proposes a two-channel Encoder-Decoder structure, which consists of history encoder, interaction encoder, and future decoder, for this problem.

Ii-a Problem Formulation

This work aims to predict the future trajectory of a target vehicle driving on a highway given historical trajectories of its up-to-eight surrounding vehicles. As shown in Fig. 1, this task considers two kinds of vehicles: the target vehicle and its neighboring vehicles.

Neighboring vehicles considered are the target vehicle’s preceding (#1) and following (#2) vehicles, its nearest neighbors in adjacent lanes (#3 and #4), in terms of longitudinal distance, and their preceding (#5 and #7) and following (#6 and #8) vehicles.

The input to the model () is a set of historical trajectories of all considered vehicles, including the target vehicle.

(1)

where represents the sequence of historical trajectory of vehicle at time . is the traceback horizon. Without loss of generality, this work numbers the target vehicle as and the neighboring vehicles from to .

The output is the predicted future trajectory of the target vehicle at time :

(2)

where is the prediction horizon.

Ii-B Model Structure

To solve the single trajectory prediction problem, this work proposes a GNN-RNN based model, which is designed under the Encoder-Decoder structure and consists of two encoders (history encoder, interaction encoder) and one decoder (future decoder). The history encoder, implemented with an RNN, extracts an individual vehicle’s dynamics from its historical trajectory. The interaction encoder uses a GNN to summarize interaction features among a variable number of vehicles. Then the future decoder uses another RNN to roll out the future trajectory of the target vehicle. Details of these main parts of the proposed model are described below.

Ii-B1 History Encoder

The history encoder is shared across all vehicles to encode individual dynamics from their own historical trajectories. Eq. 3 shows that the encoder is applied to historical tracks of all vehicles in parallel.

(3)

where

is a linear transformation embedding the low-dimensional xy-coordinates into a high-dimensional vector space,

is a shared RNN applied to the embedded historical tracks of all vehicles, is the dynamics feature of vehicle at time .

Ii-B2 Interaction Encoder

Considering the fact that driving is an interactive activity and the mutual influence between two cars on each other is different, this method models the inter-vehicular interaction as a directed graph, where each node represents a vehicle and contains the vehicle’s sequential feature.

Definition 1 (Directed Graph)

A graph can be represented by , where is the set of nodes, and is the set of edges. If the edge from node to node is different from the edge from node to node , the graph is a directed graph.

Since this work models the interaction among vehicles as a graph, the structure of the graph will significantly affect the performance and efficiency of method [diehl2019graph]. If the graph contains only self connections, its performance should be similar to a simple model working on the target vehicle’s historical track only. While if the graph contains all connections (every node is connected to the rest of the nodes), it considers redundant connections, which increases quadratically with the number of nodes. This work considers up-to-eight neighboring vehicles and constructs the interactive graph as a star-like graph.

Graph Construction. Without loss of generality, this work sets the target vehicle as , and all the neighboring vehicles as . Then the edge set of the star-like graph with self-loop is constructed.

(4)

where means that there is a directed edge from node to node , that is, node is the neighbor of node and node ’s behavior will affect node ’s behavior. An example of the star-like directed graph with self-loop can be found in Fig. 1

Nodes in the constructed graph contain corresponding vehicles’ sequential features and directed edges represent their directed effects to others. Then the graph is processed by a graph neural network to model the the interaction feature as shown in Eq. 5:

(5)

where represents the graph structure at time , is the interaction encoder implemented with a 2-layer GNN, and contains the interaction features of all vehicles at time .

Ii-B3 Future Decoder

The future trajectory is predicted upon the target vehicle’s dynamics feature and interaction feature using another RNN.

(6)

where is the future decoder implemented with RNN and is the concatenation of and .

The model also uses proper fully-connected layers, which are not shown in the equations. Further details can be found in Sub.Sec. III-C and the released code.

Iii Experimental Setup

The experiments are set up with data pre-processing, model implementing, and metric setting.

Iii-a Dataset

This work uses vehicle trajectories extracted from the publicly available NGSIM US-101 [ushighway101] dataset, collected from 7:50 a.m. to 8:35 a.m. on June 15, 2005, for training and validation. The study area is a 640 meters segment of U.S. Highway 101, consisting of five main lanes, one auxiliary lane, and on-ramp and off-ramp lanes. The vehicle trajectory data are recorded at 10 Hz using eight synchronized digital video cameras mounted from the top of a 36-story building. This work selects roughly balanced data so that the lane-keeping trajectories do not dominate the dataset.

Iii-B Data Pre-processing

This work first selects target vehicles then selects data pieces from their trajectory.

Iii-B1 Target Vehicles Selection

A vehicle is selected as a target vehicle upon following conditions:

  • It has not been driven in lanes 7 (On-ramp) and 8 (off-ramp).

  • It only changed its lane once during the recording time.

  • Its recorded track is at least 1,000 feet in length.

  • Its lane-change maneuver happened within the range from 300 to 1,900 feet in the study area.

  • Its lane-change maneuver was obvious that the maximum lateral displacement before and after lane-change is greater than 10 feet.

This step finally selects 124 ( out of 1,993) vehicles from the 07:50am-08:05am segment, 106 (out of 1,533) vehicles from the 08:05am-08:20am segment, and 68 (out of 1,298) vehicles from the 08:20am-08:35am segment.

Iii-B2 Data Selection

For a target vehicle, 260 frames from 13 seconds (130 frames) before lane-change to 13 seconds (130 frames) after lane-change are considered as candidates of the current frame (time in Eq. 1). Then a data is stored in the dataset if the following conditions are all satisfied:

  • The target vehicle has a 3-second historical trajectory and a 5-second future trajectory.

  • All neighboring vehicles have a 3-second historical trajectory.

This step selects totally 63,176 pieces of data with 23,803 from the 07:50am-08:05am segment, 24,559 from the 08:05am-08:20am segment, and 14,814 from the 08:20am-08:35am segment.

Translation. A stationary frame of reference with its origin fixed at the target vehicle’s current position is used for each data piece.

Down-sampling. The raw data in NGSIM US-101 is recorded with a sampling rate of 10 Hz. This work down-sample the historical tracks by a factor of 2 and the future trajectories by 5.

Edge indexes. The edge set representing the graph structure is constructed as described in SubSec. II-B2.

Data format. A data with 3 parts is stored to the dataset.

(7)

where is the historical tracks of all vehicles, is the edge set containing the structure of the interactive graph, and is the target vehicle’s ground truth future trajectory.

After the above processing, this work randomly selects 10,000 data pieces from the whole dataset as the validation set and uses the rest of the dataset for training.

Iii-C Implementation Details

All the models in this work are implemented with PyTorch 

[NEURIPS2019_9015] except the GNN layers, which are implemented with PyTorch Geometric [Fey/Lenssen/2019]

. The history encoder is implemented using a one-layer Gated Recurrent Unit (GRU) 

[chung2014empirical] with a 32-dimensional hidden state, and the future decoder is implemented using a two-layer LSTM with a 64-dimensional hidden state. The interaction encoder is implemented with two Graph Attention Network (GAT) [velivckovic2017graph]

layers, which adopt concatenated three-head attention mechanism to stabilize the training process. This work uses LeakyReLU with a 0.1 negative slope as the only activation function.

The proposed model is trained for 50 epochs to minimize the same loss function as described in 

[mo2020interaction] using Adam [kingma2014adam] with a learning rate of 0.001. Full implementation of the proposed model can be found in the released code.

Iii-D Metrics

This work uses root-mean-square error (RMSE) in meters of the predicted trajectories against the ground truth future trajectories to evaluate different models. RMSE is calculated for each predictive time step within 5 seconds in the future. Previous works [mo2020interaction, deo2018convolutional, jeon2020scale] also adopt this metric.

(8)

where is the size of test set, and are the predicted position of the target vehicle in data at time and the corresponding ground truth, respectively.

Iv Results and Discussion

This section compares the proposed two-channel model with its ablations and previous works on the single trajectory prediction task, followed by an investigation of its potential for multiple trajectory prediction.

Iv-a Single Trajectory Prediction (STP)

Following methods are implemented for comparison:

  • Dynamics-only: this is the one-channel ablation of the proposed model considering the target vehicle’s dynamics feature only for prediction.

  • Interaction-only

    : this is the other one-channel ablation using the interaction feature extracted by the GNN only.

  • Two-channel: this is the proposed two-channel model.

The above implementations are trained and validated using the same dataset.

Results reported in some related works are also listed in Tab. I. However, this work focuses on comparing results between the proposed method and its ablations, considering that different works are using different training and validation datasets.

Fig. 2: Box plots of the RMSE of implemented models. R is the dynamics-only model, G the interaction-only model, and GR the proposed two-channel model.
Methods Prediction horizon
1 sec 2 sec 3 sec 4 sec 5 sec
1 Dynamics-only (Ours) 0.74 1.86 3.30 5.07 7.11
2 Interaction-only (Ours) 0.67 1.03 1.34 1.74 2.46
3 Two-channel (Ours) 0.68 0.99 1.21 1.53 2.14
4 CS-LSTM [deo2018convolutional] 0.61 1.27 2.09 3.10 4.37
5 GRIP [li2019grip] 0.37 0.86 1.45 2.21 3.16
6 CNN-LSTM [mo2020interaction] 0.64 0.96 1.22 1.53 2.09
TABLE I: Prediction performance comparison (RMSE in meters)

Tab. I compares different models. It shows that:

  • Interaction-aware methods (2,3,4,5,6) outperform the dynamics-only method (1). This demonstrates the necessity of modeling interactions for trajectory prediction as stated in previous works [deo2018convolutional, mo2020interaction].

  • The proposed two-channel model outperforms its interaction-only ablation. This shows that the target vehicle’s dynamics feature should be emphasized in some way for trajectory prediction. This work sets an additional channel for it.

  • The proposed method matches the CNN-LSTM method with advances in considering variable number of surrounding agents and the potential for multi-trajectory prediction.

  • The proposed method outperforms GRIP and CS-LSTM in longer-term prediction (3-5sec). However, for the short-term prediction, GRIP shows better performance possibly in that GRIP uses the whole dataset from NGSIM, where the lane-keeping trajectories are dominant and less challenging for trajectory prediction.

Fig. 2

shows box plots of the RMSE errors of models implemented in this study over a 5-second time in the future, where the red boxes are the results of the dynamics-only model (R), the green boxes the results of the interaction-model (G), and blue boxes the proposed two-channel model (GR). Triangles in a box represents its mean value. Outliers are ignored for clarity. In addition to Tab. 

I, Fig. 2 shows that the prediction of interaction-aware methods (G & GR) is more stable (shorter interquartile range (IQR)) than dynamics-only model (R) and the proposed two-channel model produces the shortest IQR. Please note that the mean value shown in Fig. 2 is calculated using Eq. 9:

(9)

which is slightly different to the results in Tab. I.

Fig. 3 visualizes prediction results in situations with different numbers of surrounding vehicles from the validation set. It shows that the proposed model can predict the target vehicle is going to keep or change lane in the next 5 seconds regardless of how many surrounding vehicles are in sight.

Fig. 3: Visualized STP predictions. Squares are the considered vehicles (target vehicle in blue and neighboring vehicles in gray). Gray lines are the vehicles’ historical tracks over the last 3 seconds. The green line is the ground truth (GT) future trajectory of the target vehicle. The blue line is the prediction of the proposed two-channel model (GR). All the vehicles move from left to right.

Even though this work focuses on single trajectory prediction, the proposed model has the potential to be applied to multi-vehicular trajectory prediction since the interaction encoder implemented with GNN processes all nodes simultaneously, see Eq. 5. The following section briefly formulates the problem of multi-vehicular trajectory prediction (MTP) and shows the proposed method’s performance on MTP.

Iv-B Multiple Trajectory Prediction (MTP)

From the ego vehicle’s point of view, MTP wants to predict future trajectories of up-to-eight target vehicles based on historical tracks of more vehicles. In this formulation, considered vehicles are separated into three categories: one ego vehicle, up-to-eight target vehicles, and some other surrounding vehicles. The MTP problem here is formulated similar to Sub.Sec. II-A and the target vehicles are selected as the selection of neighboring vehicle in Sub.Sec. II-A. Please note that this part is only to investigate the proposed method’s potential for multi-agent setting, and the only difference to STP is the input and output data.

The input to the model is historical trajectories of all considered vehicles,

(10)

where the is the ego vehicle’s historical track and is the number of target vehicles. MTP simultaneously predicts target vehicles’ future trajectories, numbered from to , based on historical trajectories of vehicles.

The output is then the predicted future trajectories of the target vehicles:

(11)

where represents the sequence of future trajectory of vehicle at time .

The dataset used here is pre-processed from the 08:05am-08:20am segment of NGSIM US-101. The size of training and validation datasets are 533,564 and 13,3392, respectively.

Methods Prediction horizon
1 sec 2 sec 3 sec 4 sec 5 sec
1 Two-channel (Ours) 0.54 1.12 1.80 2.63 3.67
2 GRIP(ALL) [li2019grip] 0.64 1.13 1.80 2.62 3.60
TABLE II: MTP performance comparison (RMSE in meters)

Tab. II compares the proposed method with a previous work GRIP [li2019grip] on the MTP task. It shows that the proposed model, when applied to multi-vehicular trajectory prediction, matches the previous work in terms of RMSE.

Fig. 4: Visualized MTP predictions. Blue square is the ego vehicle and gray squares represent the rest of considered vehicles. Only future trajectories of target vehicles are plotted. Green lines are the ground truth and dashed blue lines are the prediction future trajectory. All the vehicles move from left to right.

Fig. 4 visualizes the prediction results of the proposed model on the MTP task. It can be seen that the proposed method can predict the multiple trajectories longitudinally while it fails to predict the lane-change maneuver in the next 5 seconds. This can be explained by the imbalance of the MTP dataset since the majority of the future trajectories in the dataset are keeping lane, and it is hard to get a roughly balanced dataset for MTP.

V Conclusions

This work proposes a GNN-RNN-based method for trajectory prediction to model the inter-vehicular interaction among various vehicles. RNN is used to capture the dynamics feature of vehicles, and GNN is adopted to summarize the interaction feature. Another RNN serves as the decoder jointly considers the dynamics and interaction feature for prediction. This work finds that both the target vehicle’s individual dynamics feature and its interaction with other vehicles affect the prediction accuracy. The proposed method matches state-of-the-art methods on the NGSIM dataset in terms of RMSE.

This work can be improved to handle multi-vehicular trajectory prediction properly, which is necessary for the downstream decision-making module of autonomous driving. It can also be extended to consider the multi-modality of driving behaviors.

Acknowledgment

This work was supported in part by A*STAR Grant (No. 1922500046), Singapore, the Alibaba Group through Alibaba Innovative Research (AIR) Program and Alibaba-NTU Singapore Joint Research Institute (JRI) (No. AN-GC-2020-012), A*STAR AME Young Individual Research Grant (No. A2084c0156), and the SUG-NAP Grant, Nanyang Technological University, Singapore.

References