Interaction-Aware Trajectory Prediction of Connected Vehicles using CNN-LSTM Networks

by   Xiaoyu Mo, et al.
Nanyang Technological University

Predicting the future trajectory of a surrounding vehicle in congested traffic is one of the basic abilities of an autonomous vehicle. In congestion, a vehicle's future movement is the result of its interaction with surrounding vehicles. A vehicle in congestion may have many neighbors in a relatively short distance, while only a small part of neighbors affect its future trajectory mostly. In this work, An interaction-aware method which predicts the future trajectory of an ego vehicle considering its interaction with eight surrounding vehicles is proposed. The dynamics of vehicles are encoded by LSTMs with shared weights, and the interaction is extracted with a simple CNN. The proposed model is trained and tested on trajectories extracted from the publicly accessible NGSIM US-101 dataset. Quantitative experimental results show that the proposed model outperforms previous models in terms of root-mean-square error (RMSE). Results visualization shows that the model is able to predict future trajectory induced by lane change before the vehicle operate obvious lateral movement to initiate lane changing.



There are no comments yet.


page 2


ReCoG: A Deep Learning Framework with Heterogeneous Graph for Interaction-Aware Trajectory Prediction

Predicting the future trajectory of surrounding vehicles is essential fo...

GISNet: Graph-Based Information Sharing Network For Vehicle Trajectory Prediction

The trajectory prediction is a critical and challenging problem in the d...

Probabilistic Trajectory Prediction for Autonomous Vehicles with Attentive Recurrent Neural Process

Predicting surrounding vehicle behaviors are critical to autonomous vehi...

Vehicle trajectory prediction in top-view image sequences based on deep learning method

Annually, a large number of injuries and deaths around the world are rel...

Detection of Collision-Prone Vehicle Behavior at Intersections using Siamese Interaction LSTM

As a large proportion of road accidents occur at intersections, monitori...

Multi-Fidelity Recursive Behavior Prediction

Predicting the behavior of surrounding vehicles is a critical problem in...

Predicting Vehicles' Longitudinal Trajectories and Lane Changes on Highway On-Ramps

Vehicles on highway on-ramps are one of the leading contributors to cong...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human drivers always have a rough estimation of their surrounding vehicles’ future movements, especially in congested traffic. They keep adjusting their next movement according to personal driving targets and the environment. It is important that an autonomous vehicle has the ability to predict future trajectories of its surrounding vehicles when sharing road with human drivers. This prediction is expected to be more precise than human drivers’ estimation with rich information provided by connected vehicles, since human drivers can only roughly perceive the position of vehicles which are in sight. Connected vehicles based on reliable vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) wireless communication are designed to improve the efficiency, response, and reliability of human driving and autonomous vehicles, while enhancing traffic safety and mobility. Connected vehicles technology enables real-time information sharing among surrounding vehicles and traffic management center 

[biswas2006vehicle, talebpour2016influence].

Even with enhanced information availability provided by connected vehicles, predicting a vehicle’s future trajectory is challenging because it is affected by many factors, for example, the driver’s orientation [schwarting2019social, xing2019personalized, xing2020energy, xing2020personalized], different driving scenarios, and the interaction among vehicles [deo2018convolutional].

According to the survey [lefevre2014survey], methods for trajectory prediction can be separated into three categories according to their degree of abstraction. Physics-based methods, which have the lowest level of abstraction, predict short term future trajectory of a vehicle based on its kinematic and dynamic properties [ammoun2009real]; Maneuver-based methods [hermes2009long, aoude2010threat, laugier2011probabilistic, althoff2009model], which take into consideration the intention (maneuver) of a driver, are able to predict long term trajectories comparing to physics-based methods; Interaction-aware approaches [deo2018convolutional, ju2019interaction, zhao2019multi] consider the fact that the future trajectory of a vehicle is influenced by its surroundings and try to model this interaction for trajectory prediction. This nature of interaction-aware approach enables it to predict long term trajectories more precisely than other methods.

Fig. 1: Proposed model. LSTMs with shared weights are used to encode the time serial information of each vehicle individually. A CNN based interaction extractor is then applied to the LSTM encoding, which are implanted into a grid according to direction. Finally a LSTM decoder takes as input the concatenated interaction and ego dynamic and outputs predicted trajectory of the ego.

Among interaction-aware methods for trajectory prediction, the data-driven methods are prevalent because of the availability of traffic datasets [ushighway101, usi80freeway, wang2019apolloscape]

and promising success of neural networks 

[krizhevsky2012imagenet]. Many of these methods are inspired by Social LSTM[alahi2016social]

, which uses long short-term memory networks (LSTMs) 

[hochreiter1997long] to encode dynamics of agents and models interaction by sharing information among agents within a pre-defined distance. Social LSTM does not exploit the spatial structure of nearby agents and ignores vehicles in a longer distance. Convolutional social pooling [deo2018convolutional] considers the spatial structure by align LSTM encoded dynamics of neighboring vehicles into an ego centered

grid according to their local positions and applies a convolutional neural network (CNN) 


to extract interaction among neighboring vehicles excluding the ego. The final interaction is a combination of neighbors’ interaction and dynamic of the ego. This model may redundantly consider vehicles which are in the grid but invisible to the driver in a congested traffic. Multi-agent tensor fusion (MATF) 

[zhao2019multi] takes as input a pixel level scene context and past trajectories of interacting agents and predicts trajectories of all agents in the scene. The spacial structure is retained by spatially aligning individual dynamics to pixel level context. SCALE-Net [jeon2020scale] uses edge-enhanced graph convolutional neural network (EGCN) [gong2019exploiting] and LSTMs to model inter-vehicle interaction and predicts trajectories of multiple vehicles. Interaction-aware Kalman Neural Networks (IaKNN) [ju2019interaction] uses a encoder-decoder structure to extract interaction-aware acceleration from rich environment observation. The environment observation includes sequences of accelerations, widths, lengths, relative distances and computed repulsive interaction forces of agents in the system.

Inspired by convolutional social pooling [deo2018convolutional] for vehicle trajectory prediction, a method with the same structure is proposed. The proposed model handles the interaction differently. As shown in Fig. 1, assuming that eight neighbors all exist and are perceptible to the ego in congestion, the proposed model uses two tubes to extract past dynamic of the ego and its interaction with eight neighboring vehicles separately. LSTM encoded dynamics of all 9 vehicles, including the ego, are aligned into an ego centered grid according to their directions to the ego rather than their relative positions. The grid preserves surrounding vehicles’ spacial structure because of coordinate system sharing described in sub-Sec. II-A. A 2-layer CNN is designed to extract interaction feature hierarchically. Finally encoded ego dynamic and interaction are concatenated and feed to a LSTM decoder to predict future trajectory of the ego.

The proposed model is trained and tested on a dataset extracted from the publicly available dataset NGSIM US-101 [ushighway101]. Evaluation results show that it is capable of predicting trajectories of vehicles and outperforms state-of-the-art methods. The major contributions of this work is summarized as below:

  • An interaction-aware trajectory prediction model is proposed;

  • A method to choose roughly balanced lane-keeping and lane-changing scenarios from raw NGSIM dataset is implemented.

  • Ablative studies are conducted on the extracted dataset to show the effectiveness of the proposed model.

The reminder of this paper is organized as below. Sec. II formulates the problem and introduces the model in detail. Sec. III describes how the data is processed. In Sec. IV, the proposed model is evaluated and compared to other models. Finally, this work is concluded in Sec. V.

Ii Methodology

Ii-a Problem formulation

In this work, the task is to predict the trajectory of an ego vehicle based on history trajectories of its surrounding vehicles and its own. As shown in Fig. 1, eight surrounding vehicles and one ego (No.5) vehicle are considered. For each vehicle, its history is represented by a sequence of xy-coordinates of the past seconds. Past tracks of the ego and its surrounding vehicles are shown in triangular in red and gray, respectively. The ground truth of the future trajectory, which we want to predict as precisely as possible, is show in green. It is a -second trajectory represented by points.

Surrounding vehicles considered are the ego’s preceding (No.6) and following (No.4) vehicles, its nearest neighbors in adjacent lanes (No.2 and No.8), in terms of longitudinal distance, and their preceding (No.3 and No.9) and following (No.1 and No.7) vehicles. All history trajectories are aligned into a grid according to directions. As in Fig. 1, history track of vehicle is allocated to the cell with number in the grid.

The input to the model is history trajectories:




is the track history of vehicle at time . Each history is represented by xy-coordinates with time interval .

The output is future trajectory of the ego vehicle:


Ego’s future trajectory is represented by xy-coordinates with time interval .

It is worth to note that all trajectories share the same coordinate system whose origin is fixed at position of the ego at time . With this setting, trajectories of surrounding vehicles imply their relative positions with respect to the ego vehicle.

Ii-B Model structure

The proposed model consists of two channels, one for the ego and the other one for the interaction between the ego and its surrounding vehicles. Individual and interaction information extracted from these two channels are then concatenated and taken as input by a LSTM decoder to predict future trajectory of the ego vehicle as shown in Fig 1.

Ii-B1 LSTM encoder

A LSTM encoder is used for each vehicle to capture its individual bypast sequential feature. All vehicles share the same LSTM encoder. Eq. 4 shows the LSTM encoder applied to history of the ego .


where is a shared function embedding xy-coordinates into a higher space, is the shared LSTM encoder used in the proposed model, is the a fully connected layer for the ego. In this channel, The ego’s history track is embedded before sent to LSTM. The LSTM encoded feature is finally processed by a fully connected layer as the final representation of the ego’s dynamic .

Ii-B2 Interaction extractor

Individual sequential features extracted with LSTM should be jointly analyzed in order to capture the interdependence among vehicles. Social pooling 

[alahi2016social] addresses this issue by sharing information between spatially nearby LSTMs through a social pooling layer at each time step. In this setting, all agents within a certain distance are considered equally without exploiting the spacial structure. Convolutional social pooling [deo2018convolutional] defines an ego centered grid, which is populated with individual dynamics of surrounding vehicles according to their relative locations with respect to the ego. Then a 2-layer convolutional network is used to extract interaction among surrounding vehicles, considering the spacial information. If the grid is populated densely, it includes many vehicles, which cannot be perceived with on-board sensors by the ego, having no direct influence on the ego. In this work eight surrounding vehicles are implanted into an ego centered grid according to their directions, rather than positions, with respect to ego. A 2-layer CNN is then applied to the grid to extract interaction among those vehicles, without introducing many vehicles having negligible impact on the ego in congestion. It seems that the proposed grid discards the detailed relative position of surrounding vehicles. However, as we stated at the end of sub-Sec. II-A, this spatial structure is inherently encoded by LSTMs because of coordinate system sharing.

In the first convolutional layer, kernels are used to extract the interaction at four corners, upper left, upper right, lower left, and lower right. Each corner has four vehicles including the ego. The second convolutional layer also uses kernels to obtain the interaction among all vehicles by combining the interaction from four corners. The combined interaction is finally processed by a fully connected layer to be the final representation of inter-vehicle interaction.

The proposed interaction extractor can be described by Eq. 5 and Eq. 6. In Eq. 5, history tracks of all vehicles are encoded and then aligned () to a grid as shown in Fig. 1. Then this grid () is taken as input by the proposed 2-layer CNN (). Finally a fully connected layer is used to summarize the final representation of interaction .


Ii-B3 LSTM decoder

Finally, a LSTM decoder () is used to generate predicted future trajectory of the ego vehicle. It takes as input the combination of interaction and ego’s individual dynamic extracted by previous modules.


Iii Data processing

Iii-a Dataset

The data used for training and testing is extracted from raw trajectories in NGSIM US-101 [ushighway101], which consists of vehicle trajectories on a segment, approximately 2,100 feet in length, of U.S. Highway 101. The trajectories are collected at 10 Hz between 7:50 a.m. and 8:35 a.m. on June 15, 2005. The study area includes five main lanes (lane 1 to 5), one auxiliary lane (lane 6), one on-ramp lane (lane 7), and one off-ramp lane (lane 8). 298 vehicles, which changed its lane for only once throughout the study area, are selected as ego vehicles in this work. Further, trajectories before and after lane-changing are selected as data pieces, so that the resulted dataset includes roughly balanced lane-keeping and lane-changing scenarios.

Iii-B Data selection

The data used in this work is selected using a 2-step selection.

Iii-B1 Lane change vehicles selection

Vehicles satisfying following conditions are selected as ego vehicles:

  • It has only been driving in lanes 1,2,3, and 4.

  • Its lane ID has changed only once throughout the study area.

  • The length of its trajectory is longer than 1,000 feet.

  • Its longitudinal position, when changing lane, is within the range from 300 feet to 1,900 feet.

  • The maximum lateral divergence of its trajectory from 6 seconds before lane change to 6 seconds after lane change is greater than 10 feet.

Iii-B2 Data pieces selection

For a vehicle selected in the first step, 260 frames are considered as candidates of current frame (time in Eq. 1), which is the boundary of history and future trajectories. These frames are selected from 13 seconds before lane-change to 13 seconds after lane change. A data piece is selected if it meets conditions:

  • At time , the ego has all 8 surrounding vehicles.

  • The ego has a complete future trajectory with duration equals to 5 seconds.

  • Each vehicle has a complete history with duration equals to 3 seconds.

With above two steps, 48150 data pieces are selected in total. The dataset is randomly split into two non-overlapping segments, 33705 () for training and 14445 () for testing.

Iv Results and discussions

Iv-a Metric

Root-mean-square error (RMSE) in meters of the predicted trajectories with regarding to the ground truth future trajectories is used to evaluate prediction accuracy of different models. RMSE is calculated for each predictive time step within a horizon of 5 seconds. The same metric was considered in previous works [deo2018convolutional, zhao2019multi, jeon2020scale].


where is the size of test set, is the predicted position of the ego in data at time , and is the corresponding ground truth.

Iv-B Implementation details

The model is implemented using PyTorch 

[NEURIPS2019_9015]. Spacial coordinates are embedded into a 16-dimensional space before applying to encode individual past dynamics. The dimension of hidden states for and are 32 and 64, respectively. The 2-layer CNN uses fixed kernel size

but variant number of channels. The first layer has 32 in-channels and 64 out-channels, while the second layer has 64 in-channels and 128 out-channels. No padding is used in this model.

has 32 in-features and 32 out-features.

has 128 in-features and 64 out-features. All layers have the same leaky-ReLU activation function with negative slope equals to

. An optimizer called ADAM [kingma2014adam] with learning rate of

is used to train the model to minimize the weighted mean squared error (MSE) loss function defined in Eq. 

9 on the extracted dataset.


where and are lateral positions of the predicted trajectory and the ground truth, and

are the corresponding longitudinal positions. More weight is given to lateral error considering the fact that lateral variance of a surrounding vehicle is much more important than its longitudinal position. And trajectories of vehicles driving on freeways almost always have much larger longitudinal movement than lateral movement.

Iv-C Results

To demonstrate the advance of the proposed CNN-LSTM model, 4 models are implemented:

  • Vanilla LSTM (V-LSTM): this model uses a single LSTM to encode the individual history trajectory of the ego without considering the interaction among ego and its surrounding vehicles.

  • FC-LSTM: This is a variant of CNN-LSTM as described in sub-Sec II-B. The difference is that this model uses a fully connected layer to extract the interaction rather than a convolutional neural network.

  • CNN-31-LSTM: This model uses a single convolutional layer with kernels and a fully connected layer to replace in Eq. 6.

  • Interaction-only: This model discards the individual dynamic of ego in Eq. 4 and predicts ego’s trajectory based only the interaction in Eq. 6.

  • CNN-LSTM: This is the model proposed in this work.

Above models are all trained and tested on the same dataset extracted from raw NGSIM US-101 trajectories with batch size equals to 8 for 20 epochs. In addition, results of previous works 

[deo2018convolutional, zhao2019multi, jeon2020scale] are shown. It is worth to note that, although same metric is used through the implemented models and previous ones, the training and testing dataset are different from one work to another.

Methods Prediction horizon (Metric: RMSE in meters)
1 sec 2 sec 3 sec 4 sec 5 sec
CS-LSTM [deo2018convolutional] 0.61 1.27 2.09 3.10 4.37
SCALE-Net [jeon2020scale] 0.459 1.156 1.973 2.911 -
MATF GAN [zhao2019multi] 0.66 1.34 2.08 2.97 4.13
V-LSTM 0.7393 1.7887 3.1321 4.8683 6.9017
FC-LSTM 0.657 1.0567 1.4399 1.9374 2.6296
CNN-31-LSTM 0.6221 1.0314 1.4055 1.8324 2.5349
Interaction-only 0.726 1.0193 1.3183 1.7247 2.4101
CNN-LSTM 0.6214 0.976 1.2751 1.6237 2.272
TABLE I: Trajectory prediction results of diferrent models

Table I shows results of different models. Proposed interaction-aware models outperform previous models in long term trajectory prediction (2-5 seconds) in terms of RMSE in meters. This shows that dynamics of selected 8 surrounding

Fig. 2: Before lane change. Trajectory prediction in driving scenarios before lane change, where x-axis represents lateral positions of vehicles in meters and y-axis represents their lateral positions. Trajectories are distinguished with colors as in the legend. Ego hist: history track of the ego; Nbrs hist: history tracks of surrounding vehicles; GT fut: Ground truth future trajectory of the ego; CNN-LSTM: Predicted trajectory of the proposed model; V-LSTM: Predicted trajectory of implemented V-LSTM; FC-LSTM: Predicted trajectory of implemented FC-LSTM.
Fig. 3: During lane change. Trajectory prediction in driving scenarios during lane change. Same legend is used here as in Fig. 2.
Fig. 4: After lane change. Trajectory prediction in driving scenarios after lane change. Same legend is used here as in Fig. 2.

vehicles contain enough information to model the interaction.

The vanilla LSTM model has the poorest performance comparing to all listed interaction-aware trajectory prediction models. This indicates that modeling the interaction among vehicles is useful for trajectory prediction, even though the interaction can be modeled in different ways. This result is consistent with previous works [lee2017desire, deo2018convolutional, zhao2019multi, jeon2020scale].

We note that CNN-31-LSTM outperforms FC-LSTM. This suggests the effectiveness of CNN based interaction extractor comparing to a fully connected layer based one.

We also note that, the proposed CNN-LSTM, which uses 2 convolutional layers to extract the interaction hierarchically, outperforms its variant CNN-31-LSTM using a single convolutional layer. This indicates that interaction is better modeled from local to global.

The proposed CNN-LSTM also outperforms its variant, Interaction-only, where only the interaction in Eq. 6 is used by the LSTM decoder. This shows that it is necessary to emphasize the ego’s dynamic encoding with a separate tube, rather than just including it into interaction.

Iv-D Results visualization

Quantitative results in sub-Sec. IV-C shows that the proposed model outperforms previous works and its variants in terms of RMSE in meters. In this section, predition results are visualized to study the performance of implemented models in different driving scenarios as shown in Fig. 2, Fig. 3, and Fig. 4. For each figure, the first row shows three different driving scenarios, where all trajectory is shown in the same coordinate system. Since all history trajectories share the same time interval, their velocity can be roughly inferred from the dense of their trajectories. The second row shows prediction results of different models.

Fig. 2 shows driving scenarios before lane change. It is clear that the vanilla LSTM model can hardly notice the ego’s intention to change lane while interaction-aware models all predict trajectories reflecting the lane change intention. For example, the middle column of Fig. 2 shows a scenario where changing lane to left is a reasonable option. In the current lane, the preceding vehicle is slowing down; In the right lane, speeds of neighboring vehicles are much more slower than the ego, so that there is no reason to change to the right lane; In the left lane, speeds of preceding vehicles are faster than the ego and the following vehicle is slower.

Fig. 3 and Fig. 4 show driving scenarios during and after lane change. The vanilla LSTM model cannot figure out whether the lane change maneuver is completed. However, interaction-aware models are able to know whether it is driving in center of the target lane (lane change completed) or between two lanes (during lane change) from its lateral distance to surrounding vehicles.

V Conclusion

In this work, an interaction-aware vehicular trajectory prediction method based on integrated CNN and LSTM is proposed for connected vehicles. In this model, LSTM encoders with shared weights are used to extract time serial information of individual vehicles; Then a CNN is applied to extract interaction among neighboring vehicles, whose sequential feature are implanted into a grid according to their directions to the ego. Finally, the extracted interaction and ego’s individual dynamic are concatenated and sent to a LSTM decoder to predict future trajectory of the ego vehicle. Quantitative results show that the proposed model outperforms existing works in terms of RMSE. Ablative studies on the selected dataset demonstrates the rationality of the proposed model.

One limitation of the proposed model is that it assumes eight surrounding vehicles all exist and have a 3 second history. Future works could break this limitation by making a model adaptable to variational numbers of surrounding vehicles. Another way to improve this model is to improve it for multi trajectory prediction.


This work was supported by the SUG-NAP Grant (No. M4082268.050) of Nanyang Technological University, Singapore.