Human needs to continually observe nearby vehicles’ behaviors during driving so as to plan future motions for safely and efficiently passing through complex traffics. Similarly, autonomous vehicles should collect the movement information of nearby objects and then decide which maneuver can minimize risks and maximize efficiencies. Detection of such information has become possible with the development of onboard sensors and vehicle-to-infrastructure (V2I) communication [43, 13]. Recent works focus on how to utilize this information to plan motions for autonomous vehicles [15, 8]. One of the main aspects of motion planning is to predict other traffic participants’ future trajectories [33, 37], with which autonomous vehicles can infer future situations they might encounter [25, 4].
In the past decades, inspired by vehicle evolution models  and statistics-based-models [20, 22], research on vehicle trajectory prediction is undergoing developments, among which kinematic models [5, 17, 36]
and Kalman filter have been widely studied. These traditional models possess high computational efficiencies, such that they are especially suitable for real-time applications with limited hardware resources. However, if the spatial-temporal correlations of vehicles are ignored, the long-term prediction (e.g., longer than one second) calculated by these traditional methods would be unreliable 
. In order to extend these traditional methods to consider the spatial-temporal interactions among vehicles, combination methods were proposed which take the advantages of traditional methods and advanced models (e.g., machine-learning-based models). For example, Juet al.  proposed a model which combines Kalman filter, kinematic models and neural network to capture the interactive effects among vehicles. Since machine-learning-based models can model the interactions and learn nonlinear trajectory evolution from real-world data, these combination methods have shown better performance than the above approaches.
So far, most of machine-learning-based methods have assumed that the task of vehicle trajectory prediction can be decomposed into two steps: firstly predict the maneuver of a target vehicle, and then generate a maneuver-based trajectory. For example, Deo et al. et al. 
used a Gaussian process regression model to recognize the maneuver being performed by a target vehicle and applied Monte-Carlo method to predict future trajectories. Other maneuver estimation models include support vector machines[31, 3]38]
, multi-layer perceptrons23, 39]. However, most of these approaches are highly dependent on handcrafted features, which are designed to model interactions and physical constraints among vehicles under certain scenarios. Therefore, their performance would degrade in unexpected traffic scenarios.
As a branch of machine learning, deep learning can extract features automatically by learning from abundant data, which can overcome the shortages induced by handcrafted features. Especially, after the success of long-short term memory (LSTM) networks in capturing the complex temporal dependencies[29, 1], many works have applied LSTM to vehicle trajectory prediction. For example, Zyner et al.  proposed an LSTM-based model to predict a potential direction that the driver would take at an intersection. Xin et al.  proposed a dual LSTM-based model to estimate driver intentions and predict future trajectories. These approaches take the past trajectory of a target vehicle as input and can achieve high accuracy of trajectory prediction. However, they ignore the impact of nearby vehicles on the target vehicle.
Later on, inspired by the interactive nature of drivers, researchers began to add the effects of nearby vehicles to deep-learning-based models. For example, Deo et al.  proposed a maneuver LSTM (M-LSTM), which takes the past trajectories of a target vehicle and its nearby vehicles as inputs. However, this method only aggregates all trajectories together but ignores the different effects of nearby vehicles on the target one. To improve the M-LSTM, Deo et al. 
proposed a network named CS-LSTM where a social tensor and a convolutional social pooling mechanism are introduced to model and capture the spatial interactions, respectively. Similarly, Zhaoet al.  proposed a multi-agent tensor fusion (MATF) network, which introduces a spatial tensor to represent spatial relations among vehicles. However, since both the social tensor and the spatial tensor in CS-LSTM and MATF only retain the spatial relationships at the last timestamp of past trajectories, the spatial-temporal dependencies are ignored by them. To address this issue, Dai et al.  proposed a spatio-temporal LSTM to consider the dynamic effects of six closest vehicles on the target vehicle. Similarly, Hou et al.  proposed a structural-LSTM network to consider the dynamic effects of five nearby vehicles on the target vehicle.
Although the above deep-learning-based approaches have made great progress in improving the accuracy of vehicle trajectory prediction, there are still two limitations. Firstly, the spatial relations of vehicles are essentially non-Euclidean, such that it is difficult to explain the physical meaning of features when using LSTM to model the spatial-temporal interactions. Therefore, these LSTM-based approaches would be neither efficient nor intuitive in modeling the spatial-temporal interactions. Secondly, these models taking the LSTM as the backbone network require an intensive computation power, and most of them only predict one target vehicle’s future trajectory each time. Thus, the computation time would increase exponentially when predicting future trajectories of all neighbor vehicles, which would be not suitable for the real-time decision-making of autonomous vehicles.
Therefore, to overcome the two aforementioned limitations, this paper proposes an efficient and fast network called graph-based spatial-temporal convolutional network (GSTCN), which can simultaneously predict future trajectory distributions of all neighbor vehicles. Inspired by the fact that GCN can capture the spatial dependencies in a traffic network with a faster computation speed and a higher efficiency than the LSTM-based methods [6, 16, 14, 28, 26, 30], we design a spatial graph convolutional module to learn the spatial dependencies among vehicles. To distinguish the respective effects of neighbor vehicles on a vehicle, we propose a weighted adjacency matrix which is embedded into the spatial graph convolutional module. Moreover, to capture the correlations of the features between the prediction and past time horizons, we design a CNN-based temporal dependency extractor (TDE) operated in the temporal dimensions. In this way, features in the past time horizon are mapped into the prediction time horizon for analysis of vehicle trajectory evolutions. In addition, the spatial-temporal features are fed into a GRU-based encoder-decoder to generate future trajectory distributions. The main contributions of our work are as follows.
(1) The backbones of our GSTCN are GCN and CNN, so it has a smaller model size and a faster inference speed than those of LSTM-based models, making the real-time prediction of all nearby vehicles’ trajectories possible.
(2) A weighted adjacency matrix is proposed to describe the intensity of mutual influence between two vehicles, and the ablation study demonstrates the network with it has better performance than that with an unweighted adjacency matrix.
(3) The GSTCN generates the probability distributions over the future trajectories, which can describe the stochastic behaviors of the human drivers compared to models predicting deterministic trajectories, especially in a long prediction horizon.
The rest of this paper is organized as follows. The problem description is given in Section II. We present a detailed description of the proposed network in Section III. The experimental results and analysis are given in Section IV. Finally, conclusions and possible future works are drawn in Section V.
Ii Problem Description of Vehicle Trajectory Prediction
Vehicle trajectory prediction is an important assistant function of autonomous vehicles, which can help autonomous vehicles to assess the possibilities of risks and to plan appropriate trajectories in advance. Similar to the work in , this paper formulates the vehicle trajectory prediction as estimating the future trajectory distributions given past trajectories. The difference is that we predict future positions for all neighbor vehicles simultaneously, which can provide autonomous vehicles with more detailed information about future situations.
To formulate this problem, we first introduce some notations. Vehicles’ positions over a past time horizon are denoted as:
are the coordinates at the time , and is the number of vehicles. As shown in Fig. 1, we assume that the autonomous vehicles can observe the motions of vehicles within meters longitudinally and two adjacent lanes laterally, and can collect their past trajectories with a certain frequency. The trajectory distributions in the future time horizon is denoted as:
We follow the assumption in 
where is the mean,
is the standard deviation, andis the correlation.
Thus, the trajectory prediction problem can be summarized as follows. Given all neighbor vehicles’ positions over a past time horizon , the aim is to predict their trajectory distributions in the future time horizon .
Iii Graph-based Spatial-Temporal Convolutional Network
To solve the trajectory prediction problem, a key challenge is to figure out how vehicles affect each other, i.e., spatial-temporal dependencies among vehicles. Moreover, since predicted trajectories are time series, another challenge lies in how to tackle the sequence generation task. To address the two challenges, we propose a graph-based spatial-temporal convolutional network (GSTCN) for vehicle trajectory prediction. The overall architecture of our proposed network is shown in Fig. 2, in which the spatial graph convolutional module and the TDE are used to capture the spatial-temporal dependencies, and the trajectory prediction module is applied to predict future trajectories. We introduce each component in the following subsections.
Iii-a Spatial graph convolutional module
Iii-A1 Generation of spatial-temporal graph using trajectories
Inspired by the topological structure of the graph, we model the interactions among vehicles as a spatial-temporal graph. The spatial-temporal graph is defined as , where is the spatial graph representing the spatial relations of vehicles at the time . Supposing that there are vehicles in a scene, we define the spatial graph as , where is the set of all vertices. Each vertex represents an individual vehicle in the scene, and the attribute of is the coordinate . is the set of all edges, and each edge represents the mutual effects between vehicles.
Generally, a vehicle has different effects on other vehicles. For example, the sudden deceleration of a vehicle would cause close vehicles to slow down or change lanes, but have little influence on vehicles far away from it. Therefore, to distinguish the intensity of interactions between two vehicles, we consider that each edge in should be assigned to different weights. In this work, we introduce as a weighted adjacency matrix whose entries represent how strong the interactions between two vehicles could be. Considering that two vehicles with closer distances have stronger effects on each other, we use the reciprocal of the distance to measure the weight between two vehicles so that closer vehicles have higher weights. Hence, can be written as,
where in denotes the distance between the vehicles and at the time . A toy example about the generation of a spatial graph is given in Fig. 3, in which each vertex in the graph is corresponding to a vehicle and the reciprocal of distances represent the weights in the weighted adjacency matrix . At each time in the past time horizon, we can construct a spatial graph . By stacking , we get the spatial-temporal graph , which is the input of our network. In the spatial-temporal graph, we define as the stacked tensor of all in the past time horizon, and is set to be 2 to represent the two-dimensional coordinate . Similarly, the adjacency matrix of is denoted as , which is the stack of .
Iii-A2 Spatial graph convolution
The spatial-temporal graph contains raw information about dependencies among vehicles, so we should use well-designed networks to extract these dependencies from the graph. In the spatial dimension, a vehicle’s future trajectory is highly dependent on the motions of its nearby vehicles. To capture the spatial dependencies, existing works [11, 46] use the number of zeros in the grid to represent distances between vehicles, which is inefficient. Since GCN directly operates on the vertices of a graph and has shown its effectiveness to capture the spatial dependencies between one vertex and its neighbors , we introduce the spatial graph convolution in this work.
Extended from standard two dimensional convolution , the graph convolution operation is expressed as:
where denotes the feature matrix of vertices in layer .
is an activation function,,
is the identity matrix,is the diagonal node degree matrix of , is the parameters matrix of the layer . The aim of computing is to normalize the adjacency matrix, which can speed up the learning process of GCN .
As shown in Fig. 4, the graph convolution operations apply a weighted sum to the features of a target vehicle and its surrounding vehicles, and then pass the results to the next layer. The feature of the target vehicle is considered because the state of the target vehicle also impacts its future motions. Note that the shapes of features are the same before and after the graph convolution.
Iii-B Temporal dependency extractor
In the temporal dimension, the future motions of a vehicle are highly dependent on its own past trajectory. For instance, a vehicle that is conducting the maneuver of lane change is most likely to continue this maneuver in the next few seconds. In addition, the effects of the surrounding vehicles on the target vehicle are time-varying. To capture the temporal dependencies, most of the existing works [11, 2] rely on LSTM. However, LSTM would lead to low training efficiency and slow computation speed . Inspired by the work in , we design a CNN-based temporal dependency extractor (TDE) to extract the temporal features. Since this module depends on convolution operations, it has a smaller parameter size and a faster inference speed than those of LSTM.
The procedure of this module is as follows. After the neighboring information of each vertex has been captured by the spatial graph convolution operations, we get a three-dimensional tensor which contains the extracted spatial features. Then, we generate from with the transposition of dimensions. Each channel of contains the spatial features of all vehicles at the corresponding timestamp. The TDE takes as inputs to learn the evolving tendency of each vehicle from its own past motions and nearby vehicles’ dynamic interactions. The TDE takes the length of the past time horizon as the number of input channels. As shown in Fig. 5(a), a filter consists of multiple kernels, and each kernel is designed to learn the interactions among vehicles at one past timestamp by operating convolutions in one past temporal feature map. We set the number of kernels in a filter to be equal to the length of the past time horizon, so a filter can integrate past temporal information together into one feature map by performing element-wise addition on its kernels’ output feature maps. Besides, we set the number of filters in TDE to be equal to the length of the prediction time horizon. In this way, the TDE can extract temporal dependencies by learning the mapping relation between the input features and future temporal features. The output of the TDE is a tensor with shape , as shown in Fig. 5(b).
Iii-C Trajectory prediction module
After capturing the spatial-temporal features, we should generate future trajectory distributions, which is a typical sequence generating task. Inspired by the excellent performance of gated recurrent unit (GRU) network on sequence-based tasks and its cheap computation cost , we design the trajectory prediction module as a GRU-based encoder-decoder network. In our model, the spatial graph convolutional module learns the spatial dependencies among vehicles from the inputs, and then the temporal dependency extractor further extracts the temporal dependencies. Although the operations of these two modules are consecutive in the implementation steps, the spatial dependencies and temporal dependencies are not extracted at the same time, which would weaken the correlation between the spatial and temporal dependencies contained in the extracted features. Therefore, in this module, the encoder GRU is designed to strengthen the correlation between spatial dependencies and temporal dependencies, and the decoder GRU generates the probability distributions of future trajectories. The predicted coordinate is given by . Note that the encoder GRU or the decoder GRU used for each vehicle has shared weights, which guarantees the generalization of the model even if the number of neighbor vehicles varies.
Iv Experimental Evaluation
The model is trained on two public vehicle trajectory datasets: I-80 and US-101 in NGSIM , in which trajectories are recorded with a frequency of 10Hz under real freeway scenarios. Both datasets contain vehicle trajectories in mild, moderate, and heavy traffic scenarios for 45 minutes. The rich scenarios in the datasets are suitable for the evaluation of the robustness and effectiveness of the proposed network. The freeways of I-80 and US-101 to be studied are shown in Fig. 6, where the white arrows indicate driving directions.
To make a fair comparison, we follow the same training strategy in : the raw data are downsampled to 5Hz, and the trajectories are split into segments of 8 seconds, in which the first 3 seconds of each segment are taken as the past time horizon and the remaining 5 seconds are treated as the prediction time horizon. We finally get 13,218 segments of trajectories, and all of them are randomly split into training, validation, and testing sets.
Iv-B Evaluation metrics
In order to achieve quantitative metrics, the root mean square error (RMSE) of the predicted trajectory and the ground truth is used to evaluate performance. The RMSE is calculated as,
where and are the predicted coordinate of vehicle at time . We report the RMSE values for different prediction time horizons (from 1 to 5 seconds). We note when evaluating predicted probability distributions, previous works [11, 46] only calculate the RMSE of the mean values and the ground truth but ignores the predicted standard deviations and correlations. Thus, in order to make a more comprehensive evaluation, we report the lowest error based on 5 random samplings, which is similar to .
Iv-C Implementation details
We implement the proposed model using PyTorch as that in. Several implementation details are given as follows.
Iv-C1 Data preprocessing
In a real scenario with imperfect sensors which, for example, result in missing or abnormal data, it is necessary to preprocess the raw data, where it is required to remove the abnormal data and infer the missing data. We have introduced a data preprocessing module to complete these tasks. With this module, the abnormal data can be detected and removed by using one of existing anomaly detection methods, and missing data can be inferred by cubic Hermite interpolation.
Iv-C2 Scene size
We set the autonomous vehicle’s horizon of sight as follows. The autonomous vehicles can observe the motions of vehicles within the range of 100 meters longitudinally and two adjacent lanes laterally.
Iv-C3 Input embedding
We use a 32-channel convolutional layer with kernel size to increase the dimensions of spatial coordinates, which can improve the learning ability of the network .
Iv-C4 Temporal dependency extractor
Iv-C5 Trajectory prediction module
Both the encoder and decoder are one-layer GRUs. A linear layer is used to make sure the outputs of the decoder have the expected shape, and also dropout (with 0.5 probability) is applied to prevent overfittings.
Iv-C6 Training loss
Since the outputs of our network are the probability distributions, we train the network by minimizing the negative log-likelihood loss:
where denotes the likelihood of the ground truth position over the predicted probability distribution.
Iv-C7 Training process
The GSTCN is trained on the NVIDIA GTX1080Ti GPU. The batch size is set to be 128, and we train the model for 250 epochs using Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.1. The learning rate is multiplied by 0.1 every 80 epochs to speed up the convergence of the loss.
Iv-C8 Model configuration
We choose appropriate hyperparameters to improve the performance of our network. We first test how the numbers of layers in the spatial graph convolutional module and the TDE affect the performance. As shown in Fig.7, the best model has one layer in the spatial graph convolutional module and five layers in the TDE. Then, we fix the layer number and vary the number of hidden units in GRU as the number from . As shown in Table I, the model has the lowest RMSE values when the number of hidden units is set to be 32.
Iv-D Ablation study
In this subsection, we conduct several ablative studies to verify the effectiveness of our scheme.
Iv-D1 Effectiveness of different modules
Our full model mainly consists of three modules. To verify the effectiveness of each module on vehicle trajectory prediction, we train three variants of our model with different modules: GSTCN without GCN, GSTCN without TDE and GSTCN without GRU. As shown in Table II, the removal of any module in GSTCN will cause an increase in RMSE, which indicates the effectiveness of each module. When removing GCN or TDE from our full model, the features sent to the GRU lack the extracted spatial dependencies or temporal dependencies. As a result, the GRU fails to integrate the correlation between the spatial dependencies and temporal dependencies and cannot reasonably generate future trajectories. If the GRU is removed, the weak correlation between the spatial and temporal dependencies cannot be strengthened. Only after both the spatial dependencies and temporal dependencies are sent to the trajectory prediction module, can our model achieve the best performance, which indicates that the GCN and TDE are complementary in our full model, and the GRU encoder-decoder can strengthen the correlation between spatial dependencies and temporal dependencies from the extracted features and generate more accurate future trajectories.
|Horizon(s)||W/O GCN||W/O TDE||W/O GRU|
Iv-D2 Influence of different locations
The proposed network can simultaneously predict all neighbor vehicles’ future trajectories based on their past trajectories. Therefore, it is necessary to analyze the prediction errors of vehicles at different locations. As shown in Fig. 8, we report the RMSE values for vehicles located in the middle, front, and rear of the scene.
We note that vehicles located in the middle of the scene have the lowest RMSE values for all prediction horizons, which is consistent with the intuition that vehicles located in the middle have more neighboring information as inputs than vehicles at the edge. An interesting phenomenon is that the prediction errors of vehicles located in the rear of the scene are always lower than those of the vehicles in the front of the scene. This is because drivers are more likely to be affected by the vehicles in front of them according to the driving experience. However, when predicting the trajectories of the vehicles in the front of the scene, we can only utilize the motion information of the vehicles behind them. Thus, the predictions of the GSTCN are consistent with intuition.
Iv-D3 GSTCN with different weighted adjacency matrices
The weights in the adjacency matrix measure the intensities of interactions among vehicles, so we should appropriately mark the weights. In Section III-A, we have introduced the reciprocal of the distance between two vehicles to mark the weights based on prior knowledge. This means that the closer the distance between vehicles the stronger the mutual effects are. Therefore, to validate this prior knowledge has a positive influence on the performance of our model, we introduce two other ways to measure the weights: (1) we directly use the distances between two vehicles to mark the weights; (2) similar to existing works [26, 21], we set all elements in the adjacency matrix to ones as a baseline.
The RMSE values of different weighted adjacency matrices are listed in Table III. The model with the weighted adjacency matrices defined by the reciprocal of the distance outperforms others. Therefore, the reasonable usage of the prior knowledge can help improve the performance of the model. Note that all weighted adjacency matrices outperform the baseline, which demonstrates the effectiveness of weighted adjacency matrices. An interesting result is that although it is contrary to the intuition when marking the weights directly using the distances, the prediction errors are still lower than that of the baseline, which is partly due to the reverse learning ability of the deep-learning-based models.
We evaluate the performance of our proposed method by comparing it with the following baselines:
Constant Velocity (denoted as CV) : This baseline uses a constant velocity Kalman filter to predict the deterministic trajectory of one vehicle. The effects of neighbor vehicles on the target vehicle are ignored.
Vanilla LSTM (denoted as V-LSTM) : This baseline is a simple LSTM encoder-decoder that takes the past trajectory of the target vehicle as inputs and generates a deterministic future trajectory of the target vehicle. The effects of surrounding vehicles on the target vehicle are ignored.
CS-LSTM with maneuvers (denoted as CS-LSTM-M) : This LSTM-based model applies convolutional social pooling layers to tackling the spatial interactions and predicts the multi-modal trajectory distributions of the target vehicle based on maneuvers.
CS-LSTM : This baseline is the same as CS-LSTM-M except that it generates unimodal prediction distribution.
MATF : This baseline takes past trajectories of all vehicles and the scene image of the predicted area as inputs and uses LSTM to predict deterministic trajectories of all vehicles in a scene.
Graph-based interaction-aware trajectory prediction model (denoted as GRIP-ALL) : This baseline uses the graph to model the interactions among vehicles and uses LSTM to predict deterministic trajectories of all vehicles in a scene.
GRIP : This baseline is the same as GRIP-ALL except that it only predicts the trajectory of one vehicle in the central location of the scene.
Iv-F Quantitative analysis for prediction results
In this subsection, we compare the prediction errors, model sizes, and inference speeds of our method with those of the above baselines. Since some methods can predict trajectories of all neighbor vehicles simultaneously, while others only predict the trajectory of one vehicle each time, we compare the results respectively for the sake of fairness.
Iv-F1 Performance when predicting one vehicle
In order to compare with the methods that only predict one vehicle’s future trajectory each time, we report the RMSE values of the vehicle located in the middle of the scene, as shown in the column of “GSTCN-ONE” in Table IV. It shows that the GSTCN-ONE achieves the lowest prediction errors for almost all prediction horizons, demonstrating the powerful ability of our GSTCN in capturing the spatial-temporal dependencies and inferring the future trajectories.
In Table IV, we can see that in the prediction horizon of one second, the previous state-of-the-art model GRIP has better performance than the GSTCN. However, the deterministic trajectories generated by the GRIP fail to describe the stochastic behaviors of human drivers, so our GSTCN outperforms the GRIP in long prediction horizons, and the average RMSE values are 7.45% lower than that of the GRIP. Both the GSTCN and the CS-LSTM predict the probability distributions over the future trajectories, but GSTCN outperforms CS-LSTM in all prediction horizons, which demonstrates that the GCN is more effective to capture the spatial dependencies than LSTM. In addition, we note that almost all deep-learning-based approaches have better performance than traditional and machine-learning-based methods (CV and C-VGMM+VIM). Among all deep-learning-based methods, the ones that use the information of surrounding vehicles outperform the ones that do not (i.e., V-LSTM). Therefore, it is crucial to consider the neighboring information when predicting the trajectories.
Iv-F2 Performance when predicting all vehicles
Among all baselines, only the MATF and the GRIP-ALL can simultaneously predict trajectories of all vehicles. Therefore, this paper compares the performance of the two methods with our GSTCN in the case of predicting trajectories of all vehicles. As shown in Table V, our network improves the prediction accuracies for all prediction horizons. For example, the GSTCN achieves 22.4% average accuracy improvement compared with the GRIP-ALL. In addition, although our model does not take the scene images as the additional inputs, it still outperforms the MATF that does. This is because MATF only considers the spatial relationships at one timestamp but ignores the spatial-temporal dependencies, which demonstrates that capturing the spatial-temporal dependencies is far more important than processing the scene images.
Iv-F3 Comparison of model size and inference speed
Model size and inference speed are two important performance factors to decide whether an algorithm can be deployed to autonomous vehicles. The small model size guarantees that the method can work well with limited hardware resources. The fast inference speed can guarantee that autonomous vehicles have enough time to make decisions based on predictions. Since only the MATF and the GRIP-ALL can simultaneously predict trajectories of all vehicles among all baselines, we compare model sizes and inference speeds of them, as listed in Table VI.
The GRIP-ALL was previously the smallest model with 496.3K parameters. The model size of GSTCN is only about one tenth of that of the GRIP-ALL. To make a fair comparison, the inference speeds of these models are tested on the NVIDIA GTX1080Ti GPU, and we compare the average time required for each model to predict the trajectory of one vehicle. These three model simultaneously predict trajectories of 120 vehicles each time. Our GSTCN spends an average of 0.044 ms to predict one vehicle’s trajectory, which is about 7.3 times faster than the previously fastest model (GRIP-ALL). We achieve these improvements because the backbones of the GSTCN are GCN and CNN, which overcome the limitations induced by the recurrent architecture of LSTM.
|Model||Parameters||Average inference time|
|count||for one vehicle (ms)|
|MATF||15.3M (313)||0.370 (8.4)|
|GRIP-ALL||496.3K (10)||0.322 (7.3)|
Iv-G Qualitative analysis for GSTCN
In this subsection, we qualitatively analyze the prediction performance of our GSTCN by visualizing several representative predicted trajectories under mild, moderate, and heavy traffic scenarios. All the results are sampled from the I-80 and US-101 datasets. As shown in Fig. 9, our model observes the past trajectories of all vehicles in the scene for 3 seconds, illustrated as dashed lines. Then the probability distributions over all trajectories for the next 5 seconds are predicted, which are shown as color densities. The solid lines represent the ground truth trajectories. Overall, the predicted trajectory distributions can capture the pattern of ground truth trajectories well.
As shown in Fig. 9(a) and (b), when the traffic condition is mild, vehicles have few interactions and tend to drive at a high speed, and the predicted distributions well prove that our network has learned this feature. In the moderate traffic scenario, vehicles are more likely to change lanes to maximize their speeds. For example, as shown in Fig. 9(c), since the front vehicle on Lane 3 has a relatively slow speed, the vehicle behind it would conduct lane changing. The GSTCN can capture this kind of spatial-temporal dependencies and successfully predict future distributions. While in heavy traffic, the motions of vehicles become more complicated. For example, as shown in Fig. 9(d), vehicles on Lane 1 drive at relatively high speeds, while vehicles on other lanes move slowly. Our GSTCN still successfully predicts the future distributions of all vehicles, which demonstrates the robustness of our network.
Iv-H Robustness to imperfect data
In the previous experiments, since the NGSIM datasets have been preprocessed by the data provider, we assume that the trajectory data received by the autonomous vehicle are perfect. However, in practical application, such perfection is generally impossible to achieve. Therefore, we will discuss the robustness of the proposed model in the case of imperfect data in this subsection.
|Prediction||Case I (Perfect)||Case II (Perfect)||Perfect|
|1||0.48 (+0.04)||0.48 (+0.04)||0.44|
|2||0.90 (+0.07)||0.93 (+0.10)||0.83|
|3||1.44 (+0.11)||1.55 (+0.22)||1.33|
|4||2.14 (+0.13)||2.37 (+0.36)||2.01|
|5||3.12 (+0.14)||3.53 (+0.55)||2.98|
|Average||1.62 (+0.10)||1.77 (+0.25)||1.52|
Iv-H1 Case I: partially missing
In a general case, the data collected by the sensor have some missing points, so we randomly select half of the sequences from the testing set, from which randomly delete 20% of the data points. Before these imperfect data are fed into our model, we use cubic Hermite interpolation to infer the value of missing points.
Iv-H2 Case II: totally missing
An extreme case is that a vehicle is totally undetected during a past time horizon, so it is impossible to infer reasonable data for the model. For this case, we randomly select a vehicle from the input and completely discard its trajectory data.
As shown in Table VII, we compare the RMSE for our model with imperfect and perfect data. We can see that the RMSE for imperfect data are greater than these for the perfect data, but the increase is acceptable, which demonstrates the robustness of our model and its potential for practical applications. In addition, when one nearby vehicle is totally undetected during the past time horizon, the RMSE increments in the long prediction horizon (3-5s) are much larger than these in case I, which indicates that the complete undetection of one vehicle has a greater impact on the accuracy of vehicle trajectory prediction than the partial undetection of several vehicles.
V Conclusions and Future Work
In this paper, we have presented a graph-based spatial-temporal convolutional network (GSTCN) that can predict the future trajectory distributions of all vehicles in a scene simultaneously. In our method, a weighted adjacency matrix has been proposed to distinguish different effects from nearby vehicles on the target vehicle. Based on this adjacency matrix, a spatial graph convolutional module can be used to learn the spatial dependencies among vehicles. The experimental results have shown that our GSTCN outperforms the main previous methods, and the small model size and fast inference speed of our GSTCN have demonstrated it has the potential to be deployed to autonomous vehicles.
In the future, several works can be done to further extend our proposed network. Firstly, in addition to past trajectories of surrounding vehicles, there are many other supplementary data (e.g., visual scene images, high-resolution maps, and vehicle-to-vehicle communication information) that can be detected by autonomous vehicles. Therefore, how to utilize these supplementary data to further improve the performance is worth studying. Secondly, the effects of nearby vehicles in different directions on the target vehicle are slightly different, even if the distances are the same. Hence, we can study how to consider the effects of different directions when constructing the weighted adjacency matrix. Finally, the aim of trajectory prediction is to enable autonomous vehicles to gain the ability to make optimal decisions. Thus, better motion planning algorithms for autonomous vehicles based on the predictions of our GSTCN can be studied to improve traffic efficiency and security.
Social LSTM: human trajectory prediction in crowded spaces.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 961–971. Cited by: §I, §IV-C3.
-  (2017-Oct.) An LSTM network for highway trajectory prediction. In Proc. 20th Int. IEEE Conf. Intell. Transp. Syst., pp. 353–359. Cited by: §III-B.
-  (2012-Jun.) Driver behavior classification at intersections and validation on large naturalistic data set. ieee_j_its 13 (2), pp. 724–736. Cited by: §I.
Cooperation-aware lane change maneuver in dense traffic based on model predictive control with recurrent neural network. In Proc. Amer. Control Conf., pp. 1209–1216. Cited by: §I.
-  (2010-Sep.) Model-based threat assessment for avoiding arbitrary vehicle collisions. ieee_j_its 11 (3), pp. 658–669. Cited by: §I.
-  (2018-Nov.) Bike flow prediction with multi-graph convolutional networks. In Proc. 26th ACM SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst., pp. 397–400. Cited by: §I.
-  (2014-Oct.) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process., pp. 1724–1734. Cited by: §III-C.
-  (2019-05) A review of motion planning for highway autonomous driving. ieee_j_its 21 (5), pp. 1826–1848. Cited by: §I.
-  (2019) Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7, pp. 38287–38296. Cited by: §I.
-  (2018-Jun.) How would surround vehicles move? A unified framework for maneuver classification and motion prediction. IEEE Trans. Intell. Vehicles 3 (2), pp. 129–140. Cited by: §I, 3rd item.
-  (2018-Jun.) Convolutional social pooling for vehicle trajectory prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1468–1476. Cited by: §I, §II, §III-A2, §III-B, 1st item, 4th item, 5th item, §IV-A, §IV-B.
-  (2018-Jun.) Multi-modal trajectory prediction of surrounding vehicles with maneuver based LSTMs. In Proc. IEEE Intell. Vehicles Symp., pp. 1179–1184. Cited by: §I.
-  (2015-Dec.) Connected vehicles-advancements in vehicular technologies and informatics. ieee_j_ie 62 (12), pp. 7824–7826. Cited by: §I.
-  (2019-Jan.) Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In Proc. 33st AAAI Conf. Artif. Intell., Vol. 33, pp. 3656–3663. Cited by: §I.
-  (2015-Apr.) A review of motion planning techniques for automated vehicles. ieee_j_its 17 (4), pp. 1135–1145. Cited by: §I.
-  (2019-Jan.) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proc. 33st AAAI Conf. Artif. Intell., Vol. 33, pp. 922–929. Cited by: §I.
-  (2006-Dec.) A multilevel collision mitigation approach—Its situation assessment, decision making, and performance tradeoffs. ieee_j_its 7 (4), pp. 528–540. Cited by: §I.
-  (2020-Nov.) Interactive trajectory prediction of surrounding road users for autonomous driving using structural-LSTM network. ieee_j_its 21 (11), pp. 4615–4625. Cited by: §I.
-  (2020-Oct.) Interaction-aware Kalman neural networks for trajectory prediction. In Proc. IEEE Intell. Vehicles Symp., pp. 1793–1800. Cited by: §I.
-  (2017-Jun.) Parametric trajectory prediction of surrounding vehicles. In Proc. IEEE Int. Conf. Veh. Electron. Saf., pp. 26–31. Cited by: §I.
-  (2017-Apr.) Semi-supervised classification with graph convolutional networks. In Proc. Int. Conf. Learn. Representations, pp. 1–14. Cited by: §III-A2, §IV-D3.
-  (2014-Jun.) Combining behavior and situation information for reliably estimating multiple intentions. In Proc. IEEE Intell. Vehicles Symp., pp. 388–393. Cited by: §I.
-  (2011-Jun.) Exploiting map information for driver intention estimation at road intersections. In Proc. IEEE Intell. Vehicles Symp., pp. 583–588. Cited by: §I.
-  (2014-Jul.) A survey on motion prediction and risk assessment for intelligent vehicles. Robomech J. 1 (1), pp. 1–14. Cited by: §I.
-  (2016-Apr.) Real-time trajectory planning for autonomous urban driving: framework, algorithms, and verifications. ieee_j_mech 21 (2), pp. 740–753. Cited by: §I.
-  (2019-Oct.) Grip: graph-based interaction-aware trajectory prediction. In Proc. 22th Int. IEEE Conf. Intell. Transp. Syst., pp. 3960–3966. Cited by: §I, 7th item, 8th item, §IV-D3.
-  (2000-05) Vehicle dynamics and external disturbance estimation for vehicle path prediction. ieee_j_cst 8 (3), pp. 508–518. Cited by: §I.
-  (2019-Oct.) Contextualized spatial–temporal network for taxi origin-destination demand prediction. ieee_j_its 20 (10), pp. 3875–3887. Cited by: §I.
-  (2015-05) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transp. Res. C, Emerg. Technol. 54, pp. 187–197. Cited by: §I.
-  (2020-Jun.) Social-STGCNN: a social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 14424–14432. Cited by: §I, §II, §III-B, §IV-B.
-  (2011-Nov.) Trajectory learning for activity understanding: unsupervised, multilevel, and long-term adaptive approach. ieee_j_pami 33 (11), pp. 2287–2301. Cited by: §I.
-  (2005-Aug.) Kalman filters predictive steps comparison for vehicle localization. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 565–571. Cited by: §I.
-  (2020) Deep learning-based vehicle behavior prediction for autonomous driving applications: a review. ieee_j_its (), pp. 1–15. Cited by: §I.
-  (2011-Jun.) Behavior prediction at multiple time-scales in inner-city scenarios. In Proc. IEEE Intell. Vehicles Symp., pp. 1068–1073. Cited by: §I.
-  (2017-Dec.) Automatic differentiation in PyTorch. In Proc. Adv. Neural Inf. Process. Syst. Workshops, pp. 1–4. Cited by: §IV-C.
-  (2007-Sep.) Sensor fusion for predicting vehicles’ path for collision avoidance systems. ieee_j_its 8 (3), pp. 549–562. Cited by: §I.
-  (2020-Jun.) Human motion trajectory prediction: a survey. Int. J. Robot. Res. 39 (8), pp. 895–935. Cited by: §I.
-  (2015-Jul.) When will it change the lane? A probabilistic regression approach for rarely occurring events. In Proc. IEEE Intell. Vehicles Symp., pp. 1373–1379. Cited by: §I.
-  (2014-Oct.) Bayesian, maneuver-based, long-term trajectory prediction and criticality assessment for driver assistance systems. In Proc. 17th Int. IEEE Conf. Intell. Transp. Syst., pp. 334–341. Cited by: §I.
-  (Website) External Links: Cited by: §IV-A.
-  (2014-Jun.) Online maneuver recognition and multimodal trajectory prediction for intersection assistance using non-parametric regression. In Proc. IEEE Intell. Vehicles Symp., pp. 918–923. Cited by: §I.
-  (2017-Dec.) Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst., pp. 5998–6008. Cited by: §III-B.
-  (2018-Jul.) Vehicle trajectory prediction by integrating physics-and maneuver-based approaches using interactive multiple models. ieee_j_ie 65 (7), pp. 5999–6008. Cited by: §I.
-  (2018-Nov.) Intention-aware long horizon trajectory prediction of surrounding vehicles using dual LSTM networks. In Proc. 21th Int. IEEE Conf. Intell. Transp. Syst., pp. 1441–1446. Cited by: §I.
-  (2018-Jul.) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proc. 27th Int. Joint Conf. Artif. Intell., pp. 3634–3640. Cited by: §III-A2.
-  (2019-Jun.) Multi-agent tensor fusion for contextual trajectory prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 12126–12134. Cited by: §I, §III-A2, 2nd item, 6th item, §IV-B.
-  (2018-Jul.) A recurrent neural network solution for predicting driver intention at unsignalized intersections. IEEE Robot. Autom. Lett. 3 (3), pp. 1759–1764. Cited by: §I.