Developing autonomous vehicles that can drive on public roads along with human drivers, pedestrians, cyclists, and other road users is a challenging task. Researchers have been attempting to solve this problem for many years, from the days of ALVINN Pomerleau (1989) and the DARPA Urban Challenge Montemerlo et al. (2008); Urmson et al. (2008) to exploring a variety of approaches in recent years, including end-to-end learning Chen et al. (2015) and traditional engineering stacks Levinson et al. (2011). In order to drive both safely and comfortably in the real world, one of the most important and difficult tasks for self-driving vehicles (SDVs) is to predict the future behaviors of the surrounding road users.
There has been significant research dedicated to predicting the future states of traffic actors. One common line of research attempts to independently predict each actor’s future trajectory from the scene Altché and de La Fortelle (2017); Cui et al. (2019); Deo and Trivedi (2018a); Kim et al. (2017); Luo et al. (2018); Xie et al. (2017). However, a limitation of many of these approaches is that they fail to capture the interactions among actors. For example, in the case shown in Figure 1, the future trajectory of the blue vehicle and the future trajectory of the pedestrian depend on one another. In this case, there are two possible outcomes of the interaction: the pedestrian yields to the vehicle, or, more likely, the vehicle yields to the pedestrian. Predicting the marginal future trajectory of the vehicle or the pedestrian will ignore the possible modes of interaction between the two actors, and will end up capturing an inaccurate distribution over the future trajectories.
Recently, there has been an increasing amount of work on modeling interaction in multi-agent systems with neural networks Alahi et al. (2016); Battaglia et al. (2018); Deo and Trivedi (2018b); Hoshen (2017); Kipf et al. (2018); Rhinehart et al. (2019); Sukhbaatar et al. (2016); Sun et al. (2019); Tacchetti et al. (2019); Casas et al. (2019). For example, in CommNet Sukhbaatar et al. (2016), communication protocols (interactions) between smart agents are learned in conjunction with the final prediction outcomes through a Graph Neural Network (GNN), where the communication is learned implicitly. Similarly, in SocialLSTM Alahi et al. (2016), interaction between pedestrians is captured by the social pooling operation on the hidden states of the LSTMs. As opposed to modeling interaction implicitly, Neural Relational Inference Kipf et al. (2018) models the interactions in dynamical systems as latent edge types of an interaction graph, which are learned in an unsupervised manner.
Another related approach in the autonomous driving domain is IntentNet Casas et al. (2018). In this work, the model learns discrete actions, such as “keep lane” and “left lane change” using supervision. One limitation of predicting actions instead of interactions is that it is unnatural to pose constraints or priors on a pair of actor actions, but much easier to do so with interactions. An example of such a prior is illustrated in Figure 1, where we believe that if the pedestrian goes first, then the vehicle will yield, and vice versa. By introducing the concept of pair-wise interaction, we are able to capture the fact that in this interaction pair, it is unlikely that both the vehicle and the pedestrian will go at the same time and it is instead more likely that one actor will yield to the other. Importantly, by using the future observations of the scenario to categorize interaction types during training, we can learn these pair-wise interactions, instead of independent agent-wise actions, explicitly in a supervised manner.
In this paper, we propose a supervised learning framework for the joint prediction of interactions and trajectories. Specifically, we model interactions as intermediate discrete variables that capture the long-term relative intents of the actors, such as whether one actor will yield to another. In order to learn the interaction types from labeled examples, we introduce a labeling function which uses simple heuristics to programmatically generate the labels from the future trajectories. This enables us to build a large dataset of vehicle interactions without relying on human experts for manual labeling. In addition to improving the accuracy of trajectory predictions, we show that explicitly modeling the interaction types helps capture the modes of vehicles’ future behaviors in an explainable manner. Our approach is empirically verified by experiments conducted on a large-scale dataset collected by real-world autonomous vehicles.
2 Problem Formulation
We are mainly interested in predicting the future dynamics of multi-agent systems consisting of vehicles. Our goal is to predict the future trajectories of vehicles in traffic scenes, given their observed states and some additional features describing the traffic conditions around them. In order to jointly model the dynamics of all agents in the system in a structured way, we introduce an auxiliary task of learning discrete interaction types between agents.
We first define the state to be the 2D position and velocity of agent at time , and let denote the sequence of agent ’s states from to . For compactness, we further define to be the state sequences of all agents in the scene. Then, the trajectory prediction task is to predict the future states of all agents given observations of their past states .
Next, we assume that there exist discrete types that summarize the modes of interaction between each pair of agents. Under this assumption, we introduce a secondary task of learning the interaction types from labeled examples. In traffic scenarios, it is often difficult to capture interactions based solely on the agents’ dynamics, and additional contextual features about the agents in the scene can be very informative. Let and be agent-wise and pair-wise features that describe an individual agent (capturing basic traffic context) and the relationship between a pair of agents (capturing their relative dynamics via a compact set of high-level features), respectively. Then, the interaction prediction task is to predict the interaction label
for each ordered pair of agentsgiven the agent-wise and pair-wise features along with the observed dynamics of the pair.
Since our primary goal is still trajectory prediction, we combine the predicted interaction types with information about the agents’ past states and provide these as inputs to the trajectory prediction module. Explicitly capturing these interaction types guides the trajectory prediction module on how to aggregate information from agents when predicting the future behavior of agent , which ultimately leads to more accurate trajectories.
We tackle the trajectory prediction problem by jointly learning to predict both interaction types and future trajectories. The key insight is that by learning interaction types and future dynamics jointly, a model can learn to make better and more explainable predictions.
Labeling Function. Supervised learning of interaction types requires labeled examples. Instead of obtaining interaction labels from human experts, we use simple heuristics to programmatically generate the labels, similar in philosophy to Zhan et al. (2019). We extract an interaction label for each ordered pair of agents in the scene at each timestep . The label is determined by the future trajectories of the agents. Given the trajectories, the labeling function outputs: a) if trajectories do not intersect, b) if trajectories intersect and arrives at the intersection point before , and c) if trajectories intersect and arrives at the intersection point after .
Graph Representation of Agent States. It is undesirable to use a global coordinate system in trajectory prediction tasks because of the high variability of the input and output coordinates Becker et al. (2019). Instead, we transform the trajectory waypoints of each agent to their individual reference frame. The current position of an agent at time is set to be the origin of the agent’s reference frame, and the coordinates of past and future trajectory points are offset by the agent’s current position and rotated by the current heading of the agent in a global frame.
When the coordinates of the past and current states of each agent are transformed from global frame to that individual agent’s coordinate frame, information about the relative positions and velocities among agents is lost. However, a model needs to be informed with the relative configuration of agents in order to reason about their interaction. To preserve the relationships among agents, we represent the configuration of agents at a given timestep as a state graph, in which a node represents the state of an individual agent in its reference frame, and a directed edge represents the relative state of the destination node in the source node’s reference frame.
Modeling Interaction with Graph Network. We use a variant of Graph Network (GN) layers Battaglia et al. (2018) to process the state graphs and model the interactions between agents. Our GN consists of two components: an edge model which combines the representations of each edge and its terminal nodes to output an updated edge representation, and a node model which operates on each node to aggregate the representations of incident edges and outputs an updated node representation. We model different types of interaction using a separate learnable function for each type in the edge model (see Appendix for details). Given the predicted scores of the interaction types between a pair of actors, the edge model computes the sum of the outputs from each weighted by these scores.
Figure 2 describes the overall schematics of our joint prediction model. Our model consists of three components: 1) trajectory encoder network, 2) interaction prediction network, and 3) trajectory decoder network. First, the encoder network encodes the observed past states of each agent into a hidden state . Then, the interaction prediction network takes in the encoded states of all agents , along with the agent-wise features and pair-wise features , and predicts the interaction type scores for every ordered pair of agents, using a stack of two vanilla (untyped) GN layers. Finally, given the hidden states , interaction type scores , and the initial states , the decoder network rolls out the future states of the agents . This module aggregates information from all actors using a stack of two typed GN layers, which employ an MLP for each learned function in the edge model.
The loss function we use is the combination of a classification loss over the edges for predicting the discrete interaction types (edge loss) and a regression loss over the nodes for predicting the continuous future trajectories (node loss). Our complete loss function is given by
where is the Cross-Entropy loss and is the Mean-Squared-Error loss. For the supervised interaction model, we set , and for the unsupervised interaction model, we set .
We present experiments on real-world traffic data collected by autonomous vehicles which operated in numerous locations across North America. The dataset contains trajectories of 68,878 unique vehicles in various traffic scenarios. The vehicles were tracked continuously with a sampling frequency of 2 Hz (0.5s interval). We sample trajectories in sliding time windows of 10 seconds, and use the first 5 seconds as inputs and the last 5 seconds as prediction targets. Using the extracted trajectories, we run our labeling function to obtain the labels for every pair of agents that are less than 100 meters apart from each other. To evaluate the performance of our models, we report the mean displacement error, cross-track error, and along-track error between the estimated and ground-truth trajectories. We present an ablation study to analyze the capability of our proposed method to capture interactions between agents. The quantitative results are summarized in Table1.
|Method||mean DPE||mean ATE||mean CTE|
|Baseline, no interaction||2.051||1.818||0.558|
|Graph, untyped, yielding/going edges only||1.725||1.511||0.512|
|Graph, untyped, all edges||1.713||1.491||0.523|
|Graph, oracle, yielding/going edges only||1.709||1.435||0.519|
|Graph, oracle, all edges||1.638||1.435||0.489|
|Graph, joint, supervised interaction||1.611||1.397||0.500|
|Graph, joint, unsupervised interaction||1.579||1.378||0.477|
|Trajectory prediction with map and scene context Djuric et al. (2018)||1.643||1.533||0.334|
Our baseline model is an RNN encoder and decoder, which treats trajectories independently without modeling interactions between agents. In addition, we introduce two variants of our joint model. The first variant (untyped) has a single edge function to learn interaction without differentiating between types. The second variant (oracle) is modified to use ground-truth interaction types, instead of its own predictions, to predict trajectories. We also modify each variant to exclude the edges with IGNORING labels from the graph in order to see if these edges can be ignored as the name suggests.
We first demonstrate the power of graph networks to model interaction by comparing our joint model and its variants against the baseline. We observe that all of the graph models significantly outperform the baseline. This suggests that the motion of vehicles is highly interdependent, and graph models can effectively capture their interactions. Next, we showcase the effect of interaction labels on trajectory prediction. The typed variants outperform all of the untyped variants, which suggests that our graph model benefits from the discrete modeling of interaction types. Furthermore, we can see that the typed model benefits from having information shared along the IGNORING edges.
Finally, we present the results of our full fledged joint prediction model. Even without rich map context, our model shows comparable performance with Djuric et al. (2018), particularly on along-track error, which captures the temporal accuracy of the predicted trajectories. Notably, another version of the joint model trained without supervision on interaction labels (simply by zeroing out the interaction classification loss) achieves better performance than the supervised model. This implies that the heuristics used in our labeling function are not optimal, and could be improved for better trajectory prediction. Nonetheless, we observe via simulation experiments that the supervised model predicts trajectories that are consistent with the meanings of the interaction labels (see Appendix and supplementary video for details). This interpretability helps provide key insights into the model’s behavior, which is a crucial step towards building safe prediction systems for autonomous vehicles.
In this paper, we propose a graph-based model for multi-agent trajectory and interaction prediction, which explicitly models discrete interaction types using programmatically generated weak labels and typed edge models. The main advantages of our approach are: i) we can gain a boost in performance without additional labeling costs when compared to the baseline, and ii) our model can effectively capture the multi-modal behavior of interacting agents while learning semantically meaningful interaction modes.
-  (2016) Social LSTM: human trajectory prediction in crowded spaces. In , Cited by: §1.
-  (2017) An lstm network for highway trajectory prediction. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 353–359. Cited by: §1.
Relational inductive biases, deep learning, and graph networks. arXiv. External Links: Cited by: §1, §3, §6.3.
-  (2019) RED: a simple but effective baseline predictor for the TrajNet benchmark. In Computer Vision – ECCV 2018 Workshops, Cited by: §3.
-  (2019) Spatially-aware graph neural networks for relational behavior forecasting from sensor data. arXiv preprint arXiv:1910.08233. Cited by: §1.
-  (2018) IntentNet: learning to predict intention from raw sensor data. In Conference on Robot Learning, pp. 947–956. Cited by: §1.
-  (2015) DeepDriving: learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730. Cited by: §1.
-  (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2090–2096. Cited by: §1.
-  (2018) Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1179–1184. Cited by: §1.
-  (2018) Convolutional social pooling for vehicle trajectory prediction. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1549–15498. Cited by: §1.
-  (2018) Motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: Table 1, §4, Figure 4, §6.4.
-  (2017) VAIN: attentional multi-agent predictive modeling. In Neural Information Processing Systems, Cited by: §1.
Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 399–404. Cited by: §1.
Neural relational inference for interacting systems.
Proceedings of the 35th International Conference on Machine Learning, Cited by: §1.
-  (2011) Towards fully autonomous driving: systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 163–168. Cited by: §1.
-  (2018) Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §1.
-  (2008) Junior: the Stanford entry in the urban challenge. Journal of field Robotics 25 (9), pp. 569–597. Cited by: §1.
-  (1989) ALVINN: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §1.
-  (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. arXiv preprint arXiv:1905.01296. Cited by: §1.
Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pp. 2244–2252. Cited by: §1.
-  (2019) Predicting the present and future states of multi-agent systems from partially-observed visual data. In International Conference on Learning Representations, Cited by: §1.
-  (2019) Relational forward models for multi-agent learning. In International Conference on Learning Representations, Cited by: §1.
-  (2008) Autonomous driving in urban environments: boss and the urban challenge. Journal of Field Robotics 25 (8), pp. 425–466. Cited by: §1.
-  (2017) Vehicle trajectory prediction by integrating physics-and maneuver-based approaches using interactive multiple models. IEEE Transactions on Industrial Electronics 65 (7), pp. 5999–6008. Cited by: §1.
-  (2019) Generating multi-agent trajectories using programmatic weak supervision. In International Conference on Learning Representations, Cited by: §3.
6.1 Qualitative Evaluation on Simulated Data
In order to understand the influence of the edge types on the final trajectory predictions generated by our model, we simulated several very simple dynamic interactions between two actors. We generated simulated historical states for the actors and then injected fixed values for the edge scores to analyze how the predicted trajectories change as a function of the injected edge types. We visualized the predicted trajectories from the baseline model, oracle model with all edges, supervised joint model, and unsupervised joint model while injecting several different combinations of interaction types. We compiled the results of these simulations into a short video, which is available at https://www.youtube.com/watch?v=n5RNRDdoPoQ.
In these simulations, we find that all of the graph models learn to rely heavily on the edge type in order to predict the future trajectories of the two actors. We can effectively control how the interaction plays out simply by injecting different edge types (e.g., yielding/going vs. going/yielding). The interaction modes of the oracle and supervised models correspond directly to the labeled categories that we provide. Interestingly, the interaction modes learned by the unsupervised model encode a similar set of categories, but seem to capture the leading/following relationship separately from the going/yielding relationship.
6.2 Qualitative Evaluation on Real Data
Next, we looked at some examples of real-world scenes and qualitatively evaluated our model’s performance on these cases. Table 2 illustrates some examples of trajectory predictions for actors driving in three-way and four-way intersections. The trajectories are predicted from time to , where is the prediction horizon. The different colors indicate trajectory predictions for different actors. Each dot shows a single predicted waypoint (at 0.5-second intervals), and the more transparent the dot is, the further away it is in the future (i.e., the further its timestamp is from the current time, ).
We observe several interesting patterns in these examples. First, note that the map in the figures is purely for illustration – we do not provide map information directly to the model. Nevertheless, the graph models are able to learn the lane directions and drivable surfaces to some degree by observing the histories of the other vehicles. Second, we notice that the model that uses only the YIELDING/GOING edges (second column) is substantially worse at capturing lane-following behavior than all other models (see rows (a), (b), (c), and (e) for examples), suggesting that the IGNORING edges are useful for transmitting implicit map information from actor to actor. Third, if we compare the supervised and unsupervised models (last two columns), we observe that the unsupervised model is slightly worse at predicting lane-following behavior than the supervised model (see rows (c), (d), and (e) for examples). We also notice that the supervised model appears to predict fewer conflicts between trajectories than the unsupervised model (an example can be seen in row (b), where the supervised model clearly predicts the yellow actor to yield to the blue actor, but the unsupervised model predicts that both will go at the same time). Lastly, we see that in case (e), the red merging actor may be equally likely to turn left or right, and because the current model is uni-modal (i.e., it only predicts the single most likely future trajectory), it is not able to model such discrete modes. This suggests two future works: (1) incorporating map elements (traffic signals, traffic signs, lane segments, sidewalks) as nodes in the graph; (2) adding multi-modality to the model.
|ground||graph, oracle,||graph, oracle,||graph, joint,||graph, joint,|
|truth||yielding/going only||all edges||supervised||unsupervised|
6.3 Typed Graph Network Architecture
Here, we describe a variant of Graph Networks  used in our model architecture. A Graph Network (GN) layer propagates information between the nodes and edges to output a new graph with updated representations for each node and each edge. Following the notation in , a graph is defined by nodes and directed edges , where and are node and edge attributes. We extend the original formulation by defining edges with discrete types and an update function for the typed edges. Assuming distinct edge types, let
be the one-hot encoding or the scores of the types for edge. Then, the typed edge update function outputs updated edge attribute , where is a learnable function for each edge type . Additional details can be found in Figure 3.
6.4 Additional Quantitative Comparison
In Figure 4, we further compare our supervised joint model to the baseline from  by measuring the trajectory errors at different time horizons (1 second, 3 seconds, 5 seconds). The results indicate that our approach is worse than the baseline in terms of cross-track error, which is expected because  provides a rasterized bird’s eye view of the map as an input to the model, and we don’t have the same map and scene context. However, we also see that our method is better than the baseline in terms of along-track error, which highlights the value of explicitly capturing temporal interactions such as going and yielding for the trajectory prediction problem.