Multi-agent trajectory prediction is critical in many real-world applications, such as autonomous driving, mobile robot navigation and other areas where a group of entities interact with each other, giving rise to complicated behavior patterns at the level of both individuals and the multi-agent system as a whole. Since usually only the trajectories of individual entities are available without any knowledge of the underlying interaction patterns, and there are usually multiple possible modalities for each agent, it is challenging to model such dynamics and forecast their future behaviors.
There have been a number of existing works trying to provide a systematic solution to multi-agent interaction modeling. Some related techniques include, but not limited to social pooling layers , attention mechanisms [32, 13, 10, 31], message passing over fully-connected graphs [6, 28]. These techniques can be summarized as implicit interaction modeling by information aggregation. Another line of research is to explicitly perform inference over the structure of the latent interaction graph, which allows for a static discrete structure with multiple interaction types . Our proposed approach falls into this category but with significant extension and performance enhancement over existing methods.
The most related work is NRI 
, in which the interaction graph is static during training with homogeneous nodes. This is sufficient for the systems with fixed interaction patterns, which involve homogeneous type of agents. However, in many real-world applications, the interactions are time-varying even with abrupt changes (e.g. basketball players). There may be heterogeneous types of agents (e.g. cars, pedestrians, cyclists, etc.) involved in the system, while NRI does not distinguish them explicitly. Moreover, NRI only outputs single Gaussian distribution which cannot capture the multi-modality in the future. In this work, therefore, we address the problem of 1) extracting the underlying interaction patterns with a graph structure which is able to handle different types of agents in a unified way and evolve along time, 2) predicting future trajectories based on the history information and extracted interaction graph. The model can capture the dynamics of interaction graph evolution, and 3) capturing the multi-modality of future trajectories.
The main contributions of this paper are summarized as:
We propose a generic trajectory forecasting framework with explicit interaction modeling via a latent graph among multiple heterogeneous, interactive agents. The framework can incorporate both trajectory information and context information (e.g. scene images, point cloud density maps).
We propose a dynamic mechanism to evolve the underlying interaction graph adaptively along time, which captures the dynamics of interaction patterns among multiple agents. We also introduce a double-stage training pipeline which not only improves training efficiency and accelerates convergence, but also enhances model performance in terms of prediction error.
The proposed framework is designed to capture the uncertainty and multi-modality of future trajectories in nature, which is more informed than single-modal prediction and thus more beneficial to potential downstream tasks such as decision and planning. The proposed graph evolution mechanism can enhance multi-modality.
We validate the proposed framework on multiple trajectory forecasting benchmarks in different areas and the method achieves the state-of-the-art performance. Detailed experimental results and analysis are provided.
2 Related Work
2.1 Trajectory and Behavior Prediction
The problem of trajectory prediction has been considered as modeling behaviors among a group of interactive agents. Earlier work in 
introduced social forces to model the attractive and repulsive motion of humans with respect to the neighborhoods. Some other learning-based approaches were proposed, such as hidden Markov models[33, 18]
, dynamic Bayesian networks
, inverse reinforcement learning[30, 37]. In recent years, the conceptual extension has been made to better model social behavior with supplemental cues such as motion patterns [39, 36] and group attributes [21, 35]. Such social models have motivated the recent data-driven methods in [1, 14, 5, 34, 7, 38, 15, 9, 2, 4, 40, 19, 25, 26, 29, 23, 16]
. They encode the motion history of individual entities using the recurrent operation of neural networks. However, these methods are susceptible to find acceptable future motions in heterogeneous environments, partly due to their heuristic feature pooling for interaction modeling.
2.2 Interaction Modeling and Graph Networks
Interaction modeling and relational reasoning have been widely studied in various fields. Recently, deep neural networks applied to graph structures have been employed to formulate a connection between interactive agents [32, 19, 13, 17]. These methods introduce nodes to represent interactive agents and edges to express their interactions with each other. They directly learn the evolving dynamics of node attributes (agents’ states) and/or edge attributes (relations between agents) by constructing spatio-temporal graphs. However, their models have no explicit knowledge about the underlying interaction patterns. Some existing works (e.g. NRI ) have taken a step forward towards explicit relational reasoning by inferring a latent interaction graph. However, it is nontrivial for NRI to deal with heterogeneous agents, context information and the systems with time-varying interactions. In this work, we present an effective solution to handle aforementioned issues.
3 Problem Formulation
The objective of this work is to forecast future trajectories for multiple heterogeneous, interactive agents based on historical state information and/or context information. Without loss of generality, we assume there are heterogeneous agents in the scene, which belongs to categories (e.g. cars, cyclists, pedestrians). The number of agents may vary in different cases. We denote a set of trajectories covering the historical and forecasting horizons ( and ) as , where
is the 2D coordinate in the world space or image pixel space in the scope of this paper. We also denote a sequence of historical context information (images or tensors) asfor dynamic scenes or fixed context information for static scenes. For simplicity, we use
when referring to the context information in the equations. The future information is accessible during the training stage. We aim to estimate the conditional distributionfor dynamic scenes or for static scenes. The predicted distribution is desired to be multi-modal to represent uncertainty.
4.1 Framework Overview
An illustrative graphical model is shown in Fig. 1 to demonstrate the essential procedures of the prediction framework. Instead of end-to-end training in a single pipeline, our training process contains two consecutive stages: Static interaction graph learning: An encoder is trained to extract interaction patterns from the observed trajectories and context information and generate a distribution of static latent interaction graphs. A decoder is trained to recurrently generate multi-modal distributions of future states.
Dynamic interaction graph learning: The well-trained encoder and decoder during the first stage are utilized as an initialization, which are finetuned together with the training of a recurrent network which captures the dynamics of interaction graph evolution. The recurrent unit can be treated as a highly flexible integration which takes past graphs into consideration.
The number of agents can be flexible in different cases without changing the model complexity, due to the function sharing and the property of permutation invariance of graph representation.
4.2 Static Interaction Graph
In this stage, the goal is to simultaneously learn an encoder that extracts the underlying interaction patterns as a distribution of latent graphs from the historical information, and a decoder that outputs a sequence of multi-modal distributions of future states based on the encoded interaction graph and historical information. We introduce the details of the encoding / decoding processes in the following.
4.2.1 Observation Graph
A fully-connected graph without self-loops is constructed to represent the observed information with node/edge attributes, which is called observation graph. Assume that there are heterogeneous agents in the scene, which belongs to categories. Then the observation graph consists of agent nodes and one context node. Agent nodes are bidirectionally connected to each other, and the context node only have outgoing edges to each agent node. We denote an observation graph as , where and . , and , denote agent node attribute, context node attribute and agent-agent, context-agent edge attribute, respectively. Each agent node has two types of attributes: self-attribute and social-attribute. The former only contains the node’s own state information, while the latter only contains other nodes’ state information. The calculations of node/edge attributes are given by
where are learnable attention coefficients, , are agent, context node embedding functions, and , and are agent-agent edge, agent-context edge, and agent node update functions, respectively. Different types of nodes (agents) use different embedding functions. Note that the attributes of the context node are never updated and the edge attributes only serve as intermediates for the update of agent node attributes. These
functions are implemented by deep networks with proper architectures, which are multi-layer perceptrons (MLPs) in our experiments. At this stage, we obtain a complete set of node/edge attributes which include the information of direct (first-order) interaction. The higher-order interactions can be modeled by multiple loops of equations (4)-(5), in which the social node attributes and edge attributes are updated by turns.
4.2.2 Interaction Graph
The interaction graph is not node/edge attributed, which represents interaction patterns with a distribution of edge types for each edge. We set a hyperparameterto denote the number of possible edge types (interaction types) between pairwise agent nodes to model agent-agent interactions. Also, there is another edge type that is shared between the context node and all agent nodes to model agent-context
interactions. Note that “no edge” can also be treated as a special edge type, which implies that there is no message passing along such edges. More formally, the interaction graph is a discrete probability distributionor , where
is a set of discrete random variables to indicate pairwise interaction types.
The goal of the encoding process is to infer a latent interaction graph from the observation graph, which is essentially a multi-class edge classification task. We employ a softmax function with a continuous approximation of the discrete distribution  on the last updated edge attributes to obtain the probability of each edge type, which is given by
is a vector of i.i.d. samples drawn fromdistribution and
is the Softmax temperature, which controls the sample smoothness. We also use the repramatrization trick to obtain gradients for backpropagation. The edge type between agent nodes and context node is hard-coded with probability one. For simplicity, we summarize all the operations in the observation graph and the encoding process as.
Since in many real-world applications the state of agents has long-term dependence, a recurrent decoding process is applied to the interaction graph and observation graph to approximate the distribution of future trajectories . The output of each time step is possible Gaussian distributions and their corresponding weights. The detailed operations in the decoding process consists of two stages: burn-in stage () and prediction stage (), which are given by
|Draw samples from categorical||(16)|
where is the hidden state of , is the weight of the th Gaussian distribution at time step for agent . is the edge update function of edge type , is a mapping function to get the weight of the th Gaussian distribution, and is a mapping function to get the mean of the th Gaussian component. We should notice that in equation (12), is needed while in the previous decoder step, we only have its corresponding distribution from the previous step. We first sample the desired Gaussian distribution from the categorical distribution . Say we get the th one, then we use as , which means the most likely posterior trajectory in this situation. The nodes (agents) of the same type share the same GRU decoder. During the burn-in stage, the ground-truth states are used; while during the prediction stage, the state prediction hypotheses are used as the input at the next time step iteratively. For simplicity, the whole decoding process is summarized as .
4.3 Dynamic Interaction Graph
In many applications, the interaction patterns computed from the past time steps are likely not static in the future. Instead, they are rather dynamic evolving throughout the future time steps. A single static interaction graph is not sufficient to model such situations, especially those with abrupt changes. Moreover, many interaction systems have multi-modal properties in its nature. Different modalities afterwards are likely to represent different interaction patterns. Using only a single interaction pattern is not appropriate to predict all the modalities. Therefore, we introduce an effective dynamic mechanism to evolve the interaction graph.
The encoding process is repeated every (re-encoding gap) time steps to obtain the latent interaction graph based on the latest observation graph. Since the new interaction graph also has dependence on previous ones, we also need to consider their effects. Therefore, a recurrent unit (GRU) is utilized to maintain and propagate the history information, as well as adjust the prior interaction graphs. More formally, the calculations are given by
where is the re-encoding index starting from 0, is the interaction graph obtained from the static encoding process, is the adjusted interaction graph with time dependence, and is the hidden state of GRU.
After obtaining , the decoding process is applied to get the states of the next time steps,
The decoding and re-encoding processes are iterated to obtain the distribution of future trajectories.
4.4 Diverse Trajectory Generation
Due to the uncertainty of human intention and interaction outcomes, the prediction model is desired to capture the multi-modality of human behaviors and generate diverse prediction hypotheses which represent various possible behavior patterns. Therefore, in our decoding process, instead of outputting a deterministic trajectory at every step, we output several Gaussian distributions and their corresponding weights , indicating that we have several possible modalities for the next step output. We only choose a single Gaussian distribution as the next step output. The choice is based on
, which means the probability of the next step in each modality respectively. This is slightly different from a traditional mixture density network, since we set a fixed variance and use a slightly modified loss function definition instead of a traditional negative log-likelihood function in our training process, which is shown in section4.5.1.
However, directly training such a model tends to collapse to a single mode. Therefore, we introduce an effective mechanism to mitigate the mode collapse issue and encourage diverse trajectory generation. We sample the Gaussian distribution from Gaussian distribution candidates and we use it for iterative decoding in the current decoding step. Using different Gaussian distributions and locations leads to different trajectories afterwards, which enables our model to generate multiple trajectories. Therefore, in our training process, we run our model times, after which we generate possible trajectories for each agent under every specific scenarios. We only choose the trajectory with the minimal loss to back propagate. Since the one with the minimal loss is the most likely to be in the same mode as the ground truth. The predicted other trajectories may have much higher loss, but it doesn’t necessarily mean that they are wrong. It’s still possible that they represent some potentially plausible modalities. If we compute their loss, they may have a very large loss though. Thus back-propagating their loss is not appropriate.
4.5 Loss Function and Training
4.5.1 Loss Function
In our training process, we are trying to maximize the conditional posterior likelihood. Our loss function is defined as follows:
where denotes the encoding and re-encoding operations, which return a factorized distribution of . denotes a certain Gaussian distribution.
In our experiments, we first train the encoding/decoding functions using a static interaction graph. Then in the process of training dynamic interaction graph, we use the well-trained encoding/decoding functions at the first stage to initialize the parameters of the modules used in the dynamic training. This step is reasonable since the encoding/decoding functions used in these two training process play the same role and their optima are supposed to be close. And if we train dynamic graphs directly, it will lead to longer convergence time and is likely to be trapped into some bad local optima due to large number of learnable parameters. It is possible that this method may accelerate the whole training process and avoid some bad local optima.
In this paper, we used three benchmark datasets: Honda 3D Dataset (H3D) , NBA SportVU Dataset (NBA), and Stanford Drone Dataset (SDD) . H3D is a large scale full-surround 3D multi-object detection and tracking dataset, which provides point cloud information and trajectory annotations for heterogeneous traffic participants (e.g. cars, trucks, cyclists and pedestrians). NBA dataset was collected by NBA with the SportVU tracking system, which contains the trajectory information of all the ten players and the ball in real games. SDD contains a set of top-down-view images and the corresponding trajectories of involved entities, which was collected in multiple scenarios in a university campus full of interactive pedestrians, cyclists and vehicles.
5.2 Evaluation Metrics and Baselines
We evaluate the model performance in terms of average displacement error (ADE) defined as the average distance between the predicted trajectories and the ground truth over all the involved entities within the prediction horizon, as well as final displacement error (FDE) defined as the deviated distance at the last predicted time step. For the H3D and NBA dataset, we predicted the future 10 time steps (4.0s) based on the historical 5 time steps (2.0s). For the SDD dataset, we predicted the future 12 time steps (4.8s) based on the historical 8 time steps (3.2s). We compared the performance of our proposed approach with the following baseline approaches: Constant Velocity Model (CVM), Probabilistic LSTM (P-LSTM) , Social LSTM (S-LSTM) , Social GAN (S-GAN) , Social Attention (S-ATT) , DESIRE , Gated-RN , Trajectron++  and NRI . Please refer to the reference papers for more details.
5.3 Implementation Details
A batch size of 32 was used and the models were trained for up to 10 epochs during the static graph learning stage and up to 50 epochs during the dynamic graph learning stage with early stopping. We used Adam optimizer with an initial learning rate of 0.001. The models were trained on a single TITAN X GPU. We used a split of 65%, 10%, 25% as training, validation and testing data.
|1.0s||0.18 / 0.26||0.29 / 0.45||0.26 / 0.41||0.27 / 0.37||0.18 / 0.32||0.21 / 0.34||0.24 / 0.30|
|2.0s||0.34 / 0.60||0.53 / 0.96||0.49 / 0.92||0.45 / 0.77||0.32 / 0.64||0.33 / 0.62||0.32 / 0.60|
|3.0s||0.52 / 1.03||0.87 / 1.62||0.72 / 1.53||0.68 / 1.29||0.49 / 1.03||0.46 / 0.93||0.48 / 0.94|
|4.0s||0.74 / 1.54||1.21 / 2.56||1.01 / 2.32||0.94 / 1.91||0.69 / 1.56||0.71 / 1.63||0.73 / 1.56|
|Time||Static Graph (same node type)||Static Graph||Re-encoding w/o GRU||Dynamic Graph (single stage)||Dynamic Graph (double stage)|
|1.0s||0.28 / 0.37||0.27 / 0.35||0.25 / 0.32||0.24 / 0.31||0.23 / 0.29|
|2.0s||0.40 / 0.58||0.38 / 0.55||0.35 / 0.50||0.33 / 0.46||0.31 / 0.44|
|3.0s||0.51 / 0.80||0.48 / 0.76||0.44 / 0.70||0.40 / 0.60||0.39 / 0.58|
|4.0s||0.64 / 1.21||0.61 / 1.14||0.57 / 1.07||0.50 / 0.90||0.48 / 0.86|
5.4 Quantitative Analysis
We provide quantitative analysis for each dataset in the following.
H3D Dataset: The comparison of results is shown in Table 1, where the unit of reported ADE and FDE is meters in the world coordinates. Note that we included cars, trucks, cyclists and pedestrians in the experiments. It is shown that the CVM performs the best in short-term prediction (1.0s), which is reasonable since the change of velocity can be ignored during a short interval. But learning-based models may sacrifice the short-term performance a little for better long-term prediction. Another potential reason is that learning-based models may capture some subtle patterns from data, which complicates short-term behaviors. All the other baseline methods consider the relations and interactions among agents. The S-LSTM uses social pooling layers to fuse the information of different agents.The S-ATT employs spatial attention mechanisms, while the S-GAN demonstrates a generative model which learns the data distribution.The Gated-RN and Trajectron++ both leverage spatio-temporal information to involve relational reasoning, which leads to smaller prediction error. The NRI infers a latent interaction graph and learns the dynamics of agents, which achieves similar performance to Trajectron++. Our proposed method achieves the best performance, which implies the advantages of explicit interaction modeling via evolving interaction graphs. The 4.0s ADE/FDE are significantly reduced by 30.4%/44.9% compared to the best baseline approach (Gated-RN).
NBA Dataset: The comparison of results is shown in Table 2, where the unit of reported ADE and FDE is meters in the world coordinates. Note that we included both players and the basketball in the experiments. Since basketball players are highly interactive and behaviors often change suddenly due to the reaction to other players, the CVM performs much worse than learning-based baselines. The P-LSTM has better performance than CVM since it learns from data to predict future trajectories based on each agent’s history information independently. The other baselines all consider the relations and interactions among agents with different strategies, such as soft attention mechanisms, social pooling layers, and graph-based representation. Owing to the dynamic interaction modeling by evolving interaction graph, our method achieves significantly better performance than state-of-the-art, which reduces the 4.0s ADE/FDE by 40.5%/42.2% (4.0s FDE) with respect to the best baseline (NRI).
|1.0s||1.47 / 2.72||1.40 / 2.32||1.28 / 2.00||1.36 / 2.00||1.20 / 1.84||1.12 / 1.60||1.04 / 1.52|
|2.0s||2.72 / 4.12||2.48 / 3.88||2.16 / 3.44||2.24 / 3.76||2.08 / 3.36||1.76 / 2.96||1.68 / 2.80|
|3.0s||4.01 / 6.44||3.56 / 6.04||2.96 / 4.96||3.12 / 5.36||2.80 / 4.80||2.48 / 4.24||2.32 / 4.00|
|4.0s||5.40 / 9.04||4.88 / 7.72||3.76 / 6.64||4.00 / 7.12||3.60 / 6.24||3.12 / 5.60||2.96 / 5.12|
|Time||Static Graph (same node type)||Static Graph||Re-encoding w/o GRU||Dynamic Graph (single stage)||Dynamic Graph (double stage)|
|1.0s||1.04 / 1.60||0.88 / 1.36||0.96 / 1.36||0.72 / 1.12||0.56 / 0.80|
|2.0s||1.84 / 2.88||1.44 / 2.40||1.44 / 2.32||1.04 / 1.76||0.80 / 1.20|
|3.0s||2.24 / 3.76||2.00 / 3.44||2.00 / 3.36||1.52 / 2.80||1.20 / 1.92|
|4.0s||2.80 / 4.88||2.56 / 4.56||2.48 / 4.08||2.16 / 3.76||1.76 / 3.04|
|33.2 / 56.4||28.7 / 44.4||33.3 / 55.9||35.4 / 57.6||27.0 / 43.9||24.3 / 40.1||25.6 / 43.7||15.3 / 27.9|
SDD Dataset: The comparison of results is shown in Table 3, where the unit of reported ADE and FDE is pixels in the image coordinates. Note that we included all types of agents in the experiments, although most of them are pedestrians. Our proposed method achieves the best performance. The 4.8s ADE/FDE are reduced by 37.0%/30.4% compared to the best baseline approach (Trajectron++).
Analysis on Edge Types and Re-encoding Gap: We also provide a comparison of ADE/FDE (in meters) and testing running time on the NBA dataset to demonstrate the effect of different numbers of edge types and re-encoding gaps. In Fig. 3(a), it is shown that as the number of edge type increases, the prediction error first decreases to a minimum and then increases, which implies too many edge types may lead to overfitting issues, since some edge types may capture subtle patterns from data which reduces generalization ability. The cross-validation is needed to determine the number of edge types. In Fig. 3(b), it is illustrated that the prediction error increases consistently as the re-encoding gap raises, which implies more frequent re-identification of underlying interaction pattern indeed helps when it evolves along time. However, we need to trade off between the prediction error and testing running time if online prediction is required. The variance of ADE / FDE in both figures are small, which implies the model performance is stable with random initialization and various settings in multiple experiments.
5.5 Qualitative Analysis
We qualitatively evaluated on prediction hypotheses of typical testing cases on H3D and NBA datasets in Fig. 2.
H3D Dataset: Fig. 2(a) and Fig.2(b) show two random samples from H3D results. We can tell from it that our framework can generate accurate trajectories. More specifically, in Fig.2(a), for the blue prediction hypothesis at the right bottom, we can tell that there is an abrupt change at the fifth step. This is because the interaction graph evolved at this step (Our re-encoding gap was set to be 5 in this case). Moreover, in the heatmap, we can see that there are multiple possible trajectories starting from this point, which means multiple possible modalities. These results show that the evolving interaction graph can reinforce the multi-modal property of our model, since different samples of trajectories at the previous steps lead to different directions of graph evolution, which significantly influences the prediction afterwards. In Fig.2(b), it is a roundabout scenario. Intuitively, each car is likely to exit the roundabout at any possible exit. Our model can successfully show the modalities of exiting the roundabout and staying in it. Moreover, if exiting the roundabout, the cars are predicted to exit on their right in most cases, which shows that the modalities predicted by our model are not arbitrary, but plausible and reasonable.
NBA Dataset: Fig. 2(c), Fig. 2(d) show two random samples from our results. First, we tell that in such cases the ball follows a player at most times, which implies that the predicted results represent plausible situations. Second, most prediction hypotheses are very close to the ground truth, even if some predictions are not similar to the ground truth, they represent a plausible behavior. Third, the heatmaps show that our model can successfully predict most reasonable future trajectories and their multi-modal distributions. More specifically in Fig. 2(c), for the player of the green team in the middle, the historical steps move forward quickly, while our model can successfully predict that the player will suddenly stop, since he is surrounded by many opponents and he is not carrying the ball. In Fig. 2(d), our model shows that three pairs of players from different teams competing against each other for chances. the defensing team is closer to the basket. and the player carrying the ball is running quickly towards the basket. Two opponents are trying to defend him. Such case is a very common situation in basketball games. In general, not only does our model achieve high accuracy, it can also understand and predict most moving, stopping, offending and defensing behaviors in basketball games.
5.6 Ablative Analysis
We conducted ablative analysis on the H3D and NBA datasets to demonstrate the effectiveness of heterogeneous node types, dynamic interaction graph and two-stage graph learning. The best ADE / FDE of each model setting are shown in the lower parts of Table 1 and Table 2. We first introduce the five ablative model settings and provide a detailed analysis afterwards.
Static Graph (same agent node type): This is the simplest model setting, where only a single interaction graph is extracted based on the history information. The same node embedding function is shared among all the nodes.
Static Graph: This setting is similar to the last one, except that different node embedding functions are applied to different types of agent nodes.
Re-encoding w/o GRU: The interaction graph is re-encoded every time steps only using the static encoding process without recurrent units.
Dynamic Graph (single stage): This is our whole model, where the encoding, decoding functions and the graph evolving GRU are all trained from scratch.
Dynamic Graph (double stage): This is our whole model with double stage interaction graph learning, where the encoding, decoding functions obtained from the first stage are employed as an initialization in the second stage.
Static Graph (same agent node type) v.s. Static Graph: We show the effectiveness of the distinction of agent node types. According to the prediction results in Table 1 and Table 2, utilizing distinct agent-node embedding functions for different agent types achieves consistently smaller ADE/FDE than a universal embedding function. The reason is that different types of agents have distinct behavior patterns or feasibility constraints. For example, the trajectories of on-road vehicles are restricted by roadways, traffic rules and physical constraints, while the restrictions on pedestrian behaviors are much fewer. Moreover, since vehicles usually have to yield pedestrians at intersections, it is helpful to indicate agent types explicitly in the model. The 4.0s ADE/FDE are reduced by 4.7%/5.8% on the H3D dataset and 8.6%/6.6% on the NBA dataset.
Static Graph v.s. Re-encoding w/o GRU: It is shown that the two settings achieve very similar performance, which is reasonable since they share the same data information and model architecture with identical amount of parameters. Although the re-encoding process is applied during the prediction, it cannot capture the dynamics of graph evolution, so the improvement of model performance is quite limited.
Dynamic Graph (single stage) v.s. Dynamic Graph (double stage): We show the effectiveness and necessity of double-stage dynamic graph learning. It is shown that the double-stage training scheme leads to remarkable improvement in terms of ADE/FDE on both datasets. During the first training stage, the encoding/decoding functions are well trained to a local optimum, which is able to extract a proper static interaction graph. According to empirical findings, the encoding / decoding functions are sufficiently good as an initialization for the second stage training after a several epochs’ training. During the second training stage, the encoding/decoding functions are initialized from the first stage and finetuned, along with the training of graph evolution GRU. This leads to faster convergence and better performance, since it may help avoid some bad local optima at which the loss function may be stuck if all the components are randomly initialized. With the same hyperparameters, the single-stage/double-stage training took about 25/14 epochs to reach their smallest validation loss on the NBA dataset and 41/26 epochs on the H3D dataset. Compared to single-stage training, the 4.0s ADE/FDE of double-stage training are reduced by 18.5%/19.2% on the NBA dataset and 9.4%/12.2% on the H3D dataset.
In this paper, we present a generic trajectory forecasting framework with explicit interaction modeling among multiple heterogeneous, interactive agents with a graph representation. Multiple types of context information (e.g. static / dynamic, scene images / point cloud density maps) can be incorporated in the framework together with the trajectory information. In order to capture the underlying dynamics of the evolution of interaction patterns, we propose a dynamic mechanism to evolve the interaction graph, which is trained in two consecutive stages. The double-stage training mechanism can speed up convergence as well as enhance prediction performance. The method is able to capture the multi-modality of future behaviors. The proposed framework is validated by multiple trajectory forecasting benchmarks for different applications, which achieves state-of-the-art performance in terms of prediction accuracy. For the future work, we will handle the prediction task which involves a time-varying number of agents with an extended adaptive framework.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 961–971 (2016)
-  Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449 (2019)
-  Choi, C., Dariush, B.: Looking to relations for future trajectory forecast. arXiv preprint arXiv:1905.08855 (2019)
-  Deo, N., Rangesh, A., Trivedi, M.M.: How would surround vehicles move? a unified framework for maneuver classification and motion prediction. IEEE Transactions on Intelligent Vehicles 3(2), 129–140 (2018)
-  Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social gan: Socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2255–2264 (2018)
-  Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H., Kanai, R.: Permutation-equivariant neural networks applied to dynamics prediction. arXiv preprint arXiv:1612.04530 (2016)
-  Hasan, I., Setti, F., Tsesmelis, T., Del Bue, A., Galasso, F., Cristani, M.: Mx-lstm: mixing tracklets and vislets to jointly forecast trajectories and head poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6067–6076 (2018)
-  Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Physical review E 51(5), 4282 (1995)
-  Hong, J., Sapp, B., Philbin, J.: Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8454–8462 (2019)
-  Hoshen, Y.: Vain: Attentional multi-agent predictive modeling. In: Advances in Neural Information Processing Systems. pp. 2701–2711 (2017)
-  Kasper, D., Weidl, G., Dang, T., Breuel, G., Tamke, A., Wedel, A., Rosenstiel, W.: Object-oriented bayesian networks for detection of lane change maneuvers. IEEE Intelligent Transportation Systems Magazine 4(3), 19–31 (2012)
-  Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687 (2018)
-  Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H., Savarese, S.: Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In: Advances in Neural Information Processing Systems. pp. 137–146 (2019)
-  Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: Distant future prediction in dynamic scenes with interacting agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 336–345 (2017)
-  Li, J., Ma, H., Tomizuka, M.: Interaction-aware multi-agent tracking and probabilistic behavior prediction via adversarial learning. In: 2019 IEEE International Conference on Robotics and Automation (ICRA). IEEE (2019)
-  Li, J., Ma, H., Zhan, W., Tomizuka, M.: Conditional generative neural system for probabilistic trajectory prediction. In: in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE (2019)
-  Li, J., Ma, H., Zhang, Z., Tomizuka, M.: Social-wagdat: Interaction-aware trajectory prediction via wasserstein graph double-attention network. arXiv preprint arXiv:2002.06241 (2020)
-  Li, J., Zhan, W., Hu, Y., Tomizuka, M.: Generic tracking and probabilistic prediction framework and its application in autonomous driving. IEEE Transactions on Intelligent Transportation Systems (2019)
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 6120–6127 (2019)
-  Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016)
-  Moussaïd, M., Perozo, N., Garnier, S., Helbing, D., Theraulaz, G.: The walking behaviour of pedestrian social groups and its impact on crowd dynamics. PloS one 5(4) (2010)
-  Patil, A., Malla, S., Gang, H., Chen, Y.T.: The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In: International Conference on Robotics and Automation (2019)
-  Rhinehart, N., McAllister, R., Kitani, K., Levine, S.: Precog: Prediction conditioned on goals in visual multi-agent settings. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2821–2830 (2019)
-  Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: Human trajectory understanding in crowded scenes. In: European conference on computer vision. pp. 549–565. Springer (2016)
-  Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1349–1358 (2019)
-  Sadeghian, A., Legros, F., Voisin, M., Vesel, R., Alahi, A., Savarese, S.: Car-net: Clairvoyant attentive recurrent network. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 151–167 (2018)
-  Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control. arXiv preprint arXiv:2001.03093 (2020)
-  Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural network module for relational reasoning. In: Advances in neural information processing systems. pp. 4967–4976 (2017)
-  Su, S., Peng, C., Shi, J., Choi, C.: Potential field: Interpretable and unified representation for trajectory prediction. arXiv preprint arXiv:1911.07414 (2019)
-  Sun, L., Zhan, W., Tomizuka, M.: Probabilistic prediction of interactive driving behavior via hierarchical inverse reinforcement learning. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC). pp. 2111–2117. IEEE (2018)
-  Van Steenkiste, S., Chang, M., Greff, K., Schmidhuber, J.: Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353 (2018)
Vemula, A., Muelling, K., Oh, J.: Social attention: Modeling attention in human crowds. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 1–7. IEEE (2018)
-  Wang, W., Xi, J., Zhao, D.: Learning and inferring a driver’s braking action in car-following scenarios. IEEE Transactions on Vehicular Technology 67(5), 3887–3899 (2018)
-  Xu, Y., Piao, Z., Gao, S.: Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5275–5284 (2018)
-  Yamaguchi, K., Berg, A.C., Ortiz, L.E., Berg, T.L.: Who are you with and where are you going? In: CVPR 2011. pp. 1345–1352. IEEE (2011)
-  Yi, S., Li, H., Wang, X.: Understanding pedestrian behaviors from stationary crowd groups. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3488–3496 (2015)
-  Zhan, W., Sun, L., Hu, Y., Li, J., Tomizuka, M.: Towards a fatality-aware benchmark of probabilistic reaction prediction in highly interactive driving scenarios. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC). pp. 3274–3280. IEEE (2018)
-  Zhang, P., Ouyang, W., Zhang, P., Xue, J., Zheng, N.: Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12085–12094 (2019)
-  Zhang, Y., Qin, L., Yao, H., Huang, Q.: Abnormal crowd behavior detection based on social attribute-aware force model. In: 2012 19th IEEE International Conference on Image Processing. pp. 2689–2692. IEEE (2012)
-  Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao, Y., Wang, Y., Wu, Y.N.: Multi-agent tensor fusion for contextual trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12126–12134 (2019)
7 Illustrative Diagram of the Decoding Process
We provide an illustrative diagram of the decoding process, which is shown in Fig. 4. In this figure, without loss of generality we demonstrate the decoding process for only one node in a five-node observation graph to illustrate how the decoding process works. Fig. 4(a) shows the observation graph, we choose the node on the right as an example. Fig. 4(b) shows the process of using MLPs to process a specific edge, where denotes the probability of the edge belonging to a certain edge type . The processed edges are shown in red. Fig. 4(c) shows the sum over every incoming edge attribute of this node. Then we input the result into the decoding GRU. The decoding GRU outputs several Gaussian distributions and their corresponding weights. We sample one specific Gaussian distribution based on the weights. Then we use the of the sampled Gaussian distribution as the output state at this step. is used as the input into the next decoding step (if it’s not the burn-in step). We iterate the decoding process several times until the desired prediction horizon is reached.
8 Additional Framework Details
Multi-layer perceptron (MLP) is a very frequently used building block in our model. Every MLP used in our model is a three-layer MLP with ELU as the activation function. More specifically, the hidden size of node MLPs was 256 and that of edge MLPs was 512. We also employed GRU units in our decoding and re-encoding process. The hidden size of both GRUs was 256.