Architecture Details for CVPR 19 paper: Multi-Agent Tensor Fusion for Contextual Trajectory Prediction
Accurate prediction of others' trajectories is essential for autonomous driving. Trajectory prediction is challenging because it requires reasoning about agents' past movements, social interactions among varying numbers and kinds of agents, constraints from the scene context, and the stochasticity of human behavior. Our approach models these interactions and constraints jointly within a novel Multi-Agent Tensor Fusion (MATF) network. Specifically, the model encodes multiple agents' past trajectories and the scene context into a Multi-Agent Tensor, then applies convolutional fusion to capture multiagent interactions while retaining the spatial structure of agents and the scene context. The model decodes recurrently to multiple agents' future trajectories, using adversarial loss to learn stochastic predictions. Experiments on both highway driving and pedestrian crowd datasets show that the model achieves state-of-the-art prediction accuracy.READ FULL TEXT VIEW PDF
Architecture Details for CVPR 19 paper: Multi-Agent Tensor Fusion for Contextual Trajectory Prediction
Human drivers continually anticipate the behavior of nearby vehicles and pedestrians in order to plan safe and comfortable interactive motions that avoid conflict with others. Autonomous vehicles (AVs) must likewise predict the trajectories of others in order to proactively plan for future interactions before they occur, rather than reactively respond to unanticipated outcomes after they occur, which can lead to unsafe behaviors such as sudden hard braking, or failure to execute maneuvers in dense traffic. Fundamentally, trajectory prediction allows autonomous vehicles to reason about the possible future situations they will encounter, to evaluate the risk of a given plan relative to these predicted situations, and to select a plan which minimizes that risk. This adds a layer of interpretability to the system that is critical for debugging and verification.
Trajectory prediction is challenging because agents’ motions are stochastic, and dependent on their goals, social interactions with other agents, and the scene context. Predictions must generalize to new situations, where the number and configuration of other agents are not fixed in advance. Encoding this information is difficult for neural-network-based approaches, because standard NN architectures prefer fixed input, output, and parameter dimensions, while for the prediction task these dimensions vary. Previous work has addressed this issue using either agent-centric or spatial-centric encodings. Agent-centric encodings apply aggregation functions on multiple agents’ feature vectors, while spatial-centric approaches operate directly on top-down representations of the scene.
We propose a novel Multi-Agent Tensor Fusion (MATF) encoder-decoder architecture, which combines the strengths of agent- and spatial-centric approaches within a flexible network that can be trained in an end-to-end fashion to represent all the relevant information about the social and scene context. The Multi-Agent Tensor representation, illustrated in Fig. 1, spatially aligns an encoding of the scene with encodings of the past trajectory of every agent in the scene, which maintains the spatial relationships between agents and scene features. Next, a fused Multi-Agent Tensor encoding is formed via a fully convolutional mapping (see Fig. 2), which naturally learns to capture the spatial locality of interactions between multiple agents and the environment, as in agent-centric approaches, and preserves the spatial layout of all agents within the fused Multi-Agent Tensor in a spatial-centric manner.
Our model decodes the comprehensive social and contextual information encoded by the fused Multi-Agent Tensor into predictions of the trajectories of all agents in the scene simultaneously. Real-world behavior is not deterministic – agents can perform multiple maneuvers from the same context (e.g. follow lane or change lane), and the same maneuver can vary in execution in terms of velocity and orientation profile. We use conditional generative adversarial training [12, 23] to capture this uncertainty over predicted trajectories, representing the distribution over trajectories with a finite set of samples.
We conduct experiments on both driving datasets and pedestrian crowd datasets. Experimental results are reported on the publicly available NGSIM driving dataset , Stanford Drone pedestrian crowd dataset , ETH-UCY crowd datasets [21, 27], and a private recently-collected Massachusetts driving dataset. Quantitative and qualitative ablative experiments are conducted to show the contribution of each part of the model, and quantitative comparisons with recent approaches show that the proposed approach achieves state-of-the-art accuracy in both highway driving and pedestrian trajectory prediction.
Traditional methods for predicting or classifying trajectories model various kinds of interactions and constraints by hand-crafted features or cost functions[3, 5, 6, 8, 15, 22, 32]. Early methods based on inverse optimal control also use hand-crafted cost features, and learn linear weighting functions to rationalize trajectories which are assumed to be generated by optimal control . Recent data-driven approaches based on deep networks [1, 4, 9, 10, 13, 19, 20, 24, 28, 29, 31] outperform traditional approaches. Most of this work focuses either on modeling constraints from the scene context  or on modeling social interactions among multiple agents [1, 9, 10, 13, 31]; a smaller fraction of work considers both aspects [4, 20, 28].
Agent-centric NN-based approaches integrate information from multiple agents by applying aggregation functions on multiple agents’ feature vectors output from recurrent units. Social LSTM 
runs max pooling over state vectors of nearby agents within a predefined distance range, but does not model social interaction with far-away agents. Social GAN contributes a new pooling mechanism over all the agents involved in a scene globally, and by using adversarial training to learn a stochastic, generative model of human behavior . Although these kinds of max pooling aggregation functions handle varying numbers of agents well, permutation invariant functions may discard information when input agents lose their uniqueness . In contrast, Social Attention  and Sophie  address the heterogeneity of social interaction among different agents by attention mechanisms [2, 30], and spatial-temporal graphs . Attention mechanisms encode which other agents are most important to focus on when predicting the trajectory of a given agent. However, attention-based approaches are very sensitive to the number of agents included — predicting agents has computational complexity. In contrast, our approach captures multiagent interactions while maintaining computational complexity.
The agent-centric approaches discussed above do not make use of spatial relationships among agents directly. As an alternative, spatial-centric approaches retain the spatial structure of agents and the scene context throughout their representations. Convolutional Social Pooling  partially retains the spatial structure of agents’ locations by forming a social tensor which is similar to our Multi-Agent Tensor representation, but much of this spatial information is later aggregated by several bottleneck layers. This approach does not encode the scene context, and only a single agent’s trajectory can be predicted with each forward pass — potentially too slow for real-time trajectory prediction of multiple agents. Chauffeur Net  proposes a novel method to retain the spatial structure of agents and the scene by directly operating on the spatial feature map of agents and the scene context. In this approach, agents are represented as bounding boxes and do not have independent recurrent encoding units. In contrast, our model encodes multiple agents’ feature vectors via recurrent units while simultaneously retaining the spatial structure of agents and the scene throughout the reasoning process.
Many data-driven approaches learn to predict deterministic future trajectories of agents by minimizing reconstruction loss [1, 29]. However, human behavior is inherently stochastic. Recent approaches address this by predicting a distribution over future trajectories by combining Variational Auto-Encoders  and Inverse Optimal Control , or with conditional Generative Adversarial Nets [13, 28]. GAIL-GRU 
uses generative adversarial imitation learning to learn a stochastic policy that reproduces human expert driving behavior. R2P2  proposes a novel cost function to encourage enhancement in both precision and diversity of the learned predictive distribution. Other approaches predict a set of possible trajectories, instead of a single deterministic trajectory, by conditioning on possible maneuver classes [9, 10].
In this section, we describe the Multi-Agent Tensor Fusion (MATF) encoder, and the decoder architecture for trajectory prediction. The network is shown in Fig. 2. The network takes as input 1) the past trajectories of multiple dynamic interacting agents, and 2) a scene containing a static context, which is represented from an overhead perspective and can either be a segmented image containing all static objects, or a bird’s-eye view raw image. The network outputs the predicted future trajectories of all agents in the scene.
There are two parallel encoding streams in the MATF architecture. One encodes the past trajectories of each individual agent independently using single agent LSTM encoders, and another encodes the static scene context image with a CNN. Each LSTM encoder shares the same set of parameters, so the architecture is invariant to the number of agents in the scene. The outputs of the LSTM encoders are 1-D agent state vectors without temporal structure. The output of the scene context encoder CNN is a scaled feature map retaining the spatial structure of the bird’s-eye view static scene context image.
Next, the two encoding streams are concatenated spatially into a Multi-Agent Tensor. Agent encodings are placed into one bird’s-eye view spatial tensor, which is initialized to 0 and is of the same shape (width and height) as the encoded scene image . The dimension axis of the encodings fits into the channel axis of the tensor as shown in Fig. 1. The agent encodings are placed into the spatial tensor with respect to their positions at the last time step of their past trajectories. This tensor is then concatenated with the encoded scene image in the channel dimension to get a combined tensor. If multiple agents are placed into the same cell in the tensor due to discretization, element-wise max pooling is performed.
The Multi-Agent Tensor is fed into fully convolutional layers, which learn to represent interactions among multiple agents and between agents and the scene context, while retaining spatial locality, to produce a fused Multi-Agent Tensor. Specifically, these layers operate at multiple spatial resolution scale levels by adopting U-Net-like architectures  to model interaction at different spatial scales. The output feature map of this fused model has exactly the same shape as in width and height to retain the spatial structure of the encoding.
To decode each agent’s predicted trajectory, agent-specific representations with fused interaction features for each agent are sliced out according to their coordinates from the fused Multi-Agent Tensor output (Fig. 2). These agent-specific representations are then added as a residual  to the original encoded agent vectors to form final agent encoding vectors , which encode all the information from the past trajectories of the agents themselves, the static scene context, and the interaction features among multiple agents. In this way, our approach allows each agent to get a different social and contextual embedding focused on itself. Importantly, the model gets these embeddings for multiple agents using shared feature extractors instead of operating times for agents.
Finally, for each agent in the scene, its final vector is decoded to future trajectory prediction by LSTM decoders. Similar to the encoders for each agent, parameters are shared to guarantee that the network can generalize well when the number of agents in the scene varies.
The whole architecture is fully differentiable and can be trained end-to-end to minimize reconstruction loss between predicted future trajectories and observed ground-truth future trajectories : , where indicates that we can use either the or distance between two positions for reconstruction error.
We use conditional generative adversarial training [12, 23] to learn a stochastic generative model that captures the multimodal uncertainty of our predictions. GANs consist of two networks, a generator and a discriminator competing against each other. learns the distribution of the data and generates samples, while learns to distinguish the feasibility or infeasibility of the generated samples. These networks are simultaneously trained in a two player min-max game framework.
In our setting, we use a conditional to generate future trajectories of multiple agents, conditioning on all the agents’ past trajectories, the static scene context, and random noise input to create stochastic outputs. Simultaneously, we use to distinguish whether the generated trajectories are real (ground truth) or fake (generated). Both and share exactly the same architecture in their encoding parts with the deterministic model presented in Section 3.1, to reason about static scene context and interaction among multiple agents spatially. Both and are initialized with parameters from the trained deterministic model introduced in previous subsections. Detailed architectures and losses are described below.
Generator (G) observes past trajectories of all the agents in a given scene , and the static scene context . It jointly outputs the predicted future trajectories by decoding the final agent vectors
described in Section 3.2, concatenated with Gaussian white noise vector. The architecture is exactly the same as presented in previous subsections, except that in the deterministic model, the final encoding for a given agent is concatenated with vector to decode into its future trajectory; while in ,
is sampled from a Gaussian distribution.
Discriminator (D) observes the ground truth past trajectories of all the agents in a given static scene context, combined either with all generated future trajectories or all ground truth future trajectories . It outputs real or fake labels for the future trajectory of each agent in the scene, such that if trajectory is fake, and if trajectory is real. shares nearly the same architecture as presented in previous subsections, except for the following differences: (1) Its single agent LSTM encoders take in past and future trajectories as input instead of just past trajectories; (2) As a classifier, it does not use an LSTM to decode the final agent vector to a future trajectory. Instead, final agent encodings are fed into fully connected layers to be classified as real or a fake.
Losses The adversarial loss for a given scene is:
where is the set of agents in a given scene, and denote ground truth (real) and generated (fake) trajectories, respectively, and denotes the generative MATF network which we are optimizing.
To train the MATF GAN, we use the following losses:
where is the set of parameters of the model and weights the contribution of reconstruction loss versus adversarial loss.
In the Experiments and Results sections, we evaluate our model on both driving datasets  and pedestrian crowd datasets [21, 27, 25]. We construct different baseline variants of our models for ablative studies, and compare with state-of-the-art alternative methods quantitatively [1, 8, 9, 13, 15, 19, 20, 28]. Qualitative results are also presented for further analysis.
We use the publicly available NGSIM dataset , a recently collected Massachusetts driving dataset, the publicly available ETH-UCY datasets [21, 27], and the publicly available Stanford Drone dataset  for training and evaluation.
NGSIM. A driving dataset consisting of trajectories of real freeway traffic over a time span of 45 minutes. Data were recorded by fixed bird’s-eye view cameras placed over a 640-meter span of US101. Trajectories of all the vehicles traveling through the area within this 45 minutes are annotated. The dataset consists of various traffic conditions (mild, moderate and congested), and contains around 6k vehicles in total.
ETH-UCY. A collection of relatively small benchmark pedestrian crowd datasets. There are 5 datasets with 4 different scenes, including 1.5k pedestrian trajectories in total. We use the same cross-validation training-test split metrics as reported in previous work [13, 28].
Stanford Drone. A large-scale pedestrian crowd dataset consisting of 20 unique scenes in which pedestrians, bicyclists, skateboarders, carts, cars, and buses navigate on a university campus. Raw, static scene context images are provided from bird’s-eye view, and coordinates of multiple agents’ trajectories are provided in pixels. These scenes contain rich human-human interactions, often taking place within high density crowds, and diverse physical landmarks such as buildings and roundabouts that must be avoided. We use the standard test set for quantitative evaluation. Some scenes from the standard training set are not used for our training process, but left out for qualitative evaluation instead.
We construct a set of baseline variants of our model for ablative studies.
LSTM: A simple deterministic LSTM encoder-decoder. It shares exactly the same architecture as the single-agent LSTM encoders and decoders introduced in Section 3 for fair comparison.
Single Agent Scene: This deterministic model shares exactly the same architecture as introduced in Section 3, except that it only takes in one agent history
with scene representationand outputs only each time, so the model reasons about scene-agent interaction, but is completely unaware of multi-agent interaction.
Multi Agent: This deterministic model has the same details as the model described in Section 3, except that the scene representation is not provided as input. The model only reasons about multi-agent interactions absent from scene context information.
Multi Agent Scene: The deterministic model introduced in Section 3.
GAN: The stochastic model introduced in Section 3.3. Similar to Social GAN , we sample times and report the best trajectory in the L2 sense for fair comparison with stochastic models, with in Section 5.1, and as adopted by  in Section 5.2.
See Supplementary Materials for implementation details.
NGSIM Dataset. We adopt the same experimental setting and directly report the presented results as in : We split the trajectories into segments of 8s, and all agents appearing in the 640-meter span are considered in the reasoning and prediction process. We use 3s of trajectory history and a 5s prediction horizon. LSTMs operate at 0.2s. As in , we report the Root Mean Square Error in meters with respect to each timestep within the prediction horizon: , where is the total number of agents in the validation set, denotes the coordinate of the -th car in the dataset at future timestep , and the coordinate at .
|C-VGMM + VIM ||0.66||1.56||2.75||4.24||5.99|
|MATF Multi Agent||0.67||1.51||2.51||3.71||5.12|
|Social Conv ||0.61||1.27||2.09||3.10||4.37|
Quantitative results are shown in Table 1. Our deterministic model MATF Multi Agent outperforms the state-of-the-art deterministic model C-VGMM + VIM 
, a recent vehicle interaction approach based on variational Gaussian mixture models with Markov random fields. We include a comparison withGAIL-GRU ; however, note that this model has access to the future ground-truth trajectories of other agents when predicting a given agent, while MATF and other models do not, so these results are not fully comparable. We compare our stochastic model, MATF GAN, with Social Conv , an approach that captures the distribution over future trajectories by representing maneuvers. MATF GAN performs at the state-of-the-art level, with particularly improved performance at longer prediction horizons (3-5s). Note that Social Conv has access to auxiliary supervision from maneuver labels, while MATF does not require this information. Multi Agent Scene does not outperform Multi Agent on NGSIM, because lanes in the NGSIM dataset are quite straight, and little agent-scene interaction is observed.
Massachusetts Driving Dataset. We also analyze a private Massachusetts driving dataset, which includes more curved lanes and more complex static scene contexts than NGSIM. NGSIM contains rich vehicle-vehicle interactions. However, the recorded highway span is quite straight, so minimal agent-scene interaction is observed. As an alternative, we analyze a large-scale dataset gathered during highway driving, including a several-mile stretch of highway, with curved lanes, highway exits, and entrances. Ablative studies are conducted for this dataset to show our model’s ability to model agent-scene and agent-agent interactions, respectively. We report the Mean Absolute Error in meters with respect to each timestep within the prediction horizon: .
Interesting qualitative results are shown in Fig. 3, and quantitative ablative results are shown in Fig. 4. Quantitative results show that both Single Agent Scene and Multi Agent outperform the LSTM baseline, and that Multi Agent Scene consistently outperforms Single Agent Scene and Multi Agent, and comparison between Multi Agent and Single Agent Scene shows that the former performs better at short term trajectory prediction, while the latter performs better at long term prediction.
From these studies, we conclude that our MATF model successfully models agent-agent and agent-scene interaction. More specifically, the scene fusion model learns constraints from the scene context, and the multi-agent model learns multi-agent interaction.
ETH-UCY Dataset. We adopt the same experimental setting, split and error measure as Social GAN: We split the trajectories into segments of 8s. We use 3.2s of trajectory history and a 4.8s prediction horizon. LSTMs operate at 0.4s. We use a leave-one-out approach, training on 4 sets and testing on the remaining set. We adopt exactly the same experimental settings, splits and error measures as . As in , we report the Average Displacement Error and Final Displacement Error in pixels with respect to each time-step within the prediction horizon:
where is the total number of agents in the validation set, and denote the coordinates of the -th agent in the dataset at future timestep , and denotes the final future timestep. Table 2 shows our results. MATF performs the best both in deterministic and stochastic settings.
Stanford Drone Dataset. We adopt the same experimental setting and directly report the results presented in : We split the trajectories into segments of 8s, and all agents appearing in the scene are considered in the reasoning and prediction process. We use 3.2s of trajectory history and a 4.8s prediction horizon. LSTMs operate at 0.4s per timestep. As in , we report ADE and FDE.
Fig. 5 shows qualitative ablative results using deterministic models; only the full MATF Multi Agent Scene model captures the range of behaviors in the data. Quantitative results for deterministic and stochastic models are shown in Table 3. MATF Multi Agent Scene outperforms other deterministic models in ADE, and MATF GAN performs close to the state-of-the-art level. Among the deterministic models, Social LSTM achieves the best performance in FDE. Among the stochastic models, Desire gains strength from using Variational Auto-Encoders  and Inverse Optimal Control to generate and rank trajectories; Sophie performs the best with its strong attention-based social and physical reasoning modules. However, the computational complexity of these approaches is higher than that of other approaches due to the iterative process of IOC and -based attention mechanisms, respectively. In contrast, our model is more efficient in computational complexity with our shared convolution operations.
|Social Force ||36.38||58.14|
|Social LSTM ||31.19||56.97|
|MATF Multi Agent||30.75||65.90|
|MATF Multi Agent Scene||27.82||59.31|
|Stochastic||Social GAN ||27.25||41.44|
We also analyze the factors influencing performance in our model—particularly the impact of the spatial resolution of the Multi-Agent Tensor. Table 4 shows that there is a U-shaped performance curve due to under/overfitting at low/high resolution, respectively, and that the ideal resolution is , the setting we report.
Spatial Grid Resolution
We proposed an architecture for trajectory prediction which models scene context constraints and social interaction while retaining the spatial structure of multiple agents and the scene, unlike the purely agent-centric approaches more commonly used in the literature. Our motivation was that scene context constraints and social interaction patterns are invariant to the absolute coordinates where they take place; these patterns only depend on the relative positions among agents and scenes. Convolutional layers are suited to modeling these kinds of position-invariant spatial interactions by sharing parameters across agents and space, while recent approaches like Social Pooling [1, 13] or Attention mechanisms  cannot explicitly reason about spatial relationships among agents and cannot reason about these relationships at multiple spatial scales. Our Multi-Agent tensor fusion architecture models this naturally. To the best of our knowledge, MATF is the first approach which fuses information from a static scene context with multiple dynamic agent states, while retaining their spatial structure throughout the reasoning process to bridge the gap between agent-centric and spatial-centric trajectory prediction paradigms.
We applied our model to two different trajectory prediction tasks to demonstrate its flexibility and capacity to learn different types of behaviors, agent types, and scenarios from data. In the vehicle prediction domain, our model achieved state-of-the-art results at long-range prediction of vehicle trajectories in the NGSIM dataset. Our adversarially trained stochastic prediction model performed best relative to the maneuver-based approach of , suggesting that a representation of the distribution over maneuvers was necessary – whether explicit as in  or implicit as in our work. Our ablative studies on a Massachusetts driving dataset showed that representations of both the scene and multiagent interactions were necessary for accurate trajectory prediction in more complex scene contexts than NGSIM (greater lane curvature, more entrances and exits, etc.).
Our application to a state-of-the-art pedestrian dataset  demonstrated comparable performance with previously published results. Although some recent models achieved greater accuracy than ours [28, 20], all used dramatically different architectures; it is interesting to find that a novel spatial-centric architecture can also achieve a high standard of performance. Future work should examine the factors that influence performance, and the advantages and disadvantages of different architectures.
In future work, we plan to integrate unsupervised learning of structured maneuver representations into our framework. This will increase the interpretability of our model predictions, while enabling our model to better capture multimodal structure in the distribution over agent-scene and agent-agent interactions.
Social trajectory prediction is a complex task, which depends on the ability to extract structure from the scene and the history of agents’ joint motions. Our central goal here has been to combine the strengths of agent- and spatial-centric approaches to this problem. Beyond achieving more accurate multi-agent trajectory predictions, our belief is that the work of engineering better models will continue to yield further insights into the structure of human interaction.
This work was mainly conducted at ISEE, Inc. with the support of the ISEE team and ISEE data platform. This work was supported in part by NSFC-61625201, 61527804.
Structural-rnn: Deep learning on spatio-temporal graphs.In Proceedings of the International Conference on Robotics and Automation (ICRA) 2018, pages 5308–5317, 2016.
Social attention: Modeling attention in human crowds.In Proceedings of the International Conference on Robotics and Automation (ICRA) 2018, May 2018.