Leveraging Smooth Attention Prior for Multi-Agent Trajectory Prediction

Multi-agent interactions are important to model for forecasting other agents' behaviors and trajectories. At a certain time, to forecast a reasonable future trajectory, each agent needs to pay attention to the interactions with only a small group of most relevant agents instead of unnecessarily paying attention to all the other agents. However, existing attention modeling works ignore that human attention in driving does not change rapidly, and may introduce fluctuating attention across time steps. In this paper, we formulate an attention model for multi-agent interactions based on a total variation temporal smoothness prior and propose a trajectory prediction architecture that leverages the knowledge of these attended interactions. We demonstrate how the total variation attention prior along with the new sequence prediction loss terms leads to smoother attention and more sample-efficient learning of multi-agent trajectory prediction, and show its advantages in terms of prediction accuracy by comparing it with the state-of-the-art approaches on both synthetic and naturalistic driving data. We demonstrate the performance of our algorithm for trajectory prediction on the INTERACTION dataset on our website.



page 5

page 6


Multi-Agent Tensor Fusion for Contextual Trajectory Prediction

Accurate prediction of others' trajectories is essential for autonomous ...

LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction

Multi-agent trajectory prediction is a fundamental problem in autonomous...

EvolveGraph: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction with Evolving Interaction Graphs

Multi-agent interacting systems are prevalent in the world, from pure ph...

Dynamic-Group-Aware Networks for Multi-Agent Trajectory Prediction with Relational Reasoning

Demystifying the interactions among multiple agents from their past traj...

A Graph Attention Based Approach for Trajectory Prediction in Multi-agent Sports Games

This work investigates the problem of multi-agents trajectory prediction...

Instance-Aware Predictive Navigation in Multi-Agent Environments

In this work, we aim to achieve efficient end-to-end learning of driving...

Heterogeneous Edge-Enhanced Graph Attention Network For Multi-Agent Trajectory Prediction

Simultaneous trajectory prediction for multiple heterogeneous traffic pa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To navigate safely and efficiently in dense and complex traffic scenarios crowded by vehicles and pedestrians, it is crucial for autonomous vehicles or mobile robots to be socially compliant. This requires interacting with other agents and making decisions based on not only the observation of the environment, but also the behaviors of other agents. For example, you might decide to slow down if you see a vehicle aggressively taking over another vehicle in your rear view mirror. Hence, socially compliant navigation requires accurately forecasting future trajectories of the surrounding agents [lefevre2014survey, schwarting2019social]. However, this trajectory prediction problem can be challenging in complex traffic scenarios, as many agents might be interacting with each other.

Taking into account the interaction with all other agents for the goal of predicting an agent’s trajectory can be computationally intractable —such an approach may require too many training trajectories, far beyond the size of existing datasets [chang2019argoverse, houston2020one]. While part of this problem is remedied by parameter sharing between agents [bhattacharyya2018multi], an important observation is that humans can effectively coordinate with each other in such complex driving scenarios without reacting to all other agents at the same time [scalf2013competition]

. People attend only to a limited set of factors at every given moment 

[treisman1969modelsofattention]. In the driving context, the factors are mostly confined to a subset of agents and objects that immediately affect or are affected by them. For example, when driving in a lane, the driver often pays attention only to the vehicles in neighboring lanes with potential to join the lane. Such limited attention largely reduces the complexity of the model and the amount of training data needed [vemula2018social].

Many existing works learn attention models that are purely based on the goal of trajectory prediction [vemula2018social, gupta2018social]. However, prior knowledge can substantially improve the multi-agent trajectory prediction [casas2020importance]. Attention priors such as the neighboring prior that agents only pay attention to nearby agents [alahi2016social], and the sparsity prior that the attention should only be paid to a limited number of agents [zhang2018attention] are helpful in trajectory prediction in complex traffic scenes. However, the neighboring prior is often easily violated since there are situations where not just nearby agents affect the ego agent’s behavior, e.g., vehicles need to pay attention to an ambulance and make way for it even when it is far away. The sparsity prior can also make errors when the agents attend to limited but the incorrect agents. In this paper, we provide a new attention prior that is motivated by cognitive science and is widely applicable to different driving scenarios.

We notice that changes in a traffic scenario are usually not fast enough to require rapid changes in the attention, and cognitive science literature suggests humans’ social attention does not change rapidly [swettenham1998frequency]

. Thus, we propose a temporal smoothness prior on the attention model. Specifically, we train an attention model with a constraint that the attention distribution over consecutive time steps does not change rapidly. Using this attention regularization for smoothness, we design a network architecture to simultaneously model interactions among agents and learn the attention of each agent to all other agents, where we impose the smooth attention prior as a total variation loss on the attention. We then perform trajectory prediction for each agent by a recurrent neural network with the history of states and the predicted attention as input.

Our contributions in this paper are three-fold:

  • We propose a smoothness prior on attention modeling by constraining its rate of change. This prior is based on evidence for smooth attention from cognitive science and the commonly smooth nature of traffic scenarios.

  • We develop an architecture that simultaneously learns an attention model for interactions and a sequential prediction model for predicting future states. The architecture is optimized by a sequence likelihood loss to minimize prediction error and a total variation loss to impose the smoothness attention prior.

  • We conduct experiments on synthetic and real multi-agent trajectory prediction datasets. Our results show the proposed approach consistently achieves the lowest prediction errors and increases sample-efficiency compared to the state-of-the-art trajectory prediction methods.

Ii Related Work

Trajectory Prediction. Multi-agent trajectory prediction aims to predict the future trajectories of a group of interactive agents given their observation history. Helbing et al. introduced social force that includes attractive and repulsive forces between agents to model the behavior of pedestrians [helbing1995social]. More recently, different sequential prediction models have been proposed to enable trajectory prediction including Gaussian processes [kim2011gaussian]

, hidden Markov models 


, dynamic Bayesian networks 


, inverse reinforcement learning 


, and recurrent neural networks (RNN) 

[vemula2018social, alahi2016social, gupta2018social, chandra2020forecasting, huang2020diversitygan, pokle2019deep]. However, these methods do not employ a separate module to explicitly reason about interactions between the agents, but implicitly model the interactions as part of the trajectory prediction problem.

Multi-Agent Interactions. Modeling interactions in multi-agent systems is important not only for autonomous driving, but more generally for robotics. Many works handle interactions by learning and modeling the other agents in the environment [basu2019active, kwon2020when, xie2020learning, zhu2020multi, sadigh2016information, sadigh2016planning, sadigh2018planning], and Communication protocols and conventions are developed to enable effective interactions between agents [sukhbaatar2016learning, shih2021critical]. Furthermore, graph structures are adopted to model interactions [kipf2018neural, bohmer2020deep]. In driving, interactions between agents usually have common structures, e.g., agents in the neighborhood are more likely to interact with each other. Therefore, driving prediction techniques have an opportunity to leverage these structures for better performance.

Trajectory Prediction via Interaction Modeling. Modeling the interactions between agents is important for multi-agent trajectory prediction since the trajectory of each agent depends on not only their internal states but also their interactions with other agents. Social pooling [alahi2016social, gupta2018social] and graph neural networks [casas2020implicit, salzmann2020trajectron++] are adopted to simultaneously model interaction and conduct trajectory prediction. However, not all interactions between agents are necessary for trajectory prediction, and hence attention models are employed to only focus on important agents. Social attention [vemula2018social] designs a attention module on a graph structure to simultaneously predict attention and model interactions. EvolveGraph [li2020evolvegraph] adaptively evolves the latent interaction graphs to enable dynamic relational reasoning. Multi-head attention social pooling models the high-order interactions between vehicles [messaoud2020attention].

While these works attempt to focus the attention on the correct agents in various ways, they lack a natural model of human attention, and do not explicitly use the properties of driving. We hypothesize that additional priors can enable more effective trajectory prediction in complex multi-agent scenarios. We aim to fill this research gap by incorporating a smooth attention prior in trajectory prediction.

Attention Regularization. Attention mechanisms aim to re-distribute the focus of the network on inputs such as agents [vemula2018social] or image parts [bahdanau2015neural]. Early works on attention focus on the attention structure and implicitly learn the attention with the task loss [bahdanau2015neural, luong2015effective]. However, implicit attention learning may result in low-quality or even meaningless attention such as too dense values [luong2015effective]. Instead, regularization techniques such as sparsity encourage the network to focus on a small number of inputs [martins2020sparse]. Consistency regularizer enables coherent attention at different levels of the network [zhou2019discriminative]. The sparse and structured neural attention utilizes the structural properties of the input [niculae2017regularized]. Diversity regularization maximizes attention difference between different heads of neural network [li-etal-2018-multi-head]. However, most of these attention regularizers are designed for general attention modeling in various applications, but not specially for the multi-agent trajectory prediction in complex traffic scenarios, where accurately modeling humans’ attention becomes crucial. In this work, backed by evidence from cognitive science, we use the total variation attention regularizer for modeling realistic driving behavior.

Iii Method

We are interested in the multi-agent trajectory prediction problem in complex driving scenes. More formally, observing the states of all agents from time step to , we aim to predict their future states from time step to .

As discussed in Section I, trajectory prediction for each agent in the scene requires modeling the interactions between agents, which we achieve by modeling the attention of each agent to the most relevant agents. To this end, we build our architecture based on the RNN mixture model of [vemula2018social]. The main difference is that we propose a new smoothness attention prior on the attention model, which is implemented as a total variation prior on learning the attention.

Fig. 1: The architecture of the proposed method. We show the prediction process of a node (node 1) in a scene with agents at time step and . The solid green edges indicate the interaction between different agents with the states of the connected nodes as the input. The dashed green arrows show the RNN for the solid green edges. The solid blue edges indicate the interaction between consecutive steps with the states of consecutive steps as the input. The dashed blue arrows show the RNN for the solid blue edges. The solid red arrows indicate the attention module. The solid black arrows show the input and output of the RNN to predict the location of node 1.

Architecture. Fig. 1 shows our network architecture at time step and for an agent. To predict future trajectories of each agent, we incorporate information about the past and the interactions between the agent and other agents present in the scene. We model the interactions between agents with a directed complete graph with self-loops, where the nodes represent the agents, the edges between different agents model the interactions between agents, and self-loops model the interactions between consecutive time steps of each agent. For the edges between different agents, we adopt a neural network with the states of the edge’s tail and head nodes as the input, which embeds the interaction between these two nodes. Specifically, given the state of agent at time , we embed the interaction between each pair of agents as


where ‘concat’ here mean the vector concatenation of two states. The self-loop models the interaction of states between consecutive time steps. We adopt a neural network

that embeds the temporal state change of each agent. Specifically, at each time step , given the current state and the previous state of agent , we embed the temporal change as


Modeling the interaction for each time step independently may lose the information of the temporal evolution of interactions. Thus, we adopt two RNNs and to model the temporal evolution of edges and self-loops in the graph, where the computation at time step can be represented as


Here, indicates the output of the RNN, and indicates the hidden states in RNN, which are the internal states of RNN to transit information from a step to another. embeds the interaction information between each pair of agents and embeds the temporal shift information of each agent.

As we discussed in Section I, each agent should focus on the interactions with the most relevant agents. For each agent , we compute its attention on other agents as the similarity between and the interaction features with other agents , where :


Here, we use to refer to a learnable attention module and use inner-product as the similarity measurement. We apply the following normalization to make the attention from one agent to all other agents a categorical distribution:


With the interaction features and the attention, we use another RNN, , to predict the future trajectories of each agent. uses the attended interactions between the agent and other agents, and between consecutive steps of the agent. Similar to and , we also embed the input state into a hidden space. At time step , the prediction of the state of agent can be formalized as follows:


where is the embedding network of the agent state. We predict the current step state by a prediction network based on the output of , which is represented by a mean and covariance:


Using the above architecture, the trajectory prediction of each agent depends on the historical sequence of states and the attended interactions with other agents. Next, we introduce the loss function to train this network.

Likelihood and Variance Losses.

We define a loss function that maximizes the likelihood of the observed states:



is the probability of

under a Gaussian distribution with mean

and covariance matrix . Only minimizing the likelihood loss can sometime introduce large variance for the predictions, and can cause unstable predictions during test time. We therefore place a prior on the variance via an additional loss:


where is the indicator function, and threshold ensures that we do not penalize small variances.

The above losses, originally used in [vemula2018social]

, only penalize the one-step error and under-estimate the compounding error in long horizon. At test time, a sequence of states that come after the observed time step are predicted, where errors will accumulate over time steps and introduce a large compounding error. To address this problem, in the training phase, we simulate the prediction process of the test phase as follows:


To differentiate from the notation above for one-step prediction, we use the notation with to indicate that the variable is derived using the multi-step prediction process. is a sample from a Gaussian distribution parameterized by and if or otherwise it is the ground-truth state. We similarly apply the Gaussian likelihood loss and the variance loss on this simulated test-time prediction, which are defined on the full prediction sequence, in contrast to the one-step loss in Eqns. (8) and (9). With the above loss functions, we can simultaneously penalize the short-term and long-term prediction errors.

The attention module, which is used as an input to the RNN to predict and , is implicitly learned by the above losses without any explicit constraints. However, proper regularization might be helpful in training an attention model, which is effective in reducing the sample complexity. Therefore, we further introduce a regularization on the attention to learn a more reasonable and informative attention module.

Attention Modeling. In this section, we impose the smooth attention prior that human attention does not change frequently over time. Our hypothesis is based on two factors: First, cognitive science literature suggests that although human attention can change rapidly when running freely without specific instructions, deliberate movement of attention is significantly slower because of an internal limit on the speed of volitional commands [wolfe2000attention]. This suggests that human social attention does not change frequently in driving, as it falls into the category of deliberate movement. Second, in driving scenarios, the most relevant vehicles to pay attention to are often the ones that can immediately affect the ego agent or the ones that are affected by the ego agent. This group of agents often do not change their behavior rapidly since the reward function typically consists of terms related to the distance to the goal and proximity to other agents, which mostly continuously change along time steps. Hence, we impose a smoothness constraint on the attention, which is defined as a vectorial total variation penalty:


This total variation loss [rudin1992nonlinear, bresson2008fast] encourages piecewise-smooth signals with few transitions, and transitions that happen at the same point for multiple elements of . We similarly define the total variation loss on .

Overall Loss. We aggregate all the losses for the network:

where and are trade-offs for the variance losses and the smoothness losses. , , , and correspond to one-step predictions as shown in Eqns. (5) and (7). , , , and are sequence predictions as shown in Eqn. (10).

Iv Experiments

We first introduce the experiment setting including datasets, and implementation details. We then show our results where we perform trajectory (state) prediction.

Iv-a Datasets

We perform tests on two synthetic datasets: Double Merge and Halting Car, for which we implement the scenarios in the CARLO simulator [cao2020reinforcement]. The synthetic experiments aim to show the attention learned by our smoothness prior focuses on the correct agents, where the scenarios usually exists on the road and the true attention is clear. We also conduct experiments on the real naturalistic driving INTERACTION dataset [interactiondataset].

Double Merge. We define a common driving scenario illustrated in Fig. 2 (left): two vehicles in two lanes (green and brown) want to change to each other’s lane, which may cause a collision. We add other vehicles (white) that do not influence these two vehicles to make the scene more complex, requiring the two vehicles to attend each other instead of these other vehicles. We show sample trajectories, where the dot sizes represent progression in time.

We generate a trajectory dataset by using a hand-designed ground-truth policy for each vehicle. The main vehicle whose initial location is behind the other main vehicle (e.g. the brown car in Fig. 2 Minor) waits until the other (e.g., the green car) finishes its lane change, and then starts its own lane change. The other vehicles drive straight with a constant speed on the outer two lanes. Hence, the policy creates two symmetric situations where either of the two main vehicles is initialized in the front (Minor and Major cases in Fig. 2).

We aim to show the regularization reduces the sample complexity, and improves performance especially for the rare events where data is limited. For this, we limit the data for one of these two driving scenarios. Specifically, we assume the case the green vehicle starts behind the brown vehicle is a major case with significant amount of data, while the other scenario as the minor case with only limited data. We collect trajectories from the major case for training. For the minor case, we vary the size of the dataset by collecting and of the number of major case trajectories. We further split of the entire training set for validation. For testing, we independently sample trajectories for each case and separately compute the prediction error on the test set for each situation.

Fig. 2: The left figures illustrate sample trajectories from the ground-truth policies for the major and minor cases of the Double Merge scenario. In the minor case, the left green vehicle is ahead of the right brown vehicle at initialization, and vice versa in the major case. The right figures are the Average Displacement Error (ADE) and the Final Displacement Error (FDE). We show the -values that measure the statistical significance in the difference between the results of Ours and S-Attn.
Fig. 3: The test error under varying number of training trajectories. -axes show the number of trajectories for minormajor cases. Our method is consistently better with a larger margin under smaller datasets.

Halting Car. In this scenario, illustrated in Fig. 5, two main vehicles are initialized in the same lane while one vehicle (the leader, brown) is in front of the other (the follower, green). We create two scenarios: ‘Go’, where the leader vehicle drives with a constant speed, and ‘Stop’, where the leader vehicle suddenly stops and then accelerates back. The follower vehicle aims to follow the leader vehicle safely, and thus needs to react to the potential slowing down behavior of the leader. We also add other cars to increase complexity.

For the ground-truth policies, we let the leader vehicle slow down to stopping and then accelerate in the ‘Stop’ case. The follower vehicle slows down or accelerates depending on its distance to the leader vehicle. In the ‘Go’ case, both the leader and the follower vehicles drive with a constant speed. In both cases, the other vehicles drive straight with a constant speed on the outer lanes. The scenario aims to simulate a near-accident driving situation.

We simulate the rare events by using either setting (‘Stop’/‘Go’) as the major case and the other as the minor case. We collect trajectories for the major case for training, and collect of the major case trajectories for the minor case. The training/validation split and the number of testing trajectories are the same as the double merge scenario.

INTERACTION. The INTERACTION dataset [interactiondataset] contains naturalistic motions of various traffic participants in 3 categories of highly interactive driving scenarios including Merging, Intersection, and Roundabout from 12 locations. The trajectories of all the agents in a location are recorded for prediction during a certain time horizon. For every steps, we use the first steps as observed steps and predict the next steps. For each dataset, we split all sequences into for training, for validation and for testing. The total time of video recorded is about 1000 minutes, which shows the performance of our method on a large-scale dataset.

Iv-B Implementation Details

We use a one-layer LSTM [hochreiter1997long] for all RNN networks and use a one-layer fully-connected network for all other networks in our architecture. We use Adam optimizer [kingma2014adam] and tune the hyper-parameters; the learning rate, the threshold , and the trade-off parameters and , by cross-validation on the validation set. We set the learning rate to be , as and both and to be , which work well for all of our datasets. We use a batch size of

and train the model on synthetic datasets with 200 epochs (only about 10 iterations per epoch) and on the Interaction dataset with 10 epochs. We run the experiments for 10 times, and report the mean and standard deviation.

Fig. 4: The top figures show the predicted trajectory. We show the trajectory of S-Attn with blue and our approach with orange. The darker the color at a step, the higher the attention paid on the correct vehicle (the green vehicle and the brown vehicle should pay attention to each other). The bottom plots show the attention at each step. The shaded area highlights the time period that requires high attention, meaning the vehicle behind is close to the center lane and needs to wait for the other vehicle ahead to pass first.

Iv-C Results for Double Merge

Prediction Error. We report the Average Displacement Error (ADE, mean error over the predicted steps) and the Final Displacement Error (FDE, the error in the final step). We compare our method with the (i) Social Attention (S-Attn) [vemula2018social], which learns attention but does not have the sequence loss and the smoothness regularization, and (ii) the average attention (denoted by ‘Average’), which distributes the attention to all vehicles equally. We use the correct attention (denoted by ‘Correct’) as the oracle: the two main vehicles fully attend to each other in the whole sequence and ignore other vehicles. As shown in Fig. 2, in both the major and the minor cases, our approach consistently outperforms S-Attn with the same attention model, which demonstrates the importance of using the sequence loss. Ours outperforms Ours-NonSmooth and Ours-Average and performs comparably to Ours-Correct, which indicates that smooth attention regularization helps learn a more accurate attention. We report the -values (two-sample -test) in Fig. 2 for statistical significance of the difference between our method and S-Attn. We observe statistically significant performance gain () in both ADE and FDE for all the ratios in both major and minor cases.

We further investigate the test error with different number of trajectories, where we keep the ratio of the minor case to the major case as and change the total number of training trajectories. We test three settings (minormajor): , and . As shown in Fig. 3, Ours outperforms other methods with larger margin especially with less data, i.e., the margin between Ours and Ours-NonSmooth is larger for ‘’ than for ‘’. This demonstrates the efficacy of smoothness regularizer on reducing sample complexity, which is crucial for driving in rare events, e.g., near-accident scenarios [cao2020reinforcement].

Fig. 5: The left figures illustrate sample trajectories from the ground-truth policies for the major and minor cases of the Halting Car scenario. In the ‘Stop’ case, the brown vehicle slows down to stopping, whereas it drives with a constant speed in the ‘Go’ case. The right figures are the Average Displacement Error (ADE) and the Final Displacement Error (FDE) on the test set for these two cases. We show the -values that measure the statistical significance in the difference between the results of Ours and S-Attn.

Trajectory and Attention. We show the trajectories and the attention for S-Attn and our approach in both minor and major cases in Fig. 4. We show the true trajectories with the gray lines, and the predictions with dots. The sizes of the dots demonstrate progression in time, and the color intensities demonstrate attending to the correct vehicle. It can be seen that this intensity is larger when using our method (orange), compared to social attention (blue). In the bar plots in Fig. 4, we demonstrate the predicted attention of the green vehicle to the brown vehicle (and vice versa) in both cases using our method (orange), and social attention (blue). The highlighted regions highlight the merging period. Our approach has a clear change in its value and correctly estimates the true attention in both major and minor cases compared to S-Attn.

Iv-D Results for Halting Car

We use either ‘Stop’ or ‘Go’ as the major case and report ADE and FDE in both cases. As shown in Fig. 5, in both ‘Stop’ major and the ‘Go’ major cases, the results show a similar trend as the double merge scenario, which further demonstrates the efficacy of the smooth attention regularization and the sequence loss. The -values in Fig. 2 demonstrate the statistically significant performance gain () in both ADE and FDE for all the ratios in both major and minor cases.

Fig. 6: The left figures illustrate different types of scenarios in the INTERACTION dataset. The right figures show Average Displacement Error (ADE) and Final Displacement Error (FDE) for different scenario types.

Iv-E Results for INTERACTION

Fig. 6 shows the results for the INTERACTION dataset. We compute the test error for each instance and report the average in each scenario type. We observe that our method outperforms all the other methods, which demonstrates the proposed smoothness regularization and the sequence loss improves the sequence prediction in realistic scenarios.

V Conclusion

Summary. We propose a new approach for multi-agent trajectory prediction in complex driving scenarios. We propose smoothness attention prior motivated by cognitive evidence and realistic driving behavior. We design an RNN mixture architecture to model the interactions between agents and learn an attention module with a loss that consists of both one-step and sequence losses to penalize both short-term and long-term prediction errors, and a total variation loss on the attention model to impose the smoothness prior. Experimental results show the proposed approach outperforms the benchmarks on both synthetic and real driving scenarios.

Limitations and Future Work. Our main focus is on how the attention prior affects the attention model but not the architecture to learn interactions. We use a complete directed graph to model the interactions between all agents, which may not be scalable to large number of agents. Also, we only consider learning a training model for one scenario but not for multiple scenarios. In the future, we plan to model interactions by sparse and/or dynamic graphs, and also use the scene maps as an input to handle multiple scenarios. The more scalable architectures may also enable conducting experiments on larger-scale real driving datasets where rare event naturally occur, e.g., Argoverse [chang2019argoverse].


We would like to thank NSF Awards 1953032 and 2125511, as well as Toyota Research Institute (TRI).