## I Introduction

Behavior prediction is a core component of real-world systems involving human-robot interaction. This task is particularly challenging due to the high degree of uncertainty in the future—the intent of human actors is unobserved, and multiple interacting agents may continually influence one another.

We are particularly interested in the high-impact application of Autonomous Vehicles (AV), in which a robot may wish to pose behavior prediction queries of the form “If I take action , what will agent do?”, as shown in Figure 1. We assert that this type of conditional inference is important and fundamental for making planning decisions in an interactive driving environment. In this paper, we focus on probabilistic models of future behavior that can condition on possible future action sequences (i.e., trajectories) of other agents. We call this task conditional behavior prediction (CBP).

In the literature, there are a family of behavior prediction models for which the conditioning capability comes naturally: those that employ step-wise, iterative sampling (“roll-outs”) for multiple agents in a scene, e.g. [sociallstm, tang_multifuture, precog_Rhinehart_2019_ICCV, schmerling2018multimodal]. In such models, it is possible to control the action sequences for a subset of agents, so that the roll-out of others will take them into account. While flexible, these sample-based models have significant disadvantages for real-world applications: sample-based inference is risky to employ in a safety-critical system, iterative errors can compound [ross2011dagger], it is difficult to control sample diversity [rhinehart2018r2p2, kitani_diverse_forecasting_dpps], and attempting to jointly model all agents is often intractable computationally. Some past work has focused on tightly-coupled robot-human interaction in limited driving game environments: [schmerling2018multimodal] iteratively conditions on generated human actions in a CVAE framework; [sadigh2016planning]

formulates the interaction problem as a 2-player game with human reward learned via inverse reinforcement learning.

As an alternative to sample-based models, there is a long line of work on single-shot, passive behavior prediction [chai2019multipath, DESIRE, neural_motion_planner_zeng2019, casas2018intentnet, SocialGAN, gao2020vectornet, chang2019argoverse]. These models are compelling due to tractability and practical parametric output distributions, and have become the popular choice in AV systems and associated benchmarks [chang2019argoverse]

. However, these models ignore the fact that the AV ego-agent will take actions in the future, which may cause a critical reaction by another agent. Using such models makes decision-making challenging: because the models do not condition on any explicit ego actions, they must implicitly account for all possible ego-actions (or ignore interactions altogether). In practice, interaction modeling has been handled via aggregating neighboring agents’ observed states via max-pooling, transformer layers, or graph neural network architectures

[sociallstm, mercat2020multi, casas2020spagnn, mangalam2020_journey_pecnet].In this paper, we propose a single-shot, conditional behavior prediction model. Our CBP model is a powerful, end-to-end trained deep neural network, which takes into account static and dynamic scene elements—road lanes, agent state histories (vehicle, pedestrian and cyclist), traffic light information, etc. From these inputs, we predict a diverse set of future outcomes, represented as Gaussian Mixture distributions, where each mixture component corresponds to a future state sequence (i.e., a trajectory with uncertainty). We train these models to be capable of conditional inference by selectively adding future trajectory information for some agents as additional inputs to the model. We use large datasets of logged driving data and train models via maximum likelihood to output either conditional or passive (marginal) predictions for any subset of agents in a scene. The recently proposed WIMP model [wimp2020] is also a single-shot conditional inference model; ours differs in that we condition on generic trajectories for any subset of agents in a fully probabilistic framework.

The notion of interactivity is a key concept for this problem, and a key contribution of this paper is to formalize the notion and obtain a simple and practical interactivity score. Now that we are equipped with a probabilistic model for conditional future distributions, we can quantify a notion of interactivity as follows. We quantify the degree of influence one agent has on another as the KL-divergence between (a) the agent’s future distribution conditioned on the other’s future and (b) its marginal distribution. We then take an expectation over all possible conditioned futures for the other agent to get a final interactivity score. This results in a simple, agent-symmetric computation in the form of mutual information between the two agent’s futures. In contrast, past work have hand-designed models of surprise or discomfort for motion planning [pandey2010framework, scandolo2011anthropomorphic, sisbot2007human, refaat2019agent]. Entropy and mutual information have been previously used in AV applications as a measure of uncertainty to predict collisions [michelmore2018evaluating].

In real-world driving, the interactivity score can be used to anticipate driver interactions. When processing data offline, we demonstrate the use of the interactivity score for mining interactive scenarios that are potentially unsafe, since the target agent’s expectations are being violated. Furthermore, we demonstrate the benefits of the interactivity score for prioritizing agents for behavior prediction and planning. In contrast to previous work that built a special-purpose model trained directly for the task of prioritization [refaat2019agent], which was derived from an implicitly-defined side-channel output of a blackbox planner, we provide a formulation that is independent of a specific planner definition and consequently, more generally applicable.

Our contributions can be summarized as follows: (1) We provide a novel, principled information-theoretic definition of interactivity, which applies to any multi-agent interaction application, (2) we develop a first-of-its-kind, single-shot, deep neural network for probabilistic conditional behavior prediction and (3) we show our interactivity score improves state-of-the-art model performance in several settings.

## Ii Defining Agent Interactivity

We define an agent trajectory as a fixed-length, time-discretized sequence of agent states up to a finite time horizon. All quantities in this work consider a pair of agents and . Without loss of generality, we consider to be the query agent whose plan for the future can potentially affect , the target agent. The future trajectories of and

are random variables

and . The marginalprobability of a particular realization of agent ’s trajectory is given by , also indicated by the shorthand . The conditional distribution of agent ’s future trajectory given a realization of agent ’s trajectory is given by , indicated by the shorthand .Even in highly interactive scenarios, agents may behave as expected by other agents and not exert any influence on one another. A define a surprising interaction

as one in which the target agent experiences a change in their behavior due to the query agent’s observed trajectory. When we have access to ground-truth future trajectories, we can quantify interactions by estimating the change in log likelihood of the target’s ground-truth future

:(1) |

A large change in the log-likelihood indicates a situation in which the likelihood of the target agent’s trajectory changes significantly as a result of the query agent’s action. If the target’s trajectory becomes more likely given the query agent’s trajectory , then will be positive. If it becomes less likely, then will be negative. If there is no change, then will be zero.

A query agent may need to estimate the impact of a planned future trajectory on the target agent . Since we don’t have access to the ground-truth future for the target agent, we can quantify the potential for a surprising interaction by estimating the shift in the distribution of the target agent’s trajectory. More specifically, we use the KL-divergence between the conditional and marginal distributions for the target’s predicted future trajectory to quantify the degree of influence exerted on by a a trajectory :

(2) |

For example, in Fig. 1, if the query agent decides to change lanes in front of the target agent, the target agent will have to slow down. In this case, the KL-divergence will reflect a significant change in the target agent’s expected behavior as a result of the query agent’s planned lane change.
In the absence of a particular plan for the query or target agent, we can consider the set of all possible actions for the query agent and compute the expectation of the degree of influence over all those possible actions. This expectation is defined as the *mutual information* between the two agents’ future trajectories and , and is computed as:

(3) |

Mutual information expresses the dependence between two random variables. It is non-negative, , and symmetric, [shannon1948mathematical]. We use this quantity as the interactivity score between agents and . For example, if the target agent is driving closely behind the query agent, we expect their interactivity score to be high because the target agent is likely to respond immediately to any actions, such as deceleration or acceleration, from the query agent.

## Iii Method

In the previous section, we developed a measure of interactivity between a pair of agents. In this section, we discuss training a conditional behavior prediction model that can estimate the distributions and . We discuss the internals of this model and losses. We then discuss the process for computing the interactivity score by sampling from the predicted distributions.

Let denote observations from the scene, including past trajectories of all agents, and context information such as lane semantics. Let denote a discrete time step, and let denote the state of an agent at time . The realization of the future trajectory is a sequence of states for , a fixed horizon.

A CBP model predicts , the distribution of future trajectories for conditioned on .
The CBP model receives as input a realization of the future trajectory of the query agent, , which we refer to as agent ’s *plan*, or the *conditional query*.
Following the approach of MultiPath [chai2019multipath], the model predicts a set of trajectories for agent , , where each trajectory is a sequence of states , capturing potentially-different intents for agent . The model predicts uncertainty over the intents as a softmax distribution . The model also predicts Gaussian uncertainty over the positions of the trajectory waypoints as:

(4) |

This yields the full conditional distribution

as a Gaussian Mixture Model (GMM) with mixture weights fixed over all time steps of the same trajectory:

(5) |

The Gaussian parameters and are directly predicted by a deep neural network (DNN). The softmax distribution is computed as , where

are logit values also output by the DNN.

The computation of the interactivity score also requires the estimation of marginal distributions, , which are not conditioned on any future plan for

. We train a single model which can produce both marginal and conditional predictions, in order to have comparable quantities without any uncertainty due to model variance. Marginal predictions,

, are provided by turning off inputs from the conditional query encoder in the model. We adopt the shorthands to describe this operation, which gives us the marginal distribution as(6) |

Given the conditional and marginal predictions of the CBP model, we can now compute the mutual information. Directly computing the mutual information between the future states of two agents via Eq. (3) is intractable between the GMM distributions (Eq. (6)). We estimate the outer expectation via importance sampling. Rather than sampling samples from the marginal distribution, we will use the most likely 6 modes of the marginal distribution’s GMM in Eq. (6) as in standard motion forecasting metrics [chang2019argoverse], with :

(7) |

where the marginal and conditional probabilities are evaluated via Eqs. (5) and (6). The use of other more efficient approaches for estimating KL divergence are left to future work [hershey2007approximating].

To train the model for conditional prediction, we set the conditional query/plan input to agent ’s ground-truth future trajectory from the training dataset. We learn to predict the distribution parameters , , and

via supervised learning with the negative log-likelihood loss,

(8) |

where is the index of the mode of the distribution that has the closest endpoint to the given ground-truth trajectory, .

Above, we describe how to produce predictions for a single agent

. However, for increased efficiency, our model produces predictions for multiple agents in parallel. To encourage the model to maintain the fundamental physical property that agents cannot occupy the same future location in space-time, we include an additional loss function:

(9) |

where and are the modes and probabilities of the future trajectory distributions for agents and .

## Iv Experiments

### Iv-a Data

We collected a large, in-house dataset of real-world driving from urban and suburban environments, using a vehicle equipped with an industry-grade sensor and perception stack, which provides us with tracked objects. In total, the training set has 1.9 billion vehicle agents that we learn to model, from 19 million unique scenarios, comprising 18 years of continuous driving data. The models receive 2 seconds of history and predict 15 seconds of future behavior for all agents in the scene, including the AV. The state of the agents are recorded at 5 Hz. Features describing the past states of the agents include

position, velocity vector, acceleration vector, orientation

, and angular velocity. There are also binary attributes indicating whether the vehicle is signaling to turn left or right, and whether it is parked. The lane markings and boundaries are represented by 500 points sampled around the current location of each predicted vehicle to balance memory requirements.At training time, we select one agent uniformly at random from the vehicles in the scene to be the query agent. For 95% of the samples, the query agent’s future ground-truth is fed to the model as the conditional query input. For the other 5%, no conditional query is provided, leading to marginal behavior prediction, with the split chosen through cross-validation.

For every scene, the model predicts future behaviors for up to the 20 closest vehicles. Other agents (vehicles, pedestrians, and cyclists) are still used in the agent state feature encoder, but the model doesn’t predict futures for them. We observed that prediction performance beyond 20 agents degrades rapidly due to sensor limitations.

### Iv-B Model Architecture

The architecture is composed of an input encoder stage, a trajectory decoder stage, and a GNN-based trajectory refinement stage, shown in Fig. 2. The encoder stage is composed of a road lane encoder which uses an architecture similar to VectorNet [gao2020vectornet], and a track history encoder which uses a 64-dimensional LSTM applied to 5 time steps of past state observations comprising 1 second of history. The result of the above two encoders are concatenated and passed into a decoder which outputs a sequence of points via predicted polynomial coefficients for trajectory modes [chai2019multipath]. In our experiments, we use a tenth-degree polynomial. The resulting trajectories are further refined using a GNN [Battaglia2018GNNs, casas2020spagnn]. The GNN uses an attention-based aggregation function that combines relative agent positions as edge features to form messages passed to each node [vaswani2017attention]. We apply one message update, which passes trajectory information between neighboring agents, and then re-apply trajectory decoding. This process can refine the agents’ trajectory distributions with awareness of their neighbors’ distributions. Further details of this state-of-the-art model architecture are currently under anonymous review.

## V Results

### V-a Metrics

Given a labeled example , the weighted Average Distance Error (wADE) over the most likely 6 modes of the conditional prediction of agent ’s future trajectory given the query agent ’s future trajectory is:

(10) |

where is the th mode for the predicted position of agent at time with its respective probability of . Likewise, we can compute the metric using and . Computing their difference: quantifies the reduction in ’s error due to conditioning on . Another established metric for behavior prediction is the minimum Average Distance Error (minADE), defined for conditional models over the most likely 6 modes as:

(11) |

To obtain a low minADE value, the model needs to accurately predict the ground-truth future as one of its predicted intents. On the other hand, the wADE metric is more suitable for evaluating multi-modal distributions and can reflect shifts in distribution of intent probabilities. Therefore, we use wADE as the main metric in the following results. Furthermore, is closely related to the definition of in Eq. (1), but since it is weighted by the distance error, it is less sensitive to prediction errors for nearly-stationary vehicles.

### V-B Conditional Behavior Prediction

Comparing accuracy between marginal and conditional predictions from the trained model shows a 10% improvement for conditional predictions, as seen in Table I. This is clear confirmation that our model is leveraging future information to improve predictive power, as expected. The early fusion conditional encoder receives the conditional query at an earlier stage in the model, whereas the late fusion setup feeds the query to the GNN only at the final prediction stage. As the results show, the early-fusion variant significantly outperforms late fusion.

### V-C Evaluation on Argoverse

Our model is competitive with state of the art on the popular Argoverse benchmark dataset [chang2019argoverse]. On the validation dataset, we achieve a minADE of 0.7488, which is near state of the art in recent work: 0.71 by Liang et al. [liang2020learning], 0.728 by TNT [zhao2020tnt], and 0.75 by WIMP [wimp2020]. By conditioning on the sensor vehicle, the CBP model reduces minADE by 0.8% to 0.7409, consistent with our more exhaustive studies on the internal dataset.

Method | ||
---|---|---|

Non-conditional | 3.486 0.0017 | 1.207 0.00062 |

Early fusion (encoder) | 3.142 0.0016 | 1.170 0.00061 |

Late fusion (GNN) | 3.469 0.0017 | 1.209 0.00063 |

Early and late fusion | 3.160 0.0016 | 1.172 0.00067 |

Comparison of CBP models on an evaluation dataset containing over 8 million agent pairs. Metrics are computed and averaged over all (query agent, target agent) pairs possible in every scene. The mean error is computed only over predictions for the target agent and does not include predictions for the query agent. The standard error of the mean is also reported.

### V-D Distribution of Interactivity Scores

Figure 3 shows the histogram of interactivity scores between all agent pairs in the evaluation dataset. The incidence of interactions in most datasets are rare, so the interactivity score may be a good tool to automatically mine a dataset for interactive examples.

### V-E Interactivity Score Predicts Surprise

The mutual information score allows us to discover scenarios with a potential for surprising interactions, where the ground-truth future of the query agent causes a target agent to change its behavior. Using the ground-truth future trajectories of agents and , we can quantify how query agent affected target agent in reality by comparing the prediction error between the conditional and marginal (non-conditional) models. A large, positive indicates that providing the query agent’s future significantly improves the prediction accuracy for the target agent.

Figure 4a shows that there is a strong correlation between high values of mutual information and high values of . In other words, agents with high interactivity scores are more likely to exert influence on one another. Note that the interactivity score does not use any future information, while does.

Also, in percentiles with high mutual information, there is a high occurrence of examples where the conditional prediction errors are much lower than marginal prediction errors. These are scenes where the behavior of the query agent has significantly affected the target agent. Such examples are not present in the lower mutual information percentiles.

On the other hand, Figure 4b shows a decrease in average for the percentiles with the highest mutual information. Upon inspection of a portion of scenes in the top percentiles, we observe many examples where the agent pair are positioned very close to each other and can exert influence on one another, however since they are almost stationary and is sensitive to distance, the impact of influence on is small. We also observe that a high KL divergence for the target agent given the ground-truth query trajectory strongly correlates with high values of . Given the future trajectory of the query agent, we can predict surprising interactions even more accurately than without future trajectories for either agent.

Figure 7 shows two examples of pairs of interacting agents discovered in the evaluation set by filtering by high mutual information and high . In the first example, one vehicle yields to another in a turn. While in the marginal prediction there is a high probability for the target agent to cross the intersection, the conditional prediction shows the target agent yielding. In second example, the target agent slows down behind a query agent which is braking.

### V-F Selecting Salient Agents

This section demonstrates using the interactivity score to predict which vehicles are salient for planning for the autonomous vehicle. We predict the trajectory of the AV both in the original scene, and in a modified scene where some agents have been removed. We show agents with high interactivity with the AV are more likely to affect its behavior, compared to agents that are just closer to it. In the dataset, we typically have 10 to 32 cars in a scene, but in practice, very few of these cars are actually relevant for planning for the AV, so they could potentially be excluded from high-fidelity behavior predictions on-board the vehicle.

In the first experiment, we compute the mutual information between the autonomous vehicle and every other agent. We choose the top agents with the largest mutual information values. Then, we remove all others from the scene, and use only the top agents states to predict the trajectory of the AV. We compare this approach to selecting the top

agents closest in distance to the AV in the scene. This is a common heuristic used for identifying relevant vehicles in the scene.

Figure 4(a) shows that mutual information can identify more relevant agents for planning up to . For larger numbers of agents, the distance heuristic outperforms mutual information as an agent selection mechanism. In practice, for agent prioritization onboard an AV, mutual information could be combined with other heuristics, such as distance.

In the second experiment, we do not remove the pruned agents from the scene, but remove them from the set of agents whose behaviors are to be predicted by the model. In this case, the pruned agents are visible to the model as scene context. Figure 4(b) compares using interactivity score vs. a distance heuristic in this task. As the results show, moving less interactive agents to scene context is actually improving predictions for the autonomous vehicle, as long as at least the 3 most interactive agents are kept in the prediction set.

One potential explanation for this result is that reducing the prediction set of the model provides an attention mechanism for the prediction of the autonomous vehicle that emphasizes the potential future trajectories of certain agents over others. In particular, the message-passing mechanism in the GNN can focus only on the relevant neighbors for the AV.

Figure 6 visualizes pruning agents by mutual information vs. pruning by distance in the same scene. We see that the mutual information selects vehicles that are behind and ahead of the AV in the same lane, in addition to a few vehicles further ahead in neighboring lanes. The distance metric, on the other hand, selects vehicles that are multiple lanes away and are not likely to interact with the AV.

### V-G Challenges and Future Work

Fig. 8 shows an example where our metrics have selected a pair of vehicles slowing down in parallel lanes at an intersection. These agents are reacting to a change in traffic light state, rather than to one another. The CBP model can not differentiate between correlation and causation of two agent’s trajectories. Before using a trajectory as a query, one can compute the marginal likelihood of the query , to determine whether is a likely query for which the model can accurately provide counterfactual predictions.

The interactivity score can be evaluated very efficiently by pre-computing the embedding of the roadgraph, which is the most expensive part of the architecture in practice, and batching the different queries to evaluate them in parallel. We could also consider using our interactivity score as a reward signal in cooperative multi-agent reinforcement learning, similar to the notion of influence introduced in [jaques2019social].