I Introduction
Selfdriving is one of the most exciting challenges of contemporary artificial intelligence because of its potential to revolutionize transportation. While there has been incredible progress in machine learning and robotics in the past years, many challenges still remain to achieve full autonomy. One of the critical challenges is that selfdriving vehicles will need to share the space with human drivers, who can perform a very diverse set of maneuvers including compromising behaviors. Particularly, these maneuvers and behaviors are highly determined by the interactions with neighboring drivers
[1, 2, 3].Understanding human intention is a very difficult task. In order to predict the other drivers’ future behavior it is necessary to perceive their past motion, analyze the interplay with other agents and process the information available from the scene such as the lanegraph. Therefore, for autonomous vehicles to be able to coexist with human drivers in the roads, they need to be able to emulate human behaviors [4, 5, 6]. For this reason, the problem of detection and longterm future behavior forecasting of other vehicles in realistic environments is at the core of safe motion planning.
In the past few years many deep learning approaches have been developed to detect objects from LiDAR point clouds
[7, 8, 9, 10]. They mostly differ in the input representation and the architecture. Voxels [9, 11], bird’s eye view [8, 12, 13] or range view [7, 14, 15] representations are typically employed. Recent work [16, 5, 17] showed how to utilize convolutional neural networks (CNNs) to produce future trajectories of actors given 3D object detections as input. However, this approach cannot recover from mistakes at the detection stage. A recent seminal work [11] proposed to jointly perform object detection and motion forecasting within the same neural network. This was further extended to also reason about actor intentions [4]. However, all these approaches ignore the social interactions between the agents.Interactions are very common in realworld driving. Illustrative examples are the negotiation between drivers that take place in 4way stop intersections, yielding situations like unprotected left turns, and even the very simple carfollowing behavior. Modeling these interactions will help significantly to reduce the uncertainty in predicting the future behavior. Towards this goal, we propose an efficient probabilistic model that leverages recent advances in graph neural networks (GNNs) [18] to capture the interactions between vehicles. Our model is fully differentiable, enabling the joint optimization of detection and behavior forecasting tasks, which mitigates the propagation of early errors.
We showcase the power of our approach by showing significant improvements over the stateoftheart across all detection, motion forecasting, and interaction metrics in both ATG4D [8], a largescale dataset with over a million frames that we collected, as well as the newly released nuScenes [19] dataset. In the remainder of the paper, we first give an overview of the related work, then present our model, and finally discuss our experimental setup, exhibiting both quantitative and qualitative results.
Ii Related Work
In this section, we first review recent advances in object detection from point clouds. We then discuss motion estimation approaches as well as methods for agent interaction modeling. Finally, we review joint perception and behavior forecasting methods.
Object Detection from Point Clouds
A popular approach is to use 3D convolutional networks that operate over voxel grids [9, 20, 11]. However, the sparsity of point clouds makes the computation redundant. Frontview representations [7, 14, 15] exploiting range information of a LiDAR sensor have also been explored with success, although they lose the original metric space and have to handle large variations in object size caused by projection. Another option is to handle point clouds directly (without voxelization) [21, 22, 23, 24]. Unfortunately, all the above methods suffer from either limited performance or heavy computation [25]. Recently, bird’seyeview detectors that exploit 2D convolutions over the groundplane [8, 10, 12, 13] have shown superior performance in terms of speed and accuracy.
Motion Forecasting
DESIRE [26] proposed a recurrent variational autoencoder to generate trajectories from groundtruth past trajectories and images. R2P2 [27] proposed a flowbased generative model to learn object dynamics. However, both DESIRE and R2P2 are not ideal for timecritical applications due to their expensive sampling needed to cover all possible outcomes. SIMP [28] parametrizes the output space as insertion areas where vehicle could go, predicting an estimated time of arrival and a spatial offset. [16], [5] and [17] create bird’s eye view rasters from the lane graph of the scene and perception results produced by a separate system to predict future trajectories of road users. Unfortunately, the actorcentric rasterization employed poses a challenge for realtime applications. Another limitation of these methods is that perception and motion forecasting modules are learned separately.
Interaction Modeling
[29]
proposed to couple game theory and deep learning to model the social aspect of pedestrian behavior. Several methods have exploited
[30, 31, 32, 33, 34, 35, 36, 37]a variety of social pooling layers to include relational reasoning in convolutional and recurrent neural networks. Graph neural networks (GNNs)
[38, 39, 40] have recently been shown to be very effective. NRI [40] models the interplay of components by using GNNs to explicitly infer interactions while simultaneously learning the dynamics. CARNet [41] models agentscene interactions by coupling two specialized attention mechanisms. However, all the aforementioned methods still assume perfect perception when facing the behavior forecasting task.Joint Perception and Behavior Forecasting
FaF [11] unified object detection and short term motion forecasting from LiDAR. IntentNet [4] modified FaF’s architecture, replacing 3D by 2D convolutions and adding the prediction of highlevel intentions for each agent by exploiting rich semantic information from HD maps. This was further extended to also predict a cost map for egovehicle motion planning [6]. However, although these models perform future behavior prediction of vehicles in urban scenes, they do not model multiagent interactions explicitly. Here we show how to leverage the success of joint detection and prediction while reasoning about interactions between agents.
Iii Object Detection from LiDAR and HD Maps
In this section, we first discuss our input parametrization to exploit 3D LiDAR and HD maps. We then explain the first stage of our model, i.e., the backbone network for object detection (top row in Fig.2).
We voxelize the 3D LiDAR point cloud similarly to [8], with the difference that we leverage ground height information available in our HD maps to obtain our voxelized LiDAR [10]
. Compared to a sensorrelative height, this reduces the variance in the Z coordinates of vehicles since these always lie on the ground, allowing our model to learn height priors. In order to exploit motion information, we follow
[11] and leverage multiple LiDAR sweeps by projecting the past sweeps to the coordinate frame of the current sweep, taking into account the egomotion. Following [4], we stack the height and time dimensions into the channel dimension of our tensor in order to exploit 2D convolutions. This provides us with a Bird’s Eye View (BEV) 3D occupancy tensor of dimensions
, where L=140, W=80 and H=5 meters are the longitudinal, transversal and normal physical dimensions of the scene we employ in the ATG4D dataset. We reduce the region of interest to 50 by 50 meters in nuScenes due to the limited range of its 32beams LiDAR sensor as well as the annotations. meters/pixel are the voxel sizes in the corresponding directions and T=10 is the number of LiDAR sweeps we employ in both datasets.Following [4], our input raster map contains information regarding roads, lanes, intersections, crossings, traffic signs and traffic lights^{1}^{1}1We use an imagebased CNN to estimate the state of the traffic light.. In such a representation, different semantics are encoded in separate channels to ease the learning of the CNN and avoid predefining orderings in the raster. For instance, yellow markers denoting the barrier between opposing traffic are rasterized in a different channel than white markers. In total, this representation consists of 17 binary channels.
We build our object detection network on top of PIXOR [8], which is a lightweight stateoftheart object detector. In particular, we extend the single branch network of PIXOR to a twostream network such that one stream processes LiDAR point clouds and the other processes HD maps, respectively referred as and in Fig. 2. We modify PIXOR’s backbone by first reducing the number of layers in the first 4 residual blocks from (3, 6, 6, 4) to (2, 2, 3, 6) in order to save computation. LiDAR point clouds are fed to this condensed backbone. To process the highdefinition map, we replicate this backbone but halve the number of filters at each layer for efficiency purposes. After extracting features from the LiDAR and HD map streams, we concatenate them along the channel dimension. The concatenated features are then fused by a header convolutional network (). Two convolutional layers are then used to output a confidence score and a bounding box for each anchor location, which are further reduced to the final set of candidates by applying nonmaximum suppression (NMS). These modifications allow us to create a high performing detector that is also very fast.
Iv Relational Behavior Forecasting
The second stage of our model provides a probabilistic formulation for predicting the future states of detected vehicles by exploiting the interactions between different actors. We denote the th actor state at time as . The state includes a future trajectory composed of 2D waypoints and heading angles . Let be the scene input composed of LiDAR and HD map. The number of detected actors in a scene is denoted as and the future time steps to be predicted as . Note that the number of actors varies from scene to scene and our relational model is general and works for any cardinality. As the number of vehicles in the scene is not large (typically less than a hundred), we use a fully connected directed graph to let the model figure out the importance of the interplay for each pair of actors in a bidirectional fashion. Bidirectionality is important as the relationships are asymetric, e.g., vehicle following the vehicle in front.
To design our probabilistic relational behavior forecasting approach, we take inspiration from the Gaussian Markov random field (Gaussian MRF) and design a novel type of graph neural network for this task. In the following, we first describe the Gaussian MRF and its canonical inference algorithm Gaussian belief propagation (GaBP). We then briefly introduce graph neural networks (GNNs) and dive deep into our SpatiallyAware Graph Neural Networks (SpAGNN).
Iva Gaussian MRFs and Gaussian Belief Propagation
We now introduce the Gaussian MRF and its inference algorithm Gaussian belief propagation in our problem context. Conditioned on the observed input and detection output, we assume the future states can be predicted independently for different future time steps. How to explore temporal dependency is left as future work. Therefore, from now on, we drop the subscript of time for simplicity. In a Gaussian MRF, the joint probability is assumed to be a multivariate Gaussian distribution,
i.e., where is the concatenation of all , and and are the model parameters. Based on the interaction graph, we can decompose the joint probability as follows,(1) 
where the unary and pairwise potentials are,
(2) 
Note that , and depend on the input . Their specific functional forms can be designed according to the application. It is straightforward to show that the unary potentials follow a Gaussian distribution, i.e., .
To compute the marginal distribution , Gaussian belief propagation (GaBP) [42] is often adopted for exact inference. In particular, denoting the mean and precision (inverse covariance) matrix of the message from node to node as and , one can derive the following iterative update equations based on the belief propagation algorithm and Gaussian integral:
(3) 
where is the neighborhood of node and is the same set without node . Once the message passing converges, one can compute the exact marginal mean and precision as below,
(4) 
where .
IvB SpatiallyAware Graph Neural Networks (SpAGNN)
Although the Gaussian MRF is a powerful model, it has important limitations in our scenario. First, some of our states (i.e
., the heading angle) can not be represented as a Gaussian random variable due to its bounded support between
and . Second, for nonGaussian data, the integral in the belief propagation update is generally intractable. However, the Gaussian MRF and the GaBP algorithm give us great inspiration in designing our approach. In the following, we first briefly review graph neural networks (GNNs) and then describe our novel formulation.Graph Neural Networks (GNNs) [18] are powerful models for processing graphstructured data because (1) the model size does not depend on the input graph size (interaction graphs in our case have varying sizes) and (2) they have high capacity to learn good representations both at a node and graph level. Given an input graph and node states, a GNN unrolls a finitestep message passing algorithm over the graph to update node states. In particular, for each edge, one first computes a message in parallel via a shared message function which is a neural network taking the state of the two terminal nodes as input. Then, each node aggregates the incoming messages from its local neighborhood using an aggregation operator, e.g., summation. Finally, each node updates its own state based on its previous state and the aggregated message using another neural network. This message passing is repeated for a finite number of times for practical reasons.
In our context, we consider each actor to be a node in the interaction graph. If we view the node state as mean and precision matrix of the marginal Gaussian as in Gaussian MRFs, what GaBP does is very similar to what a GNN does. Specifically, computing and updating messages as in Eq. (IVA, IVA) can be regarded as particular instantiations of graph neural networks. Therefore, one can generalize the message passing of GaBP using a GNN based on the universal approximation capacity of neural networks. Note that not all instantiations of GNN will guarantee the convergence to the true marginal as GaBP does in the Gaussian MRFs. Nonetheless, GNNs can be trained using backpropagation and can effectively handle nonGaussian data thanks to their high capacity. Motivated by the similarity between GaBP and GNN, we design SpAGNN as described below.
Node State
The node state of our SpAGNN consists of two parts that will be updated iteratively: a hidden state an an output state. For the th node, we construct the initial hidden state by extracting the region of interest (RoI) feature map from the detection backbone network for the th detection. In particular, we first apply the recently proposed Rotated RoI Align [43], an improved variant of the previously proposed RoI pooling [44] and RoI align [45]
that extracts fixed size spatial feature maps for bounding boxes with arbitrary shapes and rotations. We then apply a 4layer downsampling convolutional network followed by max pooling to reduce the 2D feature map to a 1D feature vector per actor (
in Fig. 2).Inspired by GaBP, the output state at each message passing step consists of statistics of the marginal distribution. Specifically, we assume the marginal of each waypoint and angle follow a Gaussian and Von Mises distributions respectively, i.e., , , where , ,
(5) 
Therefore, the output state predicted by our model is the concatenation of the parameters of both distributions , , , , , and . The goal is to gradually improve the output states in the GNN as the message passing algorithm goes on. Note that we evaluate the likelihood using the local coordinate system centered at each actor and oriented in a way that the x axis is aligned with the heading direction, as shown in Fig. 2. This makes the learning task easier compared to using a global anchor coordinate system like in [11, 4]; as shown in [46]. To initialize the output state , we use a 2layer MLP which takes the maxpooled RoI features as input and directly predicts the output state, independently per actor.
Message passing
The node states are iteratively updated by a message passing process. For any directed edge , at propagation step , we compute the message as
(6) 
where is a 3layer MLP and is the transformation from the coordinate system of detected box to the one of . Note that we rotate the state for each neighbor of node such that they are relative to the local coordinate system of . By doing so, the model is aware of the spatial relationship between two actors, which eases the learning taking into account that is extremely hard to extract such information from local, RoI pooled features. We show the advantages of projecting the output state of node to the local coordinate system of node when computing the message in the ablation study in Table III. After computing the messages on all edges, we aggregate the messages going to node as follows,
(7) 
We use an orderinginvariant, featurewise operator along the neighborhood dimension as function.
State Update
Once we compute the aggregated message , we can update the node state
(8) 
where is a GRU cell and is a 2layer MLP.
The above message passing process is unrolled for steps, where
is a hyperparameter. The final prediction of the model is
. Note that the design of the message passing algorithm in Eq. (6, 7, 8) can be regarded as generalization of the one in Eq. (IVA, IVA) due to the universal approximation capacity of neural networks.V Endtoend Learning
Our full model (including detection and relational prediction) is trained jointly endtoend through backpropagation. In particular, we minimize a multitask objective containing a binary cross entropy loss for the classification branch of the detection network (background vs vehicle), a regression loss to fit the detection bounding boxes and a negative log likelihood term for the probabilistic trajectory prediction. We apply hard negative mining to our classification loss: we select all positive examples from the groundtruth and 3 times as many negative examples from the rest of anchors. Regarding our box fitting, we apply a smooth L1 loss [44] to each of the 5 parameters of the bounding boxes anchored to a positive example. The negative loglikelihood (NLL) is as follows:
where the first line corresponds to the NLL of a 2D gaussian distribution and the second line to the NLL of a Von Mises distribution, being the modified Bessel function of order 0. For the GNN message passing, we use backpropagation through time to pass the gradient to the detection backbone network.
Vi Experimental evaluation
In this section, we first explain the datasets and metrics that we use for evaluation. Next, we compare our model against stateoftheart detection and motion forecasting algorithms. We then perform an ablation study to understand what contributes the most to the performance gain of our SpAGNN. Finally, we show some qualitative results. We defer the implementation details of our method and the baselines to the appendix (VIIIB and VIIIC).
Datasets
We report results on two datasets: ATG4D and the recently released nuScenes [19] dataset. This allow us to test the effectiveness of our approach in two vehicle platforms with different LiDAR sensors driving in different cities.
We collected the ATG4D dataset by driving a fleet of selfdriving cars over several cities in North America with a 64beam, roofmounted LiDAR. It contains over 1 million frames collected from 5,500 different scenarios, which are sequences of 250 frames captured at 10 Hz. Our labels are very precise tracks of 3D bounding boxes with a maximum distance of a 100 meters. The nuScenes dataset consists of 1,000 snippets of 20 seconds each, collected in Boston or Singapore. Their 32beam LiDAR captures a sparser point cloud than the one in ATG4D. Despite their high sensor capture frequency of 20Hz, only keyframes at 2Hz are employed for annotation, thus limiting the number of frames available as supervision by an order of magnitude. They also provide HD maps of the two cities.
Metrics
We evaluate detection using the standard precisionrecall (PR) curves and its associated mean average precision (mAP) metric. Following previous works, we ignore vehicles without any LiDAR point in the current sweep during evaluation. To showcase the abilities to capture social interaction, we use the cumulative collision rate over time, defined as the percentage of predicted trajectories overlapping in spacetime. A model that identifies interactions properly should achieve a lower collision rate since our dataset does not contain any colliding examples. To benchmark the forecasting ability we use the centroid L2 error as well as the absolute heading error at several future horizons. It is worth noting that the prediction metrics depend upon the operating point that we choose for the detector (its confidence score threshold) since these metrics can only be computed on true positive detections, i.e. those that get matched to groundtruth labels. To make it fair for models with different detection PR curves, the motion forecasting and social compliance metrics are computed at a common recall point.
Model  Col. (‰)  L2 x,y (cm)  Heading err (deg)  
01s  03s  0s  1s  3s  0s  1s  3s  
D+T+SLSTM [30]  1.43  16.31  22  147  607  4.06  5.14  8.07 
D+T+CSP [34]  1.64  20.78  22  95  282  4.06  4.70  6.20 
D+T+CARNet [41]  0.28  12.30  22  46  149  4.06  4.87  6.14 
FaF [11]  1.12  17.41  30  54  183  4.71  4.98  6.43 
IntentNet [4]  0.28  7.03  26  45  146  4.21  4.40  5.64 
NMP [6]  0.05  3.06  23  36  114  4.10  4.24  5.09 
E2E SLSTM [30]  0.06  1.14  22  36  106  4.97  4.85  5.61 
E2E CSP [34]  0.06  4.47  23  38  114  4.82  5.04  5.84 
E2E CARNet [41]  0.07  1.15  22  35  105  4.44  4.41  5.12 
SpAGNN (Ours)  0.03  0.42  22  33  96  3.92  3.89  4.55 

Legend: D=Detector(PIXOR), T=Tracker(UKF+Hungarian), SLSTM=SocialLSTM, CSP=Convolutional Social Pooling, E2E=EndtoEnd, Col=Collision rate
Model  Col. (‰)  L2 x,y (cm)  Heading err (deg)  
01s  03s  0s  1s  3s  0s  1s  3s  
E2E SLSTM [30]  0.84  9.64  24  71  185  3.08  3.59  4.63 
E2E CSP [34]  0.41  5.77  24  70  174  3.14  3.51  4.64 
E2E CARNet [41]  0.36  4.90  23  61  158  2.84  3.07  4.06 
SpAGNN (Ours)  0.25  2.22  22  58  145  2.99  3.12  3.96 

Legend: D=Detector(PIXOR), T=Tracker(UKF+Hungarian), SLSTM=SocialLSTM, CSP=Convolutional Social Pooling, E2E=EndtoEnd, Col=Collision rate
Comparison Against the Stateoftheart
We benchmark our method against a variety of baselines, which can be classified in three groups. (i) Methods that use past trajectory as their main motion cues: SocialLSTM
[30], Convolutional Social Pooling [34], CARNet [41]. However, these approaches assume past tracks of every object are given. Thus, we employ our object detector and a vehicle tracker consisting of an Interactive Multiple Model [47]with Unscented Kalman Filter
[48] and Hungarian Matching to extract past trajectories. (ii) Previously proposed joint detection and motion forecasting approaches: FaF [11], IntentNet [4] and NMP [6]. (iii) Endtoend (E2E) trainable extensions of the methods in the first group where we replace their past trajectory encoders by the actor features coming from our backbone network.The detection precisionrecall (PR) curve for IoU 0.7 and the accumulated collision rate into the future, for all possible recall points, are shown in Fig. 3 for the ATG4D dataset. Tables I and II show the cumulative collision rate, the centroid L2 error and the absolute heading error for different future horizons for the ATG4D and nuScenes datasets respectively. The results reveal a clear improvement in detection and interaction understanding, as well as a solid gain in longterm motion forecasting. Our model substantially improves both in collision rate and centroid error on both datasets. We obtain the lowest heading error on ATG4D while we get results that are on par with the best baseline in nuScenes. Note that we perform this comparison at 80% recall at IoU 0.5 in ATG4D because some of the baselines do not reach higher recalls. However, we conduct ablation studies for our model at 95% recall in the next sections, which is closer to the demand that selfdriving cars should meet. In nuScenes we use 60% recall because all models exhibit worse PR curves, most likely due to the limited dataset size, the sparser LiDAR and the absence of groundheight information.
Decoder  R  B  T  C (‰)  Centroid @ 3s  Heading @ 3s  
03s  L2  NLL  H  L2  NLL  H  
MLP  ✗  ✗  ✗  9.79  127  2.56  0.47  5.68  3.32  0.15 
MLP  ✓  ✗  ✗  2.23  118  1.50  0.47  4.97  7.01  1.49 
GNN  ✓  ✗  ✗  2.91  122  1.73  0.72  5.09  6.65  2.37 
GNN  ✓  G  ✗  2.14  116  1.42  0.50  5.14  7.31  1.17 
GNN  ✓  R  ✗  1.32  109  1.14  0.39  4.77  7.12  1.97 
GNN  ✓  R  R  0.78  105  1.08  0.24  4.75  6.99  1.87 

Legend: R=RRoI, B=bounding box, Traj=future trajectory, C=Collision rate
Ablation Study
We first study the peractor feature extraction mechanism of our model by comparing simple feature indexing versus Rotated RoI Align. We do this study in a model that does not contemplate interactions for better isolation. In particular, we omit our
SpAGNN from Fig. 2 and use the initial trajectories predicted by as the final output. As shown in the first 2 rows of Table III, Rotated RoI Align clearly outperforms simple feature indexing. Our RRoI pooled features provide better information about the surroundings of the target vehicle since they contain features in a region spanning by 25 meters whereas the feature indexing variant consists of just accessing the feature map at the anchor location associated to the detection.Graph Neural Network architectures
We evaluate several GNN architectures with different levels of spatial awareness to demonstrate the effectiveness of our SpAGNN. The second to third rows of Table III
show that adding a standard GNN without spatial awareness does not improve performance. From the third to the forth row we observe an improvement by including the detection bounding boxes in global (G) coordinate frame as part of the state at every message passing iteration of the GNN. Then, we make these bounding boxes relative (R) to the actor receiving the message, which gives us a boost across most metrics. Finally, we add the parameters of the predicted probability distributions of the future trajectories to the message passing algorithm to recover our
SpAGNN. Interestingly, the models become more certain, i.e. lower entropy (H), as we add better spatialawareness mechanisms.Qualitative results
Fig. 4 shows the outputs from the baselines and our SpAGNN in a crowded scene in the ATG4D
dataset. More visualizations will be added to our video submission. The waypoint trajectories output by IntentNet and NMP are shown in blue (they do not model uncertainty). The bivariate Gaussian distribution output by the rest of the models is shown in a colormap to encode time: from blue at 0 seconds to pink at 3 seconds into the future. The principal components of the ellipsis correspond to the square root of the two eigenvectors of the predicted covariance matrix. The groundtruth trajectories are drawn in gray. This example illustrates a failure across all the baselines, which predict that a pair of vehicles are going to collide, thus failing to model the agentagent interactions. Note that the relational baselines and
SpAGNN also predict the future states of the vehicle that performed the data collection (see the epicenter of the LiDAR point cloud) since it plays an important role in the social interactions.Vii Conclusion
In this paper we tackled the problem of detection and relational behavior forecasting. Unlike existing approaches, we have proposed a single model that can reason jointly about these tasks. We have designed a novel spatiallyaware graph neural network to produce socially coherent, probabilistic estimates of future trajectories. Our approach resulted in significant improvements over the stateoftheart on the challenging ATG4D and nuScenes autonomous driving datasets. We plan to extend our model to generate multiple future outcomes of the scene, to use other sensory input such as images and radar, and to reason about other types of agents such as pedestrians and cyclists.
References
 [1] G. J. Wilde, “Social interaction patterns in driver behavior: An introductory review,” Human factors, 1976.
 [2] J. McNabb, M. Kuzel, and R. Gray, “I’ll show you the way: Risky driver behavior when “following a friend”,” Frontiers in psychology, 2017.
 [3] T. Connolly and L. Åberg, “Some contagion models of speeding,” Accident Analysis & Prevention, 1993.
 [4] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, 2018.
 [5] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
 [6] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “Endtoend interpretable neural motion planner,” in Proceedings of the IEEE CVPR, 2019.
 [7] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
 [8] B. Yang, W. Luo, and R. Urtasun, “Pixor: Realtime 3d object detection from point clouds,” in Proceedings of the IEEE CVPR, 2018.
 [9] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, “Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks,” in 2017 ICRA, 2017.
 [10] B. Yang, M. Liang, and R. Urtasun, “Hdnet: Exploiting hd maps for 3d object detection,” in Conference on Robot Learning, 2018, pp. 146–155.
 [11] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time endtoend 3d detection, tracking and motion forecasting with a single convolutional net,” in Proceedings of the IEEE CVPR, 2018.
 [12] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparsetodense 3d object detector for point cloud,” arXiv preprint arXiv:1907.10471, 2019.
 [13] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multitask multisensor fusion for 3d object detection,” in Proceedings of the IEEE CVPR, 2019.
 [14] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multiview 3d object detection network for autonomous driving,” in Proceedings of the IEEE CVPR, 2017.
 [15] G. P. Meyer, A. Laddha, E. Kee, C. VallespiGonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” arXiv preprint arXiv:1903.08701, 2019.
 [16] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.C. Chou, T.H. Lin, and J. Schneider, “Motion prediction of traffic actors for autonomous driving using deep convolutional networks,” arXiv preprint arXiv:1808.05819, 2018.
 [17] H. Cui, V. Radosavljevic, F.C. Chou, T.H. Lin, T. Nguyen, T.K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” arXiv preprint arXiv:1809.10732, 2018.
 [18] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, 2008.
 [19] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
 [20] B. Li, “3d fully convolutional network for vehicle detection in point cloud,” in 2017 IEEE/RSJ IROS, 2017.
 [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE CVPR, 2017.
 [22] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural networks for rgbd semantic segmentation,” in Proceedings of the IEEE ICCV, 2017.
 [23] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
 [24] S. Wang, S. Suo, W.C. Ma, A. Pokrovsky, and R. Urtasun, “Deep parametric continuous convolutional neural networks,” in Proceedings of the IEEE CVPR, 2018.
 [25] M. Simon, S. Milz, K. Amende, and H. Gross, “Complexyolo: Realtime 3d object detection on point clouds. arxiv 2018,” arXiv preprint arXiv:1803.06199.
 [26] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in Proceedings of the IEEE CVPR, 2017.

[27]
N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized
pushforward policy for diverse, precise generative path forecasting,” in
Proceedings of the European Conference on Computer Vision (ECCV)
, 2018, pp. 772–788.  [28] Y. Hu, W. Zhan, and M. Tomizuka, “Probabilistic prediction of vehicle semantic intention and motion,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018.
 [29] W.C. Ma, D.A. Huang, N. Lee, and K. M. Kitani, “Forecasting interactive dynamics of pedestrians with fictitious play,” in Proceedings of the IEEE CVPR, 2017.
 [30] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE CVPR, 2016.
 [31] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in NeurIPS, 2017.
 [32] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actorcentric relation network,” in Proceedings of the ECCV (ECCV), 2018.
 [33] C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid, “Relational action forecasting,” in Proceedings of the IEEE CVPR, 2019.

[34]
N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory
prediction,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, 2018, pp. 1468–1476.  [35] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “Precog: Prediction conditioned on goals.”
 [36] A. Gupta, J. Johnson, L. FeiFei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE CVPR, 2018.
 [37] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in Proceedings of the IEEE CVPR, 2019.
 [38] R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, and S. Fidler, “Situation recognition with graph neural networks,” in Proceedings of the IEEE ICCV, 2017.
 [39] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference, 2018.
 [40] T. Kipf, E. Fetaya, K.C. Wang, M. Welling, and R. Zemel, “Neural relational inference for interacting systems,” arXiv preprint arXiv:1802.04687, 2018.
 [41] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese, “Carnet: Clairvoyant attentive recurrent network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 151–167.
 [42] Y. Weiss and W. T. Freeman, “Correctness of belief propagation in gaussian graphical models of arbitrary topology,” in NeurIPS, 2000.
 [43] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitraryoriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, 2018.
 [44] R. Girshick, “Fast rcnn,” in Proceedings of the IEEE ICCV, 2015.
 [45] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” in Proceedings of the IEEE ICCV, 2017.
 [46] F.C. Chou, T.H. Lin, H. Cui, V. Radosavljevic, T. Nguyen, T.K. Huang, M. Niedoba, J. Schneider, and N. Djuric, “Predicting Motion of Vulnerable Road Users using HighDefinition Maps and Efficient ConvNets,” arXiv eprints, Jun 2019.
 [47] A. F. Genovese, “The interacting multiple model algorithm for accurate state estimation of maneuvering targets,” Johns Hopkins APL technical digest, vol. 22, no. 4, pp. 614–623, 2001.
 [48] E. A. Wan and R. Van Der Merwe, “The unscented kalman filter for nonlinear estimation,” in Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373). Ieee, 2000, pp. 153–158.
 [49] Y. Wu and K. He, “Group normalization,” in Proceedings of the ECCV (ECCV), 2018.
 [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Viii Appendix
Viiia Algorithm
The full inference algorithm is described in Alg. 1. We write the algorithm in a nonvectorized manner for the sake of readability, although our implementation is fully vectorized since inference time is critical in autonomy tasks onboard a selfdriving car. The proposed algorithm would run every 0.1 seconds (when a new LiDAR sweep is gathered). We avoided adding an outer loop to Alg. 1 reflecting this in order to avoid cluttering the notation.
The reader might wonder why the future time steps do not appear explicitly in Alg. 1. We note that is the hidden state for agent at the th iteration of message passing, acting as a summarization of appearance, motion and interaction features from actor and its neighbors. The states for all future time steps are predicted in a feedforward fashion, using an MLP, as indicated in line 14 of Alg. 1. We did not observe any gain by using a recurrent trajectory decoder that explicitly reasons about time dependencies in the future trajectory and therefore we adopted this formulation for simplicity.
ViiiB Implementation details
In this section we describe our implementation, including details about the network architecture of the different components as well as training.
Detection network
Our LiDAR backbone uses 2, 2, 3, and 6 layers in its 4 residual blocks. The convolutions in the residual blocks of our LiDAR backbone have 32, 64, 128 and 256 filters with a stride of 1, 2, 2, 2 respectively. The backbone that processes the highdefinition maps uses 2, 2, 3, and 3 layers in its 4 residual blocks. The convolutions in the residual blocks of our map backbone have 16, 32, 64 and 128 filters with a stride of 1, 2, 2, 2 respectively. For both backbones, the final feature map is a multiresolution concatenation of the outputs of each residual block, as explained in
[8]. This gives us 4x downsampled features with respect to the input. The header network consists of 4 convolution layers with 256 filters per layer. We use GroupNorm [49] because of our small batch size (number of scenarios) per GPU.Because we want our detector to have a very high recall, we move away from only targeting to detect cars with at least 1 LiDAR point on its surface in the current frame as was done in previous works [11, 4, 6]. In contrast, we require that the car either has 1 LiDAR point in the current LiDAR sweep or 2 points in 2 different previous LiDAR sweeps so that the model has some information to figure out its current location based on motion.
Finally, we reduce the candidate confidence scores and bounding box proposals by first taking the top 200 anchors ranked by confidence score and then applying NMS with an IoU threshold of 0.1 in order to obtain our final detections.
Peractor feature extraction
We use a CUDA kernel for fast Rotated Region of Interest Align (RRoI Align [43]). We use a 41 by 25 meters region of interest aligned to the detection bounding box heading around each target vehicle, with 31 meters in front of the car, 10 meters behind and 12.5 meters to each side. We use an output resolution of 1 m/pixel in our RRoI Align operator. Our (256, 41, 25)dimensional tensor peractor is then downsampled 8 times by a 3layer CNN, and the features increased to 512. Then, we apply maxpooling across the remaining spatial dimensions to obtain the initial hidden state . Finally, this is processed by a 2layer MLP to produce the initial output state .
Relational Behavior Forecasting
We implement a fullyvectorized version of our SpAGNN. We use propagation steps. The parameters of the edge , aggregate , update and output functions are shared across all propagation steps since we did not observe any improvements by having separate ones. This highlights the refinement nature of our probabilistic trajectory prediction process. Our edge function consists of a 3layer MLP that takes as input the hidden states of the 2 terminal nodes at each edge in the graph at the previous propagation step as well as the projected output states at the previous iteration, processed by a 2layer MLP. We use featurewise maxpooling as our aggregate function in order to be more robust to changes in the graph topology. We use an efficient CUDA kernel to implement the scattermax operation that receives the incoming messages from neighboring nodes and outputs the aggregated message. To update our hidden states we use a GRU cell as . Finally, to output the statistics of our multivariate Gaussian and Von Mises distribution we use a 2layer MLP .
Scheduled sampling
Our endtoend learnable model first detects the vehicles in the scene and then forecasts its motion for the future 3 seconds. We recall that the state updates in our SpAGNN are dependent on the incoming messages from neighboring nodes/vehicles. Thus, we add scheduled sampling [45] during training in order to mitigate the distribution mismatch between groundtruth bounding boxes and detected bounding boxes. More precisely, we start training the second stage of our model by feeding only the groundtruth boxes since at that stage the detection is not good enough and the big amount of false positives and false negatives will mislead the learning of the GNN parameters. As detection gets better, we increase the probability of replacing the groundtruth bounding boxes with detections. In practice, we start with probability 1.0 of using groundtruth bounding boxes, lower it to 0.7 after 10,000 training iterations and finally down to 0.3 after 20,000. This leads to improved results because we avoid confusing the GNN with false positive and false negative detections while the detector is still at an early learning phase.
Optimizer
We use Adam [50] optimizer with a base learning rate of , which gets linearly increased with the batch size. In our experiments we use a batch size of 3 per GPU and a total of 4 GPUs for a total batch size of 12, giving us a final learning rate of .
ViiiC Baselines details
Trackingbased baselines
To keep the same architectures that were proposed in SocialLSTM [30], Convolutional Social Pooling [34], and CARNet [41], past trajectories are needed. However, our proposed model is trackingfree and just takes the sensor and map data as input. Thus, we implement an Interactive Multiple Model [47] with Unscented Kalman Filter [48] and Hungarian Matching to extract past trajectories. These past trajectories are used to obtain our results for D+T+SLSTM, D+T+CSP and D+T+CARNet in Table I. In particular, we feed up to 1 second of past trajectory. The following tracker settings delivered the best performance: filter out detections with a confidence score lower than 0.5 (this still delivers a detection recall higher than 95% in ATG4D), wait 9 cycles before discarding a track completely, birth a track immediately if the confidence score of a detection is higher than 0.9 and require a minimum of 3 consecutive IoUbased matches to birth a track otherwise. We tried training with past groundtruth trajectories as well as with tracking results. Directly training with tracking results delivered slightly better results, which corresponds to the numbers reported in Table I.
Endtoend adapted baselines
The trackingbased baselines do not achieve a good performance partly due to its hard dependency on the tracking quality. Therefore, we seek to benchmark previously proposed interaction operators and trajectory decoders and SpAGNN in a better isolated experiment. In order to do this, we keep our backbone network and feature extraction and replace the second stage in our implementation by the interaction operators and trajectory decoders proposed in SocialLSTM [30], Convolutional Social Pooling [34], and CARNet [41]. However, because the proposed methods use past trajectories and we do not, we need to adapt their trajectory encoders. For E2E SLSTM, instead of unrolling a SocialLSTM all the way from past trajectory to the future trajectory, we feed the peractor features extracted by our backbone network fed as input to a SocialLSTM that unrolls from the current time to the future. For E2E CSP, the LSTM encoder is removed from Convolutional Social Pooling and the social tensor is directly initialized with the peractor features extracted by our backbone network. For E2E CARNet, we use our backbone with RRoI pooling to replace its feature extractor module and the peractor features as the past motion context.
Joint Perception and Prediction baselines
ViiiD Additional qualitative results
First, we present additional qualitative results in the ATG4D dataset in Fig. 5. We can see how SpAGNN is able to predict very accurate detection and motion forecasts in a wide variety scenarios in terms of density of actors, lanegraph topologies, interactions and highlevel actions (e.g. illegal uturn or cars pulling out into nonmapped driveways). The last row of examples includes examples of our main failure mode: our model predicts a plausible trajectory in a multimodal situation, but in the groundtruth another mode was executed. Finally, we present additional qualitative results in the nuScenes dataset in Fig. 6, showing that our method can generalize to other datasets. However, we can appreciate in the last row (failure modes), that the detection is somewhat less reliable, mainly due to the 32beam LiDAR sensor in contrast to the 64beam one used in ATG4D.
. Detections are shown as blue bounding boxes. Probabilistic motion forecasts are shown as ellipsis (corresponding to one standard deviation of a bivariate gaussian) where variations in color indicate different future time horizons (from 0 seconds in blue to 3 seconds in pink). Groundtruth boxes and future waypoints are displayed in gray. A dashed gray box means the object is occluded. Last row shows failure modes.