Convolutions for Spatial Interaction Modeling

04/15/2021 ∙ by Zhaoen Su, et al. ∙ 7

In many different fields interactions between objects play a critical role in determining their behavior. Graph neural networks (GNNs) have emerged as a powerful tool for modeling interactions, although often at the cost of adding considerable complexity and latency. In this paper, we consider the problem of spatial interaction modeling in the context of predicting the motion of actors around autonomous vehicles, and investigate alternative approaches to GNNs. We revisit convolutions and show that they can demonstrate comparable performance to graph networks in modeling spatial interactions with lower latency, thus providing an effective and efficient alternative in time-critical systems. Moreover, we propose a novel interaction loss to further improve the interaction modeling of the considered methods.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interactions or relations between objects are critical for understanding both the individual behaviors and collective properties of many systems. Conceptually, these interactions can be modeled with graph structures that comprise a set of objects (nodes) and their relationships (edges). By applying deep learning techniques, graph neural networks (GNNs) have demonstrated great expressive power in modeling interactions in various fields, including physical science

[3, 12, 33, 36], social science [17, 24]

, knowledge graphs 

[16], and other research areas [20, 27, 30, 40].

Some of these interacting systems involve non-spatial relations such as the semantic relations in social networks; others strongly depend on geometries, such as the Euclidean distance and relative directions between objects, which is called spatial interactions in this work. One problem where spatial interaction is critical is motion forecasting, a key task in the fields of computer vision, robotics in general and autonomous driving (AD) in particular. Specifically, anticipating the future movements of objects requires understanding not only the history of dynamics of the object, but also the object’s interactions with other objects and its environment. These interactions strongly depend on relative spatial features between objects, such as their relative location, orientation, velocity, etc.

GNNs have achieved success in modeling spatial interaction [4, 13, 23, 34, 35, 37]

. Features of individual objects are typically encoded into attributes of graph nodes, and the graph edges are built by passing node attributes and the relative geometries of the node pair through a mapping function. GNNs follow a message passing scheme, where each node aggregates features of its neighboring nodes to compute its new node attributes. These approaches have two characteristics: (1) the relative spatial features in the graph edges are essential in the interaction modeling, and they usually need to be handcrafted; (2) even a single iteration of GNN may be slower than convolutional neural networks (CNNs), as seen in the experimental section, which makes GNNs less suitable for applications in fields such as AD where fast inference is safety-critical.

Alternatively, data structures for convolutional operations are in common grid forms, such as voxelization in 3D, rasterization in 2D bird’s-eye view (BEV), or CNN feature maps. In these grid structures, spatial relations between actors are intrinsically represented in the Euclidean space. Thus, they theoretically allow spatial relations between objects to be learned by CNNs with sufficiently large receptive fields [11]. In other words, CNNs have the potential to model spatial interactions. However, even though deep CNN backbones with large receptive fields are widely utilized in many trajectory forecasting models, recent research has shown that adding a GNN after the CNN backbone can still improve interaction modeling [4, 34, 35]. This suggests the CNN backbones often do not fulfill their theoretical potential in modeling spatial interaction.

In this work, we consider spatial interaction modeling through convolutions and compare them to GNNs within the context of motion forecasting for AD. A key determinant of future motion for other drivers is the avoidance of collisions, which represents a critical interaction that we model explicitly. Collisions can be approximated as geometric overlapping, which provides unambiguous definitions for interaction metrics. We evaluate the methods on large-scale real-world AD data to draw general conclusions. Our contributions are summarized below:

  • we identify three components to facilitate modeling spatial interaction with convolutions: (1) large actor-centric interaction region, (2) aggregation of per-actor feature maps using convolutions, and (3) projecting feature maps into the actor’s frame of reference;

  • we perform empirical studies to compare interaction modeling using convolutions and graphs, and find that (1) convolutions can perform similarly to or better than GNNs; (2) adding the convolutions can considerably improve interaction modeling even when a GNN is used; and (3) adding a GNN demonstrates only minor additional gain when the convolutional approach is already used.

  • we study the effect of a novel interaction loss.

2 Related Work

2.1 Motion forecasting

There exists a significant body of work on forecasting the motion of traffic actors. An input to the forecasting models can be a sequence of past actor states such as positions, headings, or velocities [1, 7, 8, 10, 13, 19, 34] where motion forecasting is performed in the actor’s frame of reference, or a sequence of raw sensor data such as LiDAR or radar returns [5, 28] where joint object detection and motion forecasting are performed in an autonomous vehicle’s (AV) frame of reference. While the latter approach may accelerate inference and joint learning by sharing common CNN features among all actors, these single-stage models could benefit from actor-centric features. Two-stage models [4, 9] address this issue by using a first stage to detect the actors and extract features, and then adding a second stage in the frame of reference of detected actors. The two stages are then learned jointly in an end-to-end fashion. The interaction modeling study in this paper adapts a two-stage architecture. Note that the designs used in the study, including rotated region of interest (RROI) [29] and actor-centric design [4, 9, 10], have been developed and applied in previous research in a context different from interaction modeling. However, our empirical study demonstrates that utilizing these ideas allows convolutions to effectively model spatial interaction as well.

2.2 Interaction modeling

GNNs have recently been applied to explicitly express interactions in motion forecasting. NRI [23] models the interaction between actors by using GNNs to infer interactions while simultaneously learning dynamics. VectorNet [13] and CAR-Net [35] model actor-context interactions. Closely related to our work, SpaGNN [4] is also a two-stage detection-and-forecasting model that builds a graph for vehicles in the second stage to model vehicle-vehicle interaction. The GNN models used for comparison in the study of this paper follow the same design.

Beyond graph models, grid-based spatial relations have been explored using social pooling approaches  [1, 7, 15], where pooling is used to capture the impact of surrounding actors in the recurrent architecture. In social-LSTM [1, 15], the LSTM cell receives pooled spatial hidden states from the LSTM cells of neighbors that are embedded into a grid. Besides the parameter-free pooling, convolutional layers have also been explored [7]. By contrast, our proposal is fully convolutional. Moreover, these approaches pool the spatial context of interacting actors while excluding the actor itself, thus the actor-context interaction is not directly modeled in the process.

2.3 Interaction metrics

It is interesting to note that while various techniques have been developed to model spatial interaction, most prior work reports motion forecasting displacement errors. As shown in this study, reducing displacement errors does not necessarily indicate improvement in interaction modeling for a motion forecasting task. An alternative metric that can more explicitly indicate the level of interaction modeling is to measure whether vehicle motion forecasts incorrectly predict overlap with other vehicles [4, 34]. In this work, we also propose vehicle-obstacle overlap rate within motion forecasts as another measure for interaction modeling.

Figure 1: Three model architectures in a scene illustrated with three vehicle actors and one obstacle (denoted by the white spot). All models share the same first-stage design shown from left to middle: input is a BEV raster image comprising past and current point clouds and a semantic map in the ADV frame. Through a CNN feature extractor we obtain a 4 downsampled feature map in the ADV frame. (a) Single-stage baseline: Object detection and trajectory forecasting are performed at a pixel-level. (b

) Adding the proposed Interaction Convolutional Module (ICM). For each actor we define an interaction region (IR) in the actor frame that is used to crop an area from the feature map. Through the weight-sharing interactive CNN (ICNN) a feature vector is aggregated for each actor and then utilized to predict a future trajectory in its frame. (

c) Adding GNN into the architecture shown in (b).

3 Methodology

In this section we formulate the motion forecasting problem that we consider in our current work, followed by a discussion of two approaches to interaction modeling: implicitly through convolutions and explicitly through graphs. Fig. 1

illustrates the architectures of the considered end-to-end models that jointly solve tasks of object detection and motion forecasting, taking BEV representation of the sensor data as an input and outputting both object detections and their future trajectories. We emphasize that we purposefully choose a commonly used input representation, neural network design, and loss functions in order to focus on understanding the interaction modeling aspect of these approaches. Moreover, to simplify the analysis we limit our attention to vehicle actors.

3.1 Problem formulation

Given input data comprising the past and current information of interacting actors and the environment, a model outputs their current and future states represented as . As mentioned previously, our study considers raw sensor data as an input to the model. Following the joint detection and forecasting architecture [5, 9], we encode the sensor data by voxelizing and stacking a sequence of current and past LiDAR point clouds around ADV at time in BEV representation, as well as rasterizing semantic map that provides an additional environmental prior, which are used as the model input. The -D detection at time for each actor is parameterized by a bounding box represented as , denoting the and coordinates of the actor’s centroid, the cosine and sine of its heading angle, and the width and length of the box, respectively. Assuming rigid SE2 transformations, future trajectories can be represented as a sequence of tuples , with [38].

3.2 Feature extraction and losses

As illustrated in Fig. 1a, the first stage of the joint model detects objects and extracts features. From the input BEV raster, a 4 downsampled feature map is extracted by a deep CNN that follows common design [26, 25]. It consists of 3 operations: (1) convolutional block (ConvB) including a convolution (kernel size 3

3), batch normalization, and ReLU optionally; (2) ResNet v2 block (ResB) 


; and (3) upsampling using bi-linear interpolation. Features are processed at multiple scales to provide larger receptive fields for capturing wider context and past motion of the actors (see Supplementary Material for detailed network design).

Following the computation of the BEV feature map, classification and regression are performed on the 1D feature vector for each grid cell. Through a fully-connected (FC) layer and a softmax function, we obtain the likelihood of existence of a vehicle actor whose center is located in the cell . We use focal loss [25] to address the foreground/background imbalance. Through a separate FC layer, the network at the same time regresses the detection bounding boxes . The centroid and heading are relative to the cell center and the ADV heading, respectively. Then, the first-stage detection loss is given as


where and represent all grid cells and vehicle foreground grid cells, respectively, equals 1 for foreground cells and 0 otherwise, is smooth- loss (with the transition value set to 0.1), while the remaining hat-notation indicates the associated supervised targets.

In addition to the detection loss, end-to-end models also optimize for the prediction loss that is only applied to future waypoints of the actors. Moreover, we model the multimodality of the predictions [9]

by classifying three modes for each actor (i.e., turning left, turning right, or going straight), where a separate trajectory is regressed for each mode along with its probability

[6]. The focal loss is used for the mode classification, where the target is equal to 1 for the mode closest to the observed trajectory and 0 otherwise. In addition, trajectory regression loss is applied only to the trajectory mode that is closest to the observed trajectory. Then, the prediction loss is given as


where indicates an index of the mode closest to the ground truth. Future centroids and headings are relative to the cell center and the ADV heading, respectively (see Fig. 1a), while they are in the actor frames in the two-stage models (see Fig. 1b-c). Then, and can be optimized together in a joint training.

For single-stage models the detection and prediction values are both optimized in the first stage (Fig. 1a). On the other hand, when the first stage serves as a part of the two-stage architecture (Fig. 1b-c), is optimized as a part of the first-stage output while is optimized in the second stage, discussed in the remainder of this section.

3.3 Interaction using convolutions implicitly

In the previous section we discussed the first-stage feature extraction, that computes per-actor grid features which are then used as an input to the second-stage models to predict future motion. In this section we discuss how to compute the per-actor features better at capturing interactions:

  • To capture relationship to nearby actors for the actor for whom the future trajectories are predicted (called the actor of interest), an input of the forecasting module can be a region covering the interacting actors and objects on the feature map, instead of directly using the feature pixel. Specifically for the traffic use-case, this interaction region (IR) should cover the area within which the objects should be paid attention to. The results presented later show that a large region ahead of the actor provides good context to model interaction.

  • To effectively propagate non-local information of the interacting actors to the actor of interest, we can use an interactive CNN (ICNN) consisting of a few downsampling convolutional layers that eventually condense an IR comprising the actor of interest itself, its surrounding actors, and the environment, into a feature vector as the final feature for this actor.

  • To overcome the rotational variance of convolutions, instead of cropping the IR features in the coordinate frame of the original BEV grid whose orientation is determined by the ADV, we can define the IR in the frame of the actor of interest (referred to as the actor frame), in which the output trajectories are also defined. This technique is commonly referred to as RROI

    [29]. Our results confirm the importance of rotational invariance in modeling interactions.

As mentioned, the actor-centric feature map and the RROI techniques have been utilized in a number of applications [4, 9, 10, 29], where it was found to lower displacement errors in trajectory forecasting tasks. In this paper we demonstrate that, by combining these ideas, convolutions are effective in modeling spatial interactions as well. Moreover, as shown in the experimental section, by varying the parameters of these ingredients one can control the level of interaction modeling, providing further evidence that spatial interactions can be effectively captured by convolutions.

The implementation of these three components are illustrated in the dashed box in Fig. 1b, which we refer to as the interaction convolutional module (ICM). For each actor we define a square IR around it, which is then used to crop actor-centric features from the global feature map using bilinear interpolation. We vary the size, orientation, and the position of the actor in the IR to study their effects on the performance of interaction modeling (e.g., in the extreme case where the IR has no area, the cropped feature is just the feature pixel on the feature map). Note that we choose a square IR in order to simplify the discussion, and we refer to the length of the side of the square as the IR size in the following discussion. Similarly, the ICNN module always consists of six ConvBs and one ResB to gradually reduce the cropped feature map to a 1-D feature vector (e.g., if the crop size is

, setting the strides of the last five ConvBs to 2 yields a 1-D vector; see Supplementary Material for detailed discussion on crop sizes and ICNN design). The final multimodal classification and future trajectory regression in the actor frame are obtained from this 1D vector via a single FC layer, one for each task.

3.4 Interaction using graphs explicitly

The purely convolutional approach described in the previous section provides implicit interaction modeling. To explicitly account for interactions, a common approach is the use of GNNs, discussed in this section. As there exist many variants, we choose one of the more general approaches, the message passing neural network [14, 39], which has also been adapted to the motion forecasting problem [4].

Indicated by the dashed box in Fig. 1c, a fully connected graph comprises all of the actors (represented as nodes), with bi-directional edges between every two actors. The feature attribute of the -th node is initialized by


where is the final feature vector of the

-th actor computed in the previous section. All multi-layer perceptrons (MLPs) in this GNN have two layers. The message passing at the

-th iteration via edge from node to is given by


where denotes concatenation. Unlike the implicit convolutional approach in the previous section where the relative spatial relation of actors are intrinsically represented within the crop, spatial relationships are explicitly required in a graph representation. The relative geometric feature rel consisting of the coordinates and heading of actor in the frame of actor , is computed as


All of the messages sent to the

-th graph node are aggregated by a max-pooling operation, denoted as


Finally, the node attribute is updated with a Gated Recurrent Unit (GRU)

[4, 14, 39] whose hidden state is and the input is ,


In general, the update iterates for times. Finally, multimodal classification and future trajectories for the actor are computed from , as discussed in Section 3.3.

Figure 2: Schematic of interaction loss. The actor is approximated with 3 costing circles, with illustrated minimal distances (black) to a static obstacle (grey) and the resulting gradients (red).

3.5 Interaction loss

In this section we introduce a novel interaction loss to improve interaction awareness of the model, which directly penalizes predicted forecast of an actor that overlap with static traffic objects (defined as objects with speed less than m/s). Traffic objects comprise objects that a vehicle should avoid, including vehicles, cyclists, pedestrians, construction fences, etc. At each prediction horizon, the predicted actor is approximated with inscribed costing circles, as illustrated in Fig. 2. The loss is then computed as


where , , , and are the numbers of actors, non-moving obstacles, prediction time horizons, and costing circles, respectively. is a radius of a costing circle (determined by the size of a ground-truth bounding box), while is a signed minimum distance between the -th costing circle center of the -th actor and the -th obstacle bounding box at time . The distance is negative when the center is inside the obstacle’s bounding box.

Note that the loss only considers overlaps between predicted trajectories and the ground-truth bounding boxes of static obstacles. Moving actors may have multimodal trajectory distributions, and it can be unclear when an overlap between the trajectories of two moving actors should be penalized by the loss. In summary, when the costing circles overlap with an obstacle bounding box, the interaction loss would only back-propagate gradients through the predicted centroid and heading. The loss is added to the prediction loss where it is applied to the -th predicted trajectory, and optimized jointly in the end-to-end training.

4 Experiments

Figure 3: Effects of the ICM components: interaction region (IR), interaction CNN (ICNN), and actor frame (AF). Extractor is the single-stage model; +IR+ICNN represents the two-stage model that defines IRs in the ADV frame; +ICM (i.e., +IR+ICNN+AF) is the proposed two-stage model that defines IRs in the actor frame. All IRs have a fixed 5:1 front-to-back ratio, unless specified otherwise. Inset: models with a fixed IR size of 60m and varied front-to-back ratios.

Input and output. The considered area is of size m, centered at the self-driving vehicle and discretized as a grid into which the LiDAR sweep information is encoded. The input contains LiDAR sweeps collected at 0.1s interval, as well as a semantic HD map from the current timestamp. The models detect the vehicle actors at the current time step and forecast their trajectories at future time horizons . Non-maximum suppression (NMS) [31] with Intersection over Union (IoU) threshold set at 0.1 is applied in order to eliminate duplicate detections.

Metrics. The studies are focused on prediction accuracy and interaction performance. IoU threshold for object detection matching is set at . We observe that the detection performance changes little in all of the models reported in the paper, with average precision at . Furthermore, we ensure equal numbers of trajectories are considered in the metrics by adjusting the detection probability threshold at a fixed recall of [4]. Each actor has 3 predicted trajectory modes, and we assign the trajectory of the most probable mode to the actor in the following metric computation. We use average displacement error (DE) at s to measure the prediction accuracy.

To quantify the interaction performance of the models we consider two overlap metrics in our experiments (additional results are provided in Supplementary Material):

  • Actor-actor overlap rate is the percentage of predicted trajectories of detected actors overlapping with predicted trajectories of other detected actors.

  • Actor-static overlap rate is the percentage of predicted trajectories of detected actors overlapping with ground-truth static traffic objects.

An actor overlap is defined as an intersection-over-obstacle-polygon of more than at any point of the 4s trajectory, set to this value to eliminate false positive overlaps due to small noise in the labeled bounding boxes.

Data. We conducted an evaluation on an in-house data set, containing scenes of s each and collected across several cities in North America with high-quality Hz annotations. No ground-truth overlaps are observed in the data. To mitigate metric variance due to rare events, (1) a large split of scenes is left out for validation; (2) the validation key frames in each scene have a temporal spacing of

s to avoid counting the same overlaps multiple times; and (3) the training and validation sets are split geographically to prevent models from memorizing the same static obstacles and environment. Note that, as the goal of this work is to understand relative performance of the considered approaches for interaction modeling, we limited the experiments to the in-house data. Using this larger data set, as opposed to using popular open-sourced data sets that are significantly smaller, enabled more statistically significant results and deriving more general conclusions.


The models were implemented in PyTorch

[32] and trained end-to-end with 16 GPUs, with a batch size of 2 per-GPU. Training without the GNN module is completed in about hours. We use the Adam optimizer [21] with a learning rate of , decayed to at and at of the training iterations.

4.1 Results

Interaction using convolutions. The performance of the single-stage model (Fig. 1a) that contains only the feature extractor is shown in Fig. 3 (Extractor, black). The +IR+ICNN (green) curve shows the performance of the two-stage model without rotating the interaction region for each actor into the actor frame. In particular, starting from the 1D per-actor feature map vectors (m), we increase the IR size to m. By cropping larger feature map regions that contain more interacting actors and surrounding context, displacement error and forecasted overlap rates decrease.

Figure 4: Performance from adding a GNN on top of the ICM (Fig. 1c) for different interaction region sizes. +GNN (only attributes) encodes only node attributes in the graph edges; +GNN (only relative) encodes only relative locations and orientations in the graph edges.
Figure 5: Comparison of ICM and GNN (which includes ICM). +GNN (no edges) is identical to +GNN except the graph edges are cut off. +IL represents models trained with the additional interaction loss. Note that comparing +ICM at large IR (interaction is modeled by ICM) against +GNN at small IR (interaction is modeled by GNN) shows that a pure ICM can outperform a pure GNN in modeling interactions.

We then rotate IRs to match estimated actor orientation instead of using the common ADV frame (

+ICM, blue). For zero IR size (i.e., a cropped feature is still the 1D feature map vector), we observe DE drops significantly compared to the model using the ADV frame with zero IR size (green). This has been explained previously as a benefit of a standardized output representation [9]. Although defining the IR in the actor frame reduces rotational variance, the zero-size IR covers no interacting actors and we thus observe little change in the actor overlap rates. As the IR is increased in size, both DE and the interaction metrics improve dramatically. Crop sizes of beyond m show no further improvement, likely because the majority of interacting actors and obstacles are already included within the m region.

In all of the IRs above we have fixed the front-to-back ratio to 5:1, meaning an IR of size m includes m ahead and m behind the actor. In Fig. 3 inset we fix the total size at m, and vary the front-to-back ratio (blue). As the vast majority of actors are moving forward, we can see that placing more of the IR ahead of the actor improves interaction modeling. It is interesting to note the divergence between DE and overlap rates: after the front-to-back ratio is above 1:1, the overlap rates continue to drop marginally, while the DE improvement stops. Even for the actor-centered IR (inset, green), not rotating the IR to match the actor orientation yields worse DE and overlap rates, which further confirms the importance of removing rotational variance for interaction modeling using convolutions. From Fig. 3, we observe that by cropping an actor-frame defined region of the feature map and then applying convolutions improves forecasting and interaction modeling considerably. Strong dependence of overlap rates on IR size provides evidence that convolutions are effectively capturing interactions once other actors are inside the IR.

Figure 6: Overlapping predicted trajectory examples in baseline (top) and ICM (bottom). Red: overlapped obstacles; Blue: forecasts of the actors of interest; Grey: forecasts of other actors; Green: labels. Trajectory visualization is downsampled to Hz for clarity.

Interaction using graphs. As illustrated in Fig. 1c, for these experiments we add a GNN on top of the ICM. Note that, as discussed earlier, setting the IR size to m deactivates the ICM while retaining the benefit of reduced rotational variance. For zero IR size (Fig. 4, +GNN, red) we see that the GNN indeed improves DE and overlap rates significantly as compared to the models without designated interaction modeling capability in Fig. 3 (+ICM, m). Notably, even when a GNN is utilized, we observe that ICM can still provide additional performance improvements as we gradually increase ICM’s interaction modeling by expanding the IR size. We also examine the benefit of the hand-crafted relative geometries in the graph edges. When the IR is small (i.e., ICM is limited), keeping only the node attributes (blue) or relative geometries (green) significantly damages the graph modeling. For large IR sizes, the difference between the three graph models becomes minor, suggesting that with larger feature crops the ICM has effectively compensated for missing GNN features.

The GNNs in the models above are single-iteration. We also evaluated the effect of increasing the GNN iterations to . An additional iteration reduces DE and overlap rates further by a small amount when the IR size is small, which could be explained by the well-known bottleneck phenomenon of GNNs [2] and the fact that the graph is fully connected. This improvement is negligible for all but the smallest IRs, and no further exploration of additional iterations is provided below.

Convolutions vs. graphs for interaction. In Fig. 5 we compare the implicit ICM (blue) and explicit GNN (red) approaches. With zero IR size where ICM is effectively off, the gain of adding GNN is significant, however, as the IR grows, we observe that the performance gap steadily narrows. In other words, while turning on ICM (by increasing IR size) can further improve the performance of GNN models, adding a graph to an ICM with a sufficiently large IR provides only minor improvements. To understand the gaps between +ICM and +GNN with large IR sizes, we study a graph-less model (+GNN (no edge), black) created by removing graph edges in +GNN. For large IRs, the graph-less model matches the performance of +GNN, which suggests that explicit interaction graph of GNN contributes little to the performance. Thus, the gaps between +ICM and +GNN for larger IR sizes are mainly due to extra network capacity of the GNN module. Lastly, we can see that comparing +ICM at large IR (i.e., interaction is modeled by ICM) against +GNN at small IR (i.e., interaction is modeled by GNN) shows that a pure ICM can outperform a pure GNN in modeling interactions.

Interaction loss. We can also see that adding the interaction loss (Eq. 8) reduces the overlap between actors’ predicted trajectories for both interaction modeling approaches (green and magenta in Fig. 5). The improvement is significant for smaller IRs, which may be due to the fact that the smaller IRs do not provide enough information to model the interactions effectively, benefiting more from this added supervision. Interestingly, the interaction loss does not affect DE results except for ICM models at small IR where the interaction modeling is limited.

Maneuver-specific qualitative results. In Fig. 6 we present a comparison of the baseline ICM model with m size (that has no designated interaction modeling) and the ICM model with m size on three typical maneuvers observed in interacting scenarios: adaptive cruise control (ACC), turn, and nudging. We note that the m model in all cases incorrectly predicts overlapping trajectories. In the ACC case the ICM model correctly predicts that the vehicle would decelerate and queue after others, while in the turn case it outputs a trajectory that follows the lane and avoids overlapping with the vehicles after the turn. In the nudging case the vehicle motion starts with considerable curvature, the forecast correctly reduces the curvature and straightens the trajectory to avoid the parked cars. We also examined the results of GNN on these maneuvers, and observed no significant difference between ICM and GNN outputs.

Inference time. The baseline model that includes the feature extractor and other parts such as input pre-processing and output post-processing takes ms per frame. Next we measure the additional time costs of adding the ICM and GNN modules to the baseline model, shown in Table 1. The ICM of zero IR size adds an additional ms, which includes processing of a 1-D feature vector and computation of the final output. ICM with non-zero size uses convolutions and bilinear feature cropping, operations that have been optimized in current GPU software and hardware. As a result, even the largest m ICM is only a few milliseconds slower than the m ICM. Lastly, the GNN itself takes ms, multiple times slower than the slowest ICM. This is consistent with earlier results showing GNN inference may be inefficient resulting in higher latency [22]. Coupled with the earlier results showing that modeling interaction using convolutions can give competitive performance compared to GNNs, we see that the convolutional approach represents an efficient and practical alternative to GNNs.

Module IR size (m) Inference (ms)
lightgray ICM 0 5.2
ICM 80 8.1
lightgray GNN - 46.9
Table 1: Inference times of modules (tested on Nvidia Titan RTX)

5 Conclusion

We considered convolutional and graph neural networks for the task of spatial interaction modeling. We compared and contrasted these two approaches, providing empirical evidence that under certain conditions convolutional networks reach comparable performance to the state-of-the-art GNNs that have recently become popular in the literature, thus allowing similar motion forecasting accuracy and interaction modeling while maintaining reduced latency and complexity of the model. We analyzed common components of the interaction approaches, leading to a better understanding of how each benefits the final performance of the system. Moreover, we introduced a novel interaction-aware loss and showed its impact on the considered approaches. Our work presents a basis for wider use of convolutional layers for the task of spatial interaction, providing evidence that the gap between convolutional models and more complex and computationally expensive GNN models may not be as large as previously suspected.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 961–971. Cited by: §2.1, §2.2.
  • [2] U. Alon and E. Yahav (2021) On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205. Cited by: §4.1.
  • [3] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. arXiv preprint arXiv:1612.00222. Cited by: §1.
  • [4] S. Casas, C. Gulino, R. Liao, and R. Urtasun (2019) Spatially-aware graph neural networks for relational behavior forecasting from sensor data. arXiv preprint arXiv:1910.08233. Cited by: §1, §1, §2.1, §2.2, §2.3, §3.3, §3.4, §3.4, §4.
  • [5] S. Casas, W. Luo, and R. Urtasun (2018) Intentnet: learning to predict intention from raw sensor data. In Conference on Robot Learning, pp. 947–956. Cited by: §2.1, §3.1.
  • [6] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2090–2096. Cited by: §3.2.
  • [7] N. Deo and M. M. Trivedi (2018) Convolutional social pooling for vehicle trajectory prediction. CoRR abs/1805.06771. External Links: Link, 1805.06771 Cited by: §2.1, §2.2.
  • [8] F. Diehl, T. Brunner, M. Truong-Le, and A. C. Knoll (2019) Graph neural networks for modelling traffic participant interaction. CoRR abs/1903.01254. External Links: Link, 1903.01254 Cited by: §2.1.
  • [9] N. Djuric, H. Cui, Z. Su, S. Wu, H. Wang, F. Chou, L. S. Martin, S. Feng, R. Hu, Y. Xu, et al. (2020) MultiNet: multiclass multistage multimodal motion prediction. arXiv preprint arXiv:2006.02000. Cited by: §2.1, §3.1, §3.2, §3.3, §4.1.
  • [10] N. Djuric, V. Radosavljevic, et al. (2018) Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §2.1, §3.3.
  • [11] F. Engelmann, T. Kontogianni, and B. Leibe (2020) Dilated point convolutions: on the receptive field size of point convolutions on 3d point clouds. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9463–9469. Cited by: §1.
  • [12] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, pp. 6530–6539. Cited by: §1.
  • [13] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid (2020) VectorNet: encoding hd maps and agent dynamics from vectorized representation. arXiv preprint arXiv:2005.04259. Cited by: §1, §2.1, §2.2.
  • [14] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §3.4, §3.4.
  • [15] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018)

    Social GAN: socially acceptable trajectories with generative adversarial networks

    CoRR abs/1803.10892. External Links: Link, 1803.10892 Cited by: §2.2.
  • [16] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto (2017) Knowledge transfer for out-of-knowledge-base entities: a graph neural network approach. arXiv preprint arXiv:1706.05674. Cited by: §1.
  • [17] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.2.
  • [19] B. Ivanovic and M. Pavone (2018) Modeling multimodal dynamic spatiotemporal graphs. CoRR abs/1810.05993. External Links: Link, 1810.05993 Cited by: §2.1.
  • [20] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song (2017)

    Learning combinatorial optimization algorithms over graphs

    In Advances in neural information processing systems, pp. 6348–6358. Cited by: §1.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [22] K. Kiningham, C. Re, and P. Levis (2020) GRIP: a graph neural network accelerator architecture. arXiv preprint arXiv:2007.13828. Cited by: §4.1.
  • [23] T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. In

    International Conference on Machine Learning

    pp. 2688–2697. Cited by: §1, §2.2.
  • [24] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1.
  • [25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.2, §3.2.
  • [26] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144. Cited by: §3.2.
  • [27] S. Löwe, D. Madras, R. Zemel, and M. Welling (2020) Amortized causal discovery: learning to infer causal graphs from time-series data. arXiv preprint arXiv:2006.10833. Cited by: §1.
  • [28] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proc. of the IEEE CVPR, pp. 3569–3577. Cited by: §2.1.
  • [29] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue (2017) Arbitrary-oriented scene text detection via rotation proposals. CoRR abs/1703.01086. External Links: Link, 1703.01086 Cited by: §2.1, 3rd item, §3.3.
  • [30] E. A. Meirom, H. Maron, S. Mannor, and G. Chechik (2020)

    How to stop epidemics: controlling graph dynamics with reinforcement learning and graph neural networks

    arXiv preprint arXiv:2010.05313. Cited by: §1.
  • [31] A. Neubeck and L. Van Gool (2006) Efficient non-maximum suppression. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3, pp. 850–855. Cited by: §4.
  • [32] A. Paszke, S. Gross, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §4.
  • [33] S. R. Qasim, J. Kieseler, Y. Iiyama, and M. Pierini (2019-07) Learning representations of irregular particle-detector geometry with distance-weighted graph networks. The European Physical Journal C 79 (7). External Links: ISSN 1434-6052, Link, Document Cited by: §1.
  • [34] N. Rhinehart, R. McAllister, K. M. Kitani, and S. Levine (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. CoRR abs/1905.01296. External Links: Link, 1905.01296 Cited by: §1, §1, §2.1, §2.3.
  • [35] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese (2017) CAR-net: clairvoyant attentive recurrent network. CoRR abs/1711.10061. External Links: Link, 1711.10061 Cited by: §1, §1, §2.2.
  • [36] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia (2018) Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242. Cited by: §1.
  • [37] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2017) Modeling relational data with graph convolutional networks. Cited by: §1.
  • [38] Z. Su, C. Wang, H. Cui, N. Djuric, C. Vallespi-Gonzalez, and D. Bradley (2020) Temporally-continuous probabilistic prediction using polynomial trajectory parameterization. arXiv preprint arXiv:2011.00399. Cited by: §3.1.
  • [39] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §3.4, §3.4.
  • [40] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan (2020) Inductive representation learning on temporal graphs. Cited by: §1.

Appendix A The extractor network

In Fig. S1 we provide detailed design of the CNN feature extractor used in all of the models in the current study (see a high-level overview in Fig. 1). We note that the multi-scale design (as indicated by , , , , and , where the numbers represent the down-sampling scales relative to the input size) and cross-scale blocks (see Fig. S3) already encourage a large receptive field of the resulting network. Nevertheless, the empirical studies presented in the main paper show that such single-stage CNN architecture still models the spatial interaction less effectively. By adding either the shallow ICNN or the GNN module in the second stage the interaction modeling performance is significantly improved.

Figure S1: Multi-scale network design of the feature extractor. As illustrated in Fig. 1, the inputs are voxelized LiDAR point clouds and rasterized map, while the network output is a feature map of size , where and are the grid width and length of the input BEV representation, respectively. The green boxes labeled as

represent tensors where

and represent the number of channels and down-sampled scale relative to the input size, respectively. The operations connecting two tensors are ConvBs, except for the specified up-sampling, element-wise addition, and the cross-scale block. The cross-scale block, detailed in Fig. S2, is repeated 3 times.
Figure S2: Detailed architecture of the cross-scale block

Appendix B The ICNN network

Figure S3: Various considered designs for ICNN, using 3232 input as an example. The green boxes represent tensors of grid size equal to and channels, where is equal to 256 in all designs. The dotted lines denote shortcut connections. The results presented in this work were based on design shown as (a).

For the IRs equal to 80m, 60m, 40m, 20m, and 5m, we set the grid sizes of the feature map crops to 64, 48, 32, 16 and 4, respectively. Zero-valued padding is utilized in the convolutional layers when necessary.

We did not extensively investigate network design for the ICNN module. Several straightforward options (see Fig. S3) that stacked ConvB and ResB blocks in series were evaluated empirically. These options set the strides of the last few ConvB blocks to 2 so that the input feature map crop was down-sampled gradually to after being processed by the ICNN. We observed that the model performance was not sensitive to the changes in these ICNN variants.

Appendix C Additional details on training setup

Each training sequential example comprises 10 past and current sweeps (s, s, …, s), and 41 current and future timestamps for ground-truth supervision (s, s, …, s). The frame at current timestamp is referred to as the key frame. Each scene on the in-house data set is 25s long, producing at most 200 complete sequential examples. We trained all of the models with decimated key frames in the training split once (i.e., every sequential example whose key frame is at , s, s, , is used once during model training).

Appendix D Additional results focusing on overlaps with non-vehicle actors

The actor-static overlap rate in the main paper considers overlaps between forecasted trajectories with both vehicle and static non-vehicle traffic objects. In this section, we provide additional results focusing on overlap with static non-vehicle traffic objects. Here the overlap rate is defined as the percentage of forecasted trajectories of detected actors that overlap with ground-truth static non-vehicle traffic objects. The three panels in Fig. S4 correspond to Figs. 35 in the main paper.

Because the feature map input cropped by the IR covers features of both vehicle and non-vehicle traffic objects in the ICM approach, it is not surprising that ICM effectively improves this interaction metric too. It is, however, interesting to note that even though GNN does not build nodes for the non-vehicle traffic objects in the graph, it also lowers this overlap rate by , by comparing +ICM (0m) to +GNN (0m). The reduction is attributed to the fact that by avoiding overlaps with vehicles (after adding the GNN), the overlaps with some of the non-vehicle objects near those vehicles are also avoided. Another factor may be the proximity effect of CNNs, as the pixel features of the vehicle actors might comprise information about its nearby non-vehicle objects. The improvement of GNN on the overlap avoidance with non-vehicle objects (), however, is considerably lower than that with vehicle actors ( as shown in the main paper by comparing +ICM (0m) to +GNN (0m) in Fig. 5 right), which is reasonable as the GNN does not model the interactions with these non-vehicle objects directly.

Figure S4: Overlap rate of forecasted actor trajectories overlapping with static non-vehicle traffic objects.