Path-Aware Graph Attention for HD Maps in Motion Prediction

by   Fang Da, et al.

The success of motion prediction for autonomous driving relies on integration of information from the HD maps. As maps are naturally graph-structured, investigation on graph neural networks (GNNs) for encoding HD maps is burgeoning in recent years. However, unlike many other applications where GNNs have been straightforwardly deployed, HD maps are heterogeneous graphs where vertices (lanes) are connected by edges (lane-lane interaction relationships) of various nature, and most graph-based models are not designed to understand the variety of edge types which provide crucial cues for predicting how the agents would travel the lanes. To overcome this challenge, we propose Path-Aware Graph Attention, a novel attention architecture that infers the attention between two vertices by parsing the sequence of edges forming the paths that connect them. Our analysis illustrates how the proposed attention mechanism can facilitate learning in a didactic problem where existing graph networks like GCN struggle. By improving map encoding, the proposed model surpasses previous state of the art on the Argoverse Motion Forecasting dataset, and won the first place in the 2021 Argoverse Motion Forecasting Competition.



page 3


Learning Lane Graph Representations for Motion Forecasting

We propose a motion forecasting model that exploits a novel structured m...

LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Forecasting the future behaviors of dynamic actors is an important task ...

SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation

Graph Neural Networks (GNNs) generalize neural networks from application...

HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding

One essential task for autonomous driving is to encode the information o...

Uncertainty-aware Attention Graph Neural Network for Defending Adversarial Attacks

With the increasing popularity of graph-based learning, graph neural net...

Message-Aware Graph Attention Networks for Large-Scale Multi-Robot Path Planning

The domains of transport and logistics are increasingly relying on auton...

A Bird's-Eye Tutorial of Graph Attention Architectures

Graph Neural Networks (GNNs) have shown tremendous strides in performanc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In autonomous driving, HD maps are an essential source of information for the robot, since they capture the semantic structure of the road and thereby provide driving guidance both as a legal requirement and as a distribution prior. As the same structural information is understood and respected by human drivers sharing the road with the robot, the map structure plays a crucial part in regulating the prediction of motion of other agents. The robot must be intimately aware of the lanes and traffic controls around it and how each can affect itself before it can safely navigate them.

The HD maps built for autonomous driving are typically structured as a graph of map elements, in particular lanes, which are connected to one another in various formations representing admissible ways of traversing the terrain. For example, two lanes positioned consecutively in space and connected with a “sequential” link permits driving along them successively, while two lanes side-by-side connected with a “lateral” link enables a “lane change” maneuver while driving on one to reach the other. This highlights that the HD map lane graph is a heterogeneous

graph, one with edges of different natures that correspond to different semantics. Classic graph-based machine learning models, widely adopted in applications like analysis of social networks

[1] and protein-protein interaction [2], often ignore the variety on edge types and focus on processing features on vertices, assuming the relation between any two vertices is a simple binary “connected” or “not connected” relation. This is clearly inadequate for encoding lane graphs: mistaking a lateral pair of lanes to be sequential could cause the robot to cut through lanes illegally.

There are techniques available to handle heterogeneity in edges. For example, one could turn each edge into a vertex with new edges to the original incident vertices, thus transforming the heterogeneous graph into a larger but homogeneous bipartite graph. Similarly, one could directly equip edges with features to capture the same semantic structure. However, understanding this structure is not easy for the robot: the complexity of traffic interaction comes not from the range of interaction types between two adjacent lanes, but from the complexity of propagating them to nonadjacent but still interacting lanes, i.e. from their vast array of combinations between the lanes. This is because in reality there is an enormous variety of combinations of lane formations in urban areas; the mere vocabulary of a small set of link types (e.g. sequential and lateral) struggles to describe more nuanced patterns such as merges/forks and double turns (Fig. 1). In order to make map building tractable and consistent, HD map builders usually have to stick to a small but well-defined set of link types nonetheless. As a result these more advanced patterns manifest as combinatoric structures beyond 1-ring neighborhoods, and the task of dissecting these nonlocal interactions is left to the map encoder model. Furthermore, the behavior of human drivers often demonstrate adaptations specific to high order combinations of these basic patterns, making modeling the full gamut of lane interactions extremely challenging.

To illustrate the complexity of lane configuration patterns, Fig. 1 gives a few examples that could potentially be ambiguous when described by the lane connectivity graph. In the example on the left, a left turn leads into a multi-lane road. Depending on the speed, road curvature and traffic, many human drivers would choose to enter the middle or even the rightmost lane, despite it being illegal in many jurisdictions. This driving behavior can be explained by hallucinating a lane that does not exist in the map, or it can be interpreted as entering the neighboring lane of the successor, thus involving a nonlocal interaction between two lanes with a path of length 2 in between. The example in the middle contains a long and shallow merge. During this merge the relation between the two joining lanes transitions from being neighbors to being nearly coincidental, creating a dilemma for the labeling of lane neighbor relation. Furthermore, as the red vehicle in the figure approaches the end of the merge, a left lane change would enter not the immediate left neighbor lane which is the merge counterpart, but the next lane over to the left. From the perspective of the lane graph, this would appear to be an interaction between the red vehicle’s lane and its left neighbor’s left neighbor, again a path of length 2. The example on the right is even more chaotic with several lanes clustered tightly in the middle of the intersection, and lanes further apart topologically could be interacting.

Fig. 1: Ambiguous road configuration examples from the maps of Argoverse, with custom rendering to highlight the start (dark-colored) and end (light-colored) of the lanes. (Left) A turn into a multi-lane road. (Middle) A shallow merge. (Right) An intersection with several nearby lanes.

Based on these observations, we believe an effective encoder model for the HD map lane graph must have the ability to understand the interaction between nonadjacent lane pairs from how they come to be linked in between, or in other words, the ability to infer vertex-vertex attention from the paths, or the sequences of edges, between them. Motivated by this insight, the main contribution of this paper is a novel attention architecture called Path-Aware Graph Attention (PAGA). As its name suggests, the attention gate between two vertices is learned from the sequence of edge features, especially the edge types, along the paths connecting the two vertices, rather than from the vertex features themselves.

Ii Related Work

Ii-a Motion Prediction

Motion prediction has become a central focus in autonomous driving research over the past few years, and given the importance of HD maps in urban driving, much research effort has been devoted to effective encoding of the HD maps. Motion prediction techniques can be broadly categorized according to the two paradigms of representing spatial features, for encoding maps and processing interaction with other agents: rasterized and vectorized. We briefly review the literature in lieu of these two categories below with an emphasis on map encoding.


Techniques in the rasterized category interpret the agent’s surroundings as images, typically in the bird eye view (BEV)

. Turning the task of semantically understanding the agent’s surroundings as a computer vision problem, these techniques can leverage the vast arsenal of image-based technologies such as convolutional neural networks. Some authors argue that a rasterized representation also has the benefit of simplicity in capturing the agent’s spatial context such as the maps

[3]. A prime example of this thread of research is the ChauffeurNet [4], which uses RNN to synthesize predicted trajectories. ChauffeurNet renders the map, the navigation information and the other objects in a rectangular area in BEV centered in front of the agent of interest. The roadmap is rendered as an RGB image containing lane centerlines, curbs etc. MultiPath [3] relies on anchor classification and offset regression based on these context features for producing predictions. A number of works adopt the same representation for encoding maps, combining it with vectorized representations for agent motion and interaction: [5] design a multi-hypothesis FC prediction head on top of the convolutional context features and the vectorized agent state features, CoverNet [6]

dynamically generates anchor trajectories and classify them with similar features, and Multiple Futures Prediction


and Multi-Agent Tensor Fusion

[8] use RNNs to encode and decode agent motion, incorporating the context features at different stages.


Techniques in the vectorized category, on the other hand, focus on the topology of the map by treating it as a graph. Compared to raster images, a graph representation is much more compact and thus enjoys potentially better efficiency. Depending on the connectivity definition, a graph representation could also easily describe complicated traffic semantics such as the case where two lanes are spatially close but prevented from interacting with each other by a median strip. Notable works in this category include VectorNet [9] which represents the map as well as the agent motion as a two-level hierarchy of complete graphs and employs a self-supervising auxiliary task of completing masked-out vertices, LaneGCN [10] which encodes the map as a heterogeneous directed graph and trains a GCN on it parameterized by edge types, and the follow-up LaneRCNN [11] which builds an agent graph to model interaction on top of the map graph. Our proposed method falls in this category as well.

Ii-B Graph Neural Networks and Attention

As data in many applications naturally manifest as graphs, deep learning models for understanding the structure of graphs have long been under active investigation.

[12] introduce a contraction mapping operator called Graph Neural Network (GNN) to study the steady state under message passing between neighbors. This can be viewed as an infinite-horizon recurrent network; later works [13] make explicit use of RNN for a similar structure. In order to overcome the difficulty of irregular neighborhood topology in applying feedforward neural networks to graphs, PATCHY-SAN [14] proposes a graph normalization procedure to adapt neighborhoods of varying sizes to a fixed convolutional layer, and GraphSAGE [15] outlines a versatile framework that samples the neighborhood for a number of vertices, and uses pooling or LSTM to aggregate them. [16] propose convolutional kernels with the graph spectrum as the basis, and [17] show that under approximation with truncated Chebyshev polynomials, it is equivalent to the local neighborhood convolution operator, called Graph Convolutional Network (GCN). MPNN [18] summarizes the various existing graph-based network structures, including NN4G [19] and GCN, in a general framework based on the notion of message passing. Graph convolution has seen wide applications in many domains, ranging from recommendation systems [20] and social networks [21] to protein-protein interaction [2] and jet physics [22].

Graph attention

Popularized in natural language processing

[23], the attention mechanism provides a way to selectively strengthen or weaken the influence of various elements in the context based on their features, and its natural adaption to graphs, GAT [24], brings an extra layer of expressive power over graph convolution. Building on top of GAT, GAAN [25] adds another layer of gates to adjust the weight of different attention heads to accommodate the variability of neighborhood structures. To model long range interactions beyond the local neighborhood in GAT, SPAGAN [26] computes attention between the vertex of interest and a distant vertex from the features along the shortest path between the two. This can be seen as an approximation of our proposed PAGA: as interaction becomes less direct and thus less relevant with longer path lengths, the shortest path is a zero-th order approximation to the collection of all paths, from which the path-aware graph attention is computed.

Iii Method

Iii-a Definitions

Graphs and models on a graph

Given a directed graph with vertices and edges , we aim to find network designs that facilitate extracting vertex output features from input features that capture the topological information of the graph as much as possible. We leave the choice of computation for unspecified here as it is orthogonal to the discussion on attention. In a graph convolution framework it is common to use a fully-connected layer from in the previous layer for this purpose, but other functions can be used too [18].

This definition works without modification for heterogeneous graphs, which are just graphs whose vertices and edges are heterogeneous, or in other words, are of different types. The semantic information held in the vertex and edge types is commonly represented as vertex and edge features; while most graph-based models directly operate on vertex features, the majority of them are agnostic of the variability of edge types and edge features, and only consider the binary connectedness relationship between pairs of vertices, with potentially a scalar edge weight capturing some notion of interaction intensity. It takes deliberately designed network architectures to respect the topology of the graph when processing the semantics of heterogeneous edges.


For a vertex of interest , an attention mechanism selectively focuses the “attention” on another vertex (that is not necessarily directly connected to ). The attention value modulates how much the features on contribute to the results on :


The choice of , in particular what features it depends on, characterizes the attention mechanism. For example, in GCN [17] is simply the (renormalized) Laplacian matrix calculated from the degree matrix and the adjacent matrix (with a self-edge added for each vertex), while in GAT [24] is a function of and for where is the 1-ring neighborhood of . As typically has a very local support to allow for efficient implementation, it is common to use multiple layers of such attention structure to enlarge the receptive field.


In the case of heterogeneous graphs, the variable types of edges carry essential information about the nature of the relations between the vertices in question. For a vertex pair , an (edge-) path from to of length is given by satisfying for , and , where and are the operators that return the source and target vertex of an edge respectively. There may be a number of paths of a given length connecting and , and they collectively describe the pathways for information to flow between the two vertices through the graph. In the context of encoding HD maps, such paths may capture how the traffic in one lane may move into the other, and the sequence of edge types along the path is a complete descriptor of how the traffic interaction could proceed.

Iii-B Path-Aware Graph Attention

Fig. 2: Illustration of the computation on edge sequences in PAGA. (A) A heterogeneous graph with two types of edges (depicted as red and green). (B) The path connecting on the top-left corner to on the bottom-right corner goes through a green edge followed by a red edge. The edge features of sequence are encoded by LSTM to produce , the attention between and . (C) For a different , there are three paths connecting and , all contributing to .

The proposed model, Path-Aware Graph Attention, bases the function on the edge features along the paths connecting to :


where is the set of all paths of length connecting to , and is a learnable feature extractor function to produce the attention values from the sequence of edge features along a path of length , such as a neural network. is a hyper-parameter controlling paths up to how long shall be considered: above a certain length the interaction represented by the paths is simply too indirect to be relevant, and including them would not improve the result while still incurring more computation cost (since generally grows with ). This attenuation effect with should arise naturally in the learned functions, but it can also be explicitly modeled by setting with a .

We emphasize that this formulation of attention enables modes of vertex interaction that cannot be easily achieved by the traditional GCN or GAT approaches. This will be further elaborated on in Section III-C and IV-A. Fig. 3 compares the structure of a few related frameworks to PAGA.

Fig. 3: Comparison of how is obtained from the neighborhood of in GraphSAGE (the LSTM variant), GCN, GAT and PAGA. GraphSAGE uses LSTM to aggregate the (sampled) vertex neighborhood, while the other three can be interpreted as attention with different computation.

Iii-B1 Choices for Edge Sequence Feature Extractor

The capability of the path-aware graph attention mechanism hinges upon the choice of , the function that processes the type features of individual edges along a path to determine an overall gating coefficient that filters the influence of on

via this path. Any pooling operator (such as max-pool, avg-pool) could serve to aggregate features along the path, but the permutation-invariance of pooling operators means they would not capture the semantics of the ordering of edges along a path. This would be undesirable e.g. at lane forks, where the neighbor of the subsequent lane is not necessarily identical as the subsequent lane of the neighbor. Therefore, a permutation-sensitive operator, such as a recurrent neural network (in practice we use LSTM) is potentially better. When

is small, a fully connected layer (whose input width is proportional to ) could work as well.

Iii-B2 Efficient Implementation of Path-based Feature Extraction

The number of possible paths from a given vertex up to length does grow exponentially with , but in practical applications on HD maps where the graph is highly sparse, the resulting support (the -ring neighborhood) is still a very small subset of , therefore taking advantage of the sparsity is a necessity for computation efficiency. Sparsity can be realized by storing the vertex adjacency matrix in the compressed column/row storage formats, i.e. as a list of tuples, and structuring the graph operations around this format.

Iii-C The Intuition Behind PAGA

In essence, PAGA learns attention by trying to capture the ways that information could flow through the graph, because it is via this flow that vertices interact with and influence one another. Consider the middle example in Fig. 1 again, and let’s call the two merging lanes A and B respectively, with A being the red vehicle’s lane. If the red vehicle intends to lane change to the left, it would have to negotiate with the traffic in its “neighbor’s neighbor” lane, i.e. the lane to the left of B (let’s call this lane C). Therefore the relationship between A and its immediate neighbor B and second-order neighbor C is unusual: in this case the second-order neighbor behaves like a regular neighbor lane where one would check for lane change safety, while the immediate neighbor behaves more like the current lane where one would look for leading vehicles and obstacles. This poses a challenge for the map encoder model which is responsible for discovering these high order connections from the graph structure. Each path connecting two vertices corresponds to a way the two lanes could come into interaction, and as witnessed by this shallow merge example, if vertex A interacts with vertex B which in turn interacts with vertex C, the interaction between A and C cannot be treated recursively, as it is not necessarily an attenuated version of the interaction between A and B or that between B and C: for example, for evaluating lane change safety from lane A, we should put 100% attention on C, and 0% attention on B. This is the intuition that motivates us to develop a path-aware mechanism for computing attention. As investigated in Section IV-A, a simple toy problem with a graph exactly like this turns out to be difficult to existing techniques like GCN, but not to PAGA.

Iv Experiments

Iv-a Didactic Problem: Learning a Skip Interaction

In the same spirit as the successful practice of decomposing large convolutional kernels into multiple layers of small (3x3) ones in computer vision, most graph convolution methods do not consider vertices beyond the immediate 1-ring neighborhood, and rely on stacking layers to push the receptive field to cover long range interactions. However, we argue that such decomposition may limit the network’s expressive power for long range interactions. As discussed above, modeling complex high order connectivity in certain situations requires the ability to encode an interaction in a non-recursive way. In this section we describe a simple problem constructed to illustrate this phenomenon.

Consider a graph consisting of three vertices , and two edges (Fig. 4 Left). Define a function as the “feature” on the vertices, and another function as the “label”. The task is simply learning the functional that maps to , given a prescribed construction of . To reflect the situation in the middle example of Fig. 1, we define as . This is equivalent to the following attention per (1):


In other words, we would like to learn a model that focuses its attention completely on vertex for vertex , while for vertex and simply focuses on themselves. Note that the attention from to has to go through the path topologically, but must not be affected.

We generate a dataset of 4500 examples with random values on and (we set to zero, which has no effect), and train a simple implementation of GCN and PAGA, both with hidden state size 1 and with no nonlinearities. Removing the nonlinearity helps reduce the randomness across runs, and the result should still be meaningful. The networks are trained on the MSE loss on

for 50 epochs with the Adam optimizer at 0.01 learning rate, and the resulting models are evaluated on another 500 examples.

Fig. 4: Left: Graph used in the didactic problem Skip Interaction. Right: Loss convergence, over 100 trials.

Fig. 4 Right shows the convergence results of the two models on this problem over 100 trials. GCN invariably fails to converge to zero loss, with the final loss being 0.05 on average, and MSE on near 0.1. PAGA’s training loss and evaluation MSE converge to below .

Iv-B Experimental Evaluation on Argoverse

We then evaluate PAGA on a large scale motion prediction benchmark, the Argoverse Motion Forecasting dataset [27]. The dataset comes with detailed HD maps in the vectorized format, covering two urban areas in Pittsburgh and Miami. The Argoverse dataset consists of over 200k five-second-long training scenarios labeled at 10Hz, and the task is predicting the motion of one specially marked object, called the agent, over the last 3 seconds of the scenario, given its motion and environment in the first 2 seconds. Prediction results are evaluated on standard trajectory distance metrics such as ADE (average displacement error, i.e. the average of the Euclidean distances between the predicted trajectory and the ground truth points) and FDE (final displacement error, i.e. the Euclidean distance at the last point), and multi-modal predictions are encouraged by the minADE and minFDE metrics (ADE and FDE of the best one out of guesses output by the predictor). To promote the predictor’s awareness of the uncertainty in its output modes, Argoverse recently introduced the brier-minADE and brier-minFDE metrics, which add a penalty term on top of minADE and minFDE

based on the probability of the best guess output by the predictor.

In a motion prediction model, our proposed Path-Aware Graph Attention network works as a module that parses the HD maps to provide guidance to the agent trajectory decoder. We plug our PAGA module into the framework of LaneGCN [10] as a drop-in replacement of its map processing network based on graph convolution. The other components of LaneGCN, such as the four-way attention between the agents and the map, and the trajectory encoder/decoder, do not need any modification to work with the PAGA map encoding net.

Due to the different structure of PAGA from graph convolution, some hyper parameters we use in the experiments on Argoverse are not found in LaneGCN or are modified, as detailed below. The attention gates are implemented as a standard multi-headed attention with heads. We use a of 2, based on the intuition that most of the motivating examples (Fig. 1) involve path-dependent interactions in the 2-ring. To reduce the branching factor, we only retain 3 out of the original 6 scales when connecting sequential lane vertices; the lost receptive field can be easily recovered by the built-in nonlocal interaction of PAGA. We use channels in the vertex features, channels in the edge features, and we augment the edge features with the raw features (position and direction) of incident vertices. We use channels in the hidden states of the LSTM module. Training is run on Nvidia V100 GPUs (16GB), on servers with Intel Xeon(R) Platinum 8163 CPUs with 336GB RAM.

Table I summarizes the performance of our model compared with some state of the art methods, with the lower half of the table focusing on vectorized representations. Our entry into the 2021 Argoverse Motion Forecasting Competition ranked in the first place by the official metric brier-FDE, and surpassed the other vectorized approaches by a large margin (Table II).

MultiPath [3]111As reported by [28]. 1.68 0.80 14% - -
[29] 1.12 0.73 - - -
TPCN [30] 1.15 0.73 11% 2.95 1.34
LaneGCN [10] 1.09 0.71 11% 2.97 1.35
LaneRCNN [11] 1.19 0.77 8% 2.85 1.33
VectorNet [9] - - - 3.67 1.66
TNT [28] 1.29 0.73 9% - -
Ours 1.02 0.69 2.87 1.31
TABLE I: Comparison to state of the art prediction techniques on the Argoverse validation set.
Model brier-FDE FDE ADE
Baseline (NN) [27] 3.98 3.29 1.71
LaneGCN [10] 2.06 1.36 0.87
LaneRCNN [11] 2.15 1.45 0.90
VectorNet [9] - - -
TNT [28] - 1.54 0.94
DM (2nd) 1.77 1.14 0.81
poly (3rd) 1.79 1.21 0.79
HIKVISION (4th) 1.82 1.19 0.82
Ours (1st) 1.76 1.21 0.80
TABLE II: Experimental results on the Argoverse test set (2021 Competition final standing).

Iv-C Ablation Experiments

We report ablation experiments that are evaluated on the Argoverse validation set. For these experiments, we train and evaluate on a decimated dataset created by retaining one example for every ten. This drastically reduces the training time allowing us to run multiple experiments for each setup and report error bars; we empirically find that the performance on the decimated dataset correlates well with that on the full Argoverse dataset.

Component ablation

First of all, we would like to evaluate the contribution of the map encoding components in the whole prediction pipeline, which provides performance lower bounds for the other ablation studies. We experiment with removing either the entire the Map net, or the attention mechanism alone within the Map net, and summarize the results in Table III. Removing the “Map net” component is done by deleting the entire PAGA-based map encoder network, whose output is fused with agent encoder outputs in the LaneGCN framework. Removing “Attention” is done by manually overwriting all attention gates to zero after they are computed. It can be seen that attention provides a small but statistically significant improvement in performance.

Map net Attention ADE FDE
0.8234 0.0047 1.3443 0.0144
0.7697 0.0012 1.1962 0.0037
0.7611 0.0011 1.1833 0.0038
TABLE III: Component ablation study.
Path attention features

The attention feature extractor function takes as input a sequence of edge features, which include both the edge type and the spatial features (position and direction) of the incident vertices of the edge. Table IV shows that eliminating both together degrades the prediction performance a lot (equivalent to eliminating all attention), but keeping either one alone shows a much smaller performance drop, suggesting that they may contain redundant information. This can be understood intuitively, as lane connectivity and lane spatial relationship both describe the same semantic relationship as perceived by the drivers.

Edge Features
Edge type Spatial features ADE FDE
0.7711 0.0015 1.2106 0.0029
0.7617 0.0001 1.1942 0.0028
0.7621 0.0021 1.1746 0.0035
0.7611 0.0011 1.1833 0.0038
TABLE IV: Ablation of edge features in
Attention feature extractor

For the feature extractor , permutation-sensitive functions such as LSTM or concatenation perform better than symmetric functions such as summation, as in Table V.

Form of ADE FDE
LSTM 0.7611 0.0011 1.1833 0.0038
summation 0.7652 0.0030 1.1924 0.0075
concatenation 0.7639 0.0015 1.1934 0.0018
TABLE V: Effects of the choice of function .
Attention feature capacity

Eliminating the multi-headed attention (using a single attention head, ) degrades performance as expected. Since the number of channels in edge features, , puts a limit on the capacity to encode interaction types needed to capture complicated combinatoric patterns, constraining it () has an even larger impact on prediction accuracy. See Table VI.

Hyper parameters
32 1 0.7649 0.0019 1.1954 0.0040
1 8 0.7694 0.0008 1.2096 0.0033
32 8 0.7611 0.0011 1.1833 0.0038
TABLE VI: Ablation on attention feature capacity.

V Conclusion

PAGA is developed with motivation from complex real world road configurations and the traffic interaction therein, and we evaluated its effectiveness in understanding nonlocal interactions on heterogeneous graphs, with both a didactic problem that highlights the structure of the attention, and a large scale motion prediction dataset with HD maps. We hope to continue exploring its applications, especially problems involving modeling interaction through paths with rich semantic context.

The computation efficiency of PAGA requires further investigation. As path lengths increase, the number of paths grows exponentially with the graph’s branching factor. When the function is permutation-invariant, paths with the same source and target vertices could potentially be pooled to reduce computation cost, but the number of vertices in the larger neighborhood itself may grow exponentially anyway. We will be looking for ways to tackle this complexity in future work.


We would like to thank Weixin Li, Runlin He and Chenxu Luo for their comments and perspectives.