Unlimited Neighborhood Interaction for Heterogeneous Trajectory Prediction

07/31/2021
by   Fang Zheng, et al.
0

Understanding complex social interactions among agents is a key challenge for trajectory prediction. Most existing methods consider the interactions between pairwise traffic agents or in a local area, while the nature of interactions is unlimited, involving an uncertain number of agents and non-local areas simultaneously. Besides, they only focus on homogeneous trajectory prediction, namely those among agents of the same category, while neglecting people's diverse reaction patterns toward traffic agents in different categories. To address these problems, we propose a simple yet effective Unlimited Neighborhood Interaction Network (UNIN), which predicts trajectories of heterogeneous agents in multiply categories. Specifically, the proposed unlimited neighborhood interaction module generates the fused-features of all agents involved in an interaction simultaneously, which is adaptive to any number of agents and any range of interaction area. Meanwhile, a hierarchical graph attention module is proposed to obtain category-tocategory interaction and agent-to-agent interaction. Finally, parameters of a Gaussian Mixture Model are estimated for generating the future trajectories. Extensive experimental results on benchmark datasets demonstrate a significant performance improvement of our method over the state-ofthe-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

06/14/2021

Heterogeneous Edge-Enhanced Graph Attention Network For Multi-Agent Trajectory Prediction

Simultaneous trajectory prediction for multiple heterogeneous traffic pa...
12/12/2018

TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions

We present a new algorithm for predicting the near-term trajectories of ...
10/11/2021

You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction

Predicting the future trajectory of a moving agent can be easy when the ...
03/31/2020

EvolveGraph: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction with Evolving Interaction Graphs

Multi-agent interacting systems are prevalent in the world, from pure ph...
07/02/2021

MSN: Multi-Style Network for Trajectory Prediction

It is essential but challenging to predict future trajectories of variou...
11/04/2019

Multiple Futures Prediction

Temporal prediction is critical for making intelligent and robust decisi...
04/13/2020

SSP: Single Shot Future Trajectory Prediction

We propose a robust solution to future trajectory forecast, which can be...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Hierarchical Graph Attention & Unlimited Neighborhood Interaction. Different marked shapes are used to distinguish the agent and category. (A) Hierarchical Attention. The agents marked by the same color belong to the same category. (1) Category Attention. Every category interacts with each other and itself. Attention of one category to all categories (including the attention to itself) is transferred to all agents of the category. (2) Agent Attention. We compute one’s attention to the rest of the agents in the whole scenario, and the attention is directed. (B) Unlimited Neighborhood. We consider the interaction as among a collective, rather than between two agents or in a small area. The behavior of any agent may influence a group of agents around the whole scenario.

The challenges hampering prediction accuracy largely stem from the complex interactions among agents [Alahi et al., 2016, Gupta et al., 2018, Zhang et al., 2019a]. Recent advances in this regard [Alahi et al., 2016, Liang et al., 2020a, Liang et al., 2020b, Bi et al., 2019, Li et al., 2019] mainly fall into two types: Graph-based methods [Mohamed et al., 2020, Haddad and Lam, 2020, Yu et al., 2020] build a spatial graph at each time step and aggregate the features from adjacent nodes; RNN-based methods [Chandra et al., 2020, Zhang et al., 2019a, Bisagno et al., 2018]

model each agent’s trajectory with Recurrent Neural Networks (RNNs) and pool hidden states within a surrounding area.

However, these methods suffer from limitations. Graph-based methods [Mohamed et al., 2020] only exploit pairwise relation between the nodes, while other nodes are mixed and relayed. In contrast, the interaction in real-world traffic is much more complex than previously assumed, such as multilateral relations (relation among three or more agents). Namely, these methods are limited by inflexible numbers of interaction agents.

Moreover, RNN-based methods [Chandra et al., 2020, Zhang et al., 2019a, Bisagno et al., 2018] merely consider the local relations among an agent’s manually defined surrounding area. As a result, potential interaction participants outside of such “surrounding area” will be simply discarded. Namely, these methods are limited by such hand-crafted way for interaction agent selection.

To solve these problems, we propose the Unlimited Neighborhood on heterogeneous graph to predict the future trajectories of multi-categories (, pedestrians, bikes, cars, .etc), as shown in Figure 1. Unlimited Neighborhood means the interactions are not limited by the number of agents or the range of area. Namely, any agent in a scenario could be involved in an interaction, as illustrated in Figure 2. In addition, many related works [Gupta et al., 2018, Yu et al., 2020] focus on homogeneous agents (, pedestrians), while a real traffic scenario usually involves heterogeneous agents (, agents in diverse categories). Due to the difference in movement patterns (, velocity, front and rear distance and the response to the interaction) for agents in different categories, trajectory prediction on heterogeneous agents is exactly more challenging compared to that on homogeneous agents.

Specifically, we present a simple yet effective Unlimited Neighborhood Interaction Network for heterogeneous trajectory prediction, which models the hierarchical attention and fuses all agents involved in interaction to predict the future trajectories for all agents with different categories simultaneously. Then, regarding the agents as nodes and the agents with the same category as a category node, we can construct a spatio-temporal-category graph combining spatial, temporal and category information together. The hierarchical graph attention module acquires the category-category attention and then the agent-agent attention on the constructed graph. Note that the edges in the constructed graph are directed. Namely the edges are represented as a weighted asymmetric adjacency matrix to measure the interactions.

Once obtained the hierarchical interactions, an unlimited neighborhood interaction module is employed to capture the global information of all agents involved in the same interaction by an asymmetric convolutional network. Based on the global information and the hierarchical attention, the final interaction is obtained and fed into a Graph Convolutional Network (GCN) [Mohamed et al., 2020] which is followed by a Temporal Convolutional Network (TCN) [Bai et al., 2018], to estimate the parameters of Gaussian Mixed Model (GMM) [Reynolds, 2009].

Experimental results on multiple benchmark datasets demonstrate significant performance improvement of our method over the state-of-the-art methods. The visualization shows our method can learn the interaction among heterogeneous agents well. The code will be published upon acceptance.

In summary, the key contributions of this paper include:

  • [nosep]

  • We propose to model the interaction among heterogeneous agents to improve the trajectory prediction;

  • We present an Unlimited Neighborhood Interaction for modeling the interaction among the agents involved in the same interaction simultaneously;

  • We present a Hierarchical Graph Attention module for enhancing the agent-to-agent interaction based on category-to-category interaction.

Figure 2: Comparison between Unlimited Neighborhood Interaction and Pairwise Neighborhood Interaction (, GCN). Different agents are enclosed in differently colored boxes, corresponding to solid circles in the same color. The hollow circle denotes an interaction. As can be seen, an interaction involves a group of agents in our method. On the right side, we show the predicted trajectories with or without our Unlimited Neighborhood.

2 Related Works

Figure 3: Framework of our UNIN. The trajectories are reformed as spatiao-temporal and category inputs, and a spatio-temporal-category graph (STC graph) is composed. Hierarchical Attention learns directed category attention representing the category interactions, and directed agent attention representing the agent interactions from the STC graph. Collective interactions are captured by subsequent Unlimited Neighborhood Interaction with the asymmetric attention matrix, and then fed into a spatio-temporal graph convolutional network and temporal convolutional network to estimate the parameters of the Gaussian Mixture Model, from which the future trajectories are predicted.

Trajectory prediction mainly involves homogeneous and heterogeneous trajectory prediction in real scenarios. Homogeneous trajectory prediction predicts future trajectories under the same category (, only pedestrians). On the contrary, heterogeneous predicts future trajectories under different categories (, pedestrians, cars and bikes).

2.1 Homogeneous Trajectory Prediction

Prior to the prevalence of deep learning, there are classical methods 

[Tay and Laugier, 2008, Treuille et al., 2006, Wang et al., 2007], including Social Force models [Helbing and Molnar, 1995], Gaussian Process regression models [Tay and Laugier, 2008], dynamic Bayesian models [Wang et al., 2008]

and hidden Markov models 

[Surana and Srivastava, 2014], which are limited by hard-to-design hand-crafted features.

Thanks to the representational power of deep neural networks, trajectory prediction is recently dominated by deep learning based methods, such as Recurrent Neural Networks (RNNs) 

[Alahi et al., 2016]

, Generative Adversarial Networks (GANs) 

[Gupta et al., 2018], Graph Convolutional Networks (GCNs) [Mohamed et al., 2020] and Transformers [Yu et al., 2020]. S-LSTM [Alahi et al., 2016] aggregates the interaction information through a pooling mechanism. S-GAN [Gupta et al., 2018] predicts multiple socially acceptable trajectories using GANs. Later works measure the influence of interaction by attention mechanism. S-BiGAT [Kosaraju et al., 2019] uses Graph Attention Networks [Veličković et al., 2018] to model the interactions between pedestrians. STAR [Yu et al., 2020] separately models spatial interaction and temporal continuity through Transformer [Vaswani et al., 2017] architecture on graph.

Since physical constraints in the scenario and human states are the predominant factors in trajectory prediction [Sadeghian et al., 2019] under certain circumstances, extensive studies focus on the role of physical information recently [Casas et al., 2020, Tao et al., 2020, Liang et al., 2020b]. Sophie [Sadeghian et al., 2019] leverages both physical and social information to predict pedestrian trajectory. Notably, CVM [Hasan et al., 2018] takes pedestrians’ velocity and direction into account rather than semantic environment. ECTP [Mangalam et al., 2020b] infers trajectory endpoints first as additional information to assist pedestrians’ planning path. Different from pedestrian trajectory prediction, vehicle trajectory prediction methods can take advantage of more sensors and semantic environments, such as 3D point cloud and lane line [Liang et al., 2020c, Xie et al., 2017].

2.2 Heterogeneous Trajectory Prediction

Homogeneous traffic agents like pedestrians, vehicles follow different social conventions, and thus the homogeneous trajectory prediction methods cannot simultaneously model the interaction of all agents of different categories in the same scene and predict accurately.

Heterogeneous trajectory prediction in the real traffic scenes gradually attracted more research interest. JPKT [Bi et al., 2019] treats vehicles as rigid particles, where non-particle objects are subject to kinematics, and model vehicles and pedestrians with separate LSTMs. DATF [Park et al., 2020] models agent-to-agent and agent-to-scene interactions and proposes a new approach to estimate trajectory distribution. In brief, these methods focus on different behavior patterns of heterogeneous traffic agents, and the influence of the semantic environment.

While previous works ignore the interaction at the categorical granularity and unlimited interactions among agents, we model the interaction among all agents. In our method, physical constraints are implicitly learned through observed trajectories without environmental semantics as prior.

2.3 Graph Neural Network

Graph neural network (GNN) [Scarselli et al., 2008]

extends the neural network to process data without nature order. GNN learns a state vector embedding containing information about every node and its corresponding neighbors. In order to gather information from neighbor nodes and their edges and enrich the representation of GNN, extensive works 

[Kipf and Welling, 2016, Hamilton et al., 2017, Li et al., 2016, Veličković et al., 2018] study more complex graph structures. GCN [Kipf and Welling, 2016] and Graph Sage [Hamilton et al., 2017]

use spectral and spatial convolutional aggregation respectively, in which spectral convolution utilizes Fourier frequency domain to calculate graph Laplacian eigenvalue decomposition and spatial convolution operates on adjacent neighbor nodes in the spatial domain. GGNN 

[Li et al., 2016] proposes a gated graph neural network to improve long-term information dissemination. GAT [Veličković et al., 2018] introduces the attention mechanism to acquire the hidden state of the node by adding attention to its neighbor nodes. Highway GCN [Rahimi and Baldwin, 2018] leverages skip connection to avoid introducing more noise from superimposing [Li et al., 2018] on the network layer.

The previous trajectory prediction methods, , GCN and GAT, lack of a clear and proper distinction between heterogeneous nodes and homogeneous nodes, while our method takes large scale heterogeneous graph into account. In addition, most existing graph neural networks group heterogeneous nodes into a subgraph, which suffer from data imbalance and ineffective global information aggregation. In contrast, we utilize hierarchical graph attention to aggregate the information of large-scale heterogeneous nodes.

3 Our Method

In this section, we introduce our proposed UNIN, which aims to model interactions of heterogeneous traffic agents under the guidance of unlimited neighborhood interaction. Given a succession of video frames of traffic scenarios over time , there are categories with agents. The goal of trajectory prediction is to predict the location of each traffic agent within a future time horizon . For a traffic agent of category , it is denoted as , where is the location coordinate of traffic agent at time step .

As discussed, the interactions in previous works are only considered between two traffic agents or in a local area, while an unlimited number of other agents may be simultaneously involved in an interaction regardless of their category. Additionally, most of the existing works only focus on homogeneous trajectory prediction, but the heterogeneous trajectory, which is spontaneous in real traffic scenarios, is under-explored. To mitigate these limitations, we propose the Unlimited Neighborhood Interaction to capture the impact that all agents experience at the same time, and a hierarchical attention module to model heterogeneous interactions among traffic agents of different categories.

The overall framework of UNIN is illustrated in Figure 3. To aggregate information of agents involved in the same interaction, an interaction graph is built first to gather global interaction information. Subsequently, the Hierarchical Attention Module is used to obtain the category-category interaction and agent-agent interaction based on the global interaction information. Next, we introduce the Unlimited Neighborhood Module directly modeling interactions by pooling features among unlimited neighborhood agents. Finally, a heterogeneous graph convolution network and a temporal convolutional network are used to predict the parameters of a Gaussian Mixture Model for trajectory prediction.

3.1 Heterogeneous Graph Construction

There are agents in multiple categories in heterogeneous trajectory prediction, and thus we build a spatio-temporal-category graph to model them altogether, as shown in Figure 3, where every agent is regarded as a node and the interactions among agents are regarded as edges. To enhance the representations of category-category interaction, we also regard all agents with the same category as a category node:

(1)

where represent the index of node, time step and category, respectively. represents the node with category at the time step . is a temporal edge connecting node and . is a spatial edge connecting node and . is the category node with category at time step generated by the concatenation of all agents with category at time step . is the spatial category edges connecting category node and at time step . is the spatial category-agent edges connecting category node and each spatial node belonged to category .

The built spatio-temporal-category graph includes not only the information of each agent, but also the information of each category. Therefore, we can leverage to build category-to-category and agent-to-agent interaction.

Note that the temporal category edge of is not built, because we empirically find it not beneficial to the performance in practice. We speculate that people tend to behave similarly without interaction in traffic scenes when the number of samples grows to a certain magnitude.

3.2 Hierarchical Graph Attention

The interaction among agents is an essential factor for trajectory prediction. Especially, the heterogeneous interaction is more complex due to diverse object categories compared with homogeneous interaction [Niu et al., 2017]. In traffic scenarios, traffic agents (pedestrians, drivers, bikers, etc.) tend to react differently according to the categories of agents they encounter because of the difference in social habits and experiences. Hence, the interaction between categories (, category-category interaction) is also an important factor affecting agent’s trajectories.

In order to model the interaction among agents with multiply categories, we propose a Hierarchical Graph Attention module. It models the category-category interaction first, based on which the agent-agent interaction is modeled.

Category-Category Interaction. To build the interaction among categories, we obtain the category features of each category first on our built spatio-temporal-category graph, based on which the category-wise interaction weights are obtained through pooling operation.

In light of the imbalanced amount of agents in different scenarios, we employ a padding operation to align them to the same amount. Then, the embeddings

of each category are obtained by a linear projection, ,

(2)

where denotes linear projection, is the category node with category at time step , is the embedding of category at time step , and is the learnable weight of linear projection.

After acquiring the embeddings of each category, the embeddings of any two categories are concatenated to obtain fused embeddings. Subsequently, the category-category attention scores are generated by graph attention mechanism [Velickovic et al., 2018], as follows:

(3)

where is the attention score vector of category to at time step , denotes a learnable attention weight vector of category used to adjust the weights among categories,

denotes a non-linear activation function.

The attention score vector measures the interaction of one category to other categories. The category-category interaction aims to assist agent-agent interaction, and thus we only acquire an importance factor by pooling operation for each attention score vector

. We employ the max pooling to choose the biggest value in

as the importance factor , ,

(4)

After acquiring the importance factor between any two categories, the final category-category interaction is obtained by normalizing all the importance factors:

(5)

The weights of spatial category edges represent the category-category interaction, and thus we assign value to by the obtained interaction values.

Agent-Agent Interaction. Some related works [Mohamed et al., 2020] indicate the relative distance between agents is essential in some special scenarios. Therefore, we obtain the agent-agent interaction by a combination of learning-based method and distance-based method.

The distance-based method initializes the spatial edge with the relative distance between the corresponding agents. Then, the normalized interaction matrices is obtained by Laplace Transform [Masuda and Rocha, 2017] as follows:

(6)

where is the location coordinates for agent at time step , , and is the diagonal node degree matrix of .

For the learning-based method, we need to fuse the features of all agents. Fortunately, the learned attention score vector shown in Equation 3 already includes the required information, and thus we directly employ the learned to obtain the agent-agent interaction , ,

(7)

where operator denotes dot-product operation.

Models Argoverse nuScenes Avg Apolloscape
ADE FDE ADE FDE ADE FDE WADE WFDE
S-LSTM [Alahi et al., 2016] 1.385 2.567 1.390 2.676 1.388 2.622 1.89 3.40
DESIRE* [Lee et al., 2017] 0.896 1.453 1.079 1.844 0.988 1.649 - -
R2P2-MA [Rhinehart et al., 2018] 1.108 1.771 1.179 2.194 1.144 1.983 - -
CAM [Park et al., 2020] 1.131 2.504 1.124 2.318 1.128 2.411 - -
MFP [Tang and Salakhutdinov, 2020] 1.399 2.684 1.301 2.740 1.350 2.712 - -
MATFD [Zhao et al., 2019] 1.344 2.484 1.261 2.538 1.303 2.511 - -
MATFG* [Zhao et al., 2019] 1.261 2.313 1.053 2.126 1.157 2.220 - -
STGCNN [Mohamed et al., 2020] 1.305 2.344 1.274 2.198 1.289 2.371 - -
StarNet [Zhu et al., 2019] - - - - - - 1.343 2.498
TPNet [Fang et al., 2020] - - - - - - 1.281 1.910
NLNI (Ours) 0.792 1.256 1.049 1.521 0.921 1.388 1.094 1.545
Table 1: Comparison with other methods on dataset Argoverse, nuScenes and Apollscape in ADE and FDE metrics (the lower the better). All methods observe seconds and predict the next seconds of trajectories. Note that the Apolloscape dataset uses weighted ADE and FDE metric, , the weights of vehicles, pedestrians and cyclists are assigned as 0.20, 0.58 and 0.22, respectively. The methods marked by “*“ use additional scene context. Our UNIN significantly outperforms the state-of-the-art works.
Datasets Models
S-LSTM [Alahi et al., 2016] MATF [Zhao et al., 2019] DESIRE [Lee et al., 2017] NRI [Kipf et al., 2018] S-GAN [Gupta et al., 2018] SOPHIE [Sadeghian et al., 2019] Traject++ [Salzmann et al., 2020] STGCN [Mohamed et al., 2020] SIMAUG* [Mohamed et al., 2020] STGAT [Kosaraju et al., 2019] Ours
SDD 31.2 / 57 22.6 / 33.5 19.3 / 34.1 25.6 / 40.3 27.3 / 41.4 16.3 / 29.4 19.3 / 32.7 20.6 / 33.1 15.7 / 30.2 18.8 / 31.3 15.9 / 26.3
Table 2: Comparison with the previous approaches on the SDD benchmark dataset, which mainly contains the trajectories of pedestrian. The performance is evaluated in ADE/FDE metrics (the lower the better). The approach marked by “*“ uses additional simulation data.

3.3 Unlimited Neighborhood Interaction

In a real traffic scenario, interactions differ among the uncertain numbers of agents, , an agent could respond differently as the number of interacted agents varies. However, the existing graph attention mechanism [Wang et al., 2019] only computes the interaction between pair-wise agents because the inner-product is operated only between two vectors once. Hence, the learned agent-agent interaction can not adaptively capture the interaction among the uncertain number of agents.

To mitigate this, we propose the Unlimited Neighborhood Interaction module to capture the information of all agents involved in a same interaction simultaneously. Note that all agents involved in an interaction are called “unlimited neighborhood”, regardless of the numbers of agents. In particular, we employ an asymmetric convolution to obtain and aggregate the global interaction information on , ,

(8)

where is the non-linear activation function, and we use padding operation to ensure the output size the same as the input size.

The asymmetric convolution is computed repeatedly and thus the global spatial interaction information can be aggregated, which means that all agents involved in an interaction are considered, regardless of the number of the agents.

The final interaction is obtained through fusing unlimited neighborhood and category-category interaction:

(9)
Variants Argoverse nuScenes AVG SDD
UNIN w/o HGA 0.834/1.415 1.509/1.729 1.172/1.572 18.0/27.1
UNIN w/o UNI 1.113/1.689 1.616/1.862 1.365/1.776 21.5/32.2
UNIN w/o GMM 0.679/1.320 1.335/1.991 1.007/1.656 16.4/25.8
NLIN (Ours) 0.792/1.256 1.049/1.521 0.921/1.388 15.9/26.3
Table 3: The ablation study of each components. UNIN (Ours) combines with each components.
UNIConv Size 1 2 3 5 10
ADE 1.179 0.921 0.998 1.247 2.691
FDE 1.632 1.388 1.323 1.766 3.515
Table 4: Ablation study of kernel size for Unlimited Neighborhood convolution.

3.4 Trajectory Prediction

After obtaining the final interaction

, we regard it as the adjacency matrix of the spatio-temporal-category graph and feed it in GCN, which is followed by a TCN to estimate the parameters of Gaussian Mixture Model. A residual connection is used in GCN, ,

(10)

where is a non-linear activation function, is the index of layers of GCN, represents the node of the graph, and is the output features of TCN. Thus we acquire the collective interaction information from both the space and the time information.

Loss Function. Since the traffic agents of different categories have their own unique movement pattern, , a certain velocity range, front and rear distance to another object, we assume that the trajectory coordinates of traffic agents follow a Gaussian Mixture Model [Dong and Zhou, 2017]. Hence, our model is trained by minimizing the negative log-likelihood loss as follows:

(11)

where is the mean,

is the standard deviation,

is the correlation co-efficient, and is the weight factor of the

-th Gaussian distribution.

4 Experiments

Datasets. Some datasets focus on homogeneous trajectories, and contain fewer traffic scenes, , ETH [Pellegrini et al., 2009] and UCY [Lerner et al., 2007], which only label pedestrian trajectory within three scenes. However, there are often diverse categories in the real scenario, and thus we train and evaluate our model on more complex datasets, including Stanford Drone Dataset(SDD) [Robicquet et al., 2016], nuScenes [Caesar et al., 2020], Argoverse [Chang et al., 2019] and Apolloscape [Ma et al., 2019], which are widely used in heterogeneous trajectory prediction with diverse categories and rich traffic scenes. The SDD consists of eight unique scenes on the university campus, more than 100 static scenes, traffic agents of categories, and approximately interactions. The nuScenes, Argoverse and Apolloscape are large-scale trajectory datasets for urban streets with dense traffic in highly complicated situations. Besides, trajectories in them are collected through an in-vehicle camera so that they have more different scenarios.

We follow the existing works, observing 3.2 seconds of trajectories while predicting the next 4.8 seconds in Stanford Drone Dataset, and observing 2 seconds while predicting the next 3 seconds in nuScenes, Argoverse, and Apolloscae datasets. The Previous homogeneous methods [Kosaraju et al., 2019, Yu et al., 2020] train and evaluate networks on each individual scene separately, resulting in bad generalization across scenes. In contrast, we train and evaluate our model on all scenarios together for each dataset.

Evaluation Metrics. We follow existing works [Yu et al., 2020] and employ two common metrics to evaluate the performance: Average Displacement Error (ADE) and the Final Displacement Error (FDE), which are defined as follows:

(12)

where ADE measures the average L2 distance between ground truth and our predicted future positions over all time steps, while FDE measures the L2 distance between our predicted final destination and the true final destination.

4.1 Implementation Details

In the Hierarchical Attention Module, the embedding dimension of one category is set to 8 and the output size after padding is equal to the largest number of nodes in the scenario. In the Unlimited Neighborhood Module, the kernel-size of the convolution(UNIConv) is fixed at 3 ***See supplementary for more experiments with several different convolution filters to aggregate the information simultaneously.. We train our model with SGD, and the learning rate is set to 0.005, which decays by a factor after every epochs. The weighted factor of GMM loss is acquired from the Hierarchical Attention Module and the approximate ratio of the categories in scenes. We train our model on an RTX2080Ti GPU for up to 50 epochs. And we use a dataset split of 60%, 20%, 20% for training, validation and testing, respectively. The complete code will be published once upon acceptance.

Figure 4: Visualization of predicted trajectory distribution. Each line represents the ground-truth trajectory of an agent. The colored dots represent our predicted trajectory distribution, and different colors represent different densities of our predicted distribution, where yellow represents the most likely trajectory distribution. (a) shows the overall trajectories in the whole scene at all of the time instants. (b) shows that we successfully predict a turning agent. (c) shows that we successfully predict two agents going in parallel to the same direction. (d) shows that we successfully predict two agents separated by another interacting and avoiding each other. (e) shows that we successfully predict the possible trajectory of a group of agents after collective interaction. All results are randomly sampled from the nuScenes dataset.

4.2 Quantitative Evaluation

Table 1 and Table 2 show the comparison of our method against state-of-the-art approaches, including Social LSTM [Alahi et al., 2016], Social GAN [Gupta et al., 2018], STGAT [Kosaraju et al., 2019], Social STGCNN [Mohamed et al., 2020], Trajectron++ [Salzmann et al., 2020], NRI [Kipf et al., 2018], SoPhie [Sadeghian et al., 2019], MATF [Zhao et al., 2019], DESIRE [Lee et al., 2017], SimAug [Liang et al., 2020a], P2P2-MA [Rhinehart et al., 2018], CAM [Park et al., 2020], MFP [Tang and Salakhutdinov, 2020], StarNet [Zhu et al., 2019] and TPNet [Fang et al., 2020]. Overall, our method significantly outperforms all compared methods on all datasets according to the tables. Particularly, our UNIN surpasses the DESIRE (the second best) by on average in ADE and on average in FDE for nuScenes, Argoverse and Apolloscape. Meanwhile, our method achieves a performance improvement by on average in FDE for SDD dataset. The underlying reason is that our method can model the collective interaction among the agents involved in the same interaction simultaneously. Meanwhile, the Hierarchical Attention enhances the agent-agent interaction based on category-category interaction.

nuScenes, Argoverse and Apolloscape. Our UNIN outperforms all the competing methods on the three datasets. The nuScense, Argoverse and Apolloscape are multi-category mixed datasets with a majority of vehicles. Compared with the RNN-based method, such as S-LSTM [Alahi et al., 2016], our method surpasses it by in FDE/ADE metrics. We speculate that S-LSTM employs a pooling mechanism to aggregate local agents’ states, while it does not take the long-range interaction into account. In addition, our method also outperforms the Graph-based methods, , S-STGCNN [Mohamed et al., 2020], by in FDE/ADE metrics. We speculate it takes the long-range interaction into account but the interactions are only modeled between pairwise agents. Interestingly, our method outperforms the methods employed scene context, such as DESIRE [Lee et al., 2017] and MATFG [Zhao et al., 2019]. Both of them employ a LSTM to model each agent and fuse the interaction with in a local area, while our model considers the unlimited neighborhood, which is not limited by the number of agents and the range of interaction. Thus, our method can capture more global and local detail information to improve the accuracy of future trajectories.

Stanford Drone Dataset. Stanford Drone Dataset(SDD) is a multi-category mixed dataset including pedestrians, bicyclists, skateboarders, carts, cars and buses with a majority of pedestrians. Our method outperforms the methods modeling the interaction in a local area, such as S-LSTM [Alahi et al., 2016] (ours achieves better on average in ADE/FDE) and S-GAN [Gupta et al., 2018] (ours achieves better on average in FDE/ADE). We speculate the reason is they employ a pooling mechanism to aggregate the local agent’s interaction states, while our method employs an unlimited interaction capable of capturing the information of flexible interactions. In addition, our method is better than the graph-based methods, such as STGCNN [Mohamed et al., 2020], by average. Moreover, our method is slightly outperformed by SIMAUG [Liang et al., 2020a] in ADE metric, possibly due to SIMAUG uses extra 3D simulation data for training, leading to more robust representations. We also evaluate the data efficiency and generalization ability of our model, please refer to supplemental material for detail.

4.3 Qualitative Evaluation

We further study the ability of our method to model interactions of large-scale traffic agents with multiple categories. As discussed previously, there are often interactions with large numbers of agents and uncertain distances between them in real traffic scenarios. And agents often adopt different strategies when interacting with different categories of traffic participants. We illustrate some qualitative evaluation results in Figure 4. Overall, our predicted trajectory distributions are in line with the ground truth trajectories. Result (a) is the long time trajectories from the beginning time instant to the last time instant, which demonstrates the great prediction accuracy achieved by our method. Result (b) shows a single traffic agent that is turning. As expected, our model captures the agent’s tendency of turning. Result (c) shows our method successfully predicts the trajectory when two agents are going in parallel orienting towards the same direction, which means our method does not appear to be over-fitting. In (d), two non-adjacent agents interact with each other rather than with another closest agent to them. Our method leverages the UNI to capture the long-range interaction, successfully predicting that relatively distant agents interacting and the subsequent trajectories. Result (e) shows the collective interaction involving a group of agents belonging to different categories. Our method successfully predicts the possible trajectory of them with a complex interaction. And our predicted trajectory distributions show that the agents of different categories react differently when interacting with a specific agent, which demonstrates the efficiency of our HGA. We also visualize the relation between category attention and agent attention in supplemental material.

4.4 Ablation Study

We study the contribution of each component in our model as shown in Table 3. In addition, we set different values of kernel size of Unlimited Neighborhood Interaction to find the empirical optimal value, as shown in Table 4.

Contribution of Each Component. As illustrated in Table 3, we evaluate three variants of our method: (1) UNIN w/o HGA, which means the category-to-category attention is removed and only the agents-to-agents interaction is kept; (2) UNIN w/o UNI, which means the unlimited neighborhood interaction is removed; and (3) UNIN w/o GMM, which means the Gaussian Mixed Model is replaced by a bivariate Gaussian Distribution. According to the results, removing any component will lead to a large performance drop. Particularly, in the Argoverse and nuScenes datasets, the results of UNIN w/o HGA show a performance reduction by in ADE/FDE metrics, reflecting the effectiveness of hierarchical attention. The results of UNIN w/o UNI shows a performance degradation by in ADE/FDE metrics, which validates the contribution of unlimited neighborhood interaction. The results of UNIN w/o GMM shows a performance degradation by in ADE/FDE metrics, which means the GMM is more suitable for heterogeneous trajectory prediction.

Moreover, on the SDD dataset, the results of UNIN w/o HGA show a performance reduction by in ADE/FDE metrics, which validates the hierarchical attention is beneficial to heterogeneous interaction. The results of UNIN w/o UNI shows a performance degradation by in ADE/FDE metrics, demonstrating the effectiveness of our unlimited neighborhood interaction. The results of UNIN w/o GMM results in performance degradation by in ADE/FDE metrics, which shows the GMM achieves the desired function.

Optimal Kernel Size. As shown in Table 4, the optimal value of the kernel size of Unlimited Neighborhood Interaction convolution is in ADE metric, and in FDE metric. From the table, a larger kernel size is unhelpful. The convolution with kernel size 2 and 3 are the best performing settings to capture the relation among group agents.

5 Conclusion

To capture the interaction information with varying numbers of agents from an uncertain distance, we present an Unlimited Neighborhood Interaction Network to predict trajectories in multiply categories. An Unlimited Neighborhood Interaction Module generates the interaction with all of the agents involved in the interaction simultaneously. A Hierarchical Graph Attention module is designed to acquire the category-to-category interaction and agent-to-agent interaction, where the former one is used to enhance the representation of agent-to-agent interaction. Extensive quantitative evaluations show our method achieves state-of-the-art performance, even outperforming methods leveraging additional scene context. Qualitative evaluations illustrate the advantage of our method when predicting heterogeneous trajectories in dense and complex traffic scenarios.

References

  • [Al-Molegi et al., 2016] Al-Molegi, A., Jabreel, M., and Ghaleb, B. (2016). Stf-rnn: Space time features-based recurrent neural network for predicting people next location. In SSCI, pages 1–7.
  • [Al-Molegi et al., 2018] Al-Molegi, A., Jabreel, M., and Martínez-Ballesté, A. (2018). Move, attend and predict: An attention-based neural model for people’s movement prediction. PRL, 112:34–40.
  • [Alahi et al., 2016] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S. (2016). Social lstm: Human trajectory prediction in crowded spaces. In CVPR, pages 961–971.
  • [Alpher, 2002] Alpher, A. (2002). Frobnication. Journal of Foo, 12(1):234–778.
  • [Alpher and Fotheringham-Smythe, 2003] Alpher, A. and Fotheringham-Smythe, F. P. N. (2003). Frobnication revisited. Journal of Foo, 13(1):234–778.
  • [Alpher et al., 2004] Alpher, A., Fotheringham-Smythe, F. P. N., and Gamow, G. (2004). Can a machine frobnicate? Journal of Foo, 14(1):234–778.
  • [Amirian et al., 2019] Amirian, J., Hayet, J.-B., and Pettré, J. (2019). Social ways: Learning multi-modal distributions of pedestrian trajectories with gans. In CVPRW, pages 0–0.
  • [Bai et al., 2018] Bai, S., Kolter, J. Z., and Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.
  • [Bartoli et al., 2018] Bartoli, F., Lisanti, G., Ballan, L., and Del Bimbo, A. (2018). Context-aware trajectory prediction. In ICPR, pages 1941–1946.
  • [Bhattacharyya et al., 2018] Bhattacharyya, A., Fritz, M., and Schiele, B. (2018). Long-term on-board prediction of people in traffic scenes under uncertainty. In CVPR, pages 4194–4202.
  • [Bi et al., 2019] Bi, H., Fang, Z., Mao, T., Wang, Z., and Deng, Z. (2019). Joint prediction for kinematic trajectories in vehicle-pedestrian-mixed scenes. In ICCV, pages 10383–10392.
  • [Bisagno et al., 2018] Bisagno, N., Zhang, B., and Conci, N. (2018). Group lstm: Group trajectory prediction in crowded scenarios. In ECCVW, pages 0–0.
  • [Caesar et al., 2020] Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631.
  • [Casas et al., 2020] Casas, S., Gulino, C., Suo, S., Luo, K., Liao, R., and Urtasun, R. (2020). Implicit latent variable model for scene-consistent motion forecasting. In ECCV.
  • [Chandra et al., 2019a] Chandra, R., Bhattacharya, U., Bera, A., and Manocha, D. (2019a). Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In CVPR, pages 8483–8492.
  • [Chandra et al., 2019b] Chandra, R., Bhattacharya, U., Roncal, C., Bera, A., and Manocha, D. (2019b). Robusttp: End-to-end trajectory prediction for heterogeneous road-agents in dense traffic with noisy sensor inputs. In CSCS, pages 1–9.
  • [Chandra et al., 2020] Chandra, R., Guan, T., Panuganti, S., Mittal, T., Bhattacharya, U., Bera, A., and Manocha, D. (2020).

    Forecasting trajectory and behavior of road-agents using spectral clustering in graph-lstms.

    RAL, 5(3):4882–4890.
  • [Chang et al., 2019] Chang, M.-F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., et al. (2019). Argoverse: 3d tracking and forecasting with rich maps. In CVPR, pages 8748–8757.
  • [Cho et al., 2019] Cho, K., Ha, T., Lee, G., and Oh, S. (2019). Deep predictive autonomous driving using multi-agent joint trajectory prediction and traffic rules. In IROS, pages 2076–2081.
  • [Choi and Dariush, 2019a] Choi, C. and Dariush, B. (2019a). Learning to infer relations for future trajectory forecast. In CVPRW, pages 0–0.
  • [Choi and Dariush, 2019b] Choi, C. and Dariush, B. (2019b). Looking to relations for future trajectory forecast. In ICCV, pages 921–930.
  • [Desrosiers and Karypis, 2011] Desrosiers, C. and Karypis, G. (2011). A comprehensive survey of neighborhood-based recommendation methods. RSH, pages 107–144.
  • [Dong and Zhou, 2017] Dong, W. and Zhou, M. (2017).

    Gaussian classifier-based evolutionary strategy for multimodal optimization.

    NNLS, 25(6):1200–1216.
  • [Fang et al., 2020] Fang, L., Jiang, Q., Shi, J., and Zhou, B. (2020). Tpnet: Trajectory proposal network for motion prediction. In CVPR, pages 6797–6806.
  • [Fernando et al., 2018a] Fernando, T., Denman, S., Sridharan, S., and Fookes, C. (2018a). Gd-gan: Generative adversarial networks for trajectory prediction and group detection in crowds. In ACCV, pages 314–330.
  • [Fernando et al., 2018b] Fernando, T., Denman, S., Sridharan, S., and Fookes, C. (2018b). Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection. Neural networks, 108:466–478.
  • [Fernando et al., 2018c] Fernando, T., Denman, S., Sridharan, S., and Fookes, C. (2018c). Tracking by prediction: A deep generative model for mutli-person localisation and tracking. In WACV, pages 1122–1132.
  • [Gupta et al., 2018] Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., and Alahi, A. (2018). Social gan: Socially acceptable trajectories with generative adversarial networks. In CVPR, pages 2255–2264.
  • [Haddad and Lam, 2020] Haddad, S. and Lam, S. (2020). Self-growing spatial graph networks for pedestrian trajectory prediction. In WACV, pages 1140–1148.
  • [Hamilton et al., 2017] Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large graphs. In NeurIPS, pages 1024–1034.
  • [Hasan et al., 2018] Hasan, I., Setti, F., Tsesmelis, T., Del Bue, A., Cristani, M., and Galasso, F. (2018). ” seeing is believing”: Pedestrian trajectory forecasting using visual frustum of attention. In WACV, pages 1178–1185.
  • [Helbing and Molnar, 1995] Helbing, D. and Molnar, P. (1995). Social force model for pedestrian dynamics. Physical review E, 51(5):4282.
  • [Huang et al., 2019] Huang, Y., Bi, H., Li, Z., Mao, T., and Wang, Z. (2019). Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In ICCV, pages 6272–6281.
  • [Hudnell et al., 2019] Hudnell, M., Price, T., and Frahm, J.-M. (2019). Robust aleatoric modeling for future vehicle localization. In CVPRW, pages 0–0.
  • [Huynh and Alaghband, 2019] Huynh, M. and Alaghband, G. (2019). Trajectory prediction by coupling scene-lstm with human movement lstm. In ISVC, pages 244–259.
  • [Kim et al., 2019] Kim, D., Liu, M., Lee, S., and Kamat, V. R. (2019). Trajectory prediction of mobile construction resources toward pro-active struck-by hazard detection. In IAARC.
  • [Kipf et al., 2018] Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. (2018). Neural relational inference for interacting systems. In ICML, pages 2688–2697. PMLR.
  • [Kipf and Welling, 2016] Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. In ICLR.
  • [Kosaraju et al., 2019] Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H., and Savarese, S. (2019). Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In NeurIPS, pages 137–146.
  • [Kothari and Alahi, 2019] Kothari, A. A. P. and Alahi, A. (2019). Human trajectory prediction using adversarial loss. In Proc. 19th Swiss Transp. Res. Conf.
  • [Lee et al., 2017] Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H., and Chandraker, M. (2017). Desire: Distant future prediction in dynamic scenes with interacting agents. In CVPR, pages 336–345.
  • [Lerner et al., 2007] Lerner, A., Chrysanthou, Y., and Lischinski, D. (2007). Crowds by example. In CGF, pages 655–664.
  • [Li et al., 2019] Li, J., Ma, H., and Tomizuka, M. (2019). Interaction-aware multi-agent tracking and probabilistic behavior prediction via adversarial learning. In ICRA, pages 6658–6664.
  • [Li et al., 2020] Li, J., Yang, F., Tomizuka, M., and Choi, C. (2020). Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning. In NeurIPS.
  • [Li et al., 2018] Li, Q., Han, Z., and Wu, X.-M. (2018).

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In AAAI, volume 32.
  • [Li, 2019] Li, Y. (2019). Which way are you going? imitative decision learning for path forecasting in dynamic scenes. In CVPR, pages 294–303.
  • [Li et al., 2016] Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. (2016). Gated graph sequence neural networks. In ICLR.
  • [Liang et al., 2020a] Liang, J., Jiang, L., and Hauptmann, A. (2020a). Simaug: Learning robust representations from simulation for trajectory prediction. In ECCV, pages 275–292.
  • [Liang et al., 2020b] Liang, J., Jiang, L., Murphy, K., Yu, T., and Hauptmann, A. (2020b). The garden of forking paths: Towards multi-future trajectory prediction. In CVPR, pages 10508–10518.
  • [Liang et al., 2019] Liang, J., Jiang, L., Niebles, J. C., Hauptmann, A. G., and Fei-Fei, L. (2019). Peeking into the future: Predicting future person activities and locations in videos. In CVPR, pages 5725–5734.
  • [Liang et al., 2020c] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., and Urtasun, R. (2020c). Learning lane graph representations for motion forecasting. In ECCV, pages 541–556.
  • [Liao et al., 2018] Liao, B., Zhang, J., Wu, C., McIlwraith, D., Chen, T., Yang, S., Guo, Y., and Wu, F. (2018). Deep sequence learning with auxiliary information for traffic prediction. In SIGKDD, pages 537–546.
  • [Lisotto et al., 2019] Lisotto, M., Coscia, P., and Ballan, L. (2019). Social and scene-aware trajectory prediction in crowded spaces. In ICCVW, pages 0–0.
  • [Luber et al., 2010] Luber, M., Stork, J. A., Tipaldi, G. D., and Arras, K. O. (2010). People tracking with human motion predictions from social forces. In ICRA, pages 464–469.
  • [Ma et al., 2017] Ma, W.-C., Huang, D.-A., Lee, N., and Kitani, K. M. (2017). Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, pages 774–782.
  • [Ma et al., 2019] Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., and Manocha, D. (2019). Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In AAAI, volume 33, pages 6120–6127.
  • [Makansi et al., 2019] Makansi, O., Ilg, E., Cicek, O., and Brox, T. (2019). Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In CVPR, pages 7144–7153.
  • [Mangalam et al., 2020a] Mangalam, K., Adeli, E., Lee, K.-H., Gaidon, A., and Niebles, J. C. (2020a). Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision. In CVPR, pages 2784–2793.
  • [Mangalam et al., 2020b] Mangalam, K., Girase, H., Agarwal, S., Lee, K.-H., Adeli, E., Malik, J., and Gaidon, A. (2020b). It is not the journey but the destination: Endpoint conditioned trajectory prediction. In ECCV, pages 759–776.
  • [Manh and Alaghband, 2018] Manh, H. and Alaghband, G. (2018). Scene-lstm: A model for human trajectory prediction. In ISVC.
  • [Masuda and Rocha, 2017] Masuda, N. and Rocha, L. E. C. (2017). A gillespie algorithm for non-markovian stochastic processes: Laplace transform approach. Siam Review, 60(1):95–115.
  • [Minoura et al., 2019] Minoura, H., Hirakawa, T., Yamashita, T., and Fujiyoshi, H. (2019). Path predictions using object attributes and semantic environment. In VISAPP, pages 19–26.
  • [Mohajerin and Rohani, 2019] Mohajerin, N. and Rohani, M. (2019). Multi-step prediction of occupancy grid maps with recurrent neural networks. In CVPR, pages 10600–10608.
  • [Mohamed et al., 2020] Mohamed, A., Qian, K., Elhoseiny, M., and Claudel, C. (2020).

    Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction.

    In CVPR, pages 14424–14432.
  • [Name, 2014a] Name, F. A. (2014a). The frobnicatable foo filter. Face and Gesture submission ID 324. Supplied as additional material fg324.pdf.
  • [Name, 2014b] Name, F. A. (2014b). Frobnication tutorial. Supplied as additional material tr.pdf.
  • [Niu et al., 2017] Niu, Z., Zhou, M., Wang, L., Gao, X., and Hua, G. (2017). Hierarchical multimodal lstm for dense visual-semantic embedding. In ICCV, pages 1899–1907.
  • [Park et al., 2020] Park, S. H., Lee, G., Seo, J., Bhat, M., Kang, M., Francis, J., Jadhav, A., Liang, P. P., and Morency, L.-P. (2020). Diverse and admissible trajectory forecasting through multimodal context understanding. In ECCV, pages 282–298.
  • [Pellegrini et al., 2009] Pellegrini, S., Ess, A., Schindler, K., and Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV, pages 261–268.
  • [Rahimi and Baldwin, 2018] Rahimi, Afshinand Cohn, T. and Baldwin, T. (2018). Semi-supervised user geolocation via graph convolutional networks. In ACL.
  • [Rasmussen et al., 1999] Rasmussen, C. E. et al. (1999). The infinite gaussian mixture model. In NIPS, volume 12, pages 554–560.
  • [Rasouli et al., 2019] Rasouli, A., Kotseruba, I., Kunic, T., and Tsotsos, J. K. (2019). Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In ICCV, pages 6262–6271.
  • [Reynolds, 2009] Reynolds, D. A. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741:659–663.
  • [Rhinehart et al., 2018] Rhinehart, Kitani, N., and K.M. Vernaza, P. (2018). R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In ECCV, page 772–788.
  • [Ridel et al., 2020] Ridel, D., Deo, N., Wolf, D., and Trivedi, M. (2020). Scene compliant trajectory forecast with agent-centric spatio-temporal grids. RAL, 5(2):2816–2823.
  • [Robicquet et al., 2016] Robicquet, A., Sadeghian, A., Alahi, A., and Savarese, S. (2016). Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, pages 549–565.
  • [Rong and Bhanu, 2005] Rong, W. and Bhanu, B. (2005). Learning models for predicting recognition performance. In ICCV.
  • [Sadeghian et al., 2019] Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., and Savarese, S. (2019). Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In CVPR, pages 1349–1358.
  • [Salzmann et al., 2020] Salzmann, T., Ivanovic, B., Chakravarty, P., and Pavone, M. (2020). Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. arXiv preprint arXiv:2001.03093.
  • [Scarselli et al., 2008] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2008). The graph neural network model. neural networks, 20(1):61–80.
  • [Schöller et al., 2019] Schöller, C., Aravantinos, V., Lay, F., and Knoll, A. (2019). The simpler the better: Constant velocity for pedestrian motion prediction.
  • [Shi et al., 2019] Shi, X., Shao, X., Guo, Z., Wu, G., Zhang, H., and Shibasaki, R. (2019). Pedestrian trajectory prediction in extremely crowded scenarios. Sensors, 19(5):1223.
  • [Srikanth et al., 2019] Srikanth, S., Ansari, J. A., Ram, R. K., Sharma, S., Murthy, J. K., and Krishna, K. M. (2019). Infer: Intermediate representations for future prediction. In IROS, pages 942–949.
  • [Sun et al., 2020] Sun, J., Jiang, Q., and Lu, C. (2020). Recursive social behavior graph for trajectory prediction. In CVPR, pages 660–669.
  • [Surana and Srivastava, 2014] Surana, A. and Srivastava, K. (2014).

    Bayesian nonparametric inverse reinforcement learning for switched markov decision processes.

    In ICMLA, pages 47–54.
  • [Tang and Salakhutdinov, 2020] Tang, Y. C. and Salakhutdinov, R. (2020). Multiple futures prediction. In NeurIPS.
  • [Tao et al., 2020] Tao, C., Jiang, Q., Duan, L., and Luo, P. (2020). Dynamic and static context-aware lstm for multi-agent motion prediction. In ECCV, pages 547–563.
  • [Tay and Laugier, 2008] Tay, M. K. C. and Laugier, C. (2008). Modelling smooth paths using gaussian processes. In FSR, pages 381–390.
  • [Thiede and Brahma, 2019] Thiede, L. A. and Brahma, P. P. (2019). Analyzing the variety loss in the context of probabilistic trajectory prediction. In CVPR, pages 9954–9963.
  • [Treuille et al., 2006] Treuille, A., Cooper, S., and Popović, Z. (2006). Continuum crowds. TOG, 25(3):1160–1168.
  • [Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In NIPS.
  • [Veličković et al., 2018] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph attention networks. In ICLR.
  • [Velickovic et al., 2018] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2018). Graph attention networks. stat, 1050:4.
  • [Wang et al., 2007] Wang, J. M., Fleet, D. J., and Hertzmann, A. (2007). Gaussian process dynamical models for human motion. PAMI, 30(2):283–298.
  • [Wang et al., 2019] Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., and Yu, P. S. (2019). Heterogeneous graph attention network. In WWW, pages 2022–2032.
  • [Wang et al., 2008] Wang, X., Ma, X., and Grimson, W. E. L. (2008). Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. PAMI, 31(3):539–555.
  • [Xie et al., 2017] Xie, G., Gao, H., Qian, L., Huang, B., Li, K., and Wang, J. (2017). Vehicle trajectory prediction by integrating physics-and maneuver-based approaches using interactive multiple models. Industrial Electronics, 65(7):5999–6008.
  • [Xu et al., 2018] Xu, Y., Piao, Z., and Gao, S. (2018). Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In CVPR, pages 5275–5284.
  • [Xue et al., 2017] Xue, H., Huynh, D. Q., and Reynolds, M. (2017). Bi-prediction: pedestrian trajectory prediction based on bidirectional lstm classification. In DICTA, pages 1–8.
  • [Xue et al., 2019] Xue, H., Huynh, D. Q., and Reynolds, M. (2019). Pedestrian trajectory prediction using a social pyramid. In ICAC, pages 439–453.
  • [Yagi et al., 2018] Yagi, T., Mangalam, K., Yonetani, R., and Sato, Y. (2018). Future person localization in first-person videos. In CVPR, pages 7593–7602.
  • [Yan et al., 2018] Yan, S., Xiong, Y., and Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pages 7444–7452.
  • [Yu et al., 2020] Yu, C., Ma, X., Ren, J., Zhao, H., and Yi, S. (2020).

    Spatio-temporal graph transformer networks for pedestrian trajectory prediction.

    In ECCV, pages 507–523.
  • [Zhang et al., 2019a] Zhang, P., Ouyang, W., Zhang, P., Xue, J., and Zheng, N. (2019a). Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In CVPR, pages 12085–12094.
  • [Zhang et al., 2019b] Zhang, W., Sun, L., Wang, X., Huang, Z., and Li, B. (2019b). Seabig: A deep learning-based method for location prediction in pedestrian semantic trajectories. Access, 7:109054–109062.
  • [Zhao et al., 2019] Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao, Y., Wang, Y., and Wu, Y. N. (2019).

    Multi-agent tensor fusion for contextual trajectory prediction.

    In CVPR, pages 12126–12134.
  • [Zhu et al., 2019] Zhu, Y., Qian, D., Ren, D., and Xia, H. (2019). Starnet: Pedestrian trajectory prediction using deep neural network in star topology. In IROS, pages 8075–8080. IEEE.

References