Adaptive Trajectory Prediction via Transferable GNN

03/09/2022
by   Yi Xu, et al.
Northeastern University
0

Pedestrian trajectory prediction is an essential component in a wide range of AI applications such as autonomous driving and robotics. Existing methods usually assume the training and testing motions follow the same pattern while ignoring the potential distribution differences (e.g., shopping mall and street). This issue results in inevitable performance decrease. To address this issue, we propose a novel Transferable Graph Neural Network (T-GNN) framework, which jointly conducts trajectory prediction as well as domain alignment in a unified framework. Specifically, a domain invariant GNN is proposed to explore the structural motion knowledge where the domain specific knowledge is reduced. Moreover, an attention-based adaptive knowledge learning module is further proposed to explore fine-grained individual-level feature representation for knowledge transfer. By this way, disparities across different trajectory domains will be better alleviated. More challenging while practical trajectory prediction experiments are designed, and the experimental results verify the superior performance of our proposed model. To the best of our knowledge, our work is the pioneer which fills the gap in benchmarks and techniques for practical pedestrian trajectory prediction across different domains.

READ FULL TEXT VIEW PDF

page 1

page 4

page 10

07/18/2022

Action-based Contrastive Learning for Trajectory Prediction

Trajectory prediction is an essential task for successful human robot in...
10/12/2020

Scene Gated Social Graph: Pedestrian Trajectory Prediction Based on Dynamic Social Graphs and Scene Constraints

Pedestrian trajectory prediction is valuable for understanding human mot...
07/08/2021

Graph and Recurrent Neural Network-based Vehicle Trajectory Prediction For Highway Driving

Integrating trajectory prediction to the decision-making and planning mo...
07/20/2022

The Atlas Benchmark: an Automated Evaluation Framework for Human Motion Prediction

Human motion trajectory prediction, an essential task for autonomous sys...
04/09/2020

An End-to-End Learning Approach for Trajectory Prediction in Pedestrian Zones

This paper aims to explore the problem of trajectory prediction in heter...
04/30/2022

HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding

One essential task for autonomous driving is to encode the information o...
08/06/2022

Generalizability Analysis of Graph-based Trajectory Predictor with Vectorized Representation

Trajectory prediction is one of the essential tasks for autonomous vehic...

1 Introduction

Figure 1: An example that reveals the limitation of original learning strategy. These two frames are extracted from two different scenes and there is a huge difference between these trajectories.
Metric Trajectory Domains E-D S-D
ETH HOTEL UNIV ZARA1 ZARA2
NoS 70 301 947 602 921 877 383.63
NoP 181 1053 24334 2253 5833 24153 10073.07
AN 2.586 3.498 25.696 3.743 6.333 23.11 9.78
AV () 0.437 0.178 0.205 0.369 0.206 0.259 0.11
AA () 0.131 0.06 0.035 0.039 0.026 0.105 0.04
Table 1: Statistics of five different scenes, ETH, HOTEL, UNIV, ZARA1, and ZARA2. NoS denotes the number of sequences to be predicted, NoP denotes the number of pedestrians, AN denotes the average number of pedestrians in each sequence, AV denotes the average velocity of pedestrians in each sequence, and AA denotes the average acceleration of pedestrians in each sequence. E-D represents Extreme Deviation and S-D

represents Standard Deviation.

Trajectory prediction aims to predict the future trajectory seconds to even a minute prior from a given trajectory history. It plays an indispensable role in a large number of real world applications such as autonomous driving, robotics, navigation, video surveillance, and so on. In self-driving scenario, accurate pedestrian trajectory prediction is essential for planning [Bai2015Intention, ma2020optimal], decision making [Yuanfu2018PORCA], environmental perception [Talbot2020Robot, Obo2020Intelligent], person identification [Luber2010People]

, and anomaly detection 

[Musleh2010Identifying, 2004Pedestrian]. Trajectory prediction is a challenging task. For instance, strangers tend to walk alone trying to avoid collisions but friends tend to walk as a group [Moussaid2010TheWalking]. In addition, pedestrians can interact with surrounding objects or other pedestrians, while such interaction is too complex and subtle to quantify. To consider such interactions, a pooling layer is designed in work Social-LSTM [alahi2016social]

to pass the interaction information among pedestrians, and then a long short-term memory (LSTM) network is applied to predict future trajectories. Following this pattern, many methods 

[liang2019peeking, zhang2019sr, hu2020collaborative, xu2020cf, zhu2021simultaneous]

have been proposed for sharing information via different mechanisms, i.e., attention mechanism or similarity measure. Instead of predicting one determined future trajectory, some generative adversarial network-based (GAN) 

[fernando2018gd, gupta2018social, li2019conditional, sadeghian2019sophie, dendorfer2021mg] and encoder-decoder-based methods [Mangalam2020It, cheng2020exploring, salzmann2020trajectronplusplus, xu2021tra2tra, chen2021personalized, chen2021human, shafiee2021introvert] have been proposed to generate multiple feasible trajectories.

However, these existing methods usually focus on learning a generic motion pattern while ignoring the potential distribution differences between the training and testing samples. We argue that this learning strategy has some limitations. Fig. 1 illustrates the basic concept. It is obvious that the trajectories of walking pedestrians in different trajectory domains are different, the trajectory in the left figure is stable but the trajectory in the right figure is much more tortuous. The original strategy is to learn these two samples together, which unintentionally introduces domain-bias and disparities in the model.

To quantitatively and objectively evaluate the potential domain gaps, Tab. 1 gives five numerical metrics of five commonly used scenes. We can observe that the number of pedestrians in UNIV is much larger than that in ETH, and the differences among five trajectory domains are significant. As for pedestrian moving pattern, pedestrians in ETH have the largest average moving velocity, which is nearly three times larger than that in HOTEL. In addition, pedestrians in ETH also have the largest average moving acceleration, which is nearly five times larger than that in ZARA2. The E-D value and S-D value also reveal the huge differences among five trajectory domains. This situation is general and always exists in practical applications. For example, in vision applications, cameras located in different cities/corners could lead to significant distribution gap. Similar issues happen in robot navigation or autonomous driving-related applications since the environments are constantly changing.

To further demonstrate this challenge, we apply three SOTA methods: Social-STGCNN [mohamed2020social], SGCN [shi2021sgcn], Tra2Tra [xu2021tra2tra] to show the performance drop when it comes to different trajectory domains. We take ETH as the example, these models are trained on the validation set of ETH and evaluated on the standard testing set of ETH. Note that there is no overlap trajectory sample between the training and testing set, but the distributions of them can be regarded as consistent. We refer to this evaluation protocol as “consistent setting” and the performance under this new protocol as “updated ADE” and “updated FDE”. Fig. 2 shows the updated ADE/FDE as well as the original ADE/FDE reported in their papers. The performance drops are significant which further reveal the domain-bias problem in the original leave-one-out setting.

Figure 2: Performance comparison of three state-of-the-art methods under the original leave-one-out setting and the consistent setting. The performance drops of all three models are significant.

Domain adaptation is a subcategory of transfer learning which aims to address the domain shift issue. The basic idea is to minimize the distance of distributions of source and target domains via some distance measures, such as maximum mean discrepancy (MMD) 

[Ni2013Subspace, Long2015Learning], correlation alignment distance (CORAL) [Sun2016Deep, Zhuo2017Deep], and adversarial loss [Ganin2014Unsupervised, 2020Unsupervised]. Among these methods, the feature dimension of one sample is fixed in both source and target domain. On the contrary, a “sample” in our task is a combination of multiple trajectories with different pedestrians, which has not only global domain shift but also internal correlations. Therefore, directly utilizing the general feature representation of one “sample” results in the lack of crucial individual-level fine-grained features. Consequently, the most popular domain adaptation approaches are not applicable here.

In this work, we delve into the trajectory domain shift problem and propose a transferable graph neural network via adaptive knowledge learning model. Specifically, we propose a novel attention-based adaptive knowledge learning module for trajectory-to-trajectory domain adaptation. Moreover, a novel trajectory graph neural network is proposed. It extracts compact features of pedestrians that enhance the domain invariant knowledge learning. The contributions of our work are summarized as,

  • We delve into the domain shift problem across different trajectory domains and propose a unified T-GNN method for jointly predicting future trajectories and adaptively learning domain-invariant knowledge.

  • We propose a specially designed graph neural network for extracting compact spatial-temporal feature representations. We also develop an effective attention-based adaptive knowledge learning module to explore fine-grained individual-level transferable representations for domain adaptation.

  • We introduce a brand new setting for pedestrian trajectory prediction problem, which is meaningful in real practice. We set up strong baselines for pedestrian trajectory prediction under this domain-shift setting.

  • Experiments on five trajectory domains verify the consistent and superior performance of our method.

As it is natural to use a graph-based model to represent the topology of social networks, recent methods [mohamed2020social, ivanovic2019trajectron, sun2020recursive, li2021spatial, shi2021sgcn, wang2021graphtcn] employ graph neural networks as their backbones. Different from these methods, the graph neural network we employed is simple yet specially designed not only to extract effective spatial-temporal features but also to be suitable for domain-invariant knowledge learning.

2 Related Works

Figure 3: Flowchart of our T-GNN model. Given the source and target trajectories, we first construct corresponding successive graphs and , and then GCN layers are applied to extract feature representations and from these graphs. Following this, and are forwarded through the Attention-Based Adaptive Knowledge Module to learn transferable features and for aligning the source and target trajectory domain. Afterwards, only from source trajectory domain is utilized for future trajectory prediction via Temporal Prediction Module. Finally, our T-GNN model jointly minimizes the prediction loss and alignment loss.

2.1 Forecasting Pedestrian Trajectory

Forecasting pedestrian trajectory aims to predict future locations of the target person based on his/her past locations and surroundings. Early researches attempt to use mathematical models [crowsourcing_1] to make predictions such as Gaussian Process [Keat2007Modelling, Ellis2009Modelling]

, and Markov Decision Process 

[Makris2002Spatial, Kitani2012Activity]

. Recently, a large number of deep learning methods have been proposed to solve this prediction problem. In the work Social-LSTM 

[alahi2016social]

, pedestrians are modeled with Recurrent Neural Networks (RNNs), and the hidden states of pedestrians are integrated via a designed pooling layer, where human-human interaction features are shared. To improve the quality of extracted interaction features, many recent works 

[vemula2018social, zhang2019sr, liang2019peeking, bisagno2018group, hu2020collaborative, zheng2021unlimited] follow this idea to pass information among pedestrians, and different effective message passing approaches are proposed. Taking into account the uncertainty of pedestrians walking, some studies [sadeghian2019sophie, li2019conditional, vaswani2017attention, fernando2018gd, amirian2019social, kosaraju2019social, dendorfer2021mg] utilize Generative Adversarial Networks (GAN) to make multiple plausible predictions of each person. In addition, different Encoder-Decoder structures [Mangalam2020It, cheng2020exploring, sun2021three] are also applied in this task, which are more flexible to embed different context features.

Transformer structure [vaswani2017attention]

has achieved remarkable performance in Natural Language Processing field 

[Devlin2018BERT]. Motivated by this design, some studies [giuliari2020transformer, yu2020spatio, yuan2021agent] adopt it to the trajectory prediction task and improve the prediction precision. For the past two years, some works [tran2021goal, zhao2021you, mangalam2021goals]

have been proposed to explore the goal-driven trajectory prediction, in which estimate the end points of trajectories. In addition, some interesting perspectives have been introduced into this task, i.e., long-tail situation 

[makansi2021exposing]

, energy-based model 

[pang2021trajectory], interpretable forecasting model [kothari2021interpretable]

, active-learning 

[xu2021robust], and counterfactual analysis [chen2021human]. Different from recent work [liang2020simaug] that studies the problem of predicting future trajectories in unseen cameras with only 3D simulation data, our work is carried out under a more general and practical trajectory prediction setting, which has more profound influences.

2.2 Graph-Involved Forecasting Models

Thanks to the powerful representation ability in non-Euclidean space, Graph Neural Networks (GNNs) are widely applied in the trajectory prediction task [velivckovic2017graph, yan2018spatial, wu2020comprehensive, jain2016structural, wang2019inductive] recently. The basic idea is to treat the pedestrians as the nodes in a graph while measuring their interactions via graph edges. Recent works have utilized different variants of graph neural networks, e.g., edge-feature aggregation [Rosmann2017Online, sun2020recursive]

, spatial-temporal feature extraction 

[mohamed2020social, ivanovic2019trajectron], adapted graph structure [VectorNet2020, zhu2019starnet, mohamed2020social, shi2021sgcn], and graph attention method [kosaraju2019social]. Our work also applies the graph model for feature representations extraction. Different from the above methods, our model is specially designed for effective spatial-temporal feature representation learning as well as trajectory domain-invariant knowledge learning.

2.3 Domain Adaptation

Recently, domain adaptation (DA) problem has attracted considerable attention, motivating a large number of approaches [Yang2018Learning, Ding2018Graph] to resolve the domain shift problem. Generally speaking, it can be divided into two main categories, one is semi-supervised DA problem, and the other is unsupervised DA problem. The difference between these two categories lies in the accessibility of target labels in the training phase. In semi-supervised DA [Hal2010Co, Saenko2010Adapting, He2020Classification], only a small number labeled target samples is accessible. While in unsupervised DA [2015Geodesic, Luo2018Deep, Jim2020A, cai2021graph], the target domain is totally unlabeled, which is much more challenging. In our work, we are dealing with the unsupervised DA problem. The majority of existing unsupervised DA methods usually project the source and target samples into a shared feature space, and then align their feature distributions via minimizing some distance measures, such as MMD [Ni2013Subspace, Long2015Learning], CORAL [Sun2016Deep, Zhuo2017Deep], or using Adversarial Loss [Ganin2014Unsupervised, 2020Unsupervised] to force their distributions indistinguishable. As discussed above, these methods cannot be directly applied in our work. We address this problem by introducing a novel attention-based adaptive knowledge learning module.

3 Our Method

The framework of T-GNN model is illustrated in Fig. 3, including three main components: 1) a graph neural network to extract effective spatial-temporal features of pedestrians from both source and target trajectory domains, 2) an attention-based adaptive knowledge learning module to explore domain-invariant individual-level representations for transfer learning, 3) a temporal prediction module for future trajectory predictions.

3.1 Problem Definition

Given one pedestrian observed trajectory from time step to , aim to predict the future trajectory from time step to , where denote the coordinates. Considering all the pedestrians in the scene, the goal is to predict trajectories of all the pedestrians simultaneously by a model with parameter . Formally,

(1)

where is the set of future trajectories of all the pedestrians, denotes the number of pedestrians, and represents the collection of learnable parameters in the model.

3.2 Spatial-Temporal Feature Representations

Different from traditional time series prediction, it is more challenging to predict future trajectories because of the implicit human-human interactions and strong temporal correlations. Extracting comprehensive spatial-temporal feature representations of observed pedestrian trajectories is the key to accurate predictions. In our work, considering the data structure of trajectories, a graph neural network is first employed to extract spatial-temporal feature representations.

Before constructing the graph, coordinates of all pedestrians are firstly passed through one layer as,

(2)

where is the number of pedestrians in the scene, represents the coordinates of pedestrian at the last observed frame . This decentralization operation is able to eliminate the effects of scene size differences and is also applied in recent works [zhu2019starnet, xu2021tra2tra]. We refer to as the “relative coordinates” for the following graph construction.

We define the graph , where is the vertex set of pedestrians in the graph, is the edge set that indicates the relationship between two pedestrians, and is the feature matrix associated with each pedestrian ( is the feature dimension). The topological structure of graph is represented by the adjacency matrix . In our case, the value of in adjacency matrix is initialized as the distance between pedestrian and . The value of in feature matrix is defined as,

(3)

where are projection learnable parameters, is

non-linearity activation function.

With the adjacency matrix , the graph attention layer from [velivckovic2017graph] is adopted here to measure the relative importance of spatial relations between pedestrians for updating the adjacency matrix . The graph attention coefficients are calculated as,

(4)

where is

column vector in

, are learnable parameters, represents the concatenation that operates in the dimension of row, is non-linearity activation function with . The same parameters are used here, see [velivckovic2017graph] for details.

The linear combination is computed based on the attention coefficients obtained, we have,

(5)

With each column vector concatenated together, we obtain the new updated adjacency matrix , which contains the information of global spatial features of pedestrians at time step . Then, the GCN layers [kipf2016semi] are applied to extract spatial-temporal features. Similar with [mohamed2020social]

, we first add identity matrix to

as,

(6)

Then, we stack from time step to as and also stack vertex feature matrices of the layer from time step to as , where represents the observation length. Additionally, the stack of node degree matrices are correspondingly calculated from , respectively. Finally, the output of the layer is calculated as,

(7)

where are learnable parameters of the layer.

In our case, three cascaded GCN layers () are employed to extract spatial-temporal feature representations of observed trajectories. Both source and target trajectories are constructed as graphs accordingly and then fed into the parameter-shared GCN layers for feature representation extraction. For simplicity, we denote the final feature representations of source trajectory domain as , and target trajectory domain as , where and are two different numbers of pedestrians from source and target domains.

3.3 Attention-Based Adaptive Learning

Given the misalignment of feature representations between source and target trajectory domains, we introduce an individual-wise attention-based adaptive knowledge learning module for transfer learning. Different from conventional domain adaptation situations, the feature space of trajectory sample is not fixed as the numbers of pedestrians are different in source and target trajectory domains. In order to address this misalignment problem, we propose a novel attention-based adaptive knowledge learning module to refine and effectively concentrate on the most relevant feature space for misalignment alleviation.

For individual-wise attention, given the feature representations and , we first reformat the final feature representations and as,

(8)

where and correspond to the feature maps of one pedestrian from source and target trajectory domain. Then we reshape the feature maps and to the feature vector with the size of , where .

Although the feature vector keeps the spatial-temporal information of one pedestrian, we cannot decide how representative of one pedestrian’s feature vector is in one trajectory domain. Therefore, an attention module is introduced to learn the relative relevance between feature vectors and trajectory domain. The attention scores are calculated as,

(9)

where and are learnable parameters. Then the final feature representations of source and target trajectory domains and are calculated as,

(10)

These two context vectors and correspond to the refined individual-level representations of source and target trajectory domains. A similarity loss for distribution alignment is accordingly introduced as,

(11)

There are multiple choices for the distance function such as distance, MMD loss [Ni2013Subspace, Long2015Learning], CORAL loss [Sun2016Deep, Zhuo2017Deep], and adversarial loss [Ganin2014Unsupervised, 2020Unsupervised]. We explore these four alignment measures in Sec. 4, and results indicate that distance is more appropriate. Thus, we have,

(12)

3.4 Temporal Prediction Module

Instead of making predictions frame by frame, TCN [GCRNSM2018Shaojie] layers are employed to make future trajectory predictions based on the spatial-temporal feature representations from source trajectory domain. This prediction strategy is able to alleviate the error accumulating problem in sequential predictions caused by RNNs. It can also avoid gradient vanishing or reduce high computational costs [hochreiter1997long, chung2014empirical]. Recent works [mohamed2020social, shi2021sgcn] also employed this strategy.

Given the feature representation , we pass through TCN layers in time dimension to obtain their corresponding future trajectories. Formally, for the TCN layer, we have,

(13)

where are leanable parameters of the TCN layer, represents the prediction output ( represents the length to be predicted). In our case, three three cascaded TCN layers () are employed to obtain the final output which we refer to as .

Similar assumption is made that pedestrian coordinates

follow a bi-variate Gaussian distribution as

, where is the mean, is the standard deviation, and is the correlation coefficient. These parameters are determined by passing through one linear layer as,

(14)

where are learnable parameters of this linear layer.

3.5 Objective Function

The overall objective function consists of two terms, the prediction loss for predicting future trajectory prediction and the alignment loss for aligning the distributions of source and target trajectory domains. The prediction loss is the negative log-likelihood as,

(15)

Note that only samples from source trajectory domain participate in the prediction phase. The whole model is trained by jointly minimizing the prediction loss and the alignment loss , thus we have,

(16)

where is a hyper-parameter for balancing these two terms.

4 Experiments

Method Year Performance (ADE)  (Source2Target) Ave
A2B A2C A2D A2E B2A B2C B2D B2E C2A C2B C2D C2E D2A D2B D2C D2E E2A E2B E2C E2D
Social-STGCNN [mohamed2020social] 2020 1.83 1.58 1.30 1.31 3.02 1.38 2.63 1.58 1.16 0.70 0.82 0.54 1.04 1.05 0.73 0.47 0.98 1.09 0.74 0.50 1.22
PECNet [Mangalam2020It] 2020 1.97 1.68 1.24 1.35 3.11 1.35 2.69 1.62 1.39 0.82 0.93 0.57 1.10 1.17 0.92 0.52 1.01 1.25 0.83 0.61 1.31
RSBG [sun2020recursive] 2020 2.21 1.59 1.48 1.42 3.18 1.49 2.72 1.73 1.23 0.87 1.04 0.60 1.19 1.21 0.80 0.49 1.09 1.37 1.03 0.78 1.38
Tra2Tra [xu2021tra2tra] 2021 1.72 1.58 1.27 1.37 3.32 1.36 2.67 1.58 1.16 0.70 0.85 0.60 1.09 1.07 0.81 0.52 1.03 1.10 0.75 0.52 1.25
SGCN [shi2021sgcn] 2021 1.68 1.54 1.26 1.28 3.22 1.38 2.62 1.58 1.14 0.70 0.82 0.52 1.05 0.97 0.80 0.48 0.97 1.08 0.75 0.51 1.22
T-GNN (Ours) - 1.13 1.25 0.94 1.03 2.54 1.08 2.25 1.41 0.97 0.54 0.61 0.23 0.88 0.78 0.59 0.32 0.87 0.72 0.65 0.34 0.96
Table 2: ADE results of our T-GNN model in comparison with existing state-of-the-art baselines on 20 tasks. “2” represents from source domain to target domain. A, B, C, D, and E denote ETH, HOTEL, UNIV, ZARA1, and ZARA2, respectively.
Method Year Performance (FDE)  (Source2Target) Ave
A2B A2C A2D A2E B2A B2C B2D B2E C2A C2B C2D C2E D2A D2B D2C D2E E2A E2B E2C E2D
Social-STGCNN [mohamed2020social] 2020 3.24 2.86 2.53 2.43 5.16 2.51 4.86 2.88 2.30 1.34 1.74 1.10 2.21 1.99 1.41 0.88 2.10 2.05 1.47 1.01 2.30
PECNet [Mangalam2020It] 2020 3.33 2.83 2.53 2.45 5.23 2.48 4.90 2.86 2.22 1.32 1.68 1.12 2.20 2.05 1.52 0.88 2.10 1.84 1.45 0.98 2.29
RSBG [sun2020recursive] 2020 3.42 2.96 2.75 2.50 5.28 2.59 5.19 3.10 2.36 1.55 1.99 1.37 2.28 2.22 1.77 0.97 2.19 2.29 1.81 1.34 2.50
Tra2Tra [xu2021tra2tra] 2021 3.29 2.88 2.66 2.45 5.22 2.50 4.89 2.90 2.29 1.33 1.78 1.09 2.26 2.12 1.63 0.92 2.18 2.06 1.52 1.17 2.34
SGCN [shi2021sgcn] 2021 3.22 2.81 2.52 2.40 5.18 2.47 4.83 2.85 2.24 1.32 1.71 1.03 2.23 1.90 1.48 0.97 2.10 1.95 1.52 0.99 2.29
T-GNN (Ours) - 2.18 2.25 1.78 1.84 4.15 1.82 4.04 2.53 1.91 1.12 1.30 0.87 1.92 1.46 1.25 0.65 1.86 1.45 1.28 0.72 1.82
Table 3: FDE results of our T-GNN model in comparison with existing state-of-the-art baselines on 20 tasks. “2” represents from source domain to target domain. A, B, C, D, and E denote ETH, HOTEL, UNIV, ZARA1, and ZARA2, respectively.
Method Average Performance
ADE/FDE
T-GNN+MMD [Long2015Learning] 1.11/2.11
T-GNN+CORAL [Zhuo2017Deep] 1.07/2.01
T-GNN+GFK [2015Geodesic] 1.15/2.08
T-GNN+UDA [2020Unsupervised] 1.07/2.09
T-GNN (Ours) 0.96/1.82
Table 4: Average performance on 20 tasks of our T-GNN model in comparison with other four commonly used DA approaches.
Value
ADE 1.19 1.05 0.96 1.16 1.31
FDE 2.16 2.02 1.82 2.07 2.45
Table 5: Average performance on 20 tasks of our T-GNN model with 5 different values of .

In this section, we first present the definition of our proposed new setting as well as the evaluation protocol, then we carry out extensive evaluations in our proposed model under this new setting in comparison with previous existing methods and different domain adaptation strategies.

Datasets. Experiments are conducted on two real-world datasets: ETH [Pellegrini2009You] and UCY [Lerner2010Crowds] as these two public datasets are widely used in this task. ETH consists of two scenes named ETH and HOTEL, and UCY consists of three scenes named UNIV, ZARA1, and ZARA2.

Experimental Settings. We introduce a more practical new setting that treats each scene as one trajectory domain and the model is trained on only one domain and tested on other four domains, respectively. Given five trajectory domains, we have total 20 trajectory prediction tasks: A B/C/D/E, B A/C/D/E, C A/B/D/E, D A/B/C/E, and E A/B/C/D, where A, B, C, D, and E represents ETH, HOTEL, UNIV, ZARA1, and ZARA2, respectively. This setting is challenging because of the domain gap between training and testing domains.

Evaluation Protocol. To ensure the fair comparison under the new setting, existing baselines are trained with one source trajectory domain as well as the validation set of the target trajectory domain. Specifically, take A B as the example, existing baselines are trained with the training set of A and the validation set of B, then evaluated on the testing set of B. Our proposed model considers the training set of A as the source trajectory domain and the validation set of B as the target trajectory domain, then evaluated on the testing set of B. Note that the validation set and the testing set are independent of each other and there is no overlap sample between the validation set and the testing set. In the training phase, our proposed model only has access to the observed trajectory from the validation set.

Baselines. Following five state-of-the-art methods are compared with our proposed method under the new setting and the evaluation protocol. Social-STGCNN [mohamed2020social], PECNet [Mangalam2020It], RSBG [sun2020recursive], SGCN [shi2021sgcn], and Tra2Tra [xu2021tra2tra]. We also use following four widely-used DA approaches for comparison. T-GNN+MMD: using the multi kernel-maximum mean discrepancies loss [Long2015Learning] as , T-GNN+CORAL: using the CORAL loss [Sun2016Deep] as ; T-GNN+GFK: using the kernel-based domain adaptation strategy [2015Geodesic], and T-GNN+UDA: unsupervised domain adaptive graph convolutional network using the adversarial loss [2020Unsupervised].

Evaluation Metrics. Following two metrics are used to for performance evaluation. In these two metrics, is the total number of pedestrians in target trajectory domain, are predictions, and are ground-truth coordinates.

  • Average Displacement Error (ADE):

    (17)
  • Final Displacement Error (FDE):

    (18)

Implementation Details. Similar with previous trajectory prediction baselines, 8 frames are observed and the next 12 frames are predicted. In our experiments, the number of GCN layers is 3, the number of TCN layers is 3, the feature dimension is set as 64. In the training phase, the batch size is set as 16 and

is set as 1. The whole model is trained for 200 epochs and Adam 

[diederik2015adam] is applied as the optimizer. We set the initial learning rate as 0.001 and change to 0.0005 after 100 epochs. In the inference phase, 20 predicted trajectories are sampled and the best amongst 20 predictions is used for evaluation.

Variants ID Performance (ADE/FDE)
A2B B2C C2D D2E E2A
T-GNN w/o GAL 1 1.51/2.34 1.17/1.90 0.69/1.42 0.39/0.71 0.90/1.98
T-GNN w/o AAL w/ AP 2 1.78/2.85 1.23/2.02 0.77/1.53 0.42/0.79 0.96/2.03
T-GNN w/o AAL w/ LL 3 1.81/2.91 1.25/2.03 0.76/1.48 0.43/0.79 0.94/2.01
Social-STGCNN- [mohamed2020social] 4 2.18/3.68 2.30/3.21 1.59/2.54 1.23/1.72 1.73/2.98
SGCN- [shi2021sgcn] 5 2.03/3.53 2.35/3.22 1.68/2.71 1.12/1.59 1.81/3.02
T-GNN- 6 2.12/3.58 2.28/3.21 1.73/2.76 1.19/1.58 1.74/2.95
Social-STGCNN [mohamed2020social] 7 1.83/3.24 1.38/2.51 0.82/1.74 0.47/0.88 0.98/2.10
SGCN [shi2021sgcn] 8 1.68/3.24 1.38/2.47 0.82/1.71 0.48/0.97 0.97/2.10
T-GNN- 9 1.89/3.25 1.35/2.48 0.88/1.93 0.53/0.97 0.98/2.16
T-GNN (Ours) 10 1.13/2.18 1.08/1.82 0.61/1.30 0.32/0.65 0.87/1.86
Table 6: The ablation study of each component and adaptive learning module.

4.1 Quantitative Analysis

Tabs. 3 and 2 show the evaluation results of 20 tasks in comparison with five existing baselines. Tab. 4 shows the average performance of total 20 tasks in comparison with four existing DA approaches.

T-GNN vs Other Baselines. In general, our proposed T-GNN model, no matter on which task, consistently outperforms the other five baselines. Overall, our T-GNN model improves by comparing with Social-STGCNN and SGCN models on the ADE metric, and improves by comparing with PCENet and SGCN models on the FDE metric. It validates that our T-GNN model has the ability to learn transferable knowledge from source to target trajectory domain and alleviate the domain gap. As mentioned in Sec. 4, these baselines have access to the whole validation set of the target domain while our model only has access to the observed trajectories from the validation set. Results indicate that directly training with mixed data from different trajectory domains is worse than with our domain-invariant knowledge learning approach. In addition, for tasks D2E and E2D, all the models have relatively smaller ADE and FDE. One possible reason is that domain D (ZARA1) and E (ZARA2) have similar background and surroundings, in which pedestrians may have similar moving pattern. This phenomenon further illustrates the importance of considering the domain-shift problem.

T-GNN vs Other DA Approaches111Performance of total 20 tasks and the implementation details of T-GNN+UDA model are provided in supplementary material since T-GNN+UDA uses an adversarial loss.. In general, our T-GNN model using distance as the alignment loss achieves the best average performance. It indicates that distance is more appropriate for similarity measure in trajectory prediction task. One intuitive reason is that in trajectory prediction task, high-dimensional feature representations may still reserve the spatial-level information.

4.2 Ablation Study

We first study the performance of different values of in the objective function, and then study the contributions of each proposed component. In addition, we investigate the functionality of our proposed adaptive learning module.

Performance Study of . The hyper-parameter is used to balance the two terms in Eq. 16. Setting too small results in the failure of alignment loss, meanwhile, setting too large results in too heavy alignment. We set different values to find the most suitable . Tab. 5 shows the average performance on 20 tasks of our T-GNN model with five different values . When , our T-GNN model can achieve the best performance.

Contributions of Each Component. We evaluate following 3 different variants of our T-GNN model on 5 selected tasks and the results are illustrated in Tab. 6. (1) T-GNN w/o GAL denotes that we remove the graph attention component defined in Eqs. 5 and 4 not to update the . (2) T-GNN w/o AAL w/ AP denotes that the attention-based adaptive learning module is replaced with one average pooling layer, in which features and are reshaped and passed through one average pooling layer that operates in the “sample” dimension to obtain and . (3) T-GNN w/o AAL w/ LL denotes that the attention-based adaptive learning module is replaced with one trainable linear layer. It can be observed that removing the graph attention component results in the performance reduction. In addition, replacing our proposed attention-based adaptive learning module with either one average pooling layer or one trainable linear layer also results in the performance reduction, which indicates the effectiveness of our proposed adaptive learning module for exploring the individual-level domain-invariant knowledge.

Effectiveness of Adaptive Learning. Experiments are carried out to further study the effectiveness of adaptive learning module in our T-GNN model. We remove the attention-based adaptive learning module presented in Sec. 3.3 and disregard the alignment loss defined in Eq. 12. Thus, our model is trained only on the source domain and evaluated on one novel target domain, which we refer to as T-GNN-. For further comparison, two graph-based baselines Social-STGCNN [mohamed2020social] and SGCN [shi2021sgcn] are also trained without using the validation set, which we refer to as Social-STGCNN- and SGCN-. In addition, we directly train our model with mixed samples without domain-invariant adaptive learning module, which we refer to as T-GNN-. The results are shown in Tab. 6.

In comparison with variants 4, 5, and 6, results indicate that the backbone of our T-GNN model is competitive with these two graph-based backbones, which validates that our T-GNN can extract effective spatial-temporal features of observed trajectories. In comparison with variants 7, 8, and 9, all three variants can achieve competitive performance since the training data is exactly the same. In addition, all three variants outperform variants 4, 5, and 6 correspondingly, because variants 7, 8, and 9 all have access to the validation set of target trajectory domain. Results of variants 7, 8, 9 and 10 validate that our proposed domain-invariant transfer learning approach is superior to directly training with mixed data from different trajectory domains.

5 Conclusion

In this paper, we delve into the domain shift challenge in the pedestrian trajectory prediction task. Specifically, a more real, practical yet challenging trajectory prediction setting is proposed. Then we propose a unified model which contains a Transferable Graph Neural Network for future trajectory prediction as well as a domain-invariant knowledge learning approach simultaneously. Extensive experiments prove the superiority of our T-GNN model in both future trajectory prediction and trajectory domain-shift alleviation. Our work is the first that studies this problem and fills the gap in benchmarks and techniques for practical pedestrian trajectory prediction across different domains.

Supplementary Material

6 Overview

In the supplementary material, we provide experimental details and more evaluation results including visualizations. We also provide our insights and discussions at the end.

7 Experiments

7.1 Dataset Details

There are two commonly used datasets including five scenes in our work: ETH222http://www.vision.ee.ethz.ch/en/datasets/.  [Pellegrini2009You] and UCY333https://graphics.cs.ucy.ac.cy/research/downloads/crowd-data. [Lerner2010Crowds]. Dataset ETH consists of two scenes: ETH and HOTEL. Dataset UCY consists of three scenes: UNIV, ZARA1, and ZARA2. Each scene contains multiple walking pedestrians with different complex walking motions. We show examples of each scene in Fig. 4 with one red dot as one person. Some basic information of two datasets is shown in Tab. 7, additional statistics of five scenes have been provided in the main body.

(a) ETH
(b) HOTEL
(c) UNIV
(d) ZARA1
(e) ZARA2
Figure 4: One frame example from five different scenes. All five scenes are from outdoor top-down view where multiple pedestrians walking in different motions (each person is denoted with one red dot). It is obvious that UNIV is much more crowded than other four scenes. In addition, ZARA1 and ZARA2 share almost the same background.
Dataset Year Location Target Sensors Description Duration and tracks Annotations Sampling
ETH 2009 Outdoor People Camera/Top-down view Two scenes 25 min, 650 tracks Positions, velocities, groups, maps @2.5Hz
UCY 2007 Outdoor People Camera/Top-down view Three scenes 16.5 min, over 700 tracks Positions, gaze directions
Table 7: Basic information ETH and UCY.

7.2 Implementation Details

We compare our proposed model with total 5 trajectory prediction baselines: Social-STGCNN [mohamed2020social], PECNet [Mangalam2020It], RSBG [sun2020recursive], SGCN [shi2021sgcn], and Tra2Tra [xu2021tra2tra]. We implemented these baselines with their provided codes: Social-STGCNN 444https://github.com/abduallahmohamed/Social-STGCNN., PECNet 555https://github.com/HarshayuGirase/Human-Path-Prediction., SGCN 666https://github.com/shuaishiliu/SGCN.. We tried our best to reproduce the codes of RSBG and authors have shared the codes of Tra2Tra with us.We also employ 4 domain adaptation approaches: T-GNN+MMD [Long2015Learning] , T-GNN+CORAL [Sun2016Deep], T-GNN+GFK [2015Geodesic], and T-GNN+UDA [2020Unsupervised]. We implemented these approaches based on the codes: MMD 777https://github.com/easezyc/deep-transfer-learning., CORAL888https://github.com/VisionLearningGroup/CORAL., GFK 999https://github.com/jindongwang/transferlearning/tree/master/code/
traditional/GFK.
, UDA 101010https://github.com/GRAND-Lab/UDAGCN..

7.3 Performance Study of

As mentioned in the main body, the value of in adjacency matrix is initialized as the distance between pedestrian and ,

(19)

where is the distance, and denotes the “relative coordinates” of pedestrian at time step .

As it should be other possible definitions of , thus we investigate and analysis different definitions of as follows. Among these different functions, the key starting point we follow is that could be the function of the relative coordinates of pedestrians and . Average ADE and FDE results are shown in Tab. 8.

(20)
(21)
(22)

where and are both two small positive constants to ensure the numerical stability. In real practice, it is really difficult to have and we set though.

Variants Performance
ADE FDE
1.03 1.99
() 1.17 2.10
() 1.09 1.99
() 1.14 2.07
1.08 1.92
0.96 1.82
Table 8: Average performance of total 20 tasks on ADE/FDE metric with different initializations for the adjacency matrix .

We can see from Tab. 8, the best performance comes from (Eq. 19). Function achieves the second best performance on ADE metric and achieves the second best performance on FDE metric, respectively.

7.4 Results of Other DA Approaches

Metric Method Performance  (Source2Target) Ave
A2B A2C A2D A2E B2A B2C B2D B2E C2A C2B C2D C2E D2A D2B D2C D2E E2A E2B E2C E2D
ADE T-GNN+MMD [Long2015Learning] 1.53 1.39 1.14 1.19 2.99 1.18 2.39 1.49 1.08 0.62 0.71 0.42 1.02 0.89 0.68 0.38 0.89 0.99 0.74 0.41 1.11
T-GNN+CORAL [Zhuo2017Deep] 1.43 1.35 1.09 1.12 2.87 1.12 2.31 1.46 1.03 0.58 0.68 0.46 0.99 0.85 0.66 0.40 0.86 0.96 0.67 0.41 1.07
T-GNN+GFK [2015Geodesic] 1.69 1.52 1.20 1.24 3.01 1.19 2.52 1.55 1.11 0.68 0.69 0.50 0.96 0.89 0.71 0.43 0.89 1.01 0.75 0.42 1.15
T-GNN+UDA [2020Unsupervised] 1.41 1.32 0.98 1.23 2.92 1.20 2.43 1.42 1.12 0.64 0.62 0.48 0.91 0.81 0.69 0.35 0.91 0.98 0.70 0.39 1.07
T-GNN (Ours) 1.13 1.25 0.94 1.03 2.54 1.08 2.25 1.41 0.97 0.54 0.61 0.23 0.88 0.78 0.59 0.32 0.87 0.72 0.65 0.34 0.96
FDE T-GNN+MMD [Long2015Learning] 2.63 2.65 1.98 2.24 4.86 2.15 4.63 2.69 2.16 1.25 1.52 0.99 2.20 1.88 1.39 0.75 2.03 1.84 1.46 0.82 2.11
T-GNN+CORAL [Zhuo2017Deep] 2.44 2.52 1.82 2.16 4.59 1.89 4.48 2.68 2.09 1.20 1.47 0.97 2.09 1.83 1.33 0.75 2.01 1.79 1.38 0.79 2.01
T-GNN+GFK [2015Geodesic] 2.67 2.66 2.03 2.21 4.74 2.12 4.88 2.68 2.19 1.23 1.34 1.01 1.96 1.77 1.30 0.76 2.03 1.83 1.43 0.78 2.08
T-GNN+UDA [2020Unsupervised] 2.59 2.61 1.94 2.25 4.81 2.13 4.85 2.63 2.19 1.29 1.42 1.03 2.03 1.75 1.37 0.73 2.08 1.80 1.45 0.76 2.09
T-GNN (Ours) 2.18 2.25 1.78 1.84 4.15 1.82 4.04 2.53 1.91 1.12 1.30 0.87 1.92 1.46 1.25 0.65 1.86 1.45 1.28 0.72 1.82
Table 9: ADE/FDE results of our T-GNN model in comparison with existing domain adaptation approaches on 20 tasks. “2” represents from source trajectory domain to target trajectory domain. A, B, C, D, and E denote ETH, HOTEL, UNIV, ZARA1, and ZARA2, respectively.

Tab. 9

shows evaluation results of total 20 tasks when comparing with other domain adaptation approaches. For model T-GNN+UDA, in which there is an adversarial loss that measured by an extra domain classifier. One fully-connected linear layer is employed as the classifier. In specific, this kind of models needs to minimize the adversarial loss with respect to parameters of domain classifier, while maximizing it with respect to the parameters of trajectory predictor. Thus we use a a gradient reversal layer 

[ganin2015unsupervised] for the min-max optimization to unify the training procedure in a single step. It can be observed that our proposed adaptive learning module outperforms these domain adaptation approaches. This may show that our designed alignment loss is more appropriate for adapting fine-grained individual-level features in trajectory prediction task.

7.5 t-SNE Visualization

(a) BD w/o AAL
(b) CE w/o AAL
(c) DE w/o AAL
(d) ED w/o AAL
(e) BD w/ AAL
(f) CE w/ AAL
(g) DE w/ AAL
(h) ED w/ AAL
Figure 5: Visualization results of the feature representations and using t-SNE. The blue and red dots denote the source and target feature representation, respectively. “w/o AAL” denotes that we disregard attention-based adaptive learning module(corresponding to the Variant T-GNN-). “w/ AAL” denotes and are extracted from our proposed T-GNN model.

In this section, we visualize feature representations and of the target and source trajectory domain with t-SNE [2008Visualizing] approach on of 4 tasks. Fig. 5 shows the visualization examples where red and blue denote the source and target trajectory features, respectively. The first row are and without attention-based adaptive learning module , which we denote as “w/o AAL” (corresponding to the Variant T-GNN- in the main body). The second row are with attention-based adaptive learning module, which we denote as “w AAL”. Each dot represents the feature of one pedestrian in the figure. Different from conventional t-SNE visualizations that applied in classification task, there is no specific “label” of each dot in our task. Therefore, the cluster structure may not be clear in our task.

For task BD and CE, we can observe that features get closer with our adaptive learning module, which validates that our proposed adaptation learning module is able to alleviate the disparities across different trajectory domains. In addition, the visualization of task BD is not significant and the corresponding quantitative results of task is BD lower than others (ADE: 2.25, FDE:4.04). For task DE and ED, since D (ZARA1) and E (ZARA2) have similar scenes, we can observe from these two pairs of figures: (1) features from source and target domains have more overlaps, (2) features become more closer. It is consistent with their corresponding quantitative results (DE: 0.32/0.65, ED: 0.34/0.72). It also validates the effectiveness of our proposed adaptive learning module.

8 Discussion

Compare with general domain adaptation methods. We delve into the domain-shift challenge in the task of pedestrian trajectory prediction in this paper. In image/video-related classification tasks, domain adaptation (DA) is a hot topic that aims to enable models to generate to novel datasets with different sample distributions. In this study, we expose the challenging domain-shift issue in future trajectory prediction. We usually consider the trajectory as two parts: observation and prediction. It is different from conventional DA tasks where data is in the form of sample-label pairs. Strictly speaking, in trajectory prediction task, the prediction part is not exactly the “label” of the observation part. This essential difference brings in another interesting finding that is worth exploring. In Fig. 5

, the cluster structure is not clear because there is no category of each trajectory, which means there exists distribution overlap of different trajectory domains. This kind of “overlap” may be reason of the variance of different tasks. If this “overlap” problem as well as domain-shift problem can be well-addressed simultaneously, trajectory prediction task would be more practical and promising. On the other hand, the observation and prediction parts of one trajectory are totally consistent, thus these two parts may be able to swap and supervise each other. We hope this perspective will inspire the research communities of considering the trajectory prediction problem as well as domain shift issue.

References