Trajectory prediction is a fundamental task for autonomous applications, such as autonomous vehicles, socially compliant robots, agents in simulators, to navigate in a shared environment. To respond timely and precisely to the environment, the ability of the agents to predict the future paths of their neighbors in an efficient and accurate manner is much needed. Although recent works [liang2019peeking, sadeghian2019sophie, huang2019stgat] have achieved great improvement in modeling complex social interactions between agents to generate accurate future paths, trajectory prediction is still a challenging task, where the deployment of the prediction models in real-world applications is mostly restricted by its high computation cost. For example, some small robots are only equipped with limited computing devices that can not afford the high inference cost of existing solutions.
In particular, the trajectory prediction is typically modeled in two dimensions, i.e., the temporal dimension and the spatial dimension. The temporal dimension models the historical movement dynamics for each agent. Most of the state-of-the-art approaches [alahi2016social, gupta2018social, liang2019peeking, huang2019stgat]hochreiter1997long] networks, to capture such sequence dynamics since RNNs are designed for sequence modeling. However, the RNN-based models are subject to the following two limitations. First, in terms of effectiveness, training RNN models is not tractable due to the gradient vanishing and exploding problem [razvan13rnn], and although RNNs theoretically are more expressive in modeling sequential data, such expressivity is absent in practice to a large extent [bai18empirical]
, as supported by recent findings that feed-forward networks, e.g., Convolutional Neural Networks (CNNs), can actually achieve comparable or even better results as the RNN counterpart on benchmark sequence prediction tasks such as language modeling[yann17language] and machine translation [jonas17conv]. Second, in terms of efficiency, both training and inference speed of RNN models are notoriously slow compared with their feed-forward counterparts. This is due to the fact that each hidden state of RNNs is dependent on the previous inputs and hidden states as well. Therefore, the prediction of RNNs is produced sequentially, and thus not parallelizable.
The spatial dimension models the interaction between the agent and its neighbors. Three approaches have been proposed to capture the spatial interaction, including the pooling-based [alahi2016social, gupta2018social], distance-based [liang2019peeking] and attention-based [sadeghian2019sophie, fernando2018soft, vemula2018social, huang2019stgat] ones. The pooling-based approaches adopt the grid-based pooling [alahi2016social] or symmetric function [gupta2018social] to aggregate the hidden states from neighbors, and the distance-based approaches encode the geometric relation between the agents with an LSTM encoder. Attention-based approaches instead dynamically generate the importance of neighbors using soft attention, which is more effective in modeling complex social interactions. However, existing attention-based approaches are overly dependent on attention and neglect the geometric distance between agents compared with the pooling-based and distance-based approaches.
To address the above-discussed limitations in effectiveness and efficiency, we propose a novel CNN-based spatial-temporal graph network (STGNN), i.e., GraphTCN, to capture the spatial and temporal interaction for trajectory prediction. In the temporal dimension, in contrast to RNN-based methods, we adopt a modified gated convolutional network (TCN) to capture the temporal dynamics for each agent. The gated highway mechanism introduced to CNNs dynamically regulate the information flow by focusing on more salient features, and the feed-forward nature of CNN makes it more tractable in training and parallelizable for much higher efficiency both in training and inference. In the spatial dimension, we propose an edge graph attention neural network (EGAT) for each time instants to better capture the spatial interaction between the agents. Specifically, nodes in the graph represent agents, and edges between two agents denote their geometric relation. EGAT then learns the adjacency matrix, i.e., the spatial interaction, of the graph adaptively. Together, the spatial and temporal modules of GraphTCN support more effective and efficient modeling of the interactions within each time step between agents and across the time steps for each and every agent. Our main contributions can be summarized as follows:
We propose an edge graph attention neural network (EGAT) to better capture the spatial interaction adaptively with a self-attention mechanism.
We propose to model the spatial-temporal interactions with a gated convolutional network (TCN), which proves to be more efficient and effective.
Our spatial-temporal framework achieves noticeably better performance compared with state-of-the-art approaches. Specifically, we reduce the average displacement error by 20.9% and final displacement error by 31.3%, and achieves up to 5.37x wall-clock time speedup over existing solutions.
We organize this paper as follows: in Section 2, we introduce the background and discuss related works in detail. Our GraphTCN framework and the implementation details are introduced in Section 3. Then in Section 4, results of GraphTCN measured in both accuracy and efficiency are compared with state-of-the-art approaches. Finally, Section 5 concludes the paper.
2 Related Work
2.1 Human-Human Interactions
Research in the crowd interaction model can be traced back to the Social Force model [helbing1995social], which adopts the nonlinearly coupled Langevin equations to represent the attractive and repulsive forces for human movement in the crowed scenarios. Similar hand-crafted approaches have attempted to model the crowd interactions with continuum dynamics [treuille2006continuum], discrete choice framework [antonini2006discrete], Gaussian processes [wang2007gaussian], Bayesian model [wang2008unsupervised], and have proved successful in crowd simulation [hou2014social, saboia2012crowd], crowd behavior detection [mehran2009abnormal], and trajectory prediction [yamaguchi2011you].
However, these approaches model social behavior based only on psychological or physical realization, which alone is insufficient to capture complex crowd interaction. Recent works have investigated deep learning techniques to capture the interaction between the agent and neighbors. Social LSTM[alahi2016social] introduces the social pooling layer to aggregate the social hidden state within the local neighborhood of the agent. Social GAN [gupta2018social] employs symmetric function to summarize the crowd global interactions, which is realized efficiently by pooling the context only once. Different from these pooling-based methods, attention-based approaches [sadeghian2019sophie, vemula2018social, fernando2018soft] differentiate the importance of neighbors by soft attention. The attention-based schemes provide better crowd understanding since they assign adaptive importance between pedestrians. Similar to the attention approach, graph attention networks (GAT) learn the social interaction by aggregating neighborhood features adaptively with the adjacency matrix. The more recent work STGAT [huang2019stgat] adopts GAT directly on the LSTM hidden states to capture the spatial interaction between pedestrians; however, it depends fully on attention and ignores the distance features of the agent.
To better capture the distance feature, we model the pedestrian interactions with a novel graph neural network EGAT, which proposes to learn the adjacency matrix of the graph. Specifically, the distance feature is used to learn an adaptive adjacency matrix for the most salient interaction information, which is then integrated into the graph convolution.
2.2 Sequence Prediction
Sequence prediction refers to the problem of predicting the future sequence using historical sequence information. There are two prevailing methods for sequence prediction, i.e., the pattern-based method and the planning-based method. Pattern-based methods summarize sequence behavior for the generation of the sequence, while planning-based methods, such as [kuderer2012feature, lee2016predicting, rehder2018pedestrian]
, make sequence predictions via the learning of the probability distribution. Recently, pattern-based methods have become mainstream for many sequence prediction tasks, e.g., speed recognition[oord2016wavenet, chorowski2014end, graves2014towards], activity recognition [donahue2015long, ibrahim2016hierarchical]cho2014learning, sutskever2014sequence, gehring2017convolutional]. In particular, trajectory prediction can be formulated as a sequence prediction task, which uses historical movement patterns of the agent to generate the future path in the sequence. Most of trajectory prediction methods adopt recurrent neural networks (RNNs), e.g., Long Short-Term Memory (LSTM) networks [hochreiter1997long], to capture the temporal movement in the sequence, since RNNs are designed for sequence modeling. However, RNN-based models suffer from gradient vanishing and exploding during training and overfocusing on more recent inputs during prediction, especially for long input sequences.
To overcome these problems, many sequence prediction works [oord2016wavenet, wu2019graph] instead adopt convolutional neural networks (CNNs) and have achieved great success. The convolutional networks can better capture long-term dependency and greatly improve prediction efficiency. The superiority of CNN-based methods can be largely attributed to the convolutional operation, which is independent of preceding time-steps and thus can process in parallel. The recent work [nikhil2018convolutional] proposes a compact CNN model to capture the temporal information and an MLP layer to generate the future sequence simultaneously; their results confirm that the CNN-based model can yield competitive performance in trajectory prediction. However, it fails to model the spatial interaction between pedestrians.
In this work, we propose to capture the spatial interaction with EGAT and introduce gated convolutional networks to capture the temporal dynamics for each pedestrian. Specifically, our CNN adopts the highway network architecture [srivastava2015highway] to dynamically regulate the information flow, and skip connections [he2016deep] to facilitate both representation learning and training.
2.3 Spatial-temporal Graph Networks for Trajectory Prediction
Recently, many studies have attempted to adopt spatial-temporal graph neural networks (STGNNs) for the sequence prediction task, such as action recognition [yan2018spatial, si2019attention], taxi demand prediction [yao2018deep], and traffic prediction [yao2018modeling]. Specifically, the sequence can be formulated as a sequence of graphs of nodes and edges, where nodes correspond to the agents and edges to their interactions. The sequence can thus be effectively modeled with the spatial-temporal graph network.
In trajectory prediction, the prediction task can be modeled in two dimensions, i.e., the spatial dimension and the temporal dimension. Specifically, the spatial dimension models the interaction between the agent and its neighbors, and the temporal dimension models the historical trajectory for each agent. Therefore, in STGNNs, each node in the graph represents one pedestrian in a scene, and each edge between two nodes captures the interaction between the two corresponding pedestrians. For example, social attention [vemula2018social] models each node with the location of the agent, and edge with the distance between pedestrians, where the spatial relation is modeled with an attention module and then the temporal with RNNs. Similarly, [wang2019pedestrian] constructs the STGNN with Edge RNN and Node RNN based on the location; STGAT [huang2019stgat] uses GAT to capture the spatial interaction by assigning different importance to neighbors and adopts extra LSTMs to capture the temporal information of each agent. The major limitation of these methods is the difficulty in capturing the spatial interaction along the temporal dimension. Notably, the future path of the agent is not only dependent on the current position but its neighbors’. However, the details of such spatial interaction may be lost during the aggregation of the node features along the temporal dimension using RNN-based models.
In contrast to the RNN-based methods, Graph WaveNet [wu2019graph] has demonstrated the capability of CNNs on the temporal modeling of the long sequence, which achieves superior performance on traffic datasets. In this paper, we propose an enhanced temporal convolutional network to integrate both the temporal dynamics of the agent and its social information, capturing the spatial and temporal correlation of the interactions.
The goal of trajectory prediction is to jointly predict the future paths of all agents that are present in the scene. Naturally, the future path of an agent depends on its historical trajectory, i.e., the temporal interaction, and also is influenced by the trajectories of neighboring agents, i.e., the spatial interaction. Consequently, the trajectory prediction model is supposed to take both features into consideration when modeling the spatial and temporal interactions for the prediction.
3.1 Problem Formulation
Formally, the trajectory prediction can be formulated as follows. We assume there are N pedestrians observed in a scene with length . The position of a single pedestrian in the time step is denoted as . Thereby, the observation positions of the pedestrian can be represented as = , ,…, . The goal of trajectory prediction is then to predict all the future positions (}) simultaneously.
3.2 Overall Framework
As illustrated in Figure 1, GraphTCN comprises three key modules, including edge graph attention (EGAT) module, gated temporal convolutional (TCN) module, and decoder. First, EGAT captures the spatial interaction between pedestrians at each time step. We feed only the absolute trajectories into the EGAT module since the spatial interaction should only be influenced by geometric distances among pedestrians. Then for each time step, we embed the relative positions of each pedestrian into a fixed-length hidden space, i.e. the temporal embedding, which represents the temporal dynamics of the pedestrian, e.g., gait, speed and, acceleration, etc. The EGAT embedding and the temporal embedding are subsequently concatenated together at each time step as the input for the TCN module. The TCN module is a feed-forward one-dimensional convolutional network [oord2016wavenet] with residual and skip connection[he2016deep]
. The residual connection facilitates the gradient backpropagation for more stable training, and the skip connection help forward the intermediate features for the decoder module for better prediction. Finally, the decoder module produces future trajectories of all pedestrians simultaneously. More details of our framework will be elaborated in the following sections.
3.3 EGAT Module for Spatial Interaction
The EGAT module is designed to encode the spatial interaction between pedestrians. Formally, pedestrians within the same time step can be formulated as an undirected graph , where the node corresponds to the -th pedestrian, and the weighted edge represents the human-human interaction between pedestrian and . The adjacency matrix of thus represents the spatial relationships between pedestrians. Prior works [huang2019stgat] have shown that graph attention network (GAT) [velivckovic2017graph] is rather effective in capturing the influence of neighbors via adaptively learning the adjacency matrix. In this work, instead of merely learning the adjacency matrix, EGAT integrates the geometric relation of the pedestrians as well. To achieve this, we adopt a doubly stochastic adjacency matrix (DSM) as the input adjacency matrix for the attention graph network.
DSMs have quite a few nice properties, e.g, symmetric, positive semi-definite with the largest eigenvalue one, which helps stabilize the graph convolution process[li19explore] to capture the spatial interaction. Further, we note that a pedestrian is more likely to be influenced by its own historical trajectory and the neighboring pedestrians. Therefore, before normalizing into a DSM, we first construct a preliminary symmetric adjacency matrix to capture the raw spatial relation between pedestrians, by computing the geometric distance between pedestrians and introducing a self-connection for each pedestrian:
where is the euclidean distance between the pedestrian and pedestrian at time step . Then, the DSM can be produced as follows:
The edge features of the DSM adjacency matrix are then exploited to guide the attention operation in graph attention layers [velivckovic2017graph]. Specifically, intermediate node features are obtained by the following embedding and aggregating functions:
where is an embedding function, is the embedding weight, and
is the LeakyReLU activation function. Then, we have the node featurefor pedestrian , where is the number of the node feature. The new node feature embeddings aggregate raw node features based on DSM, which is fed into graph attentional layers [velivckovic2017graph] to capture the spatial interaction:
where gives the importance weight of the neighbor to the pedestrian dynamically calculated via the self-attention mechanism. The graph attentional layer can learn a self-adaptive adjacency matrix that captures the relative importance of different nodes. To stabilize the self-attention process [velivckovic2017graph, wu2019graph], multi-head attention is adopted:
where is the LeakyReLU activation, is the concatenate function and indexes the -th attention head. We then obtain the final node features of , where captures the aggregated spatial interaction between pedestrian and its neighbors at each time step.
3.4 TCN-based Spatial and Temporal Interaction Representation
The movement pattern of a pedestrian is significantly influenced by the historical trajectory and the moving patterns of neighboring pedestrians. Inspired by [oord2016wavenet], we propose to capture the spatial and temporal interaction between pedestrians using a modified temporal convolution network(TCN) as illustrated in Figure 2. Specifically, the TCN module adopts the causal convolution, i.e., the convolution for one-dimensional input, and by staking the causal convolution layer, the final output of TCN can be obtained, which captures both the spatial and temporal interactions.
The network can be regarded as a short-term and long-term encoder, where lower convolution layers focus on local short-term interactions, while in higher layers, long-term interactions are captured with a larger receptive field. For example, if the kernel size of the TCN is , the receptive field size in -th layer is , which increases linearly ascending layers. Therefore, the top layer of TCN captures interactions within a longer time span. Since the order of the input is important in the sequence prediction task, we therefore adopt the left padding of size instead of symmetric padding for the causal convolution, where each convolution output convolves over the input in the corresponding time step and inputs of the preceding steps as well. In this way, the output size of each causal convolution remains the same as the input.
To fuse the spatial and temporal interaction across time steps, we first concatenate the spatial embedding obtained by the EGAT module and the temporal context embedding obtained by Equation 8 as the input of the TCN module:
where is the position of pedestrian at -th time step, is the embedding function, is the concatenate operation, and , . Then, each causal convolution with kernel size convolves for the spatial and temporal interactions together.
The gating function has demonstrated the great power for capturing the temporal information in [oord2016wavenet, dauphin2017language]. It takes advantage of two non-linear functions to control the bypass signals. Therefore, we adopt the similar gated activation unit to dynamically regulate the information flow formed as:
where = , is the tanh activation function,
denotes the sigmoid function,denotes the element-wise multiplication, and are the learnable 1D-convolution parameters, respectively. Then, the final output of the TCN module is obtained by concatenating across the time dimension, denoted as
. Thereby, the embedding vectorcaptures the spatial-temporal interaction between the -th pedestrian and its neighbors. We note that TCN can handle much longer input sequences with the dilated convolution [oord2016wavenet], which is more efficient than RNN-based methods.
3.5 Future Trajectory Prediction
In real-world applications, given the historical trajectory, there are multiple plausible ways of the future movement. We therefore also model such uncertainty of the final movement in our decoder module for the trajectory prediction. Following the multimodal strategy widely adopted [gupta2018social, liang2019peeking, huang2019stgat], the decoder module produces multiple socially acceptable trajectories by introducing random noises as part of the input besides the spatial-temporal embedding . The predicted relative positions for all the pedestrians from the decoder is then:
where , n is the number of plausible trajectories generated,
is a multi-layer perceptron with LeakyReLU non-linearity,is the random noise vector sampled from , and is the perceptron weight.
We adopt the variety loss as the loss function for training, which computes the minimum ADE loss of theplausible trajectories:
where is the ground truth,
is the predicted trajectories. Although this loss function may lead to a diluted probability density function[thiede2019analyzing], we empirically find that it facilitates better predictions of multiple future trajectories.
Following conventions [alahi2016social, gupta2018social, vemula2018social, liang2019peeking], we evaluate our GraphTCN on two trajectory prediction benchmark datasets, i.e., ETH [pellegrini2010improving] and UCY [lerner2007crowds], and compare the performance of GraphTCN with state-of-the-art approaches.
The annotated trajectories in ETH and UCY datasets are provided as world coordinates. In these datasets, pedestrians exhibit complex behaviors, including nonlinear trajectories, moving from different directions, walking together, walking unpredictably, avoiding collisions, standing, etc. The datasets comprise five unique outdoor environments that are recorded from a fixed top-view. ETH and Hotel belong to the ETH dataset, while the UCY dataset consists of UNIV, ZARA1, and ZARA2. The crowd density of a single scene in each environment is different, which is varied from 0 to 51 pedestrians per frame. The frames per second (FPS) of all the videos are 25 and the pedestrian trajectory is extracted at 2.5 FPS.
4.0.2 Implementation Details
We trained GraphTCN with Adam optimizer using a learning rate of 0.0003 for 50 epochs. The embedding size ofis set to 16. A two-layer EGAT model with attention heads and node features in first and second graph attention layer, respectively. The final node feature of EGAT has a dimension of 16, and has the dimensions of 32. comprises 3 layers. All the LeakyReLU has a negative slope coefficient of 0.2.
4.0.3 Evaluation Metrics
Following reporting conventions [alahi2016social, gupta2018social, liang2019peeking]
, the evaluation metrics adopted include Average Displacement Error (ADE) and Final Displacement Error (FDE). ADE is defined in Equation12, which is the average Euclidean distance between the predicted trajectory and the ground truth overall prediction time steps, and FDE is the Euclidean distance between the predict position and the ground truth position at the final time step . The model is trained with the leave-one-out policy, and the results are reported accordingly. The predictions are produced for the next 4.8 seconds (i.e., 12 time steps) based on 3.2 seconds (i.e., 8 time steps) observations.
We compare our framework with the following baselines approaches: Linear
is a linear regression model that predicts the next coordinates according to the previous input point.LSTM adopts the vanilla LSTM encoder-decoder model to predict the sequence of every single pedestrian. Social LSTM [alahi2016social] builds on top of LSTM and introduces a social pooling layer to capture the spatial interaction between pedestrians. We further compare GraphTCN with three state-of-the-art methods: Social GAN [gupta2018social] improves over Social LSTM with socially generative GAN to generate multiple plausible trajectories. Social Attention [vemula2018social] adopts an RNN mixture model for STGNNs to capture the spatial interaction and temporal dynamics. STGAT [huang2019stgat] also adopts GAT to model the spatial information, and adopts an LSTM to capture temporal interaction.
4.1 Qualitative Results
4.1.1 Overall Results
We compare GraphTCN with state-of-the-art baselines in Table 1. The results show that GraphTCN achieves noticeably better performance compared with existing approaches on these benchmark datasets. Specifically, our GraphTCN achieves 0.43 and 0.83 ADE and FDE on average. Compared with STGAT, GraphTCN reduces ADE by 20.9% and FDE 31.3% on average respectively. The results demonstrate that GraphTCN outperforms previous solutions both in These results show that our model significantly outperformed the previous models in terms of prediction accuracy, especially for more complex and crowded dataset ZARA1 and ZARA2.
4.1.2 Speed Comparison
We compare the inference speed of GraphTCN with baseline models, including Social GAN [gupta2018social], Social Attention [vemula2018social], and STGAT [huang2019stgat]. The results in Table 2 reports the model inference time and the speedup factor compared with the Social Attention[vemula2018social] on the same dataset in wall-clock second. As we can observe from the results, GraphTCN achieves much faster inference compared with these baseline approaches. In particular, GraphTCN achieves 0.81 second inference time, which is 1.32 and 5.37 faster than Social GAN and the most similar approach STGAT respectively.
4.2 Qualitative Evaluation
We also investigate the prediction results of our GraphTCN by visualizing and comparing the predicted trajectories with the best-performing approach STGAT in Figure 3. We choose three unique scenarios in which the complex interactions take place. The complex interactions include pedestrian standing, pedestrian merging, pedestrian following, pedestrian avoiding, etc.
From Figure 3 (a), we can observe that GraphTCN achieves better performance on stationary pedestrians. Specifically, the trajectories generated by GraphTCN follow the same direction as the ground truth, while predictions from STGAT deviates from the path obviously. Figure 3 (b) shows that STGAT may fail to make satisfactory predictions when the pedestrians are from different groups, while GraphTCN gives better prediction in scenarios where one pedestrian meet another group. Figure 3 (c) demonstrates that GraphTCN can successfully produce predictions avoiding future collisions when the pedestrian merge into the same direction from an angle. These qualitative results further confirm that our GraphTCN produces better trajectory predictions, which are socially plausible for both stationary pedestrians and moving crowds in complex real-world scenarios.
We also present successful prediction trajectories plotted in real-world meters on three different scenarios in Fig. 4. The more challenging scenario can be found in Figure 4 (fig:sub-avoiding), where the pedestrian 8 only moves a very short distance, pedestrian 6, 7 are almost stationary, pedestrian 5 moves alone, and two groups of the pedestrians (1, 2 and 3, 4) try to avoid the collision. As we can observe from the results, our GraphTCN generates plausible short trajectories for pedestrian 6, 7 and 8, and the pedestrian 5 is not affected by other pedestrians. Further, pedestrian 1, 2 and 3, 4 move in groups with collision-free future paths. In Figure 4 (fig:sub-following), two pedestrians 3 and 4 move together as one group, our GraphTCN can capture the group movement pattern of them and make accurate group trajectories predictions. Even in the more complex scenario where more pedestrians are presented in Figure 4 (fig:sub-deviating), our GraphTCN produces socially acceptable predictions when they depart for the opposite directions (pedestrian 3 and 1, 2, 7) or meet towards the same direction (pedestrian 8 and 4, 5, 6).
Figure 5 shows three failed cases of our GraphTCN. In Figure 5 (fig:overshot), we notice that although our model can generate the predictions which share the same direction as the ground truth, our predicted trajectories overshoot after reaching the final points. The reason might be that our model prediction all the trajectories simultaneously, which causes the model to have difficulties in making low-velocity predictions for those high-velocity historical trajectories. Figure 5 (fig:linear) shows that our model may produce linear future trajectories when the past trajectories are close to linear. In Figure 5 (fig:unpredicted), the prediction path of pedestrian 2, 3 and 4 fail. This is because the observed trajectory is relatively short compared with their future paths, and the pedestrian has some unpredictable behavior, which is a challenging task by nature.
In this paper, we propose GraphTCN for trajectory prediction, which captures the spatial and temporal interaction between pedestrians effectively by adopting the EGAT to model their spatial interactions, and TCN to model both the spatial and temporal interactions. The proposed GraphTCN is totally based on feed-forward networks, which is more tractable during training, and more importantly, achieves significantly better prediction accuracy and higher inference speed compared with existing solutions. The advantages of GraphTCN are more evident in the scenario of long trajectory prediction. Experiment results confirm that our GraphTCN outperforms state-of-the-art approaches on all the adopted benchmark datasets.