Crowd trajectory prediction is of fundamental importance to both the computer vision[1, 16, 53, 21, 22] and robotics [34, 33] community. This task is challenging because 1) human-human interactions are multi-modal and extremely hard to capture, e.g., strangers would avoid intimate contact with others, while fellows tend to walk in group ; 2) the complex temporal prediction is coupled with the spatial human-human interaction, e.g., humans condition their motions on the history and future motion of their neighbors .
|(a) Crowd Motion Modeling||(b) STAR Overview|
Classic models capture human-human interaction by handcrafted energy-functions [19, 18, 34], which require significant feature engineering effort and normally fail to build crowd interactions in crowded spaces 
. With the recent advances in deep neural networks, Recurrent Neural Networks (RNNs) have been extensively applied to trajectory prediction and demonstrated promising performance[1, 16, 53, 21, 22]. RNN-based methods capture pedestrian motion by their latent state and model the human-human interaction by merging latent states of spatially proximal pedestrians. Social-pooling [1, 16] treat pedestrians in a neighborhood area equally and merge their latent state by a pooling mechanism. Attention mechanisms [22, 53, 21] relax this assumption and weigh pedestrians according to a learned function, which encodes unequal importance of neighboring pedestrians for trajectory prediction. However, existing predictors have two shared limitations: 1) the attention mechanisms used are still simple, which fails to fully model the human-human interaction, 2) RNNs normally have difficulty modeling complex temporal dependencies .
Recently, Transformer networks have made ground-breaking progress in Natural Language Processing domains (NLP)[43, 10, 26, 52, 50]. Transformers discard the sequential nature of language sequences and model temporal dependencies with only the powerful self-attention mechanism. The major benefit of Transformer architecture is that self-attention significantly improves temporal modeling, especially for horizon sequences, compared to RNNs . Nevertheless, Transformer-based models are restricted to normal data sequences and it is hard to generalize them to more structured data, e.g., graph sequences.
In this paper, we introduce the Spatio-Temporal grAph tRansformer (STAR) framework, a novel framework for spatio-temporal trajectory prediction based purely on self-attention mechanism. We believe that learning the temporal, spatial and temporal-spatial attentions is the key to accurate crowd trajectory prediction, and Transformers provide a neat and efficient solution to this task. STAR captures the human-human interaction with a novel spatial graph Transformer. In particular, we introduce TGConv, a Transformer-based graph convolution mechanism. TGConv improves the attention-based graph convolution  by self-attention mechanism with Transformers and can capture more complex social interactions. We model pedestrian motions with separate temporal Transformers, which better captures temporal dependencies compared to RNNs. STAR extracts spatio-temporal interaction among pedestrians by interleaving between spatial Transformer and temporal Transformer, a simple yet effective strategy. Besides, as Transformers treat a sequence as a bag of words, they normally have problem modeling time series data where strong temporal consistency is enforced . We introduce an additional read-writable graph memory module that continuously performs smoothing over the embeddings during prediction. An overview of STAR is given by Fig. 2.(b)
We experimented on 5 commonly used real-world pedestrian trajectory prediction datasets. We show that STAR outperforms the state-of-the-art (SOTA) trajectory predictors on 4 out of 5 datasets and achieves comparable results on the rest. We conduct extensive ablation studies to provide a detailed understanding of each proposed component.
We summarize our contribution as follows: 1) we provide a new insight for trajectory prediction, that spatio-temporal attention modeling is the most critical component for trajectory prediction; 2) we introduce TGConv, a novel Transformer-based graph convolution for attention based graph feature learning; 3) we extend Transformer networks to graph-structured data sequences, in particular, crowd trajectory data sequences, and we introduce STAR, a framework for spatio-temporal graph-based crowd trajectory prediction that achieves SOTA results.
2.1 Self-Attention and Transformer Networks
The core idea of Transformer is to replace the recurrence completely by multi-head self-attention mechanism. For embeddings , the self-attention of Transformers first learns the query matrix , key matrix and a corresponding value matrix of all embeddings from to . It computes the attention by
where is the dimension of each query. The
implements the scaled-dot product term for numerical stability for attentions. By computing the self-attention between embeddings across different time steps, the self-attention mechanism is able to learn temporal dependencies over long time horizon, in contrast to RNNs that remember the history with a single vector with limited memory. Besides, decoupling attention into the query, key and value tuples allows the self-attention mechanism to capture more complex temporal dependencies.
Multi-head attention mechanism learns to combine multiple hypotheses when computing attentions. It allows the model to jointly attend to information from different representations at different positions. With heads, we have
where is a fully connected layer merging the output from heads and denote the self-attention of the -th head. Additional positional encoding is used to add positional information to the Transformer embeddings. Finally, Transformer outputs the updated embeddings by a fully connected layer with two skip connections.
However, one major limitation of current Transformer-based models is they only apply to non-structured data sequences, e.g., word sequences. STAR extends Transformers to more structured data sequences, as a first step, graph sequences, and apply it to trajectory prediction.
2.2 Related Works
2.2.1 Graph Neural Networks
Graph Neural Networks (GNNs) are powerful deep learning architectures for graph-structured data. Graph convolutions[27, 24, 9, 15, 47]
have demonstrated significant improvement on graph machine learning tasks, e.g., modeling physical systems[4, 28], drug prediction  and social recommendation systems . In particular, Graph Attention Networks (GAT)  implement efficient weighted message passing between nodes and achieved state-of-the-art results across multiple domains. From the sequence prediction perspective, temporal graph RNNs allow learning spatio-temporal relationship in graph sequences [8, 17]. Our STAR improves GAT with TGConv, a transformer boosted attention mechanism and tackles the graph spatio-temporal modeling with transformer architecture.
2.2.2 Sequence Prediction
RNNs and its variants, e.g., LSTM  and GRU , have achieved great success in sequence prediction tasks, e.g., speech recognition [46, 39], robot localization [14, 36], robot decision making [23, 37], and etc. RNNs have been also successfully applied to model the temporal motion pattern of pedestrians [1, 16, 21, 53, 22]. RNNs-based predictors make predictions with a Seq2Seq structure . Additional structure, e.g., social pooling [1, 16], attention mechanism [48, 45, 22] and graph neural networks [21, 53], are used to improve the trajectory prediction with social interaction modeling.
Transformer networks have dominated Natural Language Processing domains in recent years [43, 10, 26, 52, 50]. Transformer models completely discard the recurrence and focus on the attention across time steps. This architecture allows long-term dependency modeling and large-batch parallel training. Transformer architecture has also been applied to other domains with success, e.g., stock prediction , robot decision making  etc. STAR is a generalization of the Transformer to the graph sequences. We demonstrate it on a challenging crowd trajectory prediction task, where we consider crowd interaction as a graph. STAR is a general framework and could be applied to other graph sequence prediction tasks, e.g., event prediction in social networks  and physical system modeling . We leave this for future study.
2.2.3 Crowd Interaction Modeling
As the pioneering work, Social Force models [19, 32] have been proven effective in various applications, e.g., crowd analysis  and robotics . They assume the pedestrians are driven by virtual forces for goal navigation and collision avoidance. Social Force models work well on interaction modeling while performing poorly on trajectory prediction . Geometry based methods, e.g., ORCA  and PORCA , consider the geometry of the agent and convert the interaction modeling into an optimization problem. One major limitation of classic approaches is that they rely on hand-crafted features, which is non-trivial to tune and hard to generalize.
Deep learning based models achieve automatic feature engineering by directly learning the model from data. Behavior CNNs  capture crowd interaction by CNNs. Social-Pooling [1, 16] further encodes the proximal pedestrian states by a pooling mechanism that approximates the crowd interaction. Recent works consider crowd as a graph and merge information of the spatially proximal pedestrians with attention mechanisms [48, 45, 22]. Attention mechanism models pedestrians with importance compared to the pooling methods. Graph neural networks are also applied to address crowd modeling [21, 53]. Explicit message passing allows the network to model more complex social behaviors.
In this section, we introduce the proposed spatio-temporal graph Transformer based trajectory prediction framework, STAR. We believe attention is the most important factor for effective and efficient trajectory prediction.
STAR decomposes the spatio-temporal attention modeling into temporal modeling and spatial modeling. For temporal modeling, STAR considers each pedestrian independently and applies a standard temporal Transformer network to extract the temporal dependencies. The temporal Transformer provides a better temporal dependency modeling protocol compared to RNNs. For spatial modeling, we introduce TGConv, a Transformer-based message passing graph convolution mechanism. TGConv improves the state-of-the-art graph convolution methods with a better attention mechanism and gives a better model for complex spatial interactions. We construct two encoder modules, each including a pair of spatial and temporal Transformers, and stack them to extract spatio-temporal interactions.
3.2 Problem Setup
We are interested in the problem of predicting future trajectories starting at time step to of total pedestrians involved in a scene, given the observed history during time steps to . At each time step , we have a set of pedestrians , where denotes the position of the pedestrian in a top-down view map. We assume the pedestrian pairs with distance less than would have an undirected edge . This leads to an interaction graph at each time step : , where and . For each node at time , we define its neighbor set as , where for each node , .
|(a) Temporal Transformer||(b) Spatial Transformer|
3.3 Temporal Transformer
The temporal Transformer block in STAR uses a set of pedestrian trajectory embeddings as input, and output a set of updated embeddings with temporal dependencies as output, considering each pedestrian independently.
The structure of a temporal Transformer block is given by Fig. 3.(a). The self-attention block first learns the query matrices , key matrix and the value matrix given the inputs. For -th pedestrian, we have
where , and are the corresponding query, key and value functions shared by pedestrians . We could parallel the computation for all pedestrians, benefiting from the GPU acceleration.
We compute the attention for each single pedestrian separately, following Eq. 1. Similarly, we have the multi-head attention ( heads) for pedestrian represented as
where is a fully connected layer that merges the heads and indexes the -th head. The final embedding is generated by two skip connections and a final fully connected layers, as shown in Fig. 3.(a).
The temporal Transformer is a simple generalization of Transformer networks to a data sequence set. We demonstrate in our experiment that Transformer based architecture provides better temporal modeling.
3.4 Spatial Transformer
The spatial Transformer block extracts the spatial interaction among pedestrians. We propose a novel Transformer based graph convolution, TGConv, for message passing on a graph.
Our key observation is that the self-attention mechanism can be regarded as message passing on an undirected fully connected graph. For a feature vector of feature set , we can represent its corresponding query vector as , key vector as and value vector as . We define the message from node to in the fully connected graph as
and the attention function (Eq. 1) can be rewritten as
Built upon the above insight, we introduce Transformer-based Graph Convolution (TGConv). TGConv is essentially an attention-based graph convolution mechanism, similar to GATConv , but with a better attention mechanism powered by Transformers. For an arbitrary graph where is the node set and . Assume each node is associated with an embedding and a neighbor set . The graph convolution operation for node is written as
where is the output function, in our case, a fully connected layer, and is the updated embedding of node by TGConv. We summarize the TGConv function for node by . In a Transformer structure, we would normally apply layer normalization  after each skip connection in the above equations. We ignored them in the equations for a clean notation.
The spatial Transformer, as shown in Fig. 3.(b), can be easily implemented by the TGConv. A TGConv with shared weights is applied to each graph separately. We believe TGConv is general and can be applied to other tasks, e.g., drug prediction  and social recommendation systems . We leave it for future study.
3.5 Spatio-Temporal Graph Transformer
In this section, we introduce the Spatio-Temporal grAph tRansformer (STAR) framework for pedestrian trajectory prediction.
Temporal transformer can model the motion dynamics of each pedestrian separately, but fails to incorporate spatial interactions; spatial Transformer tackles crowd interaction with TGConv but can be hard to generalize to temporal sequences. One major challenge of pedestrian prediction is modeling coupled spatio-temporal interaction. The spatial and temporal dynamics of a pedestrian is tightly dependent on each other. For example, when one decides her next action, one would first predict the future motions of her neighbors, and choose an action that avoids collision with others in a time interval .
STAR addresses the coupled spatio-temporal modeling by interleaving the spatial and temporal Transformers in a single framework. Fig. 4 shows the network structure of STAR. STAR has two encoder modules and a simple decoder module. The input to the network is the pedestrian position sequences from to , where the pedestrian positions at time step is denoted by with . In the first encoder, we embed the positions by two separate fully connected layers and pass the embeddings to spatial Transformer and Temporal Transformer, to extract independent spatial and temporal information from the pedestrian history. The spatial and temporal features are then merged by a fully connected layer, which gives a set of new features with spatio-temporal encodings. To further model spatio-temporal interaction in the feature space, we perform post-processing of the features with the second encoder module. In encoder 2, spatial Transformer models spatial interaction with temporal information; the temporal Transformer enhances the output spatial embeddings with temporal attentions. STAR predicts the pedestrians positions at using a simple fully connected layer with the embeddings from the second temporal Transformer as input. We construct by connecting the nodes with distance smaller than according to the predicted positions. The prediction is added to the history for the next step prediction.
The STAR architecture significantly improves the spatio-temporal modeling ability compared to naively combining spatial and temporal Transformers.
3.6 External Graph Memory
Although Transformer networks improve long-horizon sequence modeling by self-attention mechanism, it would potentially have difficulties handling continuous time-series data which requires a strong temporal consistency . Temporal consistency, however, is a strict requirement for trajectory prediction, because pedestrian positions normally would not change sharply during a short period.
We introduce a simple external graph memory to tackle this dilemma. A graph memory is read-writable and learnable, where has the same size with and memorizes the embeddings of pedestrian . At time step , in encoder 1, the temporal Transformer first reads from memory the past graph embeddings with function and concatenate it with the current graph embedding . This allows the Temporal Transformers to condition current embeddings on the previous embedding for a consistent prediction. In encoder 2, we write the output of Temporal Transformer to the graph memory by function , which performs a smoothing over the time series data. For any , the embeddings will be updated by the information from , which gives temporally smoother embeddings for a more consistent trajectory.
For implementing and , many potential function forms could be adopted. In this paper, we only consider a very simple strategy
that is, we directly replace the memory with the embeddings and copy the memory to generate the output. This simple strategy works well in practice. More complicated functional form of and could be considered, e.g., fully connected layers or RNNs. We leave this for future study.
In this section, we first report our results on five pedestrian trajectory datasets which serve as the major benchmark for the task of trajectory prediction: ETH (ETH and HOTEL) and UCY (ZARA1, ZARA2, and UNIV) datasets. We compare STAR to 9 trajectory predictors, including the SOTA model, SR-LSTM . We follow the leave-one-out cross-validation evaluation strategy which is commonly adopted by previous works. We also perform extensive ablation studies to understand the effect of each proposed component and try to provide deeper insights for model design in the trajectory prediction task.
As a brief conclusion, we show that: 1) STAR outperforms the SOTA model on 4 out of 5 datasets and have a comparable performance to the SOTA model on the other dataset; 2) the spatial Transformer improves crowd interaction modeling compared to existing graph convolution methods; 3) the temporal Transformer generally improves the LSTM; 4) the graph memory gives a smoother temporal prediction and a better performance.
4.1 Experiment Setup
We follow the same data prepossessing strategy as SR-LSTM for our method. The origin of all the input is shifted to the last observation frame. Random rotation is adopted for data augmentation.
Average Displacement Error (ADE): the mean square error (MSE) overall estimated positions in the predicted trajectory and ground-truth trajectory.
Final Displacement Error (FDE): the distance between the predicted final destination and the ground-truth final destination.
We take 8 frames (3.2s) as an observation sequence and 12 frames(4.8s) as the target sequence for prediction to have a fair comparison with all the existing works. In addition, we mainly compare our results with 7 deterministic methods in which only one unique position of a target at a certain time point would be predicted instead of distribution.
4.2 Implementation Details
Coordinates as input would be first encoded into a vector in size of 32 by a fully connected layer followed with ReLU activation. The dropout ratio at 0.1 is applied when processing the input data. All the transformer layers accept input with feature size at 32. Both spatial transformer and temporal transformer consists of two basic encoding layers with 8 heads. We performed a hyper-parameter search over the learning rate, from 0.0001 to 0.004 with interval 0.0001 on a smaller network and choose the best-performed learning rate (0.0015) to train all the other models. As a result, we train the network using Adam optimizer with a learning rate of 0.0015 and batch size 4 for 300 epochs. Each batch contains around 128 pedestrians in different time windows indicated by an attention mask to accelerate the training and inference process.
We compare STAR with a wide range of baselines, including: 1) LR: A simple temporal linear regressor; 2) LSTM: a vanilla temporal LSTM; 3) S-LSTM : each pedestrian is modeled with an LSTM, and the hidden state is pooled with neighbors at each time-step; 4) Social Attention : it models the crowd as a spatio-temporal graph and uses two LSTMs to capture spatial and temporal dynamics; 5) CIDNN : a modularized approach for spatio-temporal crowd trajectory prediction with LSTMs; 6) SGAN : a stochastic trajectory predictor with GANs; 7) SoPhie : one of the SOTA stochastic trajectory predictors with LSTMs. 8) TrafficPredict : LSTM-based motion predictor for heterogeneous traffic agents. Note that TrafficPredict in  reports isometrically normalized results. We scale them back for a consistent comparison; 9) SR-LSTM: the SOTA trajectory predictor with motion gate and pair-wise attention to refine the hidden state encoded by LSTM to obtain social interactions.
4.4 Quantitative Results and Analyses
We compare ST-GAT with state-of-the-art approaches as mentioned in Section 4.3. Note that the stochastic methods [16, 40] are also presented for reference, but not directly comparable to our approach. All the stochastic method samples 20 times and reports the best-performed sample, which is infeasible for many online applications, e.g., autonomous driving, because measuring the prediction error requires the hindsight knowledge of the ground truth.
The main results are presented in Table 1. We observe that STAR outperforms SOTA models on 4 out of 5 datasets, and achieved comparable performance on the other one. STAR also achieves the best average performance over the 5 datasets. Besides, STAR significantly outperforms the stochastic methods, even though they use 20 samples and select the best one compared to the ground truth. This suggests STAR, as a deterministic model, successfully captures the underlying stochastic nature of the complex crowd dynamics.
One interesting finding is that the simple model LR significantly outperforms many deep learning approaches including the SOTA model, SR-LSTM, in the HOTEL scene, which mostly contains straight-line trajectories and is relatively less crowded. This indicates that these complex models might overfit to those complex scenes like UNIV. Another example is that STAR significantly outperforms SR-LSTM on ETH and HOTEL, but is only comparable to SR-LSTM on UNIV, where the crowd density is high. This can potentially be explained by that SR-LSTM has a well-designed gated-structure for message passing on the graph, but has a relatively weak temporal model, a single LSTM. The design of SR-LSTM potentially improves spatial modeling but might also lead to overfitting. In contrast, our approach performs well in both simple and complex scenes. We then will further demonstrate this in Sect. 4.5 with visualized results.
4.5 Qualitative Results and Analyses
STAR is able to predict temporally consistent trajectories. In Fig. 5.(a), STAR successfully captures the intention and velocity of the single pedestrian, where no social interaction exists.
STAR successfully extracts the social interaction of the crowd. We visualize the attention values of the second spatial Transformer in Fig. 7. We notice that pedestrians are paying high attention to themselves and the neighbors who might potentially collide with them, e.g., Fig. 7.(c) and (d); less attention is paid to spatially far away pedestrians and pedestrians without conflict of intentions, e.g., Fig. 7.(a) and (b).
STAR is able to capture spatio-temporal interaction of the crowd. In Fig. 5.(b), we can see that the prediction of pedestrian considers the future motions of their neighbors. In addition, STAR better balances the spatial modeling and temporal modeling, compared to SR-LSTM. SR-LSTM potentially overfits on the spatial modeling and often tends to predict curves even when pedestrians are walking straight. This also corresponds to our findings in the quantitative analyses section, that deep predictors overfits onto complex datasets. STAR better alleviates this issue with the spatial-temporal Transformer structure.
Auxiliary information is required for more accurate trajectory prediction. Although STAR achieves the SOTA results, prediction can be still inaccurate occasionally, e.g., Fig. 5.(d). The pedestrian takes a sharp turn, which makes it impossible to predict future trajectory purely based on the history of locations. For future work, additional information, e.g., environment setup or map, should be used to provide extra information for prediction.
4.6 Ablation Studies
We conduct extensive ablation studies on all 5 datasets to understand the influence of each STAR component. The results are presented in Table 2.
The temporal Transformer improves the temporal modeling of pedestrian dynamics compared to RNNs. In (4) and (5), we remove the graph memory and fix the STAR for spatial encoding. The temporal prediction ability of these two models is only dependent on their temporal encoders, LSTM for (4) and STAR for (5). We observe that the model with temporal Transformer encoding outperforms LSTM in its overall performance, which suggests that Transformers provide a better temporal modeling ability compared to RNNs.
TGConv outperforms the other graph convolution methods on crowd motion modeling. In (1), (2), (3) and (7), we change the spatial encoders and compare the spatial Transformer by TGConv (7) with the GCN , GATConv  and the multi-head additive graph convolution . We observe that TGConv, under the scenario of crowd modeling, achieves higher performance gain compared to the other two alternative attention-based graph convolutions.
Interleaving spatial and temporal Transformer is able to better extract spatio-temporal correlations. In (6) and (7), we observe that the two encoder structures proposed in the STAR framework (7), generally outperforms the single encoder structure (6). This empirical performance gain potentially suggests that interleaving the spatial and temporal Transformers is able to extract more complex spatio-temporal interactions of pedestrians.
Graph memory gives a smoother temporal embedding and improves performance. In (5) and (7), we verify the embedding smoothing ability of the graph memory module, where (5) is the STAR variant without GM. We first noticed that graph memory improves the performance of STAR on all datasets. In addition, we noticed that on ZARA1, where the spatial interaction is simple and temporal consistency prediction is more important, graph memory improves (6) to (7) by the largest margin. According to the empirical evidence, we can conclude that the embedding smoothing of graph memory is able to improve the overall temporal modeling for STAR.
We have introduced STAR, a framework for spatio-temporal crowd trajectory prediction with only attention mechanisms. STAR consists of two encoder modules, composed of spatial Transformers and temporal Transformers. We also have introduced TGConv, a novel powerful Transformer based graph convolution mechanism. STAR achieves state-of-the-art performance on 4 out of 5 commonly used pedestrian trajectory prediction datasets.
STAR makes prediction only with the past trajectories, which might fail to detect the unpredictable sharp turns. Additional information, e.g., environment configuration, could be incorporated into the framework to solve this issue.
STAR framework and TGConv are not limited to trajectory prediction. They can be applied to any graph learning task. We leave it for future study.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 961–971 (2016)
-  Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
-  Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in neural information processing systems. pp. 4502–4510 (2016)
-  Chen, B., Barzilay, R., Jaakkola, T.: Path-augmented graph transformer network (05 2019). https://doi.org/10.26434/chemrxiv.8214422
-  Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. pp. 1724–1734 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
-  Cui, Z., Henrickson, K., Ke, R., Wang, Y.: Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems (2019)
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems. pp. 3844–3852 (2016)
-  Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
-  Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., Yin, D.: Graph neural networks for social recommendation. In: The World Wide Web Conference. pp. 417–426 (2019)
-  Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 538–547 (2019)
-  Ferrer, G., Garrell, A., Sanfeliu, A.: Robot companion: A social-force based approach with human awareness-navigation in crowded environments. In: International Conference on Intelligent Robots and Systems (2013)
-  Förster, A., Graves, A., Schmidhuber, J.: Rnn-based learning of compact maps for efficient robot localization. In: 15th European Symposium on Artificial Neural Networks, ESANN (2007)
-  Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1263–1272. JMLR. org (2017)
-  Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social gan: Socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2255–2264 (2018)
-  Hajiramezanali, E., Hasanzadeh, A., Narayanan, K., Duffield, N., Zhou, M., Qian, X.: Variational graph recurrent neural networks. In: Advances in Neural Information Processing Systems. pp. 10700–10710 (2019)
-  Helbing, D., Buzna, L., Johansson, A., Werner, T.: Self-organized pedestrian crowd dynamics: Experiments, simulations, and design solutions. Transportation science 39(1), 1–24 (2005)
-  Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Physical review E 51(5), 4282 (1995)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation9(8), 1735–1780 (1997)
-  Huang, Y., Bi, H., Li, Z., Mao, T., Wang, Z.: Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6272–6281 (2019)
-  Ivanovic, B., Pavone, M.: The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2375–2384 (2019)
-  Karkus, P., Ma, X., Hsu, D., Kaelbling, L.P., Lee, W.S., Lozano-Pérez, T.: Differentiable algorithm networks for composable robot learning. arXiv preprint arXiv:1905.11602 (2019)
-  Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
-  Kuderer, M., Kretzschmar, H., Sprunk, C., Burgard, W.: Feature-based prediction of trajectories for socially compliant navigation. In: Robotics: science and systems (2012)
-  Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
-  Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
-  Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566 (2018)
-  Lim, B., Arik, S.O., Loeff, N., Pfister, T.: Temporal fusion transformers for interpretable multi-horizon time series forecasting. arXiv preprint arXiv:1912.09363 (2019)
-  Liu, J., Lin, H., Liu, X., Xu, B., Ren, Y., Diao, Y., Yang, L.: Transformer-based capsule network for stock movement prediction. In: Proceedings of the First Workshop on Financial Technology and Natural Language Processing. pp. 66–73 (2019)
-  Liu, K., Sun, X., Jia, L., Ma, J., Xing, H., Wu, J., Gao, H., Sun, Y., Boulnois, F., Fan, J.: Chemi-net: a molecular graph convolutional network for accurate drug property prediction. International journal of molecular sciences 20(14), 3389 (2019)
-  Löhner, R.: On the modeling of pedestrian motion. Applied Mathematical Modelling 34(2), 366–382 (2010)
-  Luo, Y., Cai, P.: Gamma: A general agent motion prediction model for autonomous driving. arXiv preprint arXiv:1906.01566 (2019)
-  Luo, Y., Cai, P., Bera, A., Hsu, D., Lee, W.S., Manocha, D.: Porca: Modeling and planning for autonomous driving among many pedestrians. IEEE Robotics and Automation Letters (2018)
-  Ma, X., Gao, X., Chen, G.: Beep: A bayesian perspective early stage event prediction model for online social networks. In: 2017 IEEE International Conference on Data Mining (ICDM). pp. 973–978. IEEE (2017)
-  Ma, X., Karkus, P., Hsu, D., Lee, W.S.: Particle filter recurrent neural networks. arXiv preprint arXiv:1905.12885 (2019)
-  Ma, X., Karkus, P., Hsu, D., Lee, W.S., Ye, N.: Discriminative particle filter reinforcement learning for complex partial observations. arXiv preprint arXiv:2002.09884 (2020)
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. Proceedings of the AAAI Conference on Artificial Intelligence33, 6120–6127 (07 2019). https://doi.org/10.1609/aaai.v33i01.33016120
Miao, Y., Gowayyed, M., Metze, F.: Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). pp. 167–174. IEEE (2015)
-  Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014)
-  Van Den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Reciprocal n-body collision avoidance. In: Robotics research, pp. 3–19. Springer (2011)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
-  Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
-  Vemula, A., Muelling, K., Oh, J.: Social attention: Modeling attention in human crowds. In: 2018 IEEE international Conference on Robotics and Automation (ICRA). pp. 1–7. IEEE (2018)
-  Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5934–5938 (2018)
-  Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)
-  Xu, Y., Piao, Z., Gao, S.: Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5275–5284 (2018)
-  Xu, Y., Piao, Z., Gao, S.: Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-  Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems. pp. 5754–5764 (2019)
-  Yi, S., Li, H., Wang, X.: Pedestrian behavior understanding and prediction with deep neural networks. In: European Conference on Computer Vision. pp. 263–279. Springer (2016)
-  Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13(3), 55–75 (2018)
-  Zhang, P., Ouyang, W., Zhang, P., Xue, J., Zheng, N.: Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12085–12094 (2019)
Additional Attention Visualization
Ablation Trajectory Prediction Visualizations
|(a) GAT + STAR|
|(b) MHA + STAR|
|(c) STAR + LSTM|
|(d) STAR without Graph Memory|
|(e) Simplified STAR without Encoder 2|