1 Introduction
Crowd trajectory prediction is of fundamental importance to both the computer vision
[1, 16, 53, 21, 22] and robotics [34, 33] community. This task is challenging because 1) humanhuman interactions are multimodal and extremely hard to capture, e.g., strangers would avoid intimate contact with others, while fellows tend to walk in group [53]; 2) the complex temporal prediction is coupled with the spatial humanhuman interaction, e.g., humans condition their motions on the history and future motion of their neighbors [21].(a) Crowd Motion Modeling  (b) STAR Overview 
Classic models capture humanhuman interaction by handcrafted energyfunctions [19, 18, 34], which require significant feature engineering effort and normally fail to build crowd interactions in crowded spaces [21]
. With the recent advances in deep neural networks, Recurrent Neural Networks (RNNs) have been extensively applied to trajectory prediction and demonstrated promising performance
[1, 16, 53, 21, 22]. RNNbased methods capture pedestrian motion by their latent state and model the humanhuman interaction by merging latent states of spatially proximal pedestrians. Socialpooling [1, 16] treat pedestrians in a neighborhood area equally and merge their latent state by a pooling mechanism. Attention mechanisms [22, 53, 21] relax this assumption and weigh pedestrians according to a learned function, which encodes unequal importance of neighboring pedestrians for trajectory prediction. However, existing predictors have two shared limitations: 1) the attention mechanisms used are still simple, which fails to fully model the humanhuman interaction, 2) RNNs normally have difficulty modeling complex temporal dependencies [43].Recently, Transformer networks have made groundbreaking progress in Natural Language Processing domains (NLP)
[43, 10, 26, 52, 50]. Transformers discard the sequential nature of language sequences and model temporal dependencies with only the powerful selfattention mechanism. The major benefit of Transformer architecture is that selfattention significantly improves temporal modeling, especially for horizon sequences, compared to RNNs [43]. Nevertheless, Transformerbased models are restricted to normal data sequences and it is hard to generalize them to more structured data, e.g., graph sequences.In this paper, we introduce the SpatioTemporal grAph tRansformer (STAR) framework, a novel framework for spatiotemporal trajectory prediction based purely on selfattention mechanism. We believe that learning the temporal, spatial and temporalspatial attentions is the key to accurate crowd trajectory prediction, and Transformers provide a neat and efficient solution to this task. STAR captures the humanhuman interaction with a novel spatial graph Transformer. In particular, we introduce TGConv, a Transformerbased graph convolution mechanism. TGConv improves the attentionbased graph convolution [44] by selfattention mechanism with Transformers and can capture more complex social interactions. We model pedestrian motions with separate temporal Transformers, which better captures temporal dependencies compared to RNNs. STAR extracts spatiotemporal interaction among pedestrians by interleaving between spatial Transformer and temporal Transformer, a simple yet effective strategy. Besides, as Transformers treat a sequence as a bag of words, they normally have problem modeling time series data where strong temporal consistency is enforced [29]. We introduce an additional readwritable graph memory module that continuously performs smoothing over the embeddings during prediction. An overview of STAR is given by Fig. 2.(b)
We experimented on 5 commonly used realworld pedestrian trajectory prediction datasets. We show that STAR outperforms the stateoftheart (SOTA) trajectory predictors on 4 out of 5 datasets and achieves comparable results on the rest. We conduct extensive ablation studies to provide a detailed understanding of each proposed component.
We summarize our contribution as follows: 1) we provide a new insight for trajectory prediction, that spatiotemporal attention modeling is the most critical component for trajectory prediction; 2) we introduce TGConv, a novel Transformerbased graph convolution for attention based graph feature learning; 3) we extend Transformer networks to graphstructured data sequences, in particular, crowd trajectory data sequences, and we introduce STAR, a framework for spatiotemporal graphbased crowd trajectory prediction that achieves SOTA results.
2 Background
2.1 SelfAttention and Transformer Networks
Transformer networks have achieved great success in the NLP domain, such as machine translation, sentiment analysis, and text generation
[10]. Transformer networks follow the famous encoderdecoder structure widely used in the RNN seq2seq models [3, 6].The core idea of Transformer is to replace the recurrence completely by multihead selfattention mechanism. For embeddings , the selfattention of Transformers first learns the query matrix , key matrix and a corresponding value matrix of all embeddings from to . It computes the attention by
(1) 
where is the dimension of each query. The
implements the scaleddot product term for numerical stability for attentions. By computing the selfattention between embeddings across different time steps, the selfattention mechanism is able to learn temporal dependencies over long time horizon, in contrast to RNNs that remember the history with a single vector with limited memory. Besides, decoupling attention into the query, key and value tuples allows the selfattention mechanism to capture more complex temporal dependencies.
Multihead attention mechanism learns to combine multiple hypotheses when computing attentions. It allows the model to jointly attend to information from different representations at different positions. With heads, we have
(2) 
where is a fully connected layer merging the output from heads and denote the selfattention of the th head. Additional positional encoding is used to add positional information to the Transformer embeddings. Finally, Transformer outputs the updated embeddings by a fully connected layer with two skip connections.
However, one major limitation of current Transformerbased models is they only apply to nonstructured data sequences, e.g., word sequences. STAR extends Transformers to more structured data sequences, as a first step, graph sequences, and apply it to trajectory prediction.
2.2 Related Works
2.2.1 Graph Neural Networks
Graph Neural Networks (GNNs) are powerful deep learning architectures for graphstructured data. Graph convolutions
[27, 24, 9, 15, 47]have demonstrated significant improvement on graph machine learning tasks, e.g., modeling physical systems
[4, 28], drug prediction [31] and social recommendation systems [11]. In particular, Graph Attention Networks (GAT) [44] implement efficient weighted message passing between nodes and achieved stateoftheart results across multiple domains. From the sequence prediction perspective, temporal graph RNNs allow learning spatiotemporal relationship in graph sequences [8, 17]. Our STAR improves GAT with TGConv, a transformer boosted attention mechanism and tackles the graph spatiotemporal modeling with transformer architecture.2.2.2 Sequence Prediction
RNNs and its variants, e.g., LSTM [20] and GRU [7], have achieved great success in sequence prediction tasks, e.g., speech recognition [46, 39], robot localization [14, 36], robot decision making [23, 37], and etc. RNNs have been also successfully applied to model the temporal motion pattern of pedestrians [1, 16, 21, 53, 22]. RNNsbased predictors make predictions with a Seq2Seq structure [41]. Additional structure, e.g., social pooling [1, 16], attention mechanism [48, 45, 22] and graph neural networks [21, 53], are used to improve the trajectory prediction with social interaction modeling.
Transformer networks have dominated Natural Language Processing domains in recent years [43, 10, 26, 52, 50]. Transformer models completely discard the recurrence and focus on the attention across time steps. This architecture allows longterm dependency modeling and largebatch parallel training. Transformer architecture has also been applied to other domains with success, e.g., stock prediction [30], robot decision making [12] etc. STAR is a generalization of the Transformer to the graph sequences. We demonstrate it on a challenging crowd trajectory prediction task, where we consider crowd interaction as a graph. STAR is a general framework and could be applied to other graph sequence prediction tasks, e.g., event prediction in social networks [35] and physical system modeling [28]. We leave this for future study.
2.2.3 Crowd Interaction Modeling
As the pioneering work, Social Force models [19, 32] have been proven effective in various applications, e.g., crowd analysis [18] and robotics [13]. They assume the pedestrians are driven by virtual forces for goal navigation and collision avoidance. Social Force models work well on interaction modeling while performing poorly on trajectory prediction [25]. Geometry based methods, e.g., ORCA [42] and PORCA [34], consider the geometry of the agent and convert the interaction modeling into an optimization problem. One major limitation of classic approaches is that they rely on handcrafted features, which is nontrivial to tune and hard to generalize.
Deep learning based models achieve automatic feature engineering by directly learning the model from data. Behavior CNNs [51] capture crowd interaction by CNNs. SocialPooling [1, 16] further encodes the proximal pedestrian states by a pooling mechanism that approximates the crowd interaction. Recent works consider crowd as a graph and merge information of the spatially proximal pedestrians with attention mechanisms [48, 45, 22]. Attention mechanism models pedestrians with importance compared to the pooling methods. Graph neural networks are also applied to address crowd modeling [21, 53]. Explicit message passing allows the network to model more complex social behaviors.
3 Method
3.1 Overview
In this section, we introduce the proposed spatiotemporal graph Transformer based trajectory prediction framework, STAR. We believe attention is the most important factor for effective and efficient trajectory prediction.
STAR decomposes the spatiotemporal attention modeling into temporal modeling and spatial modeling. For temporal modeling, STAR considers each pedestrian independently and applies a standard temporal Transformer network to extract the temporal dependencies. The temporal Transformer provides a better temporal dependency modeling protocol compared to RNNs. For spatial modeling, we introduce TGConv, a Transformerbased message passing graph convolution mechanism. TGConv improves the stateoftheart graph convolution methods with a better attention mechanism and gives a better model for complex spatial interactions. We construct two encoder modules, each including a pair of spatial and temporal Transformers, and stack them to extract spatiotemporal interactions.
3.2 Problem Setup
We are interested in the problem of predicting future trajectories starting at time step to of total pedestrians involved in a scene, given the observed history during time steps to . At each time step , we have a set of pedestrians , where denotes the position of the pedestrian in a topdown view map. We assume the pedestrian pairs with distance less than would have an undirected edge . This leads to an interaction graph at each time step : , where and . For each node at time , we define its neighbor set as , where for each node , .
(a) Temporal Transformer  (b) Spatial Transformer 
3.3 Temporal Transformer
The temporal Transformer block in STAR uses a set of pedestrian trajectory embeddings as input, and output a set of updated embeddings with temporal dependencies as output, considering each pedestrian independently.
The structure of a temporal Transformer block is given by Fig. 3.(a). The selfattention block first learns the query matrices , key matrix and the value matrix given the inputs. For th pedestrian, we have
(3) 
where , and are the corresponding query, key and value functions shared by pedestrians . We could parallel the computation for all pedestrians, benefiting from the GPU acceleration.
We compute the attention for each single pedestrian separately, following Eq. 1. Similarly, we have the multihead attention ( heads) for pedestrian represented as
(4)  
(5)  
(6) 
where is a fully connected layer that merges the heads and indexes the th head. The final embedding is generated by two skip connections and a final fully connected layers, as shown in Fig. 3.(a).
The temporal Transformer is a simple generalization of Transformer networks to a data sequence set. We demonstrate in our experiment that Transformer based architecture provides better temporal modeling.
3.4 Spatial Transformer
The spatial Transformer block extracts the spatial interaction among pedestrians. We propose a novel Transformer based graph convolution, TGConv, for message passing on a graph.
Our key observation is that the selfattention mechanism can be regarded as message passing on an undirected fully connected graph. For a feature vector of feature set , we can represent its corresponding query vector as , key vector as and value vector as . We define the message from node to in the fully connected graph as
(7) 
and the attention function (Eq. 1) can be rewritten as
(8) 
Built upon the above insight, we introduce Transformerbased Graph Convolution (TGConv). TGConv is essentially an attentionbased graph convolution mechanism, similar to GATConv [44], but with a better attention mechanism powered by Transformers. For an arbitrary graph where is the node set and . Assume each node is associated with an embedding and a neighbor set . The graph convolution operation for node is written as
(9)  
(10) 
where is the output function, in our case, a fully connected layer, and is the updated embedding of node by TGConv. We summarize the TGConv function for node by . In a Transformer structure, we would normally apply layer normalization [2] after each skip connection in the above equations. We ignored them in the equations for a clean notation.
The spatial Transformer, as shown in Fig. 3.(b), can be easily implemented by the TGConv. A TGConv with shared weights is applied to each graph separately. We believe TGConv is general and can be applied to other tasks, e.g., drug prediction [31] and social recommendation systems [11]. We leave it for future study.
3.5 SpatioTemporal Graph Transformer
In this section, we introduce the SpatioTemporal grAph tRansformer (STAR) framework for pedestrian trajectory prediction.
Temporal transformer can model the motion dynamics of each pedestrian separately, but fails to incorporate spatial interactions; spatial Transformer tackles crowd interaction with TGConv but can be hard to generalize to temporal sequences. One major challenge of pedestrian prediction is modeling coupled spatiotemporal interaction. The spatial and temporal dynamics of a pedestrian is tightly dependent on each other. For example, when one decides her next action, one would first predict the future motions of her neighbors, and choose an action that avoids collision with others in a time interval .
STAR addresses the coupled spatiotemporal modeling by interleaving the spatial and temporal Transformers in a single framework. Fig. 4 shows the network structure of STAR. STAR has two encoder modules and a simple decoder module. The input to the network is the pedestrian position sequences from to , where the pedestrian positions at time step is denoted by with . In the first encoder, we embed the positions by two separate fully connected layers and pass the embeddings to spatial Transformer and Temporal Transformer, to extract independent spatial and temporal information from the pedestrian history. The spatial and temporal features are then merged by a fully connected layer, which gives a set of new features with spatiotemporal encodings. To further model spatiotemporal interaction in the feature space, we perform postprocessing of the features with the second encoder module. In encoder 2, spatial Transformer models spatial interaction with temporal information; the temporal Transformer enhances the output spatial embeddings with temporal attentions. STAR predicts the pedestrians positions at using a simple fully connected layer with the embeddings from the second temporal Transformer as input. We construct by connecting the nodes with distance smaller than according to the predicted positions. The prediction is added to the history for the next step prediction.
The STAR architecture significantly improves the spatiotemporal modeling ability compared to naively combining spatial and temporal Transformers.
3.6 External Graph Memory
Although Transformer networks improve longhorizon sequence modeling by selfattention mechanism, it would potentially have difficulties handling continuous timeseries data which requires a strong temporal consistency [29]. Temporal consistency, however, is a strict requirement for trajectory prediction, because pedestrian positions normally would not change sharply during a short period.
We introduce a simple external graph memory to tackle this dilemma. A graph memory is readwritable and learnable, where has the same size with and memorizes the embeddings of pedestrian . At time step , in encoder 1, the temporal Transformer first reads from memory the past graph embeddings with function and concatenate it with the current graph embedding . This allows the Temporal Transformers to condition current embeddings on the previous embedding for a consistent prediction. In encoder 2, we write the output of Temporal Transformer to the graph memory by function , which performs a smoothing over the time series data. For any , the embeddings will be updated by the information from , which gives temporally smoother embeddings for a more consistent trajectory.
For implementing and , many potential function forms could be adopted. In this paper, we only consider a very simple strategy
(11)  
(12) 
that is, we directly replace the memory with the embeddings and copy the memory to generate the output. This simple strategy works well in practice. More complicated functional form of and could be considered, e.g., fully connected layers or RNNs. We leave this for future study.
4 Experiments
In this section, we first report our results on five pedestrian trajectory datasets which serve as the major benchmark for the task of trajectory prediction: ETH (ETH and HOTEL) and UCY (ZARA1, ZARA2, and UNIV) datasets. We compare STAR to 9 trajectory predictors, including the SOTA model, SRLSTM [53]. We follow the leaveoneout crossvalidation evaluation strategy which is commonly adopted by previous works. We also perform extensive ablation studies to understand the effect of each proposed component and try to provide deeper insights for model design in the trajectory prediction task.
As a brief conclusion, we show that: 1) STAR outperforms the SOTA model on 4 out of 5 datasets and have a comparable performance to the SOTA model on the other dataset; 2) the spatial Transformer improves crowd interaction modeling compared to existing graph convolution methods; 3) the temporal Transformer generally improves the LSTM; 4) the graph memory gives a smoother temporal prediction and a better performance.
4.1 Experiment Setup
We follow the same data prepossessing strategy as SRLSTM[53] for our method. The origin of all the input is shifted to the last observation frame. Random rotation is adopted for data augmentation.

Average Displacement Error (ADE): the mean square error (MSE) overall estimated positions in the predicted trajectory and groundtruth trajectory.

Final Displacement Error (FDE): the distance between the predicted final destination and the groundtruth final destination.
We take 8 frames (3.2s) as an observation sequence and 12 frames(4.8s) as the target sequence for prediction to have a fair comparison with all the existing works. In addition, we mainly compare our results with 7 deterministic methods in which only one unique position of a target at a certain time point would be predicted instead of distribution.
4.2 Implementation Details
Coordinates as input would be first encoded into a vector in size of 32 by a fully connected layer followed with ReLU activation. The dropout ratio at 0.1 is applied when processing the input data. All the transformer layers accept input with feature size at 32. Both spatial transformer and temporal transformer consists of two basic encoding layers with 8 heads. We performed a hyperparameter search over the learning rate, from 0.0001 to 0.004 with interval 0.0001 on a smaller network and choose the bestperformed learning rate (0.0015) to train all the other models. As a result, we train the network using Adam optimizer with a learning rate of 0.0015 and batch size 4 for 300 epochs. Each batch contains around 128 pedestrians in different time windows indicated by an attention mask to accelerate the training and inference process.
4.3 Baselines
We compare STAR with a wide range of baselines, including: 1) LR: A simple temporal linear regressor; 2) LSTM: a vanilla temporal LSTM; 3) SLSTM [1]: each pedestrian is modeled with an LSTM, and the hidden state is pooled with neighbors at each timestep; 4) Social Attention [45]: it models the crowd as a spatiotemporal graph and uses two LSTMs to capture spatial and temporal dynamics; 5) CIDNN [49]: a modularized approach for spatiotemporal crowd trajectory prediction with LSTMs; 6) SGAN [16]: a stochastic trajectory predictor with GANs; 7) SoPhie [40]: one of the SOTA stochastic trajectory predictors with LSTMs. 8) TrafficPredict [38]: LSTMbased motion predictor for heterogeneous traffic agents. Note that TrafficPredict in [38] reports isometrically normalized results. We scale them back for a consistent comparison; 9) SRLSTM: the SOTA trajectory predictor with motion gate and pairwise attention to refine the hidden state encoded by LSTM to obtain social interactions.
4.4 Quantitative Results and Analyses
We compare STGAT with stateoftheart approaches as mentioned in Section 4.3. Note that the stochastic methods [16, 40] are also presented for reference, but not directly comparable to our approach. All the stochastic method samples 20 times and reports the bestperformed sample, which is infeasible for many online applications, e.g., autonomous driving, because measuring the prediction error requires the hindsight knowledge of the ground truth.
The main results are presented in Table 1. We observe that STAR outperforms SOTA models on 4 out of 5 datasets, and achieved comparable performance on the other one. STAR also achieves the best average performance over the 5 datasets. Besides, STAR significantly outperforms the stochastic methods, even though they use 20 samples and select the best one compared to the ground truth. This suggests STAR, as a deterministic model, successfully captures the underlying stochastic nature of the complex crowd dynamics.
One interesting finding is that the simple model LR significantly outperforms many deep learning approaches including the SOTA model, SRLSTM, in the HOTEL scene, which mostly contains straightline trajectories and is relatively less crowded. This indicates that these complex models might overfit to those complex scenes like UNIV. Another example is that STAR significantly outperforms SRLSTM on ETH and HOTEL, but is only comparable to SRLSTM on UNIV, where the crowd density is high. This can potentially be explained by that SRLSTM has a welldesigned gatedstructure for message passing on the graph, but has a relatively weak temporal model, a single LSTM. The design of SRLSTM potentially improves spatial modeling but might also lead to overfitting. In contrast, our approach performs well in both simple and complex scenes. We then will further demonstrate this in Sect. 4.5 with visualized results.
Performance (ADE/FDE)  
Deterministic  ETH  HOTEL  ZARA1  ZARA2  UNIV  AVERAGE 
LR  1.33/2.94  0.39/0.72  0.62/1.21  0.77/1.48  0.82/1.59  0.79/1.59 
LSTM  1.13/2.39  0.69/1.47  0.64/1.43  0.54/1.21  0.73/1.60  0.75/1.62 
SLSTM[1]  0.77/1.60  0.38/0.80  0.51/1.19  0.39/0.89  0.58/1.28  0.53/1.15 
CIDNN[49]  1.25/2.32  1.31/1.86  0.90/1.28  0.50/1.04  0.51/1.07  0.89/1.73 
SocialAttention [45]  1.39/2.39  2.51/2.91  1.25/2.54  1.01/2.17  0.88/1.75  1.41/2.35 
TrafficPredict [38]  5.46/9.73  2.55/3.57  4.32/8.00  3.76/7.20  3.31/6.37  3.88/6.97 
SRLSTM [53]  0.63/1.25  0.37/0.74  0.41/0.90  0.32/0.70  0.51/1.10  0.45/0.94 
STAR  0.56/1.11  0.26/0.50  0.40/0.89  0.31/0.71  0.52/1.13  0.41/0.87 
Stochastic  ETH  HOTEL  ZARA1  ZARA2  UNIV  AVERAGE 
SGAN [16]  0.81/1.52  0.72/1.61  0.34/0.69  0.42/0.84  0.60/1.26  0.58/1.18 
SoPhie* [40]  0.70/1.43  0.76/1.67  0.30/0.63  0.38/0.78  0.54/1.24  0.54/1.15 
4.5 Qualitative Results and Analyses
We present our qualitative results in Fig. 5 and Fig. 7. We focus on the visualization of STAR and compare it with the SOTA model, SRLSTM.
(a)  (b)  (c)  (d) 

STAR is able to predict temporally consistent trajectories. In Fig. 5.(a), STAR successfully captures the intention and velocity of the single pedestrian, where no social interaction exists.

STAR successfully extracts the social interaction of the crowd. We visualize the attention values of the second spatial Transformer in Fig. 7. We notice that pedestrians are paying high attention to themselves and the neighbors who might potentially collide with them, e.g., Fig. 7.(c) and (d); less attention is paid to spatially far away pedestrians and pedestrians without conflict of intentions, e.g., Fig. 7.(a) and (b).

STAR is able to capture spatiotemporal interaction of the crowd. In Fig. 5.(b), we can see that the prediction of pedestrian considers the future motions of their neighbors. In addition, STAR better balances the spatial modeling and temporal modeling, compared to SRLSTM. SRLSTM potentially overfits on the spatial modeling and often tends to predict curves even when pedestrians are walking straight. This also corresponds to our findings in the quantitative analyses section, that deep predictors overfits onto complex datasets. STAR better alleviates this issue with the spatialtemporal Transformer structure.

Auxiliary information is required for more accurate trajectory prediction. Although STAR achieves the SOTA results, prediction can be still inaccurate occasionally, e.g., Fig. 5.(d). The pedestrian takes a sharp turn, which makes it impossible to predict future trajectory purely based on the history of locations. For future work, additional information, e.g., environment setup or map, should be used to provide extra information for prediction.
4.6 Ablation Studies
We conduct extensive ablation studies on all 5 datasets to understand the influence of each STAR component. The results are presented in Table 2.
Components  Performance (MAD/FAD)  
SP  TP  GM  ETH  HOTEL  ZARA1  ZARA2  UNIV  AVERAGE  
(1)  GCN  STAR  ✓  3.06/5.57  0.99/1.80  2.49/4.58  1.37/2.52  1.38/2.47  1.86 /3.34 
(2)  GAT  STAR  ✓  0.64/1.25  0.34/0.72  0.47/1.09  0.37/0.86  0.55/1.19  0.48/1.02 
(3)  MHA  STAR  ✓  0.58/1.15  0.25/0.48  0.50/0.98  0.35/0.76  0.60/1.24  0.56/0.92 
(4)  STAR  LSTM    0.66/1.29  0.34/0.68  0.45/0.96  0.34/0.74  0.60/1.29  0.48/0.99 
(5)  STAR  STAR  0.60/1.18  0.28/0.60  0.53/1.13  0.36/0.76  0.57/1.20  0.47/0.97  
(6)  VSTAR  VSTAR  ✓  0.61/1.18  0.29/0.56  0.48/1.00  0.36/0.76  0.58/1.24  0.46/0.95 
(7)  STAR  STAR  ✓  0.56/1.11  0.26/0.50  0.40/0.89  0.31/0.71  0.52/1.13  0.41/0.87 

The temporal Transformer improves the temporal modeling of pedestrian dynamics compared to RNNs. In (4) and (5), we remove the graph memory and fix the STAR for spatial encoding. The temporal prediction ability of these two models is only dependent on their temporal encoders, LSTM for (4) and STAR for (5). We observe that the model with temporal Transformer encoding outperforms LSTM in its overall performance, which suggests that Transformers provide a better temporal modeling ability compared to RNNs.

TGConv outperforms the other graph convolution methods on crowd motion modeling. In (1), (2), (3) and (7), we change the spatial encoders and compare the spatial Transformer by TGConv (7) with the GCN [24], GATConv [44] and the multihead additive graph convolution [5]. We observe that TGConv, under the scenario of crowd modeling, achieves higher performance gain compared to the other two alternative attentionbased graph convolutions.

Interleaving spatial and temporal Transformer is able to better extract spatiotemporal correlations. In (6) and (7), we observe that the two encoder structures proposed in the STAR framework (7), generally outperforms the single encoder structure (6). This empirical performance gain potentially suggests that interleaving the spatial and temporal Transformers is able to extract more complex spatiotemporal interactions of pedestrians.

Graph memory gives a smoother temporal embedding and improves performance. In (5) and (7), we verify the embedding smoothing ability of the graph memory module, where (5) is the STAR variant without GM. We first noticed that graph memory improves the performance of STAR on all datasets. In addition, we noticed that on ZARA1, where the spatial interaction is simple and temporal consistency prediction is more important, graph memory improves (6) to (7) by the largest margin. According to the empirical evidence, we can conclude that the embedding smoothing of graph memory is able to improve the overall temporal modeling for STAR.
5 Conclusion
We have introduced STAR, a framework for spatiotemporal crowd trajectory prediction with only attention mechanisms. STAR consists of two encoder modules, composed of spatial Transformers and temporal Transformers. We also have introduced TGConv, a novel powerful Transformer based graph convolution mechanism. STAR achieves stateoftheart performance on 4 out of 5 commonly used pedestrian trajectory prediction datasets.
STAR makes prediction only with the past trajectories, which might fail to detect the unpredictable sharp turns. Additional information, e.g., environment configuration, could be incorporated into the framework to solve this issue.
STAR framework and TGConv are not limited to trajectory prediction. They can be applied to any graph learning task. We leave it for future study.
References

[1]
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., FeiFei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 961–971 (2016)
 [2] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
 [3] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
 [4] Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in neural information processing systems. pp. 4502–4510 (2016)
 [5] Chen, B., Barzilay, R., Jaakkola, T.: Pathaugmented graph transformer network (05 2019). https://doi.org/10.26434/chemrxiv.8214422
 [6] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. pp. 1724–1734 (2014)
 [7] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
 [8] Cui, Z., Henrickson, K., Ke, R., Wang, Y.: Traffic graph convolutional recurrent neural network: A deep learning framework for networkscale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems (2019)

[9]
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems. pp. 3844–3852 (2016)
 [10] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
 [11] Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., Yin, D.: Graph neural networks for social recommendation. In: The World Wide Web Conference. pp. 417–426 (2019)
 [12] Fang, K., Toshev, A., FeiFei, L., Savarese, S.: Scene memory transformer for embodied agents in longhorizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 538–547 (2019)
 [13] Ferrer, G., Garrell, A., Sanfeliu, A.: Robot companion: A socialforce based approach with human awarenessnavigation in crowded environments. In: International Conference on Intelligent Robots and Systems (2013)
 [14] Förster, A., Graves, A., Schmidhuber, J.: Rnnbased learning of compact maps for efficient robot localization. In: 15th European Symposium on Artificial Neural Networks, ESANN (2007)
 [15] Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine LearningVolume 70. pp. 1263–1272. JMLR. org (2017)
 [16] Gupta, A., Johnson, J., FeiFei, L., Savarese, S., Alahi, A.: Social gan: Socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2255–2264 (2018)
 [17] Hajiramezanali, E., Hasanzadeh, A., Narayanan, K., Duffield, N., Zhou, M., Qian, X.: Variational graph recurrent neural networks. In: Advances in Neural Information Processing Systems. pp. 10700–10710 (2019)
 [18] Helbing, D., Buzna, L., Johansson, A., Werner, T.: Selforganized pedestrian crowd dynamics: Experiments, simulations, and design solutions. Transportation science 39(1), 1–24 (2005)
 [19] Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Physical review E 51(5), 4282 (1995)

[20]
Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation
9(8), 1735–1780 (1997)  [21] Huang, Y., Bi, H., Li, Z., Mao, T., Wang, Z.: Stgat: Modeling spatialtemporal interactions for human trajectory prediction. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6272–6281 (2019)
 [22] Ivanovic, B., Pavone, M.: The trajectron: Probabilistic multiagent trajectory modeling with dynamic spatiotemporal graphs. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2375–2384 (2019)
 [23] Karkus, P., Ma, X., Hsu, D., Kaelbling, L.P., Lee, W.S., LozanoPérez, T.: Differentiable algorithm networks for composable robot learning. arXiv preprint arXiv:1905.11602 (2019)
 [24] Kipf, T.N., Welling, M.: Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
 [25] Kuderer, M., Kretzschmar, H., Sprunk, C., Burgard, W.: Featurebased prediction of trajectories for socially compliant navigation. In: Robotics: science and systems (2012)
 [26] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
 [27] Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
 [28] Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566 (2018)
 [29] Lim, B., Arik, S.O., Loeff, N., Pfister, T.: Temporal fusion transformers for interpretable multihorizon time series forecasting. arXiv preprint arXiv:1912.09363 (2019)
 [30] Liu, J., Lin, H., Liu, X., Xu, B., Ren, Y., Diao, Y., Yang, L.: Transformerbased capsule network for stock movement prediction. In: Proceedings of the First Workshop on Financial Technology and Natural Language Processing. pp. 66–73 (2019)
 [31] Liu, K., Sun, X., Jia, L., Ma, J., Xing, H., Wu, J., Gao, H., Sun, Y., Boulnois, F., Fan, J.: Cheminet: a molecular graph convolutional network for accurate drug property prediction. International journal of molecular sciences 20(14), 3389 (2019)
 [32] Löhner, R.: On the modeling of pedestrian motion. Applied Mathematical Modelling 34(2), 366–382 (2010)
 [33] Luo, Y., Cai, P.: Gamma: A general agent motion prediction model for autonomous driving. arXiv preprint arXiv:1906.01566 (2019)
 [34] Luo, Y., Cai, P., Bera, A., Hsu, D., Lee, W.S., Manocha, D.: Porca: Modeling and planning for autonomous driving among many pedestrians. IEEE Robotics and Automation Letters (2018)
 [35] Ma, X., Gao, X., Chen, G.: Beep: A bayesian perspective early stage event prediction model for online social networks. In: 2017 IEEE International Conference on Data Mining (ICDM). pp. 973–978. IEEE (2017)
 [36] Ma, X., Karkus, P., Hsu, D., Lee, W.S.: Particle filter recurrent neural networks. arXiv preprint arXiv:1905.12885 (2019)
 [37] Ma, X., Karkus, P., Hsu, D., Lee, W.S., Ye, N.: Discriminative particle filter reinforcement learning for complex partial observations. arXiv preprint arXiv:2002.09884 (2020)

[38]
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: Trafficpredict: Trajectory prediction for heterogeneous trafficagents. Proceedings of the AAAI Conference on Artificial Intelligence
33, 6120–6127 (07 2019). https://doi.org/10.1609/aaai.v33i01.33016120 
[39]
Miao, Y., Gowayyed, M., Metze, F.: Eesen: Endtoend speech recognition using deep rnn models and wfstbased decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). pp. 167–174. IEEE (2015)
 [40] Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
 [41] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014)
 [42] Van Den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Reciprocal nbody collision avoidance. In: Robotics research, pp. 3–19. Springer (2011)
 [43] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
 [44] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
 [45] Vemula, A., Muelling, K., Oh, J.: Social attention: Modeling attention in human crowds. In: 2018 IEEE international Conference on Robotics and Automation (ICRA). pp. 1–7. IEEE (2018)
 [46] Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5934–5938 (2018)
 [47] Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)
 [48] Xu, Y., Piao, Z., Gao, S.: Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5275–5284 (2018)
 [49] Xu, Y., Piao, Z., Gao, S.: Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
 [50] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems. pp. 5754–5764 (2019)
 [51] Yi, S., Li, H., Wang, X.: Pedestrian behavior understanding and prediction with deep neural networks. In: European Conference on Computer Vision. pp. 263–279. Springer (2016)
 [52] Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13(3), 55–75 (2018)
 [53] Zhang, P., Ouyang, W., Zhang, P., Xue, J., Zheng, N.: Srlstm: State refinement for lstm towards pedestrian trajectory prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12085–12094 (2019)
Additional Attention Visualization
Ablation Trajectory Prediction Visualizations
(a) GAT + STAR  
(b) MHA + STAR  
(c) STAR + LSTM  
(d) STAR without Graph Memory  
(e) Simplified STAR without Encoder 2  
(f) STAR 