Spatial-Temporal Self-Attention Network for Flow Prediction

12/13/2019 ∙ by Haoxing Lin, et al. ∙ 0

Flow prediction (e.g., crowd flow, traffic flow) with features of spatial-temporal is increasingly investigated in AI research field. It is very challenging due to the complicated spatial dependencies between different locations and dynamic temporal dependencies among different time intervals. Although measurements of both dependencies are employed, existing methods suffer from the following two problems. First, the temporal dependencies are measured either uniformly or bias against long-term dependencies, which overlooks the distinctive impacts of short-term and long-term temporal dependencies. Second, the existing methods capture spatial and temporal dependencies independently, which wrongly assumes that the correlations between these dependencies are weak and ignores the complicated mutual influences between them. To address these issues, we propose a Spatial-Temporal Self-Attention Network (ST-SAN). As the path-length of attending long-term dependency is shorter in the self-attention mechanism, the vanishing of long-term temporal dependencies is prevented. In addition, since our model relies solely on attention mechanisms, the spatial and temporal dependencies can be simultaneously measured. Experimental results on real-world data demonstrate that, in comparison with state-of-the-art methods, our model reduces the root mean square errors by 9 outflow prediction on Taxi-NYC data, which is very significant compared to the previous improvement.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Flow prediction, as one of the most crucial problems in today’s smart city research, has drawn increasing attention in AI research field. With a boosted number of population, effective prediction of flow (e.g., crowd flow, traffic flow) becomes more and more critical for first-tier cities. Practically, the performance of various applications, such as intelligent service allocation and dynamic traffic management, benefit from higher prediction accuracies in crowd flow prediction and traffic flow prediction [WuT16]. On the other hand, a more substantial amount of available data has been driving the AI researches on flow prediction as well.

Specifically, flow refers to the number of people or vehicles arriving in (inflow) or departing from (outflow) the observed regions at each time interval. The goal of flow prediction is to predict the flow of future times by deriving spatial-temporal patterns from historical data. Before the era of deep learning, flow prediction has been heavily relying on methods from time series analysis community. Traditional statistic methods such as Auto-Regressive Integrated Moving Average (ARIMA), Kalmen filtering, and Vector Auto-Regressive (VAR) models are widely employed in flow prediction

[chandra_2009, Li2012, Moreira-Matias, Shekhar]. Although they are straight-forward and easy to deploy, the incapabilities of traditional methods on measuring complicated spatial dependencies limit their performance.

Recently, deep learning-based methods have shown significant advantages in modeling both spatial and temporal dependencies in flow prediction [Zhang:2017:DSR:3298239.3298479]. However, the existing methods still suffer from incomprehensive measurements of long-term and short-term temporal dependencies. Besides, they also ignore the complicated correlation between the spatial and temporal dependencies as capturing them independently. To be specific, the above problems result from the fundamental structures employed by the current methods. Generally, their structures can be categorized as (1) deep residual convolutional network [ZHANG_TKDE] and (2) convolutional recurrent network [stdn]. Although they all consider both spatial and temporal dependencies, each kind of networks has structural problems that intrinsically limit their performances.

For the deep residual convolutional methods, the spatial dependencies of different time intervals are independently measured by multiple deep residual convolutional neural networks

[Kaiming_He_2015]. Without any recurrent structures, they try to handle the temporal dependencies by applying deeper and more nested residual networks. However, as the convolutional results of different time intervals are uniformly measured, this kind of structures overlooks the distinctive impacts of short-term and long-term temporal dependencies.

For those who employ convolutional recurrent structure, they apply recurrent networks such as LSTM [doi:10.1162/neco.1997.9.8.1735] on the convolutional results of different time intervals. However, as the long-term temporal dependencies vanish rapidly via passing through the recurrent networks, it is overwhelmed by the short-term temporal dependencies, which causes the incomprehensive measurement of temporal dependencies. Moreover, the computation of the recurrent structure is very inefficient [NIPS2017_7181], which deters the convolutional recurrent networks to further improve their performance by applying deeper and more nested structures.

Additionally, both of the structures handle the spatial and temporal dependencies asynchronously, which relies on a false assumption that the correlations between the two factors are weak. However, the assumption ignores the fact that the spatial and temporal dependencies have complicated mutual influences, which is very critical for flow prediction under complex situations.

To overcome these challenges, we propose a Spatial-Temporal Self-Attention Network (ST-SAN), which adopts an innovative spatial-temporal self-attention mechanism. Given its shorter path-length to attend the long-term dependency in the self-attention mechanism, our model avoids the vanishing of long-term temporal dependencies. Besides, since it is merely based on attention mechanisms, ST-SAN captures all dependencies simultaneously and thus are more effective as the spatial and temporal dependencies can interrelate to each other. Moreover, without any recurrent or deep convolutional structures, ST-SAN is very computationally efficient.

The contributions of our work can be summarized as follows:

  • A spatial-temporal self-attention mechanism is developed to handle sophisticated and dynamic spatial and temporal dependencies simultaneously. To the best of our knowledge, the proposed mechanism is the first method that can measure both dependencies synchronously.

  • Our model prevents the vanishing of long-term temporal dependencies with the self-attention mechanism, which can attend to both short-term and long-term dependencies through equal-length paths.

  • A Spatial-Temporal Self-Attention Network is proposed, which is computationally efficient as eschewing recurrent and deep convolutional structures. To the best of our knowledge, ST-SAN is the first deep-learning-based flow prediction methods without both of these two structures.

  • We evaluate our model on three real-world, large-scale datasets and demonstrate its significant advantages over state-of-the-art baselines.

Related Work

Deep Learning for Flow Prediction

Recently, various works based on deep learning have achieved significant improvement in flow prediction. Firstly, the LSTM [doi:10.1162/neco.1997.9.8.1735] based methods demonstrates excellent performance on capturing temporal dependencies when predicting spatial-temporal flow [DBLP:cui_ke_wang]. Then, convolutional structures were investigated on capturing spatial dependencies in flow prediction tasks [Zhang:2016:DPM:2996913.2997016]. After the deep residual convolutional network is proposed [Kaiming_He_2015], several works based on deep residual structure achieve significant improvement in capturing spatial-temporal dependencies in flow prediction [zhang_zheng_qi]. Lately, after Convolutional LSTM achieved tremendous success in processing spatial-temporal information [NIPS2015_5955], several researches employ such convolutional recurrent structure to learn spatial and temporal dependencies and further improve the performance of predicting flow [ke_zheng_yang, Zhou:2018:PMC:3159652.3159682, a98d8116a2684b17bdabc50c1e1713b3, stdn]. However, these works fail to comprehensively measure the temporal dependencies and also overlook the complicated correlations between spatial and temporal dependencies. Convolutional LSTM achieved tremendous success in processing spatial-temporal information [NIPS2015_5955], several researches employ such convolutional recurrent structure to learn spatial and temporal dependencies and further improve the performance of predicting flow [ke_zheng_yang, Zhou:2018:PMC:3159652.3159682, a98d8116a2684b17bdabc50c1e1713b3, stdn]. However, these works fail to comprehensively measure the temporal dependencies and also overlook the complicated correlations between spatial and temporal dependencies. Convolutional LSTM achieved tremendous success in processing spatial-temporal information [NIPS2015_5955], several researches employ such convolutional recurrent structure to learn spatial and temporal dependencies and further improve the performance of predicting flow [ke_zheng_yang, Zhou:2018:PMC:3159652.3159682, a98d8116a2684b17bdabc50c1e1713b3, stdn]. However, these works fail to comprehensively measure the temporal dependencies and also overlook the complicated correlations between spatial and temporal dependencies. Convolutional LSTM achieved tremendous success in processing spatial-temporal information [NIPS2015_5955], several researches employ such convolutional recurrent structure to learn spatial and temporal dependencies and further improve the performance of predicting flow [ke_zheng_yang, Zhou:2018:PMC:3159652.3159682, a98d8116a2684b17bdabc50c1e1713b3, stdn]. However, these works fail to comprehensively measure the temporal dependencies and also overlook the complicated correlations between spatial and temporal dependencies.


Recently, self-attention has drawn an enormous amount of attention in natural language processing (NLP). Transformer

[NIPS2017_7181], a fully self-attention framework, has been widely adopted in many state-of-the-art pre-training language models [devlin_2018, radford2019language, xlnet].

The self-attention mechanism has three advantages over traditional convolutional and recurrent structures. First, impacts of distant series can affect each other’s output without passing through recurrent steps, or convolution layers. Second, it can learn long-term dependencies effectively. Third, its layer outputs can be calculated in parallel, which is much faster than a series like the RNN [NIPS2017_7181]. However, we observe that directly applying Transformer on flow prediction does not result in the expected improvement. The possible reason may be that it is initially designed for modeling dependencies among a sequence of words, which inherently lacks the consideration of spatial information.

Notations and Problem Formulation

As shown in Figure 1, the spatial area is divided into a grid map with N grids in total (N = ). Each grid represents a node (region) in the spatial map, denoted as {, , …, }. T stands for the number of all available time intervals equally divided from the whole period. In each time interval, w types of flows (e.g., inflow and outflow) are included in each node, their volumes are determined based on the historical records of object trajectories. Specifically, take inflow and outflow as example, when an object (e.g., person, vehicle) was in at time and appeared in at time ( , ), it contributed one volume to each of ’s outflow and ’s inflow. The overall volumes of inflow and outflow of at time t are denoted as and . At the meantime, the transitions between nodes are extracted, denoted as for transitions arrive in from and for transitions depart from to . Notice that, since the transitions may span across multiple time intervals, we discard those with duration longer than a threshold m as they have less effect on flow prediction in the next time interval. After obtaining the historical flow and transition data with length T

alongside the time axis, we constitute tensors

and .

Problem Statement Given historical flow and transition data , as inputs, the task of prediction problem is to learn a function that maps the inputs to the predicted values of all nodes at the next time:


where and stands for the learnable parameters.

Model Architecture

Figure 2 shows the architecture of ST-SAN, which consists of 2 streams of self-attention networks – Stream-T and Stream-F. Each of them contains a stack of convolutional layers, an encoder, and a decode. The Stream-T is trained independently on capturing features of transition before merging with Stream-F by a masked fusion mechanism. The detail of each component is described in the following subsections.

Figure 1: Map segmentation and the transitions between nodes
Figure 2: Model architecture. PE: positional encoding

Encoder and Decoder

We employ the encoder-decoder architecture as in most competitive neural sequence transduction models [NIPS2017_7181]. Here, the encoder maps an inputs sequence of historical flow or transition data ( or ) to a sequence of continuous representations Z. Given Z and the current flow or transition data ( or ), the decoder then generates an output y as the predicted output of the next time interval.

The encoder contains a stack of N

= 4 identical layers, whose sub-layers includes a spatial-temporal multi-head self-attention mechanism and a position-wise fully connected feed-forward network. We also employ the residual connection

[Kaiming_He_2015] and layer normalization [ba2016layer] around each of the two sub-layers. To be specific, the output of each sub-layers is , where Sublayer(x) is the function implemented by the sub-layer itself. The dimension of outputs produced by all sub-layers is set to = 64, in order to facilitate the residual connections.

The decoder consists of a stack of N = 4 identical layers as well. Besides the two sub-layers in each encoder layer, an additional sub-layer is inserted to performs spatial-temporal multi-head attention over the output of the encoder stack. Also, residual connections followed by layer normalizations are implemented around each sub-layers.

Spatial-Temporal Self-Attention

Compared to ordinary self-attention mechanism adopted in language models, the feature space of the spatial-temporal self-attention mechanism has two more axes inserted to hold the domain of spatial map. As the computation of self-attention can be parallelized [NIPS2017_7181], an enlarged feature space does not result in longer training time.

In spatial-temporal self-attention, the scaled dot-product attention [NIPS2017_7181] is used as the attention kernel (Figure 3 (a)):


The inputs consist of queries, keys and values, as Q, K, V , where is the size of spatial maps and h, stand for sequence length and feature dimension. The transpose of K is performed between the last 2 axes where . Also, the matrix multiplication between Q, is over the last two axes. Then a multi-head attention is constructed upon the scaled dot-product attention:


where are the learned projection parameter matrices and is the number of attention head. In this work, we employ u = 8 parallel attention layers, or heads. As the concepts of scaled dot-product attention and multi-head attention have been widely adopted in AI researches, here we exclude their comprehensive descriptions and refer readers to [NIPS2017_7181].

Local Convolution and Area of Interest

Before passing the spatial-temporal data into the spatial-temporal self-attention mechanism, they go through a stack of convolutional neural networks (CNN) with = 3 layers inside (Figure 2). The w types of flows will be projected to a representation space with dimension = 64, and the spatial dependencies are further interrelated via the CNN stack. Previous works have shown that when predicting the flow of , instead of measuring the whole spatial map, focusing on local dependencies is more helpful for the prediction [zhang_zheng_qi]. Therefore, we also adopt the idea of local convolution, which focuses on an area of interest (AoI) surrounding . Specifically, the historical flow input is sampled from all AoIs in historical spatial-temporal data. Similarly, when generating historical transition input , only the transitions between and the other nodes in the AoIs are sampled. In this work, we set a = b = 7.

The output of each layer in the CNN stack is computed as:


where is a slice of or , and is the convolutional result of on the p-th channel. is the weight of the p-th filter of convolution kernal , whose filter size is . All constitute a joint kernal , and the final output of each layer in the CNN stack is as:


where is or , and is the projected spatial-temporal representation of the input data.

represent the slice-wise joint convolutional operation. We employ padding with the same value for each convolutional layer to maintain the same tensor shape.

Periodic Shifting and Sliding-Window Sampling

Previous work [stdn] demonstrated that the flows in periodic windows have strong similarities. As shown in Figure 4, the same periods of different days are more similar to each other than those in the previous periods on the same day. Besides, the pattern of flow will shift periodically. For example, the peak hours of traffic flow may vary from 16:30 to 18:00 on different days. Thus, we adopt sliding-window sampling to generate inputs of flow and transition from and to form . Specifically, is the concatenation of spatial matrices from the same periods of the previous = 7 days and the previous two-time intervals of the current day (area with red boundary in Figure 4). Then, data in the time interval before the future time is used as the current data fed in the decoder stack while the remained are used as input of the encoder stack.

Figure 3: The scaled dot-product attention in spatial-temporal self-attention mechanism. (right) Masked Fusion Mechanism. is the sigmoid function.

Positional Encoding

Positional encoding is employed as the positional information is missed without the recurrent structures. Here, to encode the non-consecutive positional information, we add learned positional encodings to the output of the convolution stack. First, we represent the time information of as a one-hot vector , where g is the number of time intervals in one day. We use the first seven elements of to represent the day in a week and the last g elements to represent the index of time interval in that day. The positional encoding () of is as:


where are the learned parameters, and is the sigmoid function. Then the whole positional encoding matrix is formed and summed with before fed in the encoder and decoder stacks. The broadcast of to the same shape of is performed before the adding.

Figure 4: Temporal similarity. The darker a interval is the stronger its similarity to the time to predict.

2-Stream Structure

Previous works demonstrate that transitions between nodes have significant impacts in flow prediction [a98d8116a2684b17bdabc50c1e1713b3]. Therefore, ST-SAN is designed as a 2-stream framework with two spatial-temporal self-attention networks (Stream-T, Stream-F) to measure flow and transition independently. We first train the Stream-T on predicting the transitions in AoI. Here the output of the Stream-T is as:


where . are learned parameters and is the output from the decoder stack.

Then, the trainable parameters of Stream-T will be locked and merged with Stream-F by a masked fusion mechanism to form the ST-SAN for further training. The independent training is necessary since we observe that the Stream-T will be ambiguously trained if only loss between the output and the true flow is calculated. Hence, independent training sets a more definite target for Stream-T, which enhances the measurement of transition. The experimental results also show the advantages of employing independent training.

Masked Fusion Mechanism

A masked fusion mechanism is proposed to merge the two streams and generate the final output. As shown in Figure 3 (b), the outputs of Stream-T () and Stream-F () are fed in a stack of = 2 hybrid convolutional layers, where its -th layer’s output is computed as:


where are the convolutional kernels and learned bias. The function converts the transition features to a weight mask. Then the mask is applied on the convolutional result of to intensify the influence of more relative nodes. To be specific, if two nodes have many transitions between, consequently their connection and mutual influences should be stronger. Here, padding is not employed in the CNN layers.

After the output of the hybrid layer is flattened, the final output is then computed:


where is the flattened output.

The predicted outputs of all nodes {} constitute the predicted values of the whole spatial map (grid map) .


We use MSE loss function on both the training of Stream-T and the unified ST-SAN:


where are the ground truths of flows and AoI transitions of and and are the learnable parameters of Stream-T and ST-SAN.

Model Taxi-NYC Bike-NYC Mobile M
inflow outflow inflow outflow user number
HA 90.19 50.10 109.36 65.91 30.25 20.35 29.63 19.96 421.39 273.18
ARIMA 33.54 18.62 40.70 23.61 17.14 10.83 18.03 11.28 194.92 150.95
VAR 48.04 23.21 128.67 29.84 27.37 14.29 27.67 15.09 254.37 157.71
MLP 27.13 16.91 32.93 20.80 25.77 32.57 15.92 19.85 130.01 106.44
LSTM 24.35 15.07 30.41 19.18 24.79 32.06 15.61 20.62 111.70 93.80
GRU 24.37 15.17 30.25 19.14 24.62 31.37 15.22 19.77 114.23 93.89
ConvLSTM 22.25 14.13 27.39 17.38 9.71 7.07 11.09 7.78 85.97 67.12
ST-ResNet 20.34 12.90 25.54 16.21 9.32 6.79 10.45 7.33 74.30 55.03
DMVST-Net 18.99 12.24 24.07 15.39 8.95 6.52 9.75 6.84 68.09 50.50
STDN 17.91 11.37 23.47 14.89 8.58 6.25 9.44 6.62 62.59 43.22
ST-SAN 16.39 10.63 22.94 13.48 7.82 5.68 9.02 6.17 57.13 40.20
Table 1: Comparisons with ten baselines on Taxi-NYC, Bike-NYC, and Mobile M in flow prediction.



We evaluate our model on three real-world datasets – Taxi-NYC, Bike-NYC, and Mobile M. Their details are showed in Table 2.

  • Taxi-NYC and Bike-NYC: Taxi-NYC and Bike-NYC both contain 60 days of trip records. Each record includes the locations and times of the start and the end of a trip. We use the first 40 days as training data, and the remained 20 days as testing data.

  • Mobile M: Mobile M includes 158,742,004 service records that contain the approximate locations of mobile phone users during the service periods. The whole 92-day dataset is split to 60 and 32 days for training and testing.

Evaluation Metric & Baselines

We measure the performance of different methods by two widely adopted metrics: (1) Rooted Mean Square Error (RMSE); (2) Mean Absolute Error (MAE).

Datasets Taxi-NYC Bike-NYC Modile M
Grid map size
Time interval 30 mins 30 mins 15 mins
Time Span 1/1/2016 - 8/1/2016 - 10/1/2018 -
2/29/2016 9/29/2016 12/31/2018
Total records 22,437,649 9,194,087 158,742,004
Table 2: Details of the evaluated datasets


  • HA: Historical average.

  • ARIMA: Auto-regressive integrated moving average model.

  • VAR: Vector auto-regressive model.

  • MLP

    : Multi-layer perceptron.

  • LSTM

    : Long-Short-Term-Memory


  • GRU

    : Gated-Recurrent-Unit network


  • ConvLSTM: Convolutional LSTM [NIPS2015_5955].

  • ST-ResNet: Spatial-Temporal Residual Convolutional Network [Zhang:2017:DSR:3298239.3298479].

  • DMVST-Net: Deep Multi-View Spatial-Temporal Network[DBLP:journals/corr/abs-1802-08714].

  • STDN: Spatial-Temporal Dynamic Network [stdn].


The grid sizes of Taxi-NYC, Bike-NYC, and Mobile M are , , and respectively. The length of the time interval is set as 30 minutes and 15 minutes, whereas the number of time interval in every day is 48 and 96. We randomly select 20% of data of training dataset for validation and the remained for training. We use Min-Max normalization to convert all traffic flow data to scale of [0, 1], and convert them back during the evaluation. We also filter out all regions with real flow volume less than ten in the evaluation, which is a common criterion used in flow prediction research area [Zhang:2017:DSR:3298239.3298479].


In Taxi-NYC and Bike-NYC, = 2 types of flow – inflow and outflow, are processed. In Mobile M, only user number of each area is considered ( = 1). We set threshold m = 2 to filter out long-span transitions. The stack of convolutional layers contains = 3 layers of CNN, each of which includes = 64 filters with kernel size = . We set the dimension of Feed-Forward layer to 128 and the number of attention head to 8. The dropout rate is 0.1, and the epsilon offset in layer normalization is -1e6.


We used the Adam optimizer [adam] with = 0.9, = 0.98 and . We adopted warm-up to adjust the learning rate:


where = 4000.


We evaluated our methods and ten baselines on all three datasets and obtained the average results of each method after ten executions. Table 1 demonstrates the results of RMSE and MAE.

Noticeably, traditional statistic time-series prediction methods (HA, ARIMA, and VAR) are significantly less effective. It exposes the weakness of methods of exclusively considering the relation of historical statistic values and ignoring the complicated spatial-temporal dependency. For MLP, it barely learned the linear mapping from historical data to the predicted results, the spatial-temporal dependencies are insufficiently measured. LSTM and GRU achieved non-trivial improvement compared to MLP and traditional time-series methods given their effectiveness on modeling temporal dependency. Nonetheless, without a sophisticated mechanism to integrate spatial dependencies, their performance failed to improve further.

Deep-learning based methods showed their advantage of capturing complicated spatial-temporal dependencies. As shown in the comparison result, ST-SAN has outperformed the other deep learning frameworks. For ST-ResNet, despite it employs deep residual networks to capture spatial-temporal dependencies, the convolutional results are linearly merged, which overlooks the distinctive impacts of short-term and long-term temporal dependencies. ConvLSTM, DMVST-Net, and STDN showed the remarkable capability of modeling both the spatial and temporal dependencies. However, the LSTM employed limits their efficiencies on reaching long-term temporal dependencies. Besides, independent modeling of spatial and temporal dependencies also limits their capacity of capturing complicate spatial-temporal correlations. ST-SAN shows significant improvement compared to previous deep learning methods. In details, taking the prediction on Taxi-NYC data as an example, the RMSE is reduced by 9% for inflow prediction and 4% for outflow prediction.

Model Variants

Evaluation on the Effectiveness of Spatial-Temporal Self-Attention Mechanism

In this section, we empirically demonstrate the effectiveness of the spatial-temporal self-attention mechanism. There are three variants of the self-attention networks:

Variants RMSE/MAE
inflow outflow
SAN 22.38/13.98 28.17/16.44
ST-SAN-S 19.38/12.98 24.97/16.44
ST-SAN-D 16.73/10.91 23.34/14.07
ST-SAN-D IT 16.39/10.63 22.94/13.48
Table 3: Evaluation of variants of ST-SAN on Taxi-NYC.
  • SAN: Original self-attention network. The spatial maps are embedded into vectors by fully connected layers. Except for the input and output layers, SAN is identical to the Transformer.

  • ST-SAN-S: Single-stream ST-SAN employing spatial-temporal self-attention network.

  • ST-SAN-D: Dual-stream (2-stream) ST-SAN without independent training on Stream-T.

As shown in Table 3, ST-SAN-D outperforms other variants based on RMSE and MAE. SAN obtains poor performance as it merely employs the structure of Transformer ignoring the complicated spatial dependencies. ST-SAN-S applies the spatial-temporal self-attention mechanism, but the transition information between nodes is missed, which leads to the uniform measurement of influences of other nodes and overlooks their dynamic dependencies.

Evaluation on Effectiveness of Independent Training

To demonstrate the effectiveness of independent training on Stream-T, we evaluate the performance of 2 variants:

  • ST-SAN-D

  • ST-SAN-D IT: ST-SAN with independent training on Stream-T.

The results demonstrated in Table 3 show that ST-SAN-D MT achieves reasonable improvement compared to other variants. As mentioned above, if ST-SAN is only trained toward predicting flow, the target of Stream-T is ambiguous. Therefore, independent training of Stream-T reduces the ambiguity, leading to more accurate modeling of connectivity between nodes. Consequently, the final training on flow prediction task benefits from the pre-training.

Conclusion and Future Work

In this work, we present the spatial-temporal self-attention network. We introduce a spatial-temporal self-attention mechanism that simultaneously captures spatial and temporal dependencies while measuring long-term dependencies more efficiently. In addition, we proposed an independent training scheme to enhance the network’s ability to measure the connectivities of nodes. Experiment results demonstrate the significant improvement achieved by ST-SAN. In future work, we will focus on improving the performance of outflow prediction. During the experiment, we observed that ST-SAN achieved much fewer improvement on outflow prediction compared to inflow prediction. To find out the reason is one of the main tasks of our future works.