I Introduction
Passenger flow data of an Urban Rapid Transit (URT) system is characterized as typical spatiotemporal data. Predicting passenger flow of a URT system has significant commercial value, such as automatical warning in advance when the system failure occurs for economic loss reduction. Based on the length of the prediction horizon, the task can be classified as shortterm prediction (for several hours) and longterm prediction (for several days). Currently, the complicated spatial and temporal correlation structure hinders accurate prediction and further analysis
[bahadori2014fast].The goal of the spatiotemporal analysis is to capture various implicit spatial and temporal dependencies. In our URT case, for spatial dependency, the first aspect is the Law of Geography, where the passenger flow of a station is usually affected by its spatially adjacent neighbors. Fig. 1(a) plots the correlation structure of passenger flow data in stations along a specific URT line on a specific day. We can observe that neighboring stations are strongly correlated. The second aspect is the contextual similarity, where two stations sharing similar functions (business center, residential area or school, etc.) are more likely to have similar passenger flows. For temporal dependence, the future prediction is often correlated with historical observations in two different temporal scales such as weekly correlation and daily correlation as shown in Fig. 1(b, c). Weekly correlation refers to the passenger profile of today is correlated with the same day of previous weeks. Daily correlation refers to the the passenger profile of today is related to the pattern in the yesterday.
Existing spatiotemporal forecasting methods are mainly based on geostatistical models with regularization techniques. Different locations sharing similar features display common spatial correlations: Zhao et al., used norm to achieve the relatedness of locations sharing the same keyword pool in social media analysis [zhao2015multi]; Zhang et al., also utilized norm to encourage all locations to select common features in citywide passenger flow prediction [zhong2017spatiotemporal].
To capture the temporal pattern, many traditional timeseries models have been developed for traffic prediction, such as HoltWinters forecasting [tikunov2007traffic] and AutoRegressive Integrated Moving Average (ARIMA) [williams2003modeling], which incorporate the regularization terms. For example, temporal smoothness is often enforced by penalizing the difference between two consecutive timestamps [zhang2017spatiotemporal]. However, the linear parametric form with few lags imposes strong assumptions on the temporal correlation, which may lead to underfitting. Most importantly, a traditional ARIMA is powerless to present the crossdependency of the spatial and temporal dimension, which is severely overlooked in most of the above research.
Another way to deal with the spatiotemporal prediction is the application of Neural Networks. Convolutional Neural Network (CNN) is a common way to capture spatial correlation by using a station’s neighbors to predict its future behavior. Recurrent Neural Network (RNN) and its variants (e.g., Long Short Term Memory networks) are sufficiently capable of modeling complex temporal correlation. Built upon CNN and LSTM, H. Yao
et al. [yao2018deep] developed a spatiotemporal network to predict taxi demand of different regions. It adds a further layer to consider: semantic similarity of locations, meaning that locations sharing a similar functionality may have similar demand patterns. Recently, X. Geng et al. [geng2019spatiotemporal]incorporated MultiGraph Convolution Network to interpret not only the Euclidean correlation among spatially adjacent regions but also involved NonEuclidean adjacency, including function similarity and Connectivity, following by Gated Recurrent Neural Network (GRNN) to capture temporal dependency. However, it not only requires costly computational resources to train these deep learning models but also demands a very large sample size for training.
In this research, we aim to develop a computationally efficient and robust way to deal with the spatiotemporal prediction problem. To achieve this, we propose to use the tensor representation of the spatiotemporal data, which is proven as an efficient way to represent spatiotemporal data due to its sufficient capacity to capture interdependencies along multiple dimensions. In particular, there are two conventional ways to conduct spatiotemporal prediction by tensor, and we categorize them as 2step tensor prediction and 1step tensor prediction in the following.
2step Tensor Decomposition and Time Series Modeling for Spatiotemporal prediction
These methods combine tensor decomposition combined with traditional timeseries prediction models [dunlavy2011temporal, guo2017novel, ren2017efficient], and it is suitable for longterm prediction.
Tensor Decomposition can be considered as a highdimensional version of matrix singular value decomposition
[kolda2009tensor]. Two specific forms of tensor decomposition are usually adopted in tensor analysis: CANDECOMP/PARAFAC (CP) that decomposes a tensor as a sum of rankone tensors, and Tucker Decomposition that decomposes a tensor into a core tensor multiplied by each mode matrix.Moreover, the 2step tensor prediction initially conducts Tensor Decomposition to obtain the temporal mode matrix along time dimension, and then exploits timeseries model on the temporal mode. HoltWinters forecasting [dunlavy2011temporal]
, AutoRegressive Integrated Moving Average (ARIMA) and Support Vector Regression (SVR)
[ren2017efficient] have been exploited.However, existing 2step tensor prediction methods are not capable of capturing all the weekly and daily patterns mentioned above. Take ARIMA or AR model for example. If the weekly pattern is desired, then timelag should be set large to include past few weeks at least, which highly complicates the model. To address this, we proposed to first reshape the daily profile into a matrix (in the form of ) and then apply a 2Dimensional AutoRegressive Moving Average (2DARMA) model to capture all those daily and weekly patterns. We name it as ”2step 2DARMA” tensor prediction. Finally, realtime prediction update when new data arrive is another challenge. To this end, we also proposed a Lean Dynamic Updating method.
1step Tensor Prediction based on tensor completion
This is based on tensor completion [tan2016short, ran2016tensor, luan2019prediction]
, and it is suitable for shortterm prediction (prediction horizon as several hours ahead). Tensor Completion is originally designated for tensor random missing data imputation. H. Tan
et al., used tensor completion first time for traffic volume prediction [tan2016short, chen2019bayesian], which treated future data as missing data to be estimated.
For tensor completion, LowRank TensorCompletion (LRTC) method has prevailed. One of the most common technique is adding a nuclearnorm on tensor’s rank [kressner2014low, yokota2016smooth, liu2012tensor]; Q. Shi et al., tried norm on CP weight vector [shi2017tensor]; Q. Zhao et al., also proposed a CPbased Bayesian Hierarchical Probabilistic model, assuming that all mode matrices were generated from higherlevel latent distribution, with sparsityinducing prior to lowrank [zhao2015bayesian].
However, 1step tensor prediction based on LRTC is prone to oversimplify the model, which results in loss of prediction accuracy. To solve this problem, we propose to first cluster spatiotemporal data and then conduct LRTC within data from the same cluster. To this end, Tensor clustering method will also be studied.
In this paper, we will focus on the improving both 1step and 2step stateoftheart tensor prediction methods, and provide practical and effective techniques to improve the prediction performance correspondingly. In summary, this paper makes the following contributions:

We improve both 2step and 1step tensor prediction for URT passenger flow prediction. For example, for 2step tensor prediction, we propose a 2step 2DARMA tensor prediction model. For 1step tensor prediction, we improve the LRTC by conducting it together with a tensor cluster algorithm.

Furthermore, for both methods, we also propose how to dynamically apply the proposed method online. For example, we propose a Lean Dynamic Updating method for tensor decomposition to update the previous prediction realtime;
Ii Preliminaries
We first review the preliminaries and backgrounds about tensor decomposition methods and tensor completion.
Iia Notations and Operations
Throughout this exposition, scalars are denoted in italics, e.g. ; vectors by lowercase italic letters in bold face, e.g. ; and matrices by uppercase boldface letters, e.g.
; High dimensional data, tensor by boldface script capital
.IiB Tensor Decomposition and Completion
We will introduce the basic knowledge of CP, Tucker Decomposition and Tensor Completion here.
CP Decomposition
A tensor is represented as the weighted summation of a set of rankone tensors:
(1) 
where each is a unit vector, and is the outer product. is the mode matrix of Dimension and is the rank of CP decomposition. is the score vector.
Bayesian Lowrank Tensor Decomposition
Consider is a noisy observation of tensor , i.e.,
, and the noise is assumed to be an i.i.d Gaussian distribution
. is generated by CP model, with weight absorbed inside of mode matrices.Mode factor matrix can be denoted by row wise or column wise vectors .
The generative model based on Bayesian probabilistic structure [zhao2015bayesian] is shown in Fig. 2, and is specified as:

and are generated by:
(2) 
(given ) is generated by:
(3) 
(given ) is generated by:
(4)
Tucker Decomposition
It is commonly regarded as higherorder PCA, decomposing a tensor into a core tensor multiplied by a matrix along each mode, i.e.,
(5) 
where is core tensor, , and is the rank for Tucker decomposition.
Tensor Completion
It is usually designed for random missing data imputation, like random pixel missing in image data [zhao2015bayesian]. The basic tensor completion is formulated as following:
(6) 
where is the incomplete input tensor, is the completed output matrix, is the nuclear norm to achieve low rank, and is sampling set which denotes the indices of the observed elements of a tensor.
Iii Proposed Tensor Prediction Framework
In this section, we aim to propose spatiotemporal prediction framework based on the tensor decomposition and tensor completion methods in Section II. For the shortterm prediction, the effective prediction horizon is 2 hours ahead, which is the response time needed for URT company to take corresponding actions when abnormal passenger flow pattern happens. In the case of URT passenger flow, data can be represented as: . is location standing for 120 stations in our dataset, is the time scope we are looking at (here is 1 Jan 2017 to 28 Feb 2017), is the 5minute interval observations per day with 247 sensing points. Thus our data is .
Our prediction problem can be formulated as:
(7) 
where is the prediction horizon.
Iiia 2step Tensor LongTerm Prediction: Combine Tensor Decomposition and 2DARMA
2step tensor prediction is popularly used in the past few years and the mechanism(shown as Fig. 3(a)) behind is designed as:

Formulate data as a tensor form. In our case, it is . Decompose it, and among the decomposed components we can find temporal mode matrix .

Use traditional timeseries model to predict the incoming time’s temporal model matrix .

Then substitute it back to decomposition structure to reconstruct the tensor .
For timeseries model, options include Holt Winters, ARIMA etc. However, as mentioned before, they are unable to capture the temporal pattern as shown in Fig. 1(c), especially the weekly pattern. This is because, to take the same day of past 2 weeks, at least 14 days timelags should be involved into model, which leads to too much model complexity.
Therefore, we proposed the adoption of 2DARMA model for the prediction step (shown in Fig. 3(b)), with following additional steps needed:

For the Rankr vector of , , reshape it into 2D matrix, with each row representing one certain day of the week, each column representing the seven days in one week.

Train the 2DARMA model using .

After prediction, vectorize the matrix back to vector, and combine all the rank vectors to obtain .
The matrix from Rankr vector of is represented as a 2D random field . The 2D ARMA model is defined for the for the matrix by the following equation:
(8) 
is a stationary white noise with variance
, and the coefficients of are the parameter of the model. and are the time lags of for and respectively. The parameter estimation is explained in [zielinski2010two]. The 2step 2DARMA Tensor Prediction is summarized in Algorithm 1.Another challenges is to instantly and efficiently update the prediction result when new data come, which little research has yielded solutions. The problem is specified in Fig. 4(a). After we obtain the whole predicted passenger flow for tomorrow, when we reach tomorrow 10:00AM for example, the new data will have arrived instead. Then how to update our original prediction dynamically and yield more accurate prediction for the rest of day , especially for next two hours, is a problem. To address this, we propose a lean dynamic tensor decomposition updating method, to alleviate unnecessary workload. The method is demonstrated in Fig. 4(b).
For the longterm prediction result of day , , we propose the following procedure:

Decompose the old longterm prediction result into and , with Factor 0 as , which needs to be updated when new data arrive. An illustration is shown in Fig. 4(b).

Assume 30% new data of it have come, firstly splice the 30% new data and the rest 70% longterm prediction together, and update according to Eq.(9) [kolda2009tensor].

Finally, use the updated , and the original , to reconstruct the tensor.
(9) 
where is the mode0 unfolding along Ldimension, is KhatriRao product and is KhatriRao product pseudoinverse.
In practice, we can combine the weight and factor mactrix 0 together and update them when new data arrives.
IiiB 1step Tensor Prediction for ShortTerm: Tensor Completion
To adopt tensor completion framework to prediction, we treat the historical data as observation set (with observation indicator as 1), and the future horizon to be predicted as missing data (with observation indicator as 0).
One thing to be noted is that Tensor completion is not designed for longterm prediction. In particular, here we define two conceptions (as shown in Fig. 4(a)): open dimension and closed dimension. Open dimension is the dimension that keeps increasing as data arrives, e.g. dimension day here; closed dimension is the one which has fixed maximum length, e.g. dimension station, and time point. Tensor completion can be used for missing data imputation along closed dimension (e.g. and ), but not along open dimension (). So it can only predict shortterm.
For tensor completion methods, in particular, we follow the Bayesian LowRank Tensor Completion (LRTC) framework proposed by Zhao et al. [zhao2015bayesian].
Denote from Eq.(2, 3, 4). After calculating the
joint distribution, and the posterior distribution, the missing data can be estimated after getting
, by:(10) 
However, this method still demonstrates some drawbacks as noted in [yuan2018high]. In particular, when the tensor data violate the innate lowrank structure, such as in our case that the spatial information of URT tensor data is diverse from stations to stations, the lowrank assumption along the spatial dimension cannot hold tenably. Consequently, it is far from enough to assume the spatial prior as lowrank. This most likely oversimplifies the original data structure.
To solve the problem, before the Bayesian LRTC, we proposed to use tensor clustering first [yu2019coupled, sun2019dynamic] to classify the tensor samples into several classes, with highly similar samples within a same cluster. Tensor clustering method is shown in Fig. 5 with the following steps:

Conduct Tensor decomposition to obtain location model matrix.

Implement Principal Component Analysis to further reduce the dimension.

Cluster based on a particular clustering method, such as Kmean and Hierarchical method.
Thus different URT stations can be divided into several clusters, and Bayesian LRTC can be conducted within each cluster since the homogeneity within cluster can guarantee its performance.
Iv Experiments
According to what we have discussed in Session III, our UTR passenger flow tensor data is , representing 120 stations, over the past 59 days, with each day 247 sampling points. We set the first 50 days as known historical data and the last 9 days as data to be predicted.
Randomly Selected Station Code  1D ARIMA Tensor Prediction  2D ARMA Tensor Prediction  Relative improvement (%) 
51  0.1500  0.1066  28.90 
56  0.1043  0.0759  27.21 
38  0.0849  0.0638  24.91 
87  0.1400  0.1087  22.39 
54  0.0939  0.0739  21.33 
84  0.1605  0.1271  20.81 
11  0.0962  0.0767  20.29 
16  0.0871  0.0703  19.30 
37  0.0982  0.0811  17.38 
65  0.1247  0.1060  15.00 
Iva Proposed 2step Tensor Prediction Result
For the tensor , we conduct CP decomposition. The rank is chosen as 50 by cross validation, achieving both satisfactory reconstruction and simplicity. For each rank, a 2D ARMA model is constructed and used to predict the coming 9dayahead passenger flow. The results of the first 3 days longterm prediction are shown in Fig. 6(a). In our proposed method, to capture the weekly pattern of past 2 weeks and the daily pattern of past 2 days, we set , which introduces 8 timelag components into model. For fair comparison, the baseline is chosen as the traditional ”Tensor Decomposition + 1D ARIMA” model, with the same model complexity timelag equal to 8. As shown in Table I, our proposed model can achieve almost 20% improvements, evaluated by relative residual (RES). The overall improvement benefits from the advantage that more strongly correlated temporal patterns have been considered in our model. Note that the scale of improvement varies. This is because some stations may also have strong temporal correlation with timelag(3), (4) etc., which yet our model ignores.
[t, t+5]  LongTerm Prediction  Updated after 30% new data  Relative improvement (%) 

t =75  0.9221  0.8436  8.520 
t =80  0.5833  0.3920  32.79 
t =85  0.7136  0.6258  12.29 
t =90  1.2381  1.2042  2.735 
t =95  1.2543  1.1387  9.215 
t =100  1.0423  0.9729  6.659 
t =105  0.1153  0.0743  35.57 
t =110  0.3452  0.3379  2.113 
t =115  0.4022  0.3163  21.36 
t =120  0.2292  0.3084  34.54 
t =125  0.6245  0.6865  9.936 
When the first 30% new data (in Fig. 6(a) until time stamp ) have arrived, the prediction for the rest 70% of that day (In Fig. 6(a) highlighted in blue block, from time stamp to ) needs to be updated instantly. By using proposed lean dynamic updating, the rest 70% has been recalculated as shown in Fig. 6(b). It is clear to observe that the updated prediction is significantly improved with smaller distance to the real value, especially around the local peak time. To further check the improvement of the rest 70%, RES is calculated from for every 25 minutes (i.e., 5 time stamps). According to Table II, after involving the first 30% of new data, there is an obvious improvement by around 20% (highlighted in boldface) over the following 3 hours (from to ). After then, the relative improvement based on shortterm updating becomes less efficient, with the longterm prediction still being preferred.
IvB Proposed Bayesian LRTC Result
For the Bayesian LRTC, some stations from mixed clusters have been randomly picked, and the data since (around 10AM) of last day are to be predicted. The prediction result is shown in Fig. 6(c). The Bayesian LRTC can reduce RES by 29% for station 17 in Fig. 6(c), with some other stations achieving around 10% to 30% less residual as shown in Table III. The good performance for tensor completion in shortterm prediction is quite satisfactory compared with the proposed Lean Dynamic Updating. However, according to Table III, it is to be noted that this improvement (highlighted in boldface) is not universal, with station 84, 55, 51 and 87 having worse prediction. This is because these four stations have quite unique and distinct passenger flow pattern and the low rank assumption no longer holds.
Station Code (Mixed Cluster)  Updated after 30% new data  Bayesian LRTC  Relative improvement (%) 
17  0.1165  0.0821  29.59 
56  0.0885  0.0649  26.66 
54  0.0833  0.0704  15.38 
38  0.1775  0.1521  14.31 
35  0.1020  0.0909  10.89 
11  0.1416  0.1326  6.38 
84  0.0632  0.0691  9.38 
55  0.1240  0.1366  10.17 
51  0.1090  0.1279  17.38 
87  0.1205  0.1609  33.60 
This can be solved by conducting Bayesian LRTC within a same station cluster. We use the Hierarchical Clustering with Agglomerative Method, where the distance is defined as Group Average Euclidean. The clustering result reflects two types of spatial dependencies observed in our data: Law of Geography and contextual similarity. In other words, if two stations are geographically close or functional similar, they are in the same cluster. For example, Fig. 7 shows the landuse information of three selected stations (Station 73, 80 and 84) in one cluster. Though Station 73 and 84 are far to each other, they share the same landuse pattern (with red dominant, mix use in grey and brown). Though Station 84 and 80 are quite different in landuse (Station 84 is dominated with red, Station 80 is dominated with green), they are geographically close (since the stations are indexed consecutively, two stations with close codes indicates they are physically close.)
By selecting stations within this cluster, we compared the Bayesian LRTC within one cluster with 2step Prediction updated with 30% new data, and the result is shown in Table IV .
All the stations’ predictions have been improved (with improvement highlighted in boldface), with the majority improved by 20%. Most importantly, even for the same Station 84 and 87 in two cases(highlighted with underline in Table III and IV), conducting Bayesian LRTC within one cluster can improve prediction by 30% to 40% .
Station Code (Same Cluster)  Updated after 30% new data  Bayesian LRTC within a cluster  Relative improvement (%) 
82  0.1149  0.0673  41.43 
77  0.1775  0.1133  36.18 
79  0.2112  0.1349  36.10 
71  0.1109  0.0766  30.89 
84  0.0632  0.0489  22.59 
75  0.1593  0.1288  19.15 
74  0.1642  0.1409  14.19 
73  0.1773  0.1567  11.61 
80  0.1187  0.1087  8.41 
87  0.1205  0.1146  4.92 
Conclusion
In this paper, we focused on both Longterm and ShortTerm URT passenger flow prediction.
Our proposed 2step Tensor Prediction based on Tensor decomposition and timeseries model can predict both longterm and shortterm. In particular, the ”CP Decomposition + 2D ARMA model” can achieve satisfactory longterm prediction, and the lean tensor decomposition updating method can update shortterm prediction after receiving new data.
Our proposed 1step Tensor Prediction based on Bayesian Low Rank Tensor Completion can only predict shortterm, but with better performance than lean dynamic tensor decomposition updating. To solve its the innate drawback of lowrank assumption that results in oversimplification, a tensor cluster technique is first implemented and then Tensor Completion is conducted for each cluster respectively.
Acknowledgment
This research is supported by Hong Kong MTR Co. with grant numbers RGC GRF 16203917, 16201718 and NSFC 71931006. We also give special thanks to Dr. Qibin Zhao for sharing the scripts of their methods.
Comments
There are no comments yet.