Traffic prediction, forecasting the future traffic distribution in a city by its historical traffic records, plays an important role in multiple domains. It is a crucial help to build an intelligent transportation system for urban planning and development . Knowing the future traffic distribution, we can make alternative traffic plans for traffic congestions before they occur.
Traffic prediction has been studied for years. In the early days of the study, time series models, such as autoregressive integrated moving average (ARIMA), have been widely used [16, 8, 12, 9]. These models focus on extracting temporal patterns of changing traffic in one area. Later on, the spatial correlation of different areas has been investigated by integrating traffic data in multiple areas to make predictions. Moreover, additional data, e.g., weather, date, is integrated into prediction models [14, 23, 15]
. Recently, the introduction of neural networks sharply improves the prediction accuracy. Neural models can effectively capture temporal-spatial features from traffic data and fuse additional features in other data sources by using various techniques, e.g., convolutional neural networks (CNN), long short term memory (LSTM), attention mechanisms, etc. Our work shows there is still room for improvement by exploiting position information and feature diversity.
Position information. Most previous works focus on extracting features of traffic data itself and neglect the position information. Specifically, for traffic data in different positions, their position information is unknown to the model and does not influence the parameters of the model, i.e., different positions share the same model. This is because these methods apply image processing techniques, e.g., CNN with kernels of shared parameters, which aim to find common features of overall figures. Different from image processing, the pattern difference of positions cannot be neglected in traffic prediction and it is difficult to capture position features from data. Fig. 1 shows a simple case that uses one model for two positions. Suppose that the model predicts future value by recent two data point, if the position information is unknown to the model, the model will give the same predicted value for the two positions. Thus, we can see that position information is an intrinsic property which does not depend on traffic data. Some recent works use additional information and can mitigate the problem but they cannot essentially overcome the difficulty because they do not explicitly take position information into account.
Feature diversity. Recent neural network models have considered multi-source data and various spatial-temporal dependencies between future traffic distributions and historical traffic distributions. However, they neglect feature diversity, i.e., using only one kind of features. Specifically, these models make predictions based on features extracted by the same network. Neither spatial nor temporal patterns can be clearly described by one kind of features since these patterns may contain multiple abstraction levels, different frequencies, etc. This is similar to signal representation: a set of bases of the same frequency cannot represent a signal composed of a number of different frequencies.
To address the two issues above, we propose a position-aware network model (PAN). The framework of PAN borrows the idea of video frame prediction . Specifically, PAN takes historical traffic distributions as a series of images. Based on these images, PAN generates a new image as the predicted traffic distribution. This framework allows extracting spatial-temporal features of all positions at the same time. It benefits to learn comprehensive features.
Under the framework, we design a position embedding mechanism based on representation learning to capture the position information. Our position embedding is inspired by word position embedding in neural language processing (NLP) which distinguishes different semantics of one word in different positions of a sentence . All positions are embedded into a vector space. Embedding vectors indicate features of each position and they are learned with the prediction model. Different from the usage in NLP, we use position embedding not only in the input layer but also in convolutional layers to modify the representation of input features. Due to the utilization of position information, PAN can provide different prediction patterns for different positions with less information.
We employ position-aware inception blocks to model spatial-temporal dependencies for extracting diverse features. The network structure is based on the inception network . In each position-aware inception block, we design multiple convolutional layers with different kernel sizes and depths. Different convolutional layers can extract different kinds of dependencies between input features and output features Meanwhile, position embedding is added into convolutional layers. These position-aware features in different kinds make PAN more expressive.
Our contributions are summarized as follows:
We propose a new position-aware network for capturing the position information and promoting the feature diversity, which are two particular properties in traffic prediction but are neglected by most previous studies.
We design a position embedding mechanism to learn position features simultaneously with prediction model training. Embedding vectors are fused with multiple layers for building different patterns for different positions.
We employ position-aware inceptions blocks which parallelly use different convolutional sub-network with position embedding to promote feature diversity.
We evaluate our model on several real-world traffic datasets. The results show that our model requires fewer data and outperforms other state-of-the-art methods.
The rest of this paper is organized as follows. We first summarize recent studies of traffic prediction in Section 2. Then, we formulate the problem of traffic prediction in Section 3. Detailed design of the PAN model is presented in Section 4. We evaluate the performance of PAN on two real-world datasets in Section 5. Finally, we draw the conclusion in Section 6.
2 Related Work
In most recent years, the traffic prediction has gained increasing attention in machine learning and data mining areas. Numerous studies focusing on how to properly utilize the traffic-related data resource and accurately forecast the future traffic flow distribution have been proposed and obtained successive state-of-the-art performances.
Early works [16, 8, 12, 9, 14, 23, 15, 6, 28, 2, 20] which utilized time series methods based on statistical learning and classic machine learning have been widely studied. Shashank et al. proposed the autoregressive integrated moving average (ARIMA) for traffic prediction. Li et al.  improved the ARIMA to forecast the spatial-temporal variation of passengers in a hotspot. Moreira-Matias et al.  aggregated the streaming information and ensembled three time-series forecasting techniques to originate a prediction. Lippi et al.  proposed to use the support vector regression model combining with a seasonal kernel to measure similarity between time-series examples. Some studies [14, 23, 15] extended the prediction to further use some external data resource, such as venue types, weather conditions, event information, etc. Moreover, some methods also embedded the spatial information into the models and obtained some promising results [6, 28, 2, 20]. However, these methods require data to satisfy some assumptions or need careful feature engineering. Thus, they usually cannot model too complex data and perform poorly in practice.
Recently, with the rapid development of deep learning, traditional time series methods are inferior to deep learning based methods on multi-level aspects. In some studies, the traffic distributions of entire city are treated as images. For example, CNN can be directly applied on images of traffic speed for speed prediction . To capture more complex features and increase the depth of neural networks, residual network are proposed to use on traffic flow prediction [27, 26]. Although residual networks perform well in image processing, applying them on traffic prediction need consider the characteristics of the problem. Some other works use traffic data of neighbor areas to predict the future traffic of the centric area. Most of these works employ both convolutional neural networks (CNN) and long short-term memory networks (LSTM) for capturing spatial and temporal dependencies, respectively. For example, Yao et al.  apply an LSTM to integrate the outputs of CNNs. Zhou et al.  employ convolutional LSTM which use convolution operation in LSTM units for prediction passenger demands. Yao et al.  consider the dynamic similarity between locations and propose an attention mechanism for LSTM-connected CNNs. Using neighbor traffic data reduces the interference of low-correlation data on prediction results. Meanwhile, we note that addition information, e.g., date, weather, traffic flow direction, are widely used in most recent deep learning based model [26, 25, 24]. Some studies focus traffic flow on roads or traffic on separated nodes which can be seen as graphs. In this scenario, graph convolutional network (GNN) are used. For example, Wang et al.  apply a GNN to model spatial features based on in-cell and inter-cell data traffic. Guo et al.  add an attention mechanism on GNN to capture dynamic spatial-temporal correlations for road traffic flow prediction. This is a different kind of traffic prediction task. Thus, we do not discuss GNN-based models in this paper.
3 Problem Formulation
Lots of vehicles record and report their states while moving, e.g., trip starting or ending, moving or stopping, etc.. Suppose there are kinds of states containing geographic coordinates. To depict the overall traffic situation, we partition the whole city into small square cells and divide time into timeslots. Then, we define the traffic distribution as where is the number of state in cell at timeslot . Accordingly, a temporal sequence with number of past time slots forms the spatial-temporal demand distribution. With this distribution, the traffic prediction problem can be formulated as finding a model such that it can minimize the prediction error with an error metric , i.e.,
where the input of is -length temporal sequence of recent traffic distribution . Moreover, the prediction model predicts the whole future distribution simultaneously.
4 Position-Aware Network
backbone network. The input integrates historical traffic distribution and the position embedding features. The position-aware spatial-temporal inception (PASTI) blocks are combined with residual connection. In each PASTI, multiple position-aware convolutional modules (PAC) in different kinds are parallelly integrated to extract different kinds of features. In order to adapt to the traffic prediction problem, we add position embedding layers (PE) followed by convolutional layers in PAC. Moreover, we use dropout instead of batch normalization to avoid overfitting since normalization keeps relative values of features and destroy absolute values which are important to prediction tasks.
4.1 Input Constructing
Based on common sense, traffic series have three obvious correlations: recent, daily, and weekly. The correlated traffic distributions are most helpful to predict the future distribution. Recent correlation indicates the future traffic depends on recent traffic, i.e., future traffic distribution is similar to distributions of recent hours. We define the sequence of recent traffic distributions as
where hyper-parameter is the number of recent timeslots. The length of is .
Daily correlation indicates the future traffic is related to the traffic at the same time in past days, i.e., today’s traffic distribution is similar to yesterday’s. We define the sequence of traffic distributions in recent days as
where hyper-parameter is the number of recent days and constant is the number of timeslots in one day. Since the periods of recent days which have similar traffic distributions may deviate, we extend by adding . The length of is .
Weekly correlation indicates the future traffic is related to the traffic at the same time in past weeks, i.e., today’s traffic distribution is similar to the same day of past weeks. We define the sequence of traffic distributions in recent weeks as
where hyper-parameter is the number of recent weeks and constant is the number of timeslots in one week. The length of is .
We concatenate all distributions in the three sequences along the last dimension as the input of our model at timeslot :
4.2 Position Embedding
Inspired by word position embedding in neural language processing, we employ representation learning to generate feature vectors of positions. Positions are embedded into a vector space. Each embedding vector represents the information of its corresponding position and it is learned with together with the other parts of PAN. Since PAN process the entire distribution at the same time, the entire embedding vectors can be denoted as where is the length of one embedding vector. Then, we build position embedding layer which fuses the input features and position information:
where embedding vectors has the same shape of and is the number of channel of . Here, we follow the fusing approach in BERT . We have tried other fusing approaches such as multiplication or concatenation. However, the sum fusing achieves the best results.
We use s not only in the in the input, but also in multiple parts of PAN. Since features in different parts have different meanings, these s should not share a common . Thus, we give each independent parameters .
4.3 Position-Aware Convolutional Modules
In order to capture spatial and temporal features for different positions, we build position-aware convolutional modules (PAC). PAC is a stack of PAC units. A PAC unit (PACu) is composed of three components: position embedding, 2D convolution and ReLU activation. Given input features , PACu transform by the following formulation:
where PACu has two hyper-parameters, and , respectively indicating the kernel size and the number of filters (output channels) of the convolutional layer in PACu. is a convolution with kernels and bias . Meanwhile, and are learned parameters in PACu. is an element-wise function and does not change the shape of input features.
Then, we define PAC as
where a has three hyper-parameters: is the number of PACu in PAC, and are defined in PACu. The depth and width of can be adjusted by setting up and . This makes PAN capture different kinds of features.
4.4 Position-Aware Spatial-Temporal Inception Blocks
To increase the diversity of feature, we compose multiple PACs in a position-aware spatial-temporal inception blocks (PASTI). In this paper, we select three kinds of PACs: , , and . And the numbers of the three kinds are , , and , respectively. Then, we define PASTI as
where is the number of channels of input features . concatenates all outputs of PACs along the last dimension. is used during the training to address overfitting problem of deep neural networks . is element-wise product. is a random binary matrix where zero indicate drop a neural unit of . The conventional approach to address overfitting in convolutional layers is batch normalization. However, we find batch normalization of features decrease the performance of PAN.
Deep neural networks have advantage of capturing features but it is difficult to train. Thus, we apply residual connection  in PASTIs as follows
4.5 Loss Function and Training
To obtain the final prediction results, we employ a after the last PASTI as shown in Fig. 2. The output of PAN is defined as
Usually, mean average percentage error (MAPE) and rooted mean square error (RMSE) are used to measure the accuracy of prediction results. MAPE gives higher weights for errors which true values are small while RMSE gives higher weights for larger errors. Thus, we combine the two error measurement in our loss:
where and are 1-norm and 2-norm of a matrix, respectively.
5.1.1 Dataset description
We evaluate our model with other baselines on two public real-world datasets from New York City: BikeNYC and TaxiNYC. For comparison, we select the same datasets used in  and adopt the same settings. The two datasets contains trips of renting bikes or taking taxis. There are two states of a trip: start and end Each trip records the time and the coordinate of the trip starting and ending. The detail of the two datasets are shown in Table 1. The whole city is partitioned into regions with the size of around . Each day is split into 48 timeslots and the length of each timeslot is 30 minutes. Both of them have trip data of 60 days. In this paper, we use the first 40 days as the training set and the last 20 days as test set.
|Starting time||Ending time||# trips||# states||Area size|
|Training set||Test set||Time interval|
|BikeNYC||2016.07.01 - 2016.08.09||2016.08.09 - 2016.08.29||30 min|
|TaxiNYC||2015.01.01 - 2015.02.10||2015.02.11 - 2015.03.01||30 min|
5.1.2 Evaluation Metric
We use rooted mean square error (RMSE) and mean average rercentage error (MAPE) are two most common metrics to compare our model with other baselines. For each kind of states, we compute their errors respectively. Given prediction results from timeslot to timeslot and corresponding ground truth , the two metrics are defined as
In the two datasets, there are two volumes to predict: the number of trip starting (Start) and the number of trip ending (End). Meanwhile, we follow the filtering settings in [24, 25]. The samples with volume values less than 10 are filtered out since people have little interest of low traffic in the real-world applications. Moreover, prediction results on low traffic usually have small RMSE and make MAPE failure (small value as denominator).
We compare PAN with 11 baselines including both traditional approaches and recent deep neural network models. Baselines are historical average (HA), Aautoregressive integrated moving average (ARIMA), ridge regression (ridge), LinUOTD, XGBoost 
, multiLayer perceptron (MLP), convolutional LSTM (ConvLSTM), DeepSD , deep spatio-temporal residual networks (ST-ResNet) , Deep Multi-View Spatial-Temporal Network (DMVST-Net) , and spatial-temporal dynamic network (STDN) .
5.1.4 Hyperparameter Settings
To construct input series, we set length of recent, daily, and weekly series as , , and . Before training, we normalize traffic data to by Min-Max normalization. The prediction results will be detransformed for evaluation. In PAN, the number of PASTIs is set as 10. In each PASTI, the numbers of the three PACs are respectively set as , , and . And their numbers of filters are set as , , and
. The number of filters in other convolutional layers is set as 256. The dropout rate is set as 0.5. In training, the batch size is set to 32. Learning rate is set as 0.00001. Moreover, we use the same hyperparameter settings for training on the two datasets.
5.2.1 Prediction Performance
We compare PAN with all baselines with metrics of RMSE and MAPE. As we use the same datasets and evaluation strategy, we directly reuse the results of several baselines from the literature .
Evaluation results on TaxiNYC are shown in Table 2. PAN significantly outperforms all baselines on TaxiNYC. Especially, PAN dramatically improves RMSE of End prediction. Meanwhile, the MAPE improvement of End prediction is small. This indicates that it is difficult to balance MAPE and RMSE. Considering MAPE gives high weights for small ground-truth values and RMSE gives small weights for small errors, we infer that PAN works better at prediction high traffic volumes than low traffic volumes. From the results of all baselines, we can find the large performance gap between traditional time series models and recent neural network models. However, the performance difference of neural network models is small. Comparing the reults of all models, we can find the it is more difficult to improve MAPE than RMSE. Moreover, recent nueral network models introduce additional information such as date, weather, and volumes of traffic flow from one area to another. Our model only use traffic distribution information.
Evaluation results on TaxiNYC are shown in Table 3. PAN significantly outperforms all baselines on TaxiNYC except MAPE of End prediction. This is the price of low RMSE of End prediction. The results of Start prediction are contrary: small improvement on RMSE and large improvement on MAPE. Comparing all results on BikeNYC and TaxiNYC, RMSEs on TaxiNYC are higher than BikeNYC. This means the values of Start and End traffic in TaxiNYC are much larger. Thus, we can infer that PAN performs better for high traffic than low traffic. All models have the same phenomenon since all MAPEs on TaxiNYC are better than BikeNYC.
5.2.2 Model Analysis
In this section, we study the influence of the position-aware mechanism and diverse features. We design two simplified versions of PAN:
PAN_NoPAC: Use traditional convolutional layers instead of PAC.
PAN_OnePAC: Use only one kind of PAC in each PASTI.
We evaluate them with PAN. The results are shown in Fig. 5 We can see the performance decrease on both datasets. This indicates the effectiveness of position information and diverse features.
In this paper, we propose an position-aware network model for the traffic prediction. We employ the position embedding mechanism to extract intrinsic information of positions. With position embedding, we design the position-aware spatial-temporal inception blocks to capture different kinds of features. Position information determines the model can perform differently for different positions. Various kinds of features extend the expressiveness of the model. Thus, our model outperforms baselines in experiments of two public real-world datasets. Extensive experimental results on two public real- world datasets show that our model achieves markable improvements without additional information against the baselines.
-  Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. pp. 785–794 (2016)
-  Deng, D., Shahabi, C., Demiryurek, U., Zhu, L., Yu, R., Liu, Y.: Latent space model for road networks to predict time-varying traffic. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1525–1534. ACM (2016)
-  Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018)
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770–778 (2016)
-  Idé, T., Sugiyama, M.: Trajectory regression on road networks. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)
-  LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)
-  Li, X., Pan, G., Wu, Z., Qi, G., Li, S., Zhang, D., Zhang, W., Wang, Z.: Prediction of urban human mobility using large-scale taxi traces and its applications. Frontiers of Computer Science 6(1), 111–121 (2012)
Lippi, M., Bertini, M., Frasconi, P.: Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning. IEEE Transactions on Intelligent Transportation Systems14(2), 871–882 (2013)
-  Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., Wang, Y.: Learning traffic as images: A deep convolutional neural network for large-scale transportation network speed prediction. Sensors 17(4), 818 (2017)
-  Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016)
-  Moreira-Matias, L., Gama, J., Ferreira, M., Mendes-Moreira, J., Damas, L.: Predicting taxi–passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems 14(3), 1393–1402 (2013)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel. pp. 807–814 (2010)
-  Pan, B., Demiryurek, U., Shahabi, C.: Utilizing real-world transportation data for accurate traffic prediction. In: 2012 IEEE 12th International Conference on Data Mining. pp. 595–604. IEEE (2012)
-  Rong, L., Cheng, H., Wang, J.: Taxi call prediction for online taxicab platforms. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. pp. 214–224. Springer (2017)
-  Shekhar, S., Williams, B.M.: Adaptive seasonal time series models for forecasting short-term traffic flow. Transportation Research Record 2024(1), 116–125 (2007)
-  Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., Woo, W.: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. pp. 802–810 (2015)
-  Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014)
-  Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp. 4278–4284 (2017)
-  Tong, Y., Chen, Y., Zhou, Z., Chen, L., Wang, J., Yang, Q., Ye, J., Lv, W.: The simpler the better: a unified approach to predicting original taxi demands based on large-scale online platforms. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 1653–1662. ACM (2017)
-  Wang, D., Cao, W., Li, J., Ye, J.: Deepsd: Supply-demand prediction for online car-hailing services using deep neural networks. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. pp. 243–254 (2017)
-  Wang, X., Zhou, Z., Yang, Z., Liu, Y., Peng, C.: Spatio-temporal analysis and prediction of cellular traffic in metropolis. In: 25th IEEE International Conference on Network Protocols, ICNP 2017, Toronto, ON, Canada, October 10-13, 2017. pp. 1–10 (2017)
-  Wu, F., Wang, H., Li, Z.: Interpreting traffic dynamics using ubiquitous urban data. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. p. 69. ACM (2016)
-  Yao, H., Tang, X., Wei, H., Zheng, G., Li, Z.: Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA (2019)
-  Yao, H., Wu, F., Ke, J., Tang, X., Jia, Y., Lu, S., Gong, P., Ye, J., Li, Z.: Deep multi-view spatial-temporal network for taxi demand prediction. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. pp. 2588–2595 (2018)
-  Zhang, J., Zheng, Y., Qi, D.: Deep spatio-temporal residual networks for citywide crowd flows prediction. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp. 1655–1661 (2017)
-  Zhang, J., Zheng, Y., Qi, D., Li, R., Yi, X.: Dnn-based prediction model for spatio-temporal data. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2016, Burlingame, California, USA, October 31 - November 3, 2016. pp. 92:1–92:4 (2016)
-  Zheng, J., Ni, L.M.: Time-dependent trajectory regression on road networks via multi-task learning. In: Twenty-Seventh AAAI Conference on Artificial Intelligence (2013)
-  Zheng, Y., Capra, L., Wolfson, O., Yang, H.: Urban computing: Concepts, methodologies, and applications. ACM TIST 5(3), 38:1–38:55 (2014)
-  Zhou, X., Shen, Y., Zhu, Y., Huang, L.: Predicting multi-step citywide passenger demands using attention-based neural networks. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018. pp. 736–744 (2018)