Real-time traffic state prediction is an essential component for traffic control and management in an urban road network. The ability to predict future traffic state (e.g., flow, speed) can help improve traffic conditions, fleet organization, utilization rate, and social welfare [ke2017short, yao2018deep]
. Essentially, the traffic prediction is a time series problem, which is performed based on the changes in historical demand. A representative time-series prediction tool is the recurrent neural network (RNN), along with its diverse variants[liu2019deeppf, fu2016using]
. Apart from the temporal dimension, the correlation in the spatial dimension is also extensively incorporated by many works. Regions that are close to each other or share similar land-use structures may exhibit a homogeneous demand pattern. Techniques widely applied in computer vision like convolutional neural network (CNN)[yao2018deep, zhang2017deep] and the emerging graph-based networks [geng2019spatiotemporal, li2017diffusion, Pan:2019:UTP:3292500.3330884, yu2018spatio] are often adopted. Furthermore, multi-source data are also introduced in some literature to allow for the external influencing factors, such as weather conditions and neighboring points-of-interest [koesdwiady2016improving, liao2018deep].
2 Data Description and Problem Definition
The organizer provides industrial-scale, real-world data for 3 entire cities Berlin, Istanbul and Moscow over a year period [pmlr-v123-kreil20a]. The organizer divides each city into a grid; each pixel of this grid represents a region of . In this competition, the dataset is comprised of dynamic data (e.g., traffic speed), and static data (e.g., the number of entertainment amenities). In the dynamic dataset, there are a total of 288 frames each day, each frame representing the aggregated information with nine channels over five minutes, including the traffic volume, speed, and an aggregated incident level channel (a higher value indicates a more severe incident) in each ordinal direction (e.g., NW, NE, SW, SE). The static dataset describes the locations of road junctions and points of interest, such as food and drink, shopping, parking, transit, etc. The data of volume,speed, and incident level are scaled to through a min-max scaler. Missing values are represented by 0.
Problem: This challenge is a multi-task learning problem, i.e., use the given historical data to predict trafﬁc volume, speed, and incident level in each direction. Pixel-wise mean squared error (MSE) is used to evaluate the performance for ranking the submitted prediction results.
In this section, we will introduce most of the technical details for our solution along with some experimental results.
3.1.1 Choice of Model Architecture
In the 2019 Traffic4cast Challenge, was adopted U-NET [ronneberger2015u] as the basic model. This year, we introduce HR-NET [wang2020deep] (see Figure 1) in the competition, where HR-NET is an advanced network architecture for image segmentation that has demonstrated extraordinary performance in many tasks.
Table 1 shows a comparison of two basic models. We can conclude that, in general, HR-NET performs better than U-NET. Therefore, for the final solution, HR-NET is adopted as our backbone architecture. In the following section, most experiment results are based on HR-NET.
3.1.2 Choice of Hidden Layer Activation Function
The commonly used ReLU activation function is not the optimal activation function for many tasks, hence we experimented with some alternatives. Given limited RAM, we focused on in-place activation functions only.Table 2 lists the hidden activation functions we tested. ELU performed surprisingly better than any other activation function. Therefore, we used ELU as the hidden layer activation function in most places of the final solution.
Essentially, this task is a time-series prediction problem. For one-dimensional time-series prediction, a typical solution is to transform the problem into a supervised learning problem through feature engineering. In addition to the givenspatiotemporal channels and nine fixed spatial channels as the inputs, extra features ( e.g., periodic features, and holiday features) were also incorporated and valued. In summary, considering the decrease in MSE loss, we introduce three different types of features in this section.
3.2.1 Periodic features
The similarity between traffic states of two different days can be attributed to the periodic characteristics of traffic states, which typically repeat every 24 hours. In total, we use daily average statistics from , where
is the predicted day. Since the training set and test set are spitted based on time, we cannot obtain full observation of data in one day during testing. To relieve the gap between training and testing, we randomly sampled 1-4 periods (12-48 time steps) in one day and used it to estimate the average daily statistics.
|Without Periodic Features||1.3040e-3||0.9173e-3||1.3807e-3|
|With Periodic Features||1.2980e-3||0.9165e-3||1.3764e-3|
3.2.2 Time, Weekday and Holiday Features
Time, weekday, and holiday features are definitely useful in traffic flow prediction tasks. We used a two-dimensional vector to represent time in one day by projectingto a unit circle. We used a one-hot vector to represent weekday and a Boolean value to represent the holiday. Holiday information was obtained from www.officeholidays.com. Table 4 shows strong validity of holiday features, which is in line with our expectations.
|Without Holiday Features||1.2980e-3||0.9165e-3||1.3764e-3|
|With Holiday Features||1.2937e-3||0.9100e-3||1.3705e-3|
3.2.3 Geo-Embedding features
The locality is a common assumption in image segmentation and classification task that the object should be identical irrespective of its position in the image. This assumption is not valid for spatiotemporal data, as each pixel in the spatiotemporal data represents a region in the physical world, which has its inherent attributes. To adapt semantic segmentation models to spatiotemporal data, it is necessary to develop a technique to learn the inherent attributes of each pixel. In our previous studies [liu2020building]
, we used embedding technique to generate regional ’personalized’ temporal information and feed it into the convolutional neural network. In this competition, we further propose the method of geo-embedding to learn the inherent attributes of each location (i.e.
,pixel). We concatenate a learnable tensorto each input and optimize it using the model. For the implementation, we use the nn.Embedding
in Torch by assigning a different ID to each pixel sincenn.Embedding has options of norms to parameters. Table 5 lists the effects of geo-embedding features, and the contribution of geo-embedding features can be observed.
|Without Embedding Features||1.2919e-3||-||-|
|With Embedding Features||1.2913e-3||-||-|
Due to device issues, most of our models were trained on a mini-batch size of with
2080Ti GPU. We typically trained each model with 15 epochs with an initial learning rate of 0.01 and a linear learning rate decay. It typically took 16-20 hours to train a model. We also includedSyncBatchNorm to stabilize and speed up the training. In this section, we will introduce some options during training that potentially contribute to model performance.
3.3.1 Choice of Optimizer
SGD and ADAM [diederik2015] are commonly used to optimize the model. In most conditions, SGD is slower but theoretically guarantees to convert, while ADAM is slightly faster, but may not guarantee to convert. Recently, other self-adaptive optimizers have also become popular, e.g., LAMB [You2020Large]. It should be noted that, we did not optimize the learning rate for SGD; hence it might be possible, but less likely, that after a careful design of the initial learning rate for SGD, the result will be comparable to the self-adaptive optimizer in the same training time. We compare several optimizers in Table 6, and LAMB appears to be the best optimizer for this task.
Learning rate warm-up has been used for many NLP tasks. Table 7 lists results of our examination of the effects of warm-up learning rate. By fixing the training epochs and learning rate, using warm-up can generate far better results.
3.3.3 Inclusion of Validation Set into Training Process
After two weeks of experiments and submissions, we found that the offline validation set MSE score and the online MSE score were consistent, even though we greedily selected the best model parameter based on validation MSE. This phenomenon gave us hint to include the validation data in the training process. In the final solution, half of the models are trained by both the training set and the validation set.
In this paper, we conducted systematic research on large-scale spatiotemporal traffic prediction. The model structure is designed to accommodate spatiotemporal prediction based on the image segmentation model. Targeting spatiotemporal properties of traffic data, we proposed several new methods, including geo-embedding, and explored the details and tricks in model training. It should be noted that one assumption of the current dynamic traffic state prediction model is that no significant external influences exist.However, in reality, such as when large events are held, the model’s predictive performance is drastically reduced. For future research, we will investigate effective traffic prediction under strong external influences.
is pursuing a Ph.D. degree in Forestry and Natural Resources Department, Purdue University, West Lafayette, USA. He has gained a wealth of experience in the theory of interdisciplinary applications of machine learning techniques and has won many championships in AI competitions organized by leading international AI conferences or research institutes, including the championship of JDD (2019), championship of IJCAI-Adversarial AI Challenge (2019), and championship of KDD Cup (2020).
is pursuing a Ph.D. degree in Transportation Engineering in the School of Transportation at Southeast University, Nanjing, China. He has rich experience in artificial intelligence applications, covering diverse fields from spectral classification for LAMOST to local climate zone classification for Sentinel-1, and from short video recommendation for TikTok users to travel mode recommendation for travelers. He has won three championships of Alibaba’s Tianchi Algorithm Competition, championship of IJCAI-Adversarial AI Challenge (2019), and championship of KDD Cup (2020).