I Introduction
Traffic prediction is one of the central tasks for building intelligent transportation management systems in metropolitan areas. Traffic congestion causes delays and costs millions to the economy worldwide, which is worse in urban centres. Predicting the traffic flow will provide the stakeholders with tools for modelling and decision support.
Early approaches to traffic flow prediction are statistical techniques including support vector machines (SVM)
[1] and the autoregressive integrated moving average (ARIMA) model [2]. These statistical approaches have been demonstrated to be effective as regression techniques for time series data. However, they do not address the spatiotemporal relationship of transportation networks and cannot be applied to a largescale road network. Recently, machine learning technologies
[3, 4, 5, 6, 7]have been actively applied given that traffic prediction is essentially to make estimations of future state based on big data. The spatiotemporal features of the traffic has been of great interest of researchers. Understanding the spatial evolution of traffic for the entire road network rather than for a small part of the network is necessary at both stages of offline planning and online traffic management. Convolutional neural networks (CNNs) have been successful in dealing with spatial features of road networks
[3, 4]. Besides, recurrent neural networks (RNNs) with long shortterm memories (LSTM)
[4, 5]and gated recurrent unit (GRU)
[6] have been incorporated, considering the traffic flow prediction as a time series forecasting.A novel approach has been proposed in [7] that converts the traffic speed into images where the traffic speed data of each road segment at each time step is expressed in the third dimension. A CNN is used to capture spatiotemporal features in the images. This method has been demonstrated to outperform other stateoftheart methods. This approach differs from others in that other approaches simply treat the time dimension of the traffic flow as a channel of image data and therefore the temporal features of traffic flow are ignored [6]. However, this work was demonstrated on rectangular subnetworks of a metropolitan road network, which have a relatively simple topology.
The goal of this study is to devise a traffic speed prediction method for complex road networks. We are dealing with traffic data gathered in a metropolitan area of Santander city of Spain, whose road network is depicted in Fig. 1. The red lines denote road segments where induction loop detectors are installed to measure vehicle speeds. The speed sensors are sparsely located in a complex network. Adjacency in the spatiotemporal image does not necessarily mean adjacency in the road network. In this case, some spatial features would be disregarded by a CNN because it uses the max pooling operation to construct higher order features locally. Therefore, we utilize a capsule network (CapsNet) [8, 9] that replaces the pooling operation with dynamic routing, which enables to take into account important spatial hierarchies between simple and complex objects. We propose a CapsNet architecture designed to be suitable for traffic speed prediction and demonstrate its effectiveness by comparing it with the CNNbased method in [7]. To our best knowledge, this is the first application of the CapsNet to a time series forecasting problem.
The rest of this paper is organized as follows. It starts with addressing the method of converting traffic data into images in Section II. After that, an existing approach to traffic speed prediction based on a CNN is introduced. We then present the proposed architecture of the CapsNet designed for traffic speed prediction. The methods and results of the performance evaluation with a real dataset are given in Section III. Finally, Section IV presents a summary and conclusion.
Ii Traffic Speed Prediction
Iia Traffic Speed Data as an Image
Each induction loop sensor records time history of traffic speed on different road segments. In order to consider the spatiotemporal relationship, the traffic data are converted to an image with two axes representing time and space. As a result, we have an image as an matrix where and denote the number of time steps and the number of sensors, respectively. The matrix is then represented as:
(1) 
where denotes the traffic speed at th time step in th road segment. For example, Fig. 2 depicts the spatiotemporal image representation of traffic speed data.
Suppose we have traffic speed data from sensors and we are going to predict the traffic speed in time steps ahead based on data from previous time steps. Given the overall time history of traffic speed, a strip of matrix becomes an input and the data in the next time steps act as labels in training neural networks. The output can be an array with a size of , which can be obtained by reshaping a strip of matrix to an array as:
(2) 
IiB CNN for Traffic Speed Prediction
The CNN has been demonstrated to be significantly effective in understanding images by using max pooling and successive convolutional layers that reduce the spatial size of the data flowing through the network. These procedures increase the field of view of highlevel layers and allow them to capture highorder features of the input image.
Layer  Parameter  Activation 

Convolution1  (256, 3, 3)  ReLu 
Pooling1  (2, 2)   
Convolution2  (128, 3, 3)  ReLu 
Pooling2  (2, 2)   
Convolution3  (64, 3, 3)  ReLu 
Pooling3  (2, 2)   
Flattening     
Fullyconnected     
Layer  Parameter  Activation 
Convolution1  (32, 3, 3)  ReLu 
Convolution2  (32, 3, 3)  ReLu 
PrimaryCaps  (128, 3, 3)  ReLu 
Capsule size 8    
TrafficCaps  Capsule size 16   
We use the CNN architecture proposed in [7] as the baseline. This consists of three pairs of a convolutional layer and a pooling layer followed by a flattening operation and a fullyconnected layer. Fig. 3 depicts the architecture of the CNN for traffic speed prediction. The three convolutional layers have 256, 128, and 64 channels, respectively, of size 3
3. Each convolution layer involves a rectified linear unit (ReLu) activation function to give nonlinearity to the network.
Pooling layers have filters of size 2
2 applied with a stride of 2. This downsamples every depth slice in the input by 2 and reduces the redundancy of representation by removing 75% of the activations. The output of each max pooling filter is determined by taking the maximum over 4 numbers in a 2
2 region. The output of the last pooling layer is transformed to a vector by the flattening operation and this contains the final and the highestlevel features of the input traffic history. Lastly, the flattened output goes through a fullyconnected layer to provide the prediction. The output of the fullyconnected layer now has the same dimension as the label vector in (2). The parameters of the CNN is presented in Table I.IiC Proposed CapsNet Architecture
CNNs have worked surprisingly well in various applications. Nonetheless, max pooling in CNNs is losing valuable information by just picking the neuron with the highest activation. CapsNet has been proposed in
[8, 9] to address the drawback of CNNs.A capsule is a group of neurons that encodes the probability of detection of a feature as the length of their output vector. Each layer in a CapsNet contains many capsules that represent different properties of the same object. One of the main characteristics of capsules is that capsules have vector forms and their activations provide vector outputs whereas artificial neurons go through scalar operations. More importantly, the CapsNet is trained by an algorithm called dynamic routing proposed in
[9]. The dynamic routing is executed between two successive capsule layers to update weights that determine how the lowlevel capsules send their input to the highlevel capsules that agree with the input. In other words, the weights are determined based on the dot product of the lowlevel capsule and the highlevel capsule where the dot product captures the similarity of two vectors. Each weighted sum of the lowlevel capsules is then passed through the squash function that forces the length to be no more than 1 while preserving the direction of the vector. Unlike CNNs, the CapsNet does not throw away information that is most likely relevant to the task at hand, like relative relationships between spatiotemporal traffic features.In the proposed architecture, as depicted in Fig. 4, the first two convolutional layers convert the spatiotemporal traffic image to the activities of local feature detectors used as inputs to the third layer. The third layer, called PrimaryCaps, is another convolutional layer that has 128 channels with a 3
3 kernel. All the convolution operations are performed with a stride of 1 with zero padding, involving a ReLu nonlinearity. Each capsule in the PrimaryCaps layer is an 8dimensional vector and capsules in a cuboid are sharing their weights with each other. The final layer, called TrafficCaps, has a 16dimensional capsule per road segment. The dynamic routing is performed between PrimaryCaps and TrafficCaps with 3 iterations. The dynamic routing algorithm captures the relationship between all the capsules in the PrimaryCaps layer and each capsule representing each road segment. In this way, any distant local feature can contribute to characterizing the capsules in the TrafficCaps layer. Here we consider the length of each 16dimensional capsule vector in the TrafficCaps layer as the traffic speed on the corresponding road segment. The parameters of the proposed CapsNet are given in Table II.
Iii Performance Validation with Real Data
We use traffic speed data measured every 15 minutes on road segments in the central Santander city for a year of 2016. The dataset is from the case studies of the SETA EU project [10]. Excluding days when the sensors did not work, the spatiotemporal traffic dataset is a matrix with a size of 33054 where denotes the number of road segments. Each sparsely missing measurement is masked with an average of measurements taken at the same time in the other days. We use traffic data from January to September as a training set and the remaining data from October to December as an evaluation set. As an example, the average speed and a 1year variation on a road segment are presented in Fig. 5. Note that each road segment would have different statistics and no topological information of the road network is given. Understanding and predicting the spatiotemporal relationship of traffic between different road segments in different time slots are the duty of the neural networks. The CNN and CapsNet described in Section II. B) and II. C), respectively, performed the following four prediction tasks:

Task 1: 15min prediction with 150min traffic history on 20 road segments ()

Task 2: 30min prediction with 150min traffic history on 20 road segments ()

Task 3: 15min prediction with 210min traffic history on 50 road segments ()

Task 4: 30min prediction with 210min traffic history on 50 road segments ()
The traffic prediction tasks are performed in two sets of road segments as depicted in Fig. 6. 20 road segments used in Task 1 and Task 2 are marked in red and the other 30 road segments, used in Task 3 and Task 4 together with the red segments, are marked in blue. Note that traffic data in adjacent road segments are not always located close to each other in the spatiotemporal image. We attempt to verify the methods on larger spatiotemporal images in Task 3 and Task 4 where the neural networks are required to capture the spatiotemporal features scattered in a larger region.
In our Tensorflow implementation, each network employs mean squared error (MSE) as a loss function and we use the Adam optimizer
[11] with the exponentially decaying learning rate to minimize the sum of the MSE. We scale the traffic speed data into the range [0,1] before feeding in the neural networks.The prediction result can be compared with the true values in the form of images. Fig. 7 depicts the image representation of the true traffic speed and predictions by the CapsNet and the CNN. Traffic speed data at 3 different time periods are drawn where the images in the same column represent the traffic data at the same time period. It is observed that the deep learning methods provide similar results as if a smoothing filter is applied to the true traffic images.
The images shown in Fig. 7 are just snapshots of the result. Since we have a lot of data in the evaluation set, statistical performance metrics are required to assess the overall performance of the networks. Mean relative error (MRE) is one of the most common metric to quantify accuracy of different prediction models in general. However, the error of a smaller value of speed might result in larger MRE and vice versa. Thus, we further employ mean absolute error (MAE) and root mean squared error (RMSE) as more intuitive metrics for assessing the speed prediction performance. The three performance metrics are defined as:
(3) 
(4) 
(5) 
where and denote the th speed prediction and its true value, respectively. Here, represents the number of the speed data in the evaluation set.


(unit: % for MRE, km/h for MAE and RMSE).
The performance of the CNN and CapsNet has been assessed with their best settings. Both of the networks show their best performance with the common starting learning rate of 0.0005 and the exponential decay rate of 0.9999. The resultant performance of the neural networks on the four tasks with two datasets is presented in Table III. The MRE does not seem to provide consistent results. On the other hand, the MAE and RMSE increase as the input and output sizes increase from Task 1 to Task 4. The CapsNet shows better (smaller) MAE and RMSE than the CNN in all the tasks in both cases. The performance difference is larger in Task 3 and Task 4 where the size of the input image is larger. The CapsNet provided 6.58% smaller MAE in Task 1 and Task 2 and 10.2% smaller MAE in Task 3 and Task 4. We conclude the CapsNet is better at capturing the relationship between distant spatiotemporal features as expected. In average, the CapsNet provides 8.24% and 13.1% improvement in MAE and RMSE, respectively, compared with the CNN.
A drawback of the CapsNet is that it takes a longer time to train the network. In our experiment of Task 1, the CapsNet is about 30 times slower than the CNN. The computation time difference becomes severe for tasks with larger output sizes. The number of trainable parameters in the CapsNet varies from (Task 1) to (Task 4) whereas that in the CNN varies from (Task 1) to (Task 4). Given increased input and output sizes, the routing algorithm requires a significant increase in the number of trainable parameters because it deals with a fullscale image features by testing all the combinations between multidimensional vectors, called capsules. On the other hand, the number of trainable parameters shows a mere increase in the CNN, which is contributed by the pooling operation.
Iv Conclusion
This paper presents a capsule net framework that captures the spatiotemporal features of traffic speed and provides shortterm traffic speed predictions. The vehicular traffic speed measured by magnetic loop detectors is represented as images that are fed into the developed capsule network. Traffic speed predictions by the proposed CapsNet architecture are compared with those by a CNNbased method. Experiments performed on 1year data measured on road segments in Santander city demonstrate the proposed CapsNet provides more accurate speed predictions than the CNN. The performance difference is larger in experiments with a larger dataset. This result implies the CapsNet is better at learning spatiotemporal features in the test data, with 13.1% improvement in RMSE with respect to the CNN.
Acknowledgment
The authors appreciate the support of the SETA project funded by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 688082.
References
 [1] C.H. Wu, C.C. Wei, D.C. Su, M.H. Chang, and J.M. Ho, “Travel time prediction with support vector regression,” IEEE Transactions on Intelligent Transportation Systems, vol. 5, no. 4, pp. 276–281, 2004.
 [2] B. M. Williams and L. A. Hoel, “Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results,” Journal of Transportation Engineering, vol. 129, no. 6, pp. 664–672, 2003.
 [3] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.Y. Wang, “Traffic flow prediction with big data: A deep learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 865–873, 2015.
 [4] J. Zhang, Y. Zheng, and D. Qi, “Deep spatiotemporal residual networks for citywide crowd flows prediction,” AAAI Conference on Aritificial Intelligence, pp. 1655–1661, 2017.
 [5] X. Ma, H. Yu, Y. Wang, and Y. Wang, “Largescale transportation network congestion evolution prediction using deep learning theory,” PloS one, vol. 10, no. 3, pp. 1–17, 2015.
 [6] Y. Wu, H. Tan, L. Qin, B. Ran, and Z. Jiang, “A hybrid deep learning based traffic flow prediction method and its understanding,” Transportation Research Part C: Emerging Technologies, vol. 90, pp. 166–180, 2018.
 [7] X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang, and Y. Wang, “Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction,” Sensors, vol. 17, no. 4, p. 818, 2017.
 [8] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with em routing,” International Conference on Learing Representations, 2018.
 [9] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” Conference on Neural Information Processing Systems, pp. 3856–3866, 2017.
 [10] SETA EU Project, A ubiquitous data and service ecosystem for better metropolitan mobility, Horizon 2020 Programme, 2016. [Online]. Available: http://setamobility.weebly.com/
 [11] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Comments
There are no comments yet.