1 Introduction
An intelligent transportation system (ITS) occupies a fundamental role regarding the development of smart cities [19]. Hence, traffic forecast has become an important task for ITSs which facilitates the planning and development of traffic management and control systems [18]. The necessity of well conceived management and control systems dramatically increases due to the steadily rising number of cars which often causes the traffic infrastructure of cities to collapse especially during rush hours. Furthermore, commuters waste a lot of time by traffic jams and the emissions of their cars pollute the environment [19]. For that reason, many algorithms have been developed and applied for traffic forecasting.
Those algorithms have been subdivided in [12]
into classical timeseries and machine learning (ML) approaches. A selection of classical timeseries and ML techniques – comprising artificial neural networks (ANNs) – were applied to the California Freeway Performance Measurement System (PeMS) dataset. The results are compared to each other with the conclusion that the classical timeseries approaches seasonal autoregressive integrated moving average (SARIMA) with a kalman filter, outperforms all other techniques regarding the measured error. Nevertheless, the training time was about 8
and the prediction duration more than 12 longer compared to the ANN. Unfortunately, only a shallow ANN was analyzed. Furthermore, the forecast horizon was constrained to 15 min.In [15] SARIMA is compared to an ANN with two hidden layers. The prediction horizon varies from 1 h to 24 h, the input data is restricted to working days (7:00 am  7:00 pm) and the weather condition is incorperated. The ANN outperformed the SARIMA model regarding the error value. More complex ANN models as well as training and testing time are left unconsidered.
In [13]
stacked autoencoders (SAE) are applied to the PeMS dataset – only working days – with various prediction horizons ranging from 15 to 60 min. A grid search was performed, to find well performing SAE architectures. The SAE models had the lowest error values compared to a variety of ML approaches. A comparison to other deeplearning (DL) models is lacking.
Besides, in [18]
a bidirectional long shortterm memory (DBL) model is introduced and compared to other models, like e. g. long shortterm memory (LSTM), SAE etc. The efficiency is proved with the PeMS dataset. The DBL was able to outperform all considered models regarding the error values. The presented model complexity, the training and testing duration are omitted.
In [17] an one hidden layer LSTM model is proposed. The same prediction horizons like in [13] were chosen. The experiments were performed on the PeMS dataset of 2014 for 30 selected observation stations. The analyses showed, that the LSTM achieved the smallest error rate compared to other ML models. Though, the training and testing time are neglected.
Another approach based on LSTM – with two hidden layers – called DeepTrend is presented in [6]. The extraction layer, a fully connected layer, learns the time variant trend and the prediction layer, a LSTM layer, performs the forecast based on the extracted trend and the calculated residual series. The introduced model was compared to other ML and classical timeseries approaches. DeepTrend achieved the lowest error value on the PeMS dataset which was restricted to 16 weeks of 2016 and 50 stations in district 4. Regrettably, the training and testing time is unobserved during the comparison.
In [20] a model is introduced based on gated recurrent units (GRU) which incorporates weather and traffic data. The model was tested on different datasets and compared with a variety of ML models. Considering the loss and accuracy, the proposed model achieved the best values. The prediction accuracy increased with the data fusion of weather conditions and traffic flow. Unfortunately, the training and testing duration is omitted.
Moreover, in [9]
a deep belief network (DBN) is proposed. The DBN configuration is done by random search by constraining the searchspace. The DBN is applied on two datasets, including the PeMS dataset. Different scenarios are considered, e. g. grouped input data as well as a variety of forecast horizons. The DBN showed in all experiences superiors results regarding the error value.
In [16]
a DL model is proposed, its hyperparameter and structure became tuned by random search. The search space was heavily constrained and the model configurations were selected by a monte carlo algorithm. Different data preprocessing approaches were considered and utilized for comparing a linear, shallow and DL model. The best results were achieved by the DL model with median filter preprocessing and L1 regularization. Regrettably, the model complexity as well as the training and inference duration are not examined.
Additionally, in [7] autoregressive integrated moving average (ARIMA) as well as LSTM and GRU neural networks (NNe are evaluated. The models were applied to a part of the PeMS dataset. The prediction results for LSTM and GRU NN were reported as similar and both better than ARIMA. The error values achieved by the GRU NN were lower than those stated for the LSTM NN. The architectural details of the recurrent NNs (RNNs) were omitted. Moreover, the training and inference duration were neglected.
In [19] a model based on AutoEncoder and LSTM NNs is proposed. The investigations were performed on the PeMS dataset. Each network is preliminary separately trained whereby the learned representation of the AutoEncoder NN is concatenated with the measured traffic data and used for the training of the LSTM NN. Finally, both NNs are trained together for a fine adjustment of the weights. The introduced model achieved the lowest error rates compared to several ML approaches and various forecast horizons. Unfortunately, the training and inference duration are left unconsidered.
Besides, in [4] a model based on a generative adversarial network (GAN) and stacked LSTM is introduced. The PeMS dataset – district 5 – of 2013 was examined and synthetically extended by the developed GAN. Both, the real and generated traffic flow data were used for training the LSTM. A comparison of the LSTM trained on both, the synthetic and real as well as only on real data with different historical time steps leads to the conclusion, that the smallest error values are achieved by the proposed model. The training and inference duration are omitted.
In this paper we report about various RNNs utilized to forecast the traffic flow in the city of Hagen. After training the proposed models with traffic data over 6 months, the RNNs can be used to predict the traffic flow at each sensor location simultaneously for the configured forecast horizon.
Thus, the trained RNNs can be considered as structureindependent models of Hagen’s street network, so that we neither need handcrafted street graphs that map the underlying streets nor have to make complicated assumptions about the drivers behavior. Instead, we provide the current and historical traffic data of sensors from crossings to the RNNs to perform the traffic forecasts.
Our contribution is manifold: 1) Handling missing data and preprocessing; 2) Comparison of various models; 3) Relation between GRU and LSTM cells.
The paper is organized as follows: Section 2 is dedicated to the data description and preprocessing. In Section 3 we introduce the utilized RNNs as well as the implementation details. In Section 4 the results are presented and discussed thoroughly. Finally, we close with a conclusion and an outlook on future works.
2 Data analysis and preprocessing
All measurements are sent to and stored in a central traffic control computer which is also able to control the traffic lights. The recorded data contains measurements from 129 inductive sensors at 12 intersections. Other works often focused on freeway data [6, 12]. The provided measurements were stored every minute in the unit vehicles per hour , which indicates how many vehicles in average were passing that sensor in the proceeding hour. Most often, the traffic control computer uses steps of 60 to get an integer result when the value is divided by 60 minutes. This results in inaccurate measurements and a high fluctuation over time in the data. The measured values from 6 am to 10 pm of every weekday were analysed and used for forecasting. It turned out that some of the sensors never stored data respectively no other values than 0 over a longer period. These sensors were unusable for forecasting traffic flow. There were also sensors which did not provide measurements every minute so that the number of available values per minute fluctuates very strongly between 56 and 129 during the period of record.
At last 108 usable series of measurements from sensors that provided data over all or respectively most of the given time range. Because there were still many missing values in the time series, this work takes into account how missing data should be handled, before they were processed by RNNs. The following three approaches were considered: 1) Missing values were marked with
. This set will be considered as raw data. 2) Missing values were repaired by inserting averaged values from all identical weekdays at the same record time. Linear interpolation was used if this was not possible, in the following considered as repaired data. 3) The data was aggregated per driving direction (usually 3 to 4 directions per intersection), so that an average value was calculated for all lanes into the same direction, based on the repaired data. This set will be considered as aggregated data. The datasets were divided so that 87% was used for training, 3% for validation and 10% for test. This segmentation assured that the training data as well as the test data contains also weekends and holidays, because of the different amount of traffic on the streets during these days.
Due to, the localisation of the sensors inside the city centre, we assumed that multiple intersection were passed by the same vehicles resulting in relations between the measurements of different sensors. Sometimes, multiple sensors are on the same lane at an intersection, which differs from the PeMS dataset where in general only one is present [12]. Obliviously, it could be beneficial to process all data at the same time so that the relation of the data is used for the forecast. This assumption was proofed by calculating correlation matrices illustrating the correlation between different time series of the sensors. The result is presented in Fig. 1 and shows that there is a distinct correlation between many sensors. The matrix of the repaired data is omitted because it had shown similar results as the raw data.
This work had also analysed multiple step sizes for forecasting. E.g. when a step size of 15 min. was used, the given data was resampled over 15 min. and considered as one averaged value before it was used to train the NNs.
3 RNN Architectures and Implementation Details
This work analysed the accuracy of forecasting traffic flow with RNNs by using different types and architectures. All resulting models were implemented by using TensorFlow 1.12 [1] with CUDA support and the Scientific Computing Tools for Python (SciPy) on Python 3. The evaluation of the models was conducted on a midlevel gaming graphic card – Nvidia GTX 1060 – and on professional computing GPUs – Nvidia P100 and V100 – in Google Cloud Virtual Machines.
3.1 Models
Based on wellproven models for time series prediction, three architectures were selected for traffic flow prediction [5, 3, 8]. Each Architecture was instantiated as a model with LSTM and GRU layers as recurrent component, which resulted in six models. Every model was designed in the way that it took sequences of measurements of each sensor as input and generated sequences of predictions as output for every sensor at the same time. Input and output length had not to have necessarily the same length, so that different combinations of in and output length could be analysed. In the following, a recurrent layer can be either a LSTM or a GRU layer. Refer to the relevant documentation for further details especially how the data is processed internally [1].
A Convolutional Recurrent Neural Network (CRNN)
is a combination of a convolutional neural network and a RNN. The used CRNN architecture consists of two convolutional layers to extract relevant features from the measured traffic data by convolving the input over a single spatial dimension. The most relevant features are extracted by a subsequent maximum pooling layer. The data, which is still a sequence at this point, is now flattened to a large vector by a flatten layer. The vector will be repeated by a repeat layer according to the used prediction length. The repeat layer is followed by two recurrent layers which generate sequences of predicted data based on the repeated input. Because the output of the last recurrent layer is still an internal representation, there are two following dense layers which convert the internal representation into sequences for each sensor. To avoid overfitting, a dropout layer after the pooling layer and spatial dropout layer after each recurrent layer were used.
The EncoderDecoder architecture uses an encoder to generate an internal representation of the data. The decoder forms the internal representation into the original format respectively into a new desired format. The latter is used to have a different length for input and output sequences. There are also hidden layer for the actual forecasting next to the encoder and decoder. The developed architecture used a recurrent layer to transform the input sequences from each sensor into an internal static representation of the input. This internal state is repeated according to the prediction length by a repeat layer and feed into a second recurrent layer for forecasting. The internal prediction is than decoded by two dense layer into the output format so that it consists of predicted time series for every sensor. A dropout layer was inserted after the first recurrent layer to reduce overfitting. Because the first recurrent layer is configured to not produce sequences, a nonsequential dropout layer is used.
The VectorOutput architecture generates a vector with predicted time series for every sensor at its output without repeating any internal states. This architecture uses two recurrent layers at its input to produce predicted time series directly. The first recurrent layer takes sequences as input and also produces sequences for each sensor at its output which are forwarded to the second recurrent layer. The second layer computes a static output of all input sequences. This is further processed by a dense layer into a large vector containing concatenated time series with the predicted traffic flow for each sensor. This vector is transformed into the correct dimension by a reshape layer. This forms the vector into time series for each sensor. A dropout layer was inserted after the second recurrent layer which also does not produce sequences.
3.2 Training and evaluation
The recorded time series were fed into the models together with the corresponding weekdays (0 to 6) and timestamps (0 to 959). Each time series might be resampled according to the current step size before all values were scaled by using a RobustScaler [14] to avoid exploding and vanishing gradients [2]. The training data was used in a sequential order to extract sequences of measurements to train all models which resulted in more accurate forecasts, due to the sequential order of the data given by nature [11]
. All models were fine tuned regarding the available hyperparameter to fit a specific combination of input length, step size, prediction length and data set by applying grid search. Every model utilized the mean squared error loss function together with the adam optimizer
[10]. Furthermore, L2 regularisation was used to reduce overfitting.To evaluate the best model, best length of forecasting traffic flow and the necessary precomputation of the data, every model was trained and tested with all three data sets, step sizes of 1, 5, 15, 30 and 60 min. as well as prediction lengths of 1, 5 and 20 steps. This resulted in 270 training and evaluation phases. The training was always computed for a fixed amount of 300 epochs, each comprising 200 steps respectively input sequences of all sensors. All models were fine tuned for the following combination: Input length: 50, step size: 5 min., prediction length: 5, data set: repaired data.
4 Results and Discussion
This section outlines the results of forecasting the traffic flow with the proposed RNNs. It starts with an overview of the results achieved by the developed architectures and is finalized by implications from the results.
4.1 Model evaluation
All models were trained and tested with the in Sec. 3.2 listed combinations. During the test phase the Root Mean Square Error (RMSE) and the coefficient of determination () over all prediction steps were calculated [14]. The results of every model were compared for each combination of step size and prediction length and are shown in tables 3, 3 and 3.
Every table contains the results for one of the three tested prediction lengths which are separated by datasets. For every architecture as well as for each step size the RMSE and are stated. The best RMSE for each combination is highlighted by bold printing. The results in Table 3 show a big difference in model performance for the step sizes of 1 and 5 min. A detailed analysis was done with a CRNNLSTM model which showed especially for a step size of 1 min., that the measured traffic has a high fluctuation where the models predicted the average traffic flow, caused by a too low complexity of the used models. Obviously, the high fluctuation decreased significantly by averaging the measured traffic flow over 5 min. With this input data the model already performed significantly better, even if steep peaks are still not satisfactory predicted. The results in the table also shows that the forecasting error for the raw data becomes lowest starting with a step size of 15 min. It should be noted that in case of a sensor defect, this dataset will be unusable to do forecasting for this sensor. The aggregated data instead would be still able to do meaningful forecast for a specific direction. Furthermore, RMSE and are changing as expected where the RMSE decreases and increases with increasing step sizes. Table 3 shows the results for a prediction length of five steps. The results are competitive comparable to the results in Tab. 3
except of some outliers. All results for a prediction length of 20 steps are listed in Tab.
3. The results show no significant difference compared to the results in tables 3 and 3, especially for larger step sizes where the data is averaged over a longer period of time. Moreover, the results for the CRNN architecture are closer to them scored by the other architectures than for the shorter prediction lengths. It should be noted that there was not enough data for a step size of 60 min. and a prediction length of 20 steps to use a input length of 200 values than before for all other step sizes. It was shortened to 100 values to be able to test a step size of 60 min.The VectorOutput model with GRU performed best regarding RMSE and . The results show that the RMSE for the different prediction length did not increase more than 5 between 1 and 20 forecasting steps and step sizes larger than one min. Besides, the results state that averaging high fluctuating sensor data improves the forecasting quality dramatically. The comparison of the results scored with the different datasets presented that the RMSE with the repaired data were often worse than the one calculated with raw data and that the aggregated data leads to similar results as the raw data.
Derived from the given results, tagging missing measurements from sensors is sufficient for the traffic flow prediction. Additionally, forecasts of multiple steps did achieve considerably similar error values. The given sensor data requires to be averaged to reduce the before mentioned drawbacks of the steps of 60 . Hence, it is beneficial to use a step size of 30 min., the necessary time between two adjustments of traffic lights in the city of Hagen. Figure 2 shows the model results of forecasting 20 steps with a step size of 30 min. by utilizing raw data.
Especially both VectorOutput models show only a small deviation between the lowest and highest error. The RMSE is sometimes decreasing with further prediction steps. All models were trained and tested with an input of sequences containing 200 aggregated measurements for each sensor. The number of weights for each model as well as training and testing duration were listed in Tab. 4.
Model  Time to train  Time to  Number of 

[mm:ss]  forecast [ms]  weights  
CRNNLSTM  12:52  4.2  5,135,616 
CRNNGRU  11:34  3.9  3,907,440 
EncoderDecoderLSTM  19:53  3.7  447,012 
EncoderDecoderGRU  18:51  3.6  352,836 
VektorAusgabeLSTM  53:10  3.8  2,341,872 
VektorAusgabeGRU  49:03  3.7  1,990,224 
Timings were measured on the professional compute card. It shows that the CRNN models have the highest number of weights and was trained as fastest. The EncoderDecoder models required around twice the time for the training although they have a smaller amount of weights. Compared to the CRNN, the training time for the VectorOutput model, with about half of the number of weights, increased four times. Obviously, the number of weights itself is not essential for the compute time but their distribution. This is because the CRNN models had around 95% of their weights inside the recurrent layers where CUDA optimized versions were used. Rather the VectorOutput models only had 60% respectively 53% of their weights inside recurrent layers.
Due to the lower complexity of the GRU cells, the GRU versions of each architecture was always faster than the LSTM version. The inference duration of each model was nearly the same ( 4 ms). The GRU variants were always a little faster than the corresponding LSTM once.
4.2 Implications
We showed that good results in forecasting traffic flow were achieved with multivariate regression for 108 sensors. The best model showed a total RMSE of 31 in presence of peak loads of more than 2000 at some of the sensors. Further analysis of the local RMSE at each sensor showed that some of them had a relatively high error compared to others. This means that the traffic flow at those sensors is harder to predict than for others. This could be related to irregular loads on these streets, but further analysis is needed.
It should also be noted that every model was optimized at only one input combination (refer to section 3.2) and was not fine tuned for the combination determined as best choice. Further fine tuning of the models hyperparameter could result in lower forecasting errors. Furthermore, the statistic stability of training NNs with randomly initialized weights was not considered during the evaluation of the models. This means, that every model was trained and tested only once due to the long computing times. An analysis of training the same model 20 times was made with the VectorOutput model with the combination mentioned as best. The results presented that 50% of every prediction step of every run had a RMSE deviation smaller than 3 and that the median RMSE of all forecasts between step 1 and 20 showed an increase of 5.5 .
5 Conclusion and Future Works
In this paper we analyzed forecasting the traffic flow in the city of Hagen with RNNs. The data was generated by inductive loop sensors distributed around the city centre. Moreover, the data became preprocessed, to incorporate spatial dependencies and handle missing data. We developed three RNN architectures and evaluated them as instances with LSTM and GRU cells on different input data and forecasting horizons. We showed that forecasting the traffic flow inside cities can be done satisfactory by RNNs, even for large forecast horizons. Our best model, the VectorOutput architecture with GRU cells, predicted the traffic 10 hours into the future by considering measurements every 30 minutes. The forecasts at 108 sensors have been performed simultaneously with a total error of 35 at peak loads of more than 2000 at some sensors. Besides, we showed that computing all sensors at the same time is beneficial because the measurements are correlated between sensors inside the city. We discovered that it is sufficient to use the sensors raw data and to tag missing measurements inside the data set.
Due to limited data, the model will be trained and tested with more data to increase generalization effects. Furthermore, the hyperparameters will be fine tuned to fit the determined combination of step size, prediction length and data set. Additionally, we will take statistical deviations regarding weight initialization during model evaluation into account. It emerged that some of the sensors are harder to predict than others during the evaluation phase. Further studies are necessary to identify the root cause and to determine how these effects could be mitigated. In addition to that we will implement the developed model in our smart mobility app called STREAM (Smart TRaffic using Edge And social CoMputing) with the objective of a better, more balanced utilization of the street network.
References
 [1] Abadi, M., et al.: TensorFlow: Largescale machine learning on heterogeneous systems (2015), software available from tensorflow.org
 [2] Bengio, Y., Frasconi, P., Simard, P.: The problem of learning longterm dependencies in recurrent networks. In: IEEE International Conference on Neural Networks. IEEE (1993). https://doi.org/10.1109/icnn.1993.298725
 [3] Brownlee, J.: How to develop lstm models for multistep time series forecasting of household power consumption (Oct 2018), https://machinelearningmastery.com
 [4] Chen, Y., et al.: Traffic Flow Prediction with Parallel Data. IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC pp. 614–619 (2018). https://doi.org/10.1109/ITSC.2018.8569276
 [5] Cirstea, R.G., et al.: Correlated time series forecasting using deep neural networks: A summary of results
 [6] Dai, X., et al.: DeepTrend: A Deep Hierarchical Neural Network for Traffic Flow Prediction pp. 394–399 (2017), http://arxiv.org/abs/1707.03213
 [7] Fu, R., Zhang, Z., Li, L.: Using LSTM and GRU neural network methods for traffic flow prediction. Proceedings  31st Youth Academic Annual Conference of Chinese Association of Automation pp. 324–328 (2017). https://doi.org/10.1109/YAC.2016.7804912
 [8] Harmon, M., Klabjan, D.: Dynamic prediction length for time series with sequence to sequence networks (2018)
 [9] Huang, W., et al.: Deep architecture for traffic flow prediction: Deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15(5), 2191–2201 (2014). https://doi.org/10.1109/TITS.2014.2311123
 [10] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations (2015)
 [11] LeCun, Y.A., et al.: Efficient BackProp. In: Lecture Notes in Computer Science, pp. 9–48. Springer Berlin Heidelberg (2012). https://doi.org/10.1007/9783642352898_3

[12]
Lippi, M., Bertini, M., Frasconi, P.: Shortterm traffic flow forecasting: An experimental comparison of timeseries analysis and supervised learning. IEEE Transactions on Intelligent Transportation Systems
14(2), 871–882 (2013). https://doi.org/10.1109/TITS.2013.2247040 
[13]
Lv, Y., et al.: Traffic Flow Prediction with Big Data: A Deep Learning Approach. IEEE Transactions on Intelligent Transportation Systems
16(2), 865–873 (2015). https://doi.org/10.1109/TITS.2014.2345663  [14] Pedregosa, F., et al.: Scikitlearn: Machine learning in Python (2019), https://scikitlearn.org/stable
 [15] Peng, H., et al.: Forecasting traffic flow: Short term, long term, and when it rains. In: Lecture Notes in Computer Science (2018). https://doi.org/10.1007/9783319943015_5
 [16] Polson, N.G., Sokolov, V.O.: Deep learning for shortterm traffic flow prediction. Transportation Research Part C: Emerging Technologies 79, 1–17 (2017). https://doi.org/10.1016/j.trc.2017.02.024, http://dx.doi.org/10.1016/j.trc.2017.02.024
 [17] Tian, Y., Pan, L.: Predicting shortterm traffic flow by long shortterm memory recurrent neural network. International Conference on Smart City pp. 153–158 (2015). https://doi.org/10.1109/SmartCity.2015.63
 [18] Wang, J., Hu, F., Li, L.: Deep Bidirectional Long ShortTerm Memory Model for ShortTerm Traffic Flow Prediction. In: Neural Information Processing. pp. 306–316. Springer International Publishing, Cham (2017)
 [19] Wei, W., Wu, H., Ma, H.: An autoencoder and LSTMbased traffic flow prediction method 19(13), 1–16 (2019). https://doi.org/10.3390/s19132946
 [20] Zhang, D., Kabuka, M.R.: Combining weather condition data to predict traffic flow: A GRUbased deep learning approach. IET Intelligent Transport Systems 12(7), 578–585 (2018). https://doi.org/10.1049/ietits.2017.0313
Comments
There are no comments yet.