Log In Sign Up

Traffic flow prediction using Deep Sedenion Networks

by   Alabi Bojesomo, et al.

In this paper, we present our solution to the Traffic4cast2020 traffic prediction challenge. In this competition, participants are to predict future traffic parameters (speed and volume) in three different cities: Berlin, Istanbul and Moscow. The information provided includes nine channels where the first eight represent the speed and volume for four different direction of traffic (NE, NW, SE and SW), while the last channel is used to indicate presence of traffic incidents. The expected output should have the first 8 channels of the input at six future timing intervals (5, 10, 15, 30, 45, and 60min), while a one hour duration of past traffic data, in 5mins intervals, are provided as input. We solve the problem using a novel sedenion U-Net neural network. Sedenion networks provide the means for efficient encoding of correlated multimodal datasets. We use 12 of the 15 sedenion imaginary parts for the dynamic inputs and the real sedenion component is used for the static input. The sedenion output of the network is used to represent the multimodal traffic predictions. Proposed system achieved a validation MSE of 1.33e-3 and a test MSE of 1.31e-3.


page 1

page 2

page 3

page 4


A Graph-based U-Net Model for Predicting Traffic in unseen Cities

Accurate traffic prediction is a key ingredient to enable traffic manage...

Revisiting Flow Information for Traffic Prediction

Traffic prediction is a fundamental task in many real applications, whic...

Estimating Traffic Disruption Patterns with Volunteered Geographic Information

Accurate understanding and forecasting of traffic is a key contemporary ...

Traffic Forecasting on Traffic Moving Snippets

Advances in traffic forecasting technology can greatly impact urban mobi...

A Capsule Network for Traffic Speed Prediction in Complex Road Networks

This paper proposes an approach for vehicular spatio-temporal characteri...

1 Introduction

In this contribution, we briefly summarize the methodology and experiments in tackling the Traffic4cast (Traffic map movie forecasting) challenge 2020 [11]. The aim of the challenge is to predict future traffic flow volume and speed using high resolution city maps (see Fig. 1). Given hourly traffic data, we need to build a model to predict volume and speed at 6 time intervals (i.e., 5min, 10min, 15min, 30min, 45min and 1hr) of the next 1 hour traffic condition [11].

Figure 1: Average traffic situation, averaged over time span and channel

The pixel level prediction of this task can be posed as a segmentation problem and hence, a U-NET [8]

based architecture was used. This year’s competition includes both dynamic and static information, which are treated as independent modalities. Multi-modal datasets can be treated with a variety of approaches, including early and late fusion. In late fusion, separate feature extracting networks are used on the individual modality, while the extracted features are fused before the classification layer. In early fusion, on the other hand, the modalities are fused within the feature extraction network

[1]. To account for the multi-modal nature of the dataset, in the proposed U-NET model, a Sedenion-based convolutional layer [12, 9] was used. The sedenion convolution layers are packed in blocks similar to ResNet (see Fig. 5) [3]. The proposed model is highly parameter efficient due to the use of sedenion-based convolutional layers. The static and the dynamic information can be used as components in a 16-component sedenion representation. Having a single static frame and 12 dynamic frames makes it convenient to use the static input as the first component, while the dynamic inputs are used in the following 12 components (see Fig. 2b).

Figure 2: Sedenion representation. (a) Sedenion input feature map division into 16 groups. (b) How the sedenion representation was formed in the first sedenion layer and how the outputs of the network were extracted from the last sedenion layer of the network.

The spatio-temporal nature of the problem can equally be viewed as a multi-task problem due to the need to forecast more than one time steps in the future. Working with traditional convolutional network may be seen as a single-task segmentation problem, however, in the case of this challenge, there are 8 different channels at 6 different time steps. This gives rise to multi-task segmentation, not only in terms of channels but also in regards to the time steps (group of channels). Hypercomplex number based networks have been previously used for multi-task problems [12] with promising results. Hypercomplex networks do not only work when the number of components exactly match the number of channels or available groups but also when the number of available channels is less than the number of expected components of the hypercomplex number representation. An example of this can be found in the work of Pacholet et. al. [6]

, where a quaternion autoencoder was used for color image reconstruction using just 3 out of the 4 quaternion components in the input stage as well as output stage.

2 Sedenion Network

A sedenion can be defined as a 16-dimensional algebraic structure (see Fig. 2a), i.e., , where is the real part and are the imaginary parts with . Sedenion based convolution can be represented as the multiplication of two sedenion numbers [12]. With weight and input being a sedenion, the multiplication leads to a weight matrix having component indices and signs shown in equation (eqn. 1) (used for display purposes only). The matrix shows that if (eqn. 2), then following the first row of (1). Other components of the output feature maps can be found accordingly. It is obvious that each of the 16 components of the sedenion weights are re-used 16 times, leading to parameter efficiency. Moreover, the parameter reduction does not come at the expense of representation ability because all input feature maps to the sedenion convolution take part in the equation leading to each of the 16 sedenion outputs. To represent the inputs as a sedenion structure, the input feature map is divided into 16 groups, where each group represents a single component of the sedenion feature map.


Sedenion based concatenation is equally different from standard (real) concatenation block. Following the definition of sedenion as , concatenation of two sedenions and , can be defined as a sedenion formed by concatenating each of the components of the contributing sedenions, i.e., (see Fig. 3), where represents standard concatenation of and .

Figure 3: Sedenion concatenation showing how individual components get concatenated.

For a network with sedenion convolutions to be trainable, we assume that the components are independent during initialization, following the works of Wu et. al. [12], Gaudet and Maida [2], and Trabelsi et. al [10]

. However, we use normal batch normalization

[4] instead of the computationally/memory expensive hypercomplex based batch normalization which involves LU decomposition [12, 2, 10].

3 Methods

3.1 Input representation

The input to the model consists of 2 types of information sources, i.e., static and dynamic, supporting the use of the proposed multi-modal representation. The static input accounts for city-based variations as the model was trained using data from all three cities. The static input goes through a vector learning block, and is used as the real component of the aggregated sedenion input. The dynamic input consists of 12 frames (see Fig.

6) accounting for the 12 previous time frames, and was used as 12 imaginary components of the sedenion. The remaining 3 imaginary components were set to zero (see Fig. 2b).

3.2 Output

The 6 expected time frames of the prediction come from the last layer of the proposed model, which equally gives a 16 components output (sedenion convolution output). The actual output comes from handpicking the exact location of the expected time frame in the output sedenion (see Fig. 4). The network can easily be adapted to accommodate for up to 12 frames output for 1 hour predictions without any further adjustment to the output layer.

3.3 Model

The U-NET model structure is shown in Fig. 4. In this model, all convolutional blocks are sedenion convolutions except for the ones in the learnVectorBlock, which helps in getting the static input (7 channels) into an equal channel dimension as the dynamic input (9 channels).

Figure 4: Overall model architecture.

The model blocks consist of various functions with their respective sub-blocks, shown in Fig. 5, as follows:

  • learnVectorBlock: This block serves as a feature extractor in case of any component having a different number of dimensions. It is used to extract a 9-channel feature from the 7-channel static input before concatenation with the dynamic input in the aggregated sedenion input of the network.

  • encoderBlock: This is a ResNet block, which serves as the main encoder block for the U-NET model. Spatial pooling is done by strided convolution at the end of each encoder group.

  • codeVectorBlock: This is a sedenion convolution preceded by batch normalization and ReLU.

  • decoderBlock: This block has two inputs, one comes from the decoding end, while the other one from a skip connection, originating from the encoder output.

Figure 5: Model blocks

4 Results

In this competition, there are 362 day data files for each target city. 181 files were used for training, 18 for validation, and 163 were used for hold-out test set evaluation. The evaluation metric was the mean squared error. A single day’s training file contains 288 (24 hours x 60 minutes / 5 minute interval) time frame traffic map information. A sliding window was applied to extract 12 frame inputs and 6 frame outputs (see Fig.


Figure 6: Data sequence

The U-NET model described in Fig. 4

was implemented in Pytorch


. The mean squared error (MSE) was used as the loss function, with the Adam optimizer

[5]. The learning rate was initially set to 1e-4 and was manually reduced to 1e-6, when performance plateaued on the validation set.

The model was trained on a machine running two GeForce RTX 2080 Ti. The resources limitation and the high resolution of the data imposed constraints on the number of experiments, which, in turn, limited the competitiveness of the simulation results. The proposed model performance is summarized in Table 1.

Parameters Validation MSE Test MSE
628,592 1.33893e-03 1.30845e-03
Table 1: Model evaluation results

5 Conclusion and Future Work

We presented the use of Sedenion Convolution in a UNet based model for the traffic forecasting challenge, which resulted in competitive results, i.e., a MSE score of 1.30845e-03 on the challenge test set with just 628,592 model trainable parameters. The model was trained with limited resources, showing promising performance to be further investigated in future studies. Analysis of the training data shows that there exists time based variation in the data set, which would support the use of actual time of the day information during training. The code with the implementation of the proposed approach is available online at


  • [1] N. Audebert, B. Le Saux, and S. Lefèvre (2018) Beyond rgb: very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, pp. 20 – 32. Note:

    Geospatial Computer Vision

    External Links: ISSN 0924-2716, Document, Link Cited by: §1.
  • [2] C. Gaudet and A. Maida (2017) Deep quaternion networks. arXiv preprint arXiv:1712.04604. Cited by: §2.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. External Links: Document Cited by: §1.
  • [4] J. Hoffmann, S. Schmitt, S. Osindero, K. Simonyan, and E. Elsen (2020) AlgebraNets. pp. 1–15. External Links: 2006.07360, Link Cited by: §2.
  • [5] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.
  • [6] T. Parcollet, M. Morchid, and G. Linarès (2019)

    Quaternion convolutional neural networks for heterogeneous image processing

    In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 8514–8518. External Links: Document Cited by: §1.
  • [7] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.
  • [8] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §1.
  • [9] L. S. Saoud and H. Al-Marzouqi (2020)

    Metacognitive sedenion-valued neural network and its learning algorithm

    IEEE Access 8 (), pp. 144823–144838. External Links: Document Cited by: §1.
  • [10] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal (2018) Deep complex networks. External Links: 1705.09792 Cited by: §2.
  • [11] Traffic map movie forecasting 2020. Note: Cited by: §1.
  • [12] J. Wu, L. Xu, F. Wu, Y. Kong, L. Senhadji, and H. Shu (2020) Deep octonion networks. Neurocomputing 397, pp. 179 – 191. External Links: ISSN 0925-2312, Document, Link Cited by: §1, §1, §2, §2.