Log In Sign Up

Solving Traffic4Cast Competition with U-Net and Temporal Domain Adaptation

In this technical report, we present our solution to the Traffic4Cast 2021 Core Challenge, in which participants were asked to develop algorithms for predicting a traffic state 60 minutes ahead, based on the information from the previous hour, in 4 different cities. In contrast to the previously held competitions, this year's challenge focuses on the temporal domain shift in traffic due to the COVID-19 pandemic. Following the past success of U-Net, we utilize it for predicting future traffic maps. Additionally, we explore the usage of pre-trained encoders such as DenseNet and EfficientNet and employ multiple domain adaptation techniques to fight the domain shift. Our solution has ranked third in the final competition. The code is available at


Transposed Variational Auto-encoder with Intrinsic Feature Learning for Traffic Forecasting

In this technical report, we present our solutions to the Traffic4cast 2...

Learning to Transfer for Traffic Forecasting via Multi-task Learning

Deep neural networks have demonstrated superior performance in short-ter...

Large scale traffic forecasting with gradient boosting, Traffic4cast 2022 challenge

Accurate traffic forecasting is of the utmost importance for optimal tra...

MixMatch Domain Adaptaion: Prize-winning solution for both tracks of VisDA 2019 challenge

We present a domain adaptation (DA) system that can be used in multi-sou...

Towards Good Practices of U-Net for Traffic Forecasting

This technical report presents a solution for the 2020 Traffic4Cast Chal...

Where a Strong Backbone Meets Strong Features – ActionFormer for Ego4D Moment Queries Challenge

This report describes our submission to the Ego4D Moment Queries Challen...

Code Repositories


3rd Place Solution of the Traffic4Cast Core Challenge @ NeurIPS 2021

view repo

1 Introduction

The Traffic4Cast Competition, organized over the last three years by the Institute of Advanced Research in Artificial Intelligence (IARAI) and the data-provider company, HERE technologies,

111 and challenges its participants to design algorithms that would predict future traffic states based on the prior traffic information. Specifically, the Traffic4Cast 2019 Challenge Kreil et al. (2020) focuses on forecasting the traffic state 15 minutes ahead based on the dynamic information from the previous hour, while the prediction window is increased to up to an hour in the Traffic4Cast 2020 Challenge Kopp et al. (2021).

This year’s edition of the competition, Traffic4Cast 2021, has introduced two competition tracks:

  • Core Challenge, in which participants are tasked to fight a temporal domain shift that appears due to the COVID-19 pandemic.

  • Extended Challenge, in which models are tested over 2 entirely unseen cities, thus introducing a spatiotemporal domain shift in the traffic data.

Our work focuses on the Core Challenge, for which the competition organizers have provided 6 months of training data from 2019. They have tested the submissions on 100 one-hour time slots from 2020 in four diverse cities — Melbourne, Berlin, Istanbul, Chicago. For each one-hour time slot, the designed algorithms have been challenged to predict traffic conditions for the next six time steps into the future: 5, 10, 15, 30, 45, and 60 minutes. The main difficulty of this challenge is the temporal domain shift, which appears due to the COVID-19 pandemic in 2020 that had inevitably reduced the traffic flow volume due to mobility limitations in order to abate the spread of the virus. For participants, this has imposed a problem where the train and test distributions differ significantly.

Our contributions are as follows:

  • Inspired by the success of U-Net Ronneberger et al. (2015) in the previous competitions Choi (2020); Xu et al. (2020); Choi (2019); Martin et al. (2019); Liu et al. (2019), we build and train four independent U-Net models for each of the core cities, which serve as a simple, yet strong baseline that turned out to be hard to beat.

  • In attempt to further improve the score, we have conducted experiments employing DenseNet Huang et al. (2017) and EfficientNet Tan and Le (2019)

    , convolutional networks pre-trained on ImageNet

    Deng et al. (2009), as part of the encoder in U-Net Ronneberger et al. (2015).

  • We further have tried a set of domain adaptation techniques, such as pseudo-labeling Jaiswal et al. (2019)

    and heuristic map pre- and post-processing to fight the temporal domain shift between the training and test data.

  • Finally, we have used an ensembling method to combine the networks’ predictions, utilizing the proposed domain adaptation technique, to create the best scoring submission and have ranked third in the competition.

2 Data Preprocessing

The data had been collected from 4 cities: Berlin, Istanbul, Melbourne, Chicago. The competition organizers represent each city as a grid map of pixels, where each cell aggregates the GPS measurements in a region of approximately meters size and 5-minute time bins. Then, the dynamic traffic state at each 5-minute time bin is encoded into a grid map with 8 channels that represent traffic volume and speed in four directions (NW, NE, SW, SE). The "pixel values" of volume and speed are scaled to values from 0 to 255 with a min-max scaler and then rounded to their integer values.

Resulting input and output shapes are and , where , , , , , although and

might differ with different models, in case we pad the input image to match the working resolution of a specific model. The input represents the dynamic traffic state map over 60 minutes window, and the output includes timestamps 5, 10, 15, 30, 45 and 60 minutes after. We further reshape the input and output into three-dimensional tensors to efficiently use image-to-image models, so the resulting shapes are

for input and for output, respectively.

As we only have 6 months of labeled data from 2019, we further randomly split the data into 4 folds, fixing one of the folds as a validation set.

Figure 1: Generated masks of the road map from train and test data, for Berlin, Chicago, Istanbul, and Melbourne.

Static roadmap generation

The predicted traffic maps are sparse, meaning that there are pixels that never encode any traffic since there is simply no road in that region. While we expect that a large part of the road network knowledge will be inferred into the trained networks, we find that applying post-processing by simply multiplying a two-dimensional binary mask of a road graph with the network predictions yields better results. For this 2D mask, we experiment with the low-resolution static maps, provided by the competition organizers, alongside with the masks that were generated from training data the following way: we set a pixel value to if the traffic volume or speed value has at least once been more than zero among the training samples, and otherwise. We find that adding pixel values from 2020 test data further boosts the performance. The obtained roadmaps are shown in Figure 1.

3 Methods

3.1 U-Net

As a strong baseline solution, we utilize a vanilla U-Net network Ronneberger et al. (2015)

, with 5 downsampling blocks and batch normalization

Ioffe and Szegedy (2015)

following each convolutional block with ReLU activation

Nair and Hinton (2010). The U-Net network takes an image with channels and outputs an image with channels. We pad the input maps to a spatial resolution of pixels to better handle the downsampling.

Baseline training details

We train an independent model for each city, with the mean squared error

(MSE) chosen as the loss function to match the evaluation metric. PyTorch framework

Paszke et al. (2019) is used for all of our experiments. We employ Adam Kingma and Ba (2014) as the optimizer with the learning rate of and the batch size of . We also employ a warm-up technique by linearly scaling the learning rate from to for the first 1k optimization steps. All our training experiments are conducted with the Mixed Precision Micikevicius et al. (2018) in order to decrease training time.


The U-Net performance on the validation set of the traffic data from 2019 per city is reported in Table 1. We furthermore evaluate on the organizer’s test set and report the scores from the leaderboard for the post-processing masks discussed in Section 2. Table 2 demonstrates that our proposed roadmap generation method yields the best result on the competition leaderboard. This post-processing technique is later used by default for all submissions.

It is worth noting that these 4 standalone U-Net models with the roadmap post-processing have been sufficient to rank third in this year’s Core Challenge, meaning that carefully tuned U-Net models provide a simple, yet strong solution for the traffic forecasting task that is robust to the temporal domain shift.

Mean Squared Error
Model #Params Melbourne Berlin Istanbul Chicago
U-Net 31M
EfficientNet-B5 U-Net 30M
DenseNet201 U-Net 31M -
Table 1: Models’ performance over the fixed validation set from 2019 data. Mean Squared Error, the lower the better.
Model MSE
U-Net + static city mask from competition organizers
U-Net + generated static city mask from 2019 train data
U-Net + generated static city mask from 2019 train data and 2020 test data 49.69488
Table 2: U-Net performance over the test set from 2020 data. Mean Squared Error, the lower the better.

3.2 Pre-trained Encoders

Inspired by the success of employing convolutional networks, pre-trained on ImageNet

Deng et al. (2009), as encoders into U-Net Ronneberger et al. (2015) for a task of semantic segmentation Iglovikov and Shvets (2018), we try two different convolutional networks: DenseNet201 Huang et al. (2017) and EfficientNet-B5 Tan and Le (2019).


EfficientNet Tan and Le (2019) has been widely used in many vision tasks, such as detection Tan et al. (2020), segmentation Huynh and Boutry (2020), and others, and has proved to be powerful. Thus, we employ EfficientNet as the encoder part, and keep the decoder from the original U-Net architecture, changing the number of input channels in the decoder side to match the feature maps produced by the EfficientNet. We employ the EfficientNet-B5, pre-trained on ImageNet, so the resulting network achieves almost the same number of parameters as the baseline U-Net, as shown in Table 1. The depth of the U-Net is increased by 1 to match the number of feature maps obtained from the EfficientNet backbone.


Following the insight from the last two competitions’ winner Choi (2019, 2020), we utilize the densely connected convolutional layers by employing DenseNet201 Huang et al. (2017) pre-trained on ImageNet. Same as in EfficientNet, the decoder part is borrowed from the U-Net; however, since we omit the last dense block of DenseNet to keep the consistency across the size of the models, the resulting network’s depth is not increased compared to the baseline U-Net.

Additional training details

The EfficientNet- and DenseNet U-Nets are trained with the same parameters as the baseline U-Net, with the two differences: we increase the batch size from 8 to 16 to stabilize the training process, and also increase the padding for the input images, so the traffic maps have a spatial resolution of pixels.

Model MSE
EfficientNet-B5 U-Net + generated static city mask from 2019 and 2020 data
DenseNet201 U-Net + generated static city mask from 2019 and 2020 data
U-Net + generated static city mask from 2019 and 2020 data 49.69488
Table 3: Comparison of the baseline U-Net and pre-trained encoder U-Nets over the test set from 2020 data. Mean Squared Error, the lower the better.


The performance of the two networks over the 2019 validation set can be seen in Table 1. The EfficientNet-B5 U-Net achieves a smaller MSE over all the core cities, when compared to the baseline U-Net, and the DenseNet-201 U-Net performs even better than the EfficientNet-B5 U-Net. Unfortunately, the training process of DenseNet201 U-Net on Istanbul suffered numerical instabilities, so we don’t use the DenseNet201 U-Net for this city thereafter.

However, these U-Net models with the pre-trained encoders, although showing better results during the validation on 2019 data, were not able to beat the baseline U-Net on the competition leaderboard (Table 3).222As there is no DenseNet U-Net model trained for Istanbul, we use predictions from the Istanbul baseline U-Net model. This observation guided our further research direction into finding ways of bridging the gap between the 2019 training and 2020 testing data, which is further described in Section 3.3.

3.3 Domain Adaptation

Domain Adaptation is a common technique for tackling the problem when the model was trained on one domain, and tested on a different, but close target domain. As in the Core Challenge, the 2020 test data differs significantly due to a temporal domain shift, we study two Domain Adaptation strategies applicable to the 2021 Traffic4Cast Core Challenge.


One of the studied domain adaptation approaches is pseudo-labeling Jaiswal et al. (2019). The idea is the following: for the unlabeled, out-of-domain data, a model predicts new labels for these samples, thus generating a new labeled data set which is later used for training the model. This procedure might be applied for several rounds. When tried for the baseline U-Net and the 2020 test data, however, our models seem to have degraded their performance on the competition leaderboard. We explain that behavior by the fact that for the 2020 test data, the temporal domain shift is also present in the output maps.

Heuristically bridging the 2019 and 2020 data

Instead of modifying the model itself, we focus on finding a way to make the 2020 testing data closer to the 2019 data, which the above-mentioned models have been trained on. We try to find a deterministic inverse transformation that would make the 2020 traffic map closer to the 2019 map, use the transformed maps to generate the model predictions, and apply the inverse transformation on predictions to return to the 2020 test data domain.

Let us denote the traffic map at a certain time step as , where , , . Then, for all the time steps from 2019 and the time steps from 2020 we calculate the mean traffic map per year the following way:

where the summation symbol denotes element-wise sum. With the mean traffic maps and for 2019 training and 2020 test data, we then calculate per-pixel and per-channel relation of the mean traffic map to the mean traffic map :

where denotes element-wise division. In case the element of is , we set the corresponding element in to

. Furthermore, based on the fact that the traffic volume could not have increased in 2020, we change the outlier elements in

which are less than to .

We then multiply the input traffic maps from 2020 test data by , so the traffic maps’ distribution becomes closer to the 2019 training data, run the model inference, and then multiply the predicted traffic maps by 333This notation denotes that for each element in , the resulting corresponding element would be . In case , the value is set to , to bring the distribution back to the 2020 test data.

Using this Temporal Domain Adaptation later would result in our best submission on the competition leaderboard, as described in Section 3.4.

3.4 Ensembling

Having had multiple models and domain adaptation strategies by the end of the competition, we have ultimately experimented with ensembling different models and strategies together. As discussed in Section 2, for all the predictions we have utilized our proposed generated static city mask as the post-processing function. We experimented with two aggregating functions, namely mean and median ensembling, and chose the mean ensembling as it had performed slightly better.

Our three final ensembling submissions are the following:

  • The mean ensemble of the predictions from the Baseline U-Net, EfficientNet-B5 U-Net, and DenseNet201 U-Net.

  • The mean ensemble of the predictions from the Baseline U-Net, EfficientNet-B5 U-Net, and DenseNet201 U-Net with the Temporal Domain Adaptation (TDA) from Section 3.3

  • The two above submission combined

We report the obtained errors on the competition leaderboard in Table 4. The ensemble of the Baseline U-Net, EfficientNet-B5 U-Net, and DenseNet201 U-Net models outperforms the standalone Baseline U-Net models; what’s more, when combining this ensemble with the predictions generated from the same models, but using the Temporal Domain Adaptation strategy, the error is further decreased to , which makes it our best submission.

Ensemble MSE
Baseline U-Net, EfficientNet-B5 U-Net, DenseNet201 U-Net with TDA
Baseline U-Net, EfficientNet-B5 U-Net, DenseNet201 U-Net
The two of the above combined
Table 4: Ensembled models performance over the test set from 2020 data. Mean Squared Error, the lower the better.

4 Conclusion

Our Traffic4Cast 2021 Core Challenge solution yet again shows that U-Net Ronneberger et al. (2015) is a strong baseline for the traffic prediction task. We have studied further into employing convolutional encoders, pre-trained on ImageNet Deng et al. (2009), to improve the baseline U-Net model; however, the obtained models were less robust to the temporal domain shift. In order to fight the latter, we have applied an ensembling method with the introduced Temporal Domain Adaptation that aims to bridge the gap between the 2019 and 2020 data. The method has ranked third on the final leaderboard.


  • [1] S. Choi (2019)

    Traffic map prediction using unet based deep convolutional neural network

    External Links: 1912.05288 Cited by: 1st item, §3.2.
  • [2] S. Choi (2020) Utilizing unet for the future traffic map prediction task traffic4cast challenge 2020. External Links: 2012.00125 Cited by: 1st item, §3.2.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 248–255. External Links: Document Cited by: 2nd item, §3.2, §4.
  • [4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: 2nd item, §3.2, §3.2.
  • [5] L. D. Huynh and N. Boutry (2020) A u-net++ with pre-trained efficientnet backbone for segmentation of diseases and artifacts in endoscopy images and videos. In EndoCV@ISBI, Cited by: §3.2.
  • [6] V. Iglovikov and A. Shvets (2018) TernausNet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. External Links: 1801.05746 Cited by: §3.2.
  • [7] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    pp. 448–456. Cited by: §3.1.
  • [8] A. K. Jaiswal, I. Panshin, D. Shulkin, N. Aneja, and S. Abramov (2019) Semi-supervised learning for cancer detection of lymph node metastases. External Links: 1906.09587 Cited by: 3rd item, §3.3.
  • [9] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
  • [10] M. Kopp, D. Kreil, M. Neun, D. Jonietz, H. Martin, P. Herruzo, A. Gruca, A. Soleymani, F. Wu, Y. Liu, et al. (2021) Traffic4cast at neurips 2020? yet more on theunreasonable effectiveness of gridded geo-spatial processes. In NeurIPS 2020 Competition and Demonstration Track, pp. 325–343. Cited by: §1.
  • [11] D. P. Kreil, M. K. Kopp, D. Jonietz, M. Neun, A. Gruca, P. Herruzo, H. Martin, A. Soleymani, and S. Hochreiter (2020) The surprising efficiency of framing geo-spatial time series forecasting as a video prediction task–insights from the iarai traffic4cast competition at neurips 2019. In NeurIPS 2019 Competition and Demonstration Track, pp. 232–241. Cited by: §1.
  • [12] Y. Liu, F. Wu, B. Yu, Z. Liu, and J. Ye (2019) Building effective large-scale traffic state prediction system: traffic4cast challenge solution. External Links: 1911.05699 Cited by: 1st item.
  • [13] H. Martin, Y. Hong, D. Bucher, C. Rupprecht, and R. Buffat (2019) Traffic4cast-traffic map movie forecasting–team mie-lab. arXiv preprint arXiv:1910.13824. Cited by: 1st item.
  • [14] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu (2018) Mixed precision training. External Links: 1710.03740 Cited by: §3.1.
  • [15] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §3.1.
  • [16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.1.
  • [17] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: 1st item, 2nd item, §3.1, §3.2, §4.
  • [18] M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    In International Conference on Machine Learning, pp. 6105–6114. Cited by: 2nd item, §3.2, §3.2.
  • [19] M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §3.2.
  • [20] J. Xu, J. Zhang, Z. Yao, and Y. Wang (2020) Towards good practices of u-net for traffic forecasting. External Links: 2012.02598 Cited by: 1st item.