CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition

by   Dogancan Temel, et al.
Georgia Institute of Technology

In this paper, we investigate the robustness of traffic sign recognition algorithms under challenging conditions. Existing datasets are limited in terms of their size and challenging condition coverage, which motivated us to generate the Challenging Unreal and Real Environments for Traffic Sign Recognition (CURE-TSR) dataset. It includes more than two million traffic sign images that are based on real-world and simulator data. We benchmark the performance of existing solutions in real-world scenarios and analyze the performance variation with respect to challenging conditions. We show that challenging conditions can decrease the performance of baseline methods significantly, especially if these challenging conditions result in loss or misplacement of spatial information. We also investigate the effect of data augmentation and show that utilization of simulator data along with real-world data enhance the average recognition performance in real-world scenarios. The dataset is publicly available at


page 3

page 8

page 9


Traffic Sign Detection under Challenging Conditions: A Deeper Look Into Performance Variations and Spectral Characteristics

Traffic signs are critical for maintaining the safety and efficiency of ...

CURE-OR: Challenging Unreal and Real Environments for Object Recognition

In this paper, we introduce a large-scale, controlled, and multi-platfor...

Challenging Environments for Traffic Sign Detection: Reliability Assessment under Inclement Conditions

State-of-the-art algorithms successfully localize and recognize traffic ...

Recognition of Russian traffic signs in winter conditions. Solutions of the "Ice Vision" competition winners

With the advancements of various autonomous car projects aiming to achie...

BdSL36: A Dataset for Bangladeshi Sign Letters Recognition

Bangladeshi Sign Language (BdSL) is a commonly used medium of communicat...

ROBIN : A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts

Enhancing the robustness in real-world scenarios has been proven very ch...

A Game Theoretical Error-Correction Framework for Secure Traffic-Sign Classification

We introduce a game theoretical error-correction framework to design cla...

1 Introduction

Autonomous vehicles are transforming existing transportation systems. As we step up the ladder of autonomy, more critical functions are performed by algorithms, which demands more robustness. In case of following traffic rules, robust sign recognition systems are essential unless we have prior information about traffic sign types and locations. It is a common practice to test the robustness of these systems with traffic datasets Grigorescu2003 ; Timofte2009 ; Timofte2014 ; Belaroussi2010 ; Larsson2011 ; Stallkamp2011 ; Stallkamp2012 ; Houben2013 ; Mogelmose2012 ; Zhu2016 . However, majority of these datasets are limited in terms of challenging environmental conditions. There is usually no metadata corresponding to challenging conditions or levels in these datasets, which are also limited in terms of dataset size. Moreover, the relationship between challenging conditions and algorithmic performance is not analyzed in these studies. Lu et al. Lu2017 investigated the traffic sign detection performance with respect to challenging adversarial examples and showed that adversarial perturbations are effective only in specific situations. Das et al. Das2017 showed the vulnerabilities of existing systems and suggested JPEG compression to eliminate adversarial effects. Even though both of these studies analyze algorithmic performance variation with respect to specific challenging situations, adversarial examples are inherently different from realistic challenging scenarios.

In this paper, we investigate the traffic sign recognition performance of commonly used methods under realistic challenging conditions. To eliminate the shortcomings of existing datasets, we introduce the Challenging Unreal and Real Environments for Traffic Sign Recognition (CURE-TSR) dataset. The contributions of this paper are folds.

  • We introduce the most comprehensive publicly-available traffic sign recognition dataset with controlled challenging conditions.

  • We provide a detailed analysis of the benchmarked algorithms in terms of their recognition performance under challenging conditions. Based on this analysis, we identify the vulnerabilities of algorithms with respect to challenging conditions, which should give insights into the use of such models under certain conditions.

  • We provide images that originate from captured sequences as well as synthesized sequences, which would lead to a better understanding of similarities/differences between real-world and simulator data in terms of algorithmic performance. This understanding can be utilized to generate more realistic datasets and minimize the need for real-world data collection that requires significant resources.

  • We use diverse data augmentation methods and show that utilization of limited simulator data along with real-world data can enhance the recognition performance. This observation shows that simulated environments can enhance the performance of data-driven methods in real-world scenarios even when there is a difference between target and source domains.

2 Dataset

Timofte et al. Timofte2014 introduced the Belgium traffic sign classification (BelgiumTSC) dataset whose images were acquired with a van that had 8 roof-mounted cameras. Acquisition vehicle cruised in streets of Belgium and images were captured every meter. A subset of these images were selected and traffic signs were cropped to obtain the BelgiumTSC dataset. Stallkamp et al. Stallkamp2011 ; Stallkamp2012 introduced the German traffic sign recognition benchmark (GTSRB) dataset, which was acquired during daytime in Germany. Each traffic sign instance in the dataset is adjusted to have images. BelgiumTSC and GTSRB datasets are limited in terms of challenging environmental conditions and they do not include metadata related to the type of challenging conditions or their levels. Because of limited control in data acquisition setup, it is not possible to perform controlled experiments with these datasets. The total number of annotated signs including BelgiumTSC and GTSRB datasets is around , which may not be sufficient to test the robustness of recognition algorithms comprehensively. To compensate the shortcomings in the literature, we introduce the CURE-TSR dataset. Main characteristics of BelgiumTSC, GTSRB, and CURE-TSR datasets are summarized in Table 1.

Number of
Number of
Number of
sign types
Origin of
the videos
7,095 -
11x10 to
133,000 -
51,840 43
15x15 to
Prosilica GC
color camera
3x7 to
Captured in
Belgium and
Generated in
Unreal Engine 4
Table 1: Main characteristics of BelgiumTSC, GTSRB, and CURE-TSR datasets.

Traffic sign images in the CURE-TSR dataset were cropped from the CURE-TSD dataset curetsd_dataset , which includes around million real-world and simulator images with more than million traffic sign instances. Real-world images were obtained from the BelgiumTS video sequences and simulated images were generated with the Unreal Engine 4 game development tool. In Fig. 1, we show a sample real-world image and a simulator image. In the rest of this paper, we refer to simulator generated images as unreal images and real-world images as real images. As observed in sample images, both real and unreal images are usually from urban environments. While deciding on the type of traffic signs to be included in real and unreal sequences, we focused on two main criteria. First, not every sign type can be reasonably located in unreal sequences. Second, there are limited number of common signs between the package utilized in the generation of unreal sequences and real sequences. Based on the aforementioned selection criteria, we narrowed down number of traffic signs to 14 types as shown in Fig. 2. Sign types include speed limit, goods vehicles, no overtaking, no stopping, no parking, stop, bicycle, hump, no left, no right, priority to, no entry, yield, and parking.

(a) Real-world (real) image (b) Simulator (unreal) image
Figure 1: Real and unreal environments.
speed goods no no no stop bicycle hump no no priority no yield parking
limit vehicles overtaking stopping parking left right to entry
Figure 2: Traffic signs in real ( row) and unreal ( row) environments.

Unreal and real sequences were processed with state-of-the-art visual effect software Adobe(c) After Effects to simulate challenging conditions, which include rain, snow, haze, shadow, darkness, brightness, blurriness, dirtiness, colorlessness, sensor and codec errors. The key component in this study is not the number of traffic signs but the number of challenging conditions and the context of each traffic sign in a virtual dataset and its corresponding real dataset. If one considers a traffic sign in a challenging condition as a distinct configuration, then we end up with 182 (14x13) distinct configurations in real sequences and 168 (14x12) distinct configurations in virtual sequences. In Fig. 3, we show sample stop sign images under challenging conditions in both real and unreal environments. We included codec error as an edge case to test the limits of benchmarked methods. Recognizing traffic signs with codec errors can be challenging even for subjects because of significant misalignment. If a sign is totally misaligned, it will not be possible to recognize it at cropped location but in case there is residual, it can still be possible to recognize that traffic sign. Codec-related errors can be critical in various applications including but not limited to remote driving. Overall, there are 5 challenge levels for each challenge category, which are shown in Appendix A.

No Decolor- Lens Codec Darkening Dirty Exposure Gaussian Noise Rain Shadow Snow Haze
Challenge ization Blur Error Lens Blur
Figure 3: Stop signs under challenging conditions in real ( row) and unreal ( row) environments.

3 Experiments

3.1 Baseline Methods, Dataset, and Performance Metric

In the German traffic sign recognition benchmark (GTSRB) Stallkamp2011 , histogram of oriented gradient (HOG) features were utilized to report the baseline results. In the Belgium traffic sign classification (BelgiumTSC) benchmark, cropped traffic sign images were converted into grayscale and rescaled to

patches, which were included in the baseline. Moreover, HoG features were also used as a baseline method. They classified traffic sign images with methods including support vector machines (SVMs). Similar to GTSRB and BelgiumTSC datasets, we use rescaled grayscale and color images as well as HoG features as baseline. In the final classification stage, we utilize one-vs-all SVMs with radial basis kernels and softmax classifiers. In addition to aforementioned techniques, we also use a shallow convolutional neural network, which consists of two convolutional layers followed by two fully connected layers, and a softmax classifier. We preprocessed images using

normalization, mean subtraction, and division by standard deviation.

Traffic sign images originate from video sequences, which are split into approximately training set and test set. Video sequences were split one sign at a time, starting from the least common sign. Once video sequences were assigned to training or testing sets, splitting continued from the remaining sequences until all the sequences were classified. In the first experiment set, we utilize traffic sign images in the training stage obtained from challenge-free real training sequences. In the testing, we utilize images from each challenge category and level, which adds up to images ( images challenge types levels). As performance metric, we utilize classification accuracy, which corresponds to the percentage of traffic signs that are correctly classified.

3.2 Experiment 1: Recognition in Real Environments under Challenging Conditions

We analyze the accuracy of baseline methods with respect to challenge levels for each challenge type and report the results in Fig. 4. Severe decolorization (Fig. 4(a)) leads to at least decrease in accuracy for color-based and HoG-based methods. However, intensity-based methods show consistent performance over different challenge levels since no color information is used by intensity-based methods. Among all the challenges, codec error is the most effective category that significantly degrades the classification accuracy even with challenge level as shown in Fig. 4(c). We can observe that there is at least decrease for each method after challenge level and at least decrease after challenge level . Lens blur (Fig. 4(b)), exposure (Fig. 4(f)), and Gaussian blur (Fig. 4(g)) result in significant performance decrease under severe challenging conditions, at least for each baseline method. However, classification accuracy decreases more linearly in these categories compared to codec error because of its steep decrease in level . In darkening category (Fig. 4(d)), classification accuracy is consistent for all the methods. The normalization operation in the preprocessing step makes all methods less sensitive to darkening challenge. When challenge level becomes more severe, performance of baseline methods degrades a few percent at most.

In dirty lens category (Fig. 4(e)), new dirty lens images were overlaid on entire images to increase the challenge level. The new dirt patterns do not necessarily occlude traffic signs. Therefore, performance of baseline methods do not always change when challenge level increases. In noise category (Fig. 4(h)), HoG and CNN correspond to a more linear performance decrease compared to intensity and color-based methods. In rain category (Fig. 4(i)), particle models are all around the scene, which result in significant occlusion even in level challenge. Therefore, degradation while going from challenge-free to level challenge is steeper than any further relative changes for color-based method, HoG-based method, and CNN. In shadow category (Fig. 4(j)), vertical shadow lines are all over the images. We observe slight degradation as challenge level increases because areas under shadow become less visible. In case of snow challenge (Fig. 4(k)), all methods converge to a similar classification accuracy under severe snow challenge. In haze category (Fig. 4(l)), performance of intensity-based, color-based, and CNN methods is relatively consistent whereas decrease in HoG-based models follows a more linear behavior. Color image-based classifiers and CNN are less sensitive to haze challenge compared to other methods. Haze challenge was generated as a combination of radial gradient operator with partial opacity, a smoothing operator, an exposure operator, a brightness operator, and a contrast operator. Moreover, the location of the operator was adjusted manually per frame to simulate a sense of depth. Because of the complexity of haze model, it is less intuitive to explain the behavior of baseline methods. However, the higher tolerance of CNN model with respect to haze challenge can be explained with its capability to directly learn spatial patterns from visual representations.

(b) Lens Blur (c) Codec Error (d) Darkening (e) Dirty Lens
(f) Exposure

(g) Gaussian Blur

(h) Noise

(i) Rain

(j) Shadow

(k) Snow

(l) Haze

Figure 4: Performance versus challenge levels.

3.3 Experiment 2: Recognition in Real Environments under Challenging Conditions with Data Augmentation

We investigate the role of data augmentation methods in traffic sign recognition under challenging conditions. Augmented data include flipped real images, real challenge images, and unreal challenge images. To augment flipped images, real traffic sign images were randomly flipped horizontally, vertically or horizontally and vertically. To augment real challenge images, we selected 20 traffic sign images with maximum area (highest resolution samples) for each traffic sign in the training set. Then, we obtained corresponding images for each challenge type and level. It should be noted that augmented data is challenge version of the challenge-free data, which is already utilized in the training. We utilized the same source images because challenging conditions in the original video sequences were synthesized globally over entire videos and it is not possible to apply the same challenge generation framework directly over new traffic sign images. Overall, in both augmentation experiments, training set include images ( images challenge types traffic signs) and (complete challenge-free training set) real images. Test set is same as experiment for all three data augmentation methods.

Data augmentation based on flipping degrades the recognition performance by . Decrease in recognition performance can be mainly because of asymmetric characteristics of traffic sign images. We further explain details about this performance decrease in Appendix B. Data augmentation with real challenge images slightly decreases the average performance by . Even though novel challenging conditions are added in the augmentation stage, original images are already included in the training. Therefore, such augmentation method does not lead to any performance enhancement in tested scenarios. Similar to real challenge images, we obtained unreal challenge images by selecting the traffic signs with maximum area for each traffic sign, challenge type, and challenge level. Instead of using solely 20 distinct images for each sign, we utilize 220 distinct unreal images (one distinct image for each sign and challenge category) in the data augmentation, which enriches training set with new angle, contrast, and lighting configurations. Results of unreal image-based data augmentations are summarized in Table 2. Each entry in the table other than the last row and the last column was obtained by calculating the performance change for a baseline method over all the challenge levels for a specific challenge type. Entries in the last row were calculated by averaging the performance change of each baseline method over all challenge types. Finally, entries in the last column were calculated by averaging the performance change over all baseline methods for each challenge type.

Challenge Types Baseline Methods
Intensity Color HoG CNN Average
Softmax SVM Softmax SVM Softmax SVM
Decolorization +2.86 +3.32 +1.46 -0.53 +1.43 -0.01 +3.23 +1.68
Lens Blur +3.98 +2.71 +4.45 +6.60 +3.34 +1.81 -1.78 +3.02
Codec Error +0.47 -1.21 +1.51 -0.82 -1.55 -1.61 +2.40 -0.12
Darkening +2.83 +2.98 +2.87 +1.44 +1.68 +0.44 +2.58 +2.12
Dirty lens +3.14 +2.86 +2.68 +1.63 +2.00 +0.62 +3.11 +2.29
Exposure +2.54 +1.77 +1.34 +1.97 -0.66 -2.23 +0.54 +0.75
Gaussian Blur +5.89 +3.98 +4.24 +7.06 +2.03 +1.77 +2.78 +3.97
Noise +1.62 +1.58 +1.89 +0.58 +1.41 -0.90 +2.25 +1.21
Rain +2.30 +1.28 +4.73 +2.75 +5.48 +2.34 +0.69 +2.80
Shadow +2.95 +3.38 +3.27 +1.62 +1.73 +0.64 +3.01 +2.37
Snow +3.19 +2.81 +2.09 +0.48 +2.63 +0.92 +4.34 +2.35
Haze +3.28 +3.22 +3.22 +1.41 +2.26 -1.35 +3.51 +2.22
All (average) +2.92 +2.39 +2.81 +2.02 +1.81 +0.20 +2.22 -
Table 2: Classification accuracy change (%) when additional unreal images used in the training.

We test baseline methods over challenge types and report the performance change of each baseline method for each challenge type. Out of result categories ( baseline methods challenge types), classification performance increases in of them. On average, classification performance increases for all challenge types other than a slight decrease in codec error. Moreover, average classification performance increases for each baseline method, which is a slight increase for HoG-SVM () and more for other methods (at least ). Additional unreal images in the training set were obtained from all the challenge types except haze category. However, classification accuracy increases for all the baseline methods at least other than HoG-SVM in haze category. The performance enhancement in haze can be understood by analyzing the computational model of haze and its perceptual similarity to other challenges. Haze model includes a smoothing operator, an exposure filter, a brightness operator, and a contrast operator. Exposure filter is used in the exposure (overexposure) model and smoothing operator is utilized in blur models. Moreover, perceptually, we can observe similarities between haze and blur challenges in terms of smoothness and similarities between haze and exposure in terms of washed out details. Therefore, perceptually and computationally similar challenges in the training stage can affect the performance of each other in the testing stage.

4 Conclusion

We introduced the CURE-TSR dataset, which is the most comprehensive traffic sign recognition dataset in the literature that includes controlled challenging conditions. We provided a benchmark of commonly used methods in the CURE-TSR dataset and reported that challenging conditions lead to severe performance degradation for all baseline methods. We have shown that lens blur, exposure, Gaussian blur, and codec error degrade recognition performance more significantly compared to other challenge types because these challenge categories directly result in losing or misplacing structural information. We also investigated the effect of data augmentation and showed that flipping or simply adding challenging conditions to training data do not necessarily enhance recognition performance. However, experimental results showed that utilization of diverse simulator data with challenging conditions can enhance the average recognition performance in real-world scenarios.


  • (1)
  • (2) C. Grigorescu and N. Petkov. Distance sets for shape filters and shape recognition. IEEE Trans. Image Proces., 12(10):1274–1286, Oct 2003.
  • (3) R. Timofte, K. Zimmermann, and L. V. Gool. Multi-view traffic sign detection, recognition, and 3D localisation. In WACV, pages 1–8, Dec 2009.
  • (4) R. Timofte, K. Zimmermann, and L. Van Gool. Multi-view traffic sign detection, recognition, and 3D localisation. Mach. Vis. App., 25(3):633–647, 2014.
  • (5) R. Belaroussi, P. Foucher, J. P. Tarel, B. Soheilian, P. Charbonnier, and N. Paparoditis. Road sign detection in images: A case study. In Proc. ICPR, pages 484–488, Aug 2010.
  • (6) F. Larsson and M. ‘ Felsberg. Using fourier descriptors and spatial models for traffic sign recognition. In Proc. SCIA, SCIA’11, pages 238–249, Berlin, Heidelberg, 2011. Springer-Verlag.
  • (7) J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In Proc. IJCNN, pages 1453–1460, July 2011.
  • (8) J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man versus computer: Benchmarking machine learning algorithms for traffic sign recognition . Neural Networks, 32:323 – 332, 2012. Selected Papers from IJCNN 2011.
  • (9) S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In Proc. IJCNN, pages 1–8, Aug 2013.
  • (10) A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey. IEEE Trans. Intell. Transp. Syst., 13(4):1484–1497, Dec 2012.
  • (11) Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu. Traffic-sign detection and classification in the wild. In Proc. IEEE CVPR, pages 2110–2118, June 2016.
  • (12) J. Lu, H. Sibai, E. Fabry, and D. Forsyth. No need to worry about adversarial examples in object detection in autonomous vehicles. In arXiv:1707.03501, 2017.
  • (13) N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, L. Chen, M. E. Kounavis, and D. H. Chau. Keeping the bad guys out: Protecting and vaccinating deep learning with jpeg compression. In arXiv:1705.02900, 2017.
  • (14) R. Timofte, K. Zimmermann, and L. V. Gool. Belgium traffic sign dataset., 2009.
  • (15) J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. German traffic sign recognition and detection benchmarks., 2013.
  • (16) CURE-TSR: Challenging unreal and real environments for traffic sign recognition., 2017.
  • (17) CURE-TSD: Challenging unreal and real environments for traffic sign detection., 2017.

Appendix A Appendix: Visualization of Challenge Levels and Types

Figure 5: Challenging scene examples from each challenge type and level in CURE-TSD curetsd_dataset and CURE-TSR curetsr_dataset datasets.

To visualize scenes with realistic challenging condition types and levels, we cropped surrounding environments with traffic signs as shown in Fig. 5. Each row corresponds to a challenging condition and each column corresponds to a certain level of the challenging condition. Compared to other existing datasets, the CURE-TSR dataset contains more diverse challenging conditions and levels, which enables a comprehensive platform to test the robustness of recognition algorithms under challenging conditions.

Appendix B Appendix: Data Augmentation

We retrained the benchmark algorithms, listed in Sec. 3.1, by augmenting the initial training images with their flipped versions. Vertically, horizontally or vertically and horizontally flipped challenge-free real images were used for data augmentation. Translation was not utilized in the data augmentation because recognition dataset is based on cropped images that do not include background information. Flipping-based data augmentation degrades the average recognition accuracy by more than mainly because of the asymmetric characteristics of traffic sign images. For instance, consider the no stopping and no parking signs from Fig. 2. The former sign is symmetric along its horizontal axis while the latter sign is asymmetric. Augmenting the training data with horizontally flipped versions of the asymmetric no parking can lead to learning a visual representation similar to no stopping sign, which is different from the intended class and can degrade the overall degradation accuracy.

We visualize two softmax RGB trained models, one of which was trained without data augmentation while the other was trained with data augmentation (flipped real images) in Figs. 6 and 7 respectively. Consider the case of the learned no parking model ( row, column, green highlight). Perceptually, the data-augmented model has two diagonal lines crossing each other as opposed to the single diagonal in the model learned without data augmentation. However, lines that perpendicularly cross is a characteristic of the learned no stopping model ( row, column, yellow highlight), which would result in misclassification. Increase in the misclassification rate between these two signs because of flipping-based data augmentation can be understood from the confusion matrices in Fig.6(b) and Fig.7(b) (highlighted in yellow: class types 4 and 5) in which darker colors correspond to more misclassification.

(a) Softmax model

(b) Confusion Matrix

Figure 6: RGB softmax model visualization and averaged confusion matrix without data augmentation.
(a) Softmax model (b) Confusion matrix
Figure 7: RGB softmax model visualization and averaged confusion matrix with data augmentation.