Semantic Label Reduction Techniques for Autonomous Driving

02/11/2019 ∙ by Qadeer Khan, et al. ∙ 0

Semantic segmentation maps can be used as input to models for maneuvering the controls of a car. However, not all labels may be necessary for making the control decision. One would expect that certain labels such as road lanes or sidewalks would be more critical in comparison with labels for vegetation or buildings which may not have a direct influence on the car's driving decision. In this appendix, we evaluate and quantify how sensitive and important the different semantic labels are for controlling the car. Labels that do not influence the driving decision are remapped to other classes, thereby simplifying the task by reducing to only labels critical for driving of the vehicle.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the context of autonomous driving, semantic segmentation models are being widely used for the perception of the driving environment [1, 2] as well as for control of the ego-vehicle [3, 4]. Semantic maps offer several advantages over raw RGB data described below [4]:

  • Figure 1 shows how two weather conditions have different RGB inputs but the same semantic pixel labels. Hence, if the correct semantic representation is used as input to predict the correct steering commands, the model does not need to learn for each and every weather condition.

  • The semantic labels can precisely localize the pixels of important road landmarks such as traffic lights and signs. The status/information contained on these can then be read off to take appropriate planning and control decisions.

  • A high proportion of the pixels have the same label as its neighbours. This redundancy can be utilized to reduce the dimensionality of the semantic scene. Hence, the number of parameters required to train the control module can then also be reduced.

Fig. 1: This figure shows an example of RGB images representing 2 different weather scenarios but with the same semantic representation.

Depending on the purpose of the semantic maps not all of its labels may be necessary. For example in predicting the steering angle of the ego-vehicle, certain labels such as road lines, sidewalks would be more important for the driving decision as opposed to labels for vegetation or buildings which do not have direct influence on the car’s controls. In this paper, we evaluate how sensitive and important the different semantic labels are for controlling the driving behaviour. Labels that do not influence the driving decision are remapped to other classes. We use the following methods to identify the important of the semantic labels:

  • Grad-CAM (Section III)

  • Semantic label removal (Section IV)

Note that all the data collected and experiments performed are done using the CARLA simulator [5] which provides semantic labels for 13 classes. These classes correspond to roads, sidewalks, road lines, fences, vehicles, pedestrian, other objects, vegetation, poles, traffic signs, walls, buildings, and a none class for objects that do not fall into any of the prior labels. We aim to control only the steering angle of the car while keeping the throttle fixed. The range of the steering varies between and 1. The steering angle in degrees corresponding to these values depends on the vehicle being used. In our case, the default vehicle is used for which for the the maximum steering angle is 70. Videos referenced to in the subsequent sections could be found at the following link:

Ii Related Work

Semantic Sensitivity. The authors of [6] instituted a hierarchical structure for segregating the various semantic classes based on their relative importance by assigning them weights proportionate to their significance. Classes with differing importance were placed on the different levels of the hierarchy. A special ”Importance Aware Loss” was introduced that stressed the need for correctly segmenting the more important semantic classes in comparison to less important ones.

Iii Grad-CAM

Gradient-weighted class activation maps (Grad-CAM) is a technique providing visual explanation on how a model classifies images 

[7, 8]. This is done by tracking the flow of gradients, to localize areas in the original input image that are important to the model for making the correct classification. Our task of predicting the steering command is that of regression rather than classification. Nevertheless, by using the same Grad-CAM technique of tracking the flow of gradients, we can ascertain regions in the input RGB image and ultimately the corresponding semantic labels that are fundamental for the model to produce the correct steering command. We train a simple end-to-end model, whose architecture is described in Figure 2. The model has the following parameters: channels = 3, , , , .

Fig. 2: The architecture for training a model controlling the steering command of a car using RGB/segmentation images. In the first layer channels = 3 (for RGB input) or 13 (for segmentation input). , , and are the number of filters of the convolution operations represented by layers 1, 4, and 7, respectively.

are the number of neurons in the fully connected Layer 10. All convolutional layers have a kernel size of 5 and a stride of 1. All maxpooling layers have a kernel size of 2 and stride of 2.

We track the backward gradient flow from layer 7 of the model right up till the input image, to hone in on regions that are of interest for driving. Figure 3 shows these regions overlaid on the original image as a heat map, where the color of the heat map corresponds to the intensity of relevance. Dark red color represents high importance, whereas light blue corresponds to regions of low relevance. It can be observed that the important regions for driving decision correspond to those with semantic labels of fences (row 1), road lines (row 2 and 4), vehicles (row 3), sidewalks boundary with the road (row 5), and the road (row 6) itself. It is also interesting to note from sample images in row 2 that the model seems to be skipping the shadows and only basing its decisions on the road lines. For certain samples of continuous driving sessions, video1 demonstrates regions in the input RGB image which the model uses for decision making.

Fig. 3: Some sample images and affiliated heat maps. While taking the appropriate steering decision, the model seems to be looking into regions for fences (row 1), road lines (row 2 and 4), vehicles (row 3), sidewalk boundary (row 5), and the road (row 6) itself. In row 2, the model is skipping shadows and only focusing on the visible road lines for decision making. Dark red represents regions of high importance, while light blue color represents portions of low relevance for the model for decision making.

Iv Semantic Label Removal Techniques

In this approach, we train an end-to-end model with the semantic labels as input. The model input has 13 channels, corresponding to the 13 semantic labels. The architecture of this model is described in Figure 2 with the following parameters, channels = 13, , , , .

Once this model is trained, we evaluate the sensitivity of each semantic label by feeding zeros to its corresponding channel and recording the change in error. The error is the mean squared error (MSE) between the steering angle predicted by the model and the ground truth. Zeroing out a channel effectively eliminates that label from the input. Removal of semantic labels which are critical towards making the driving decision would result in an increased error. Figure 4 shows the increase in error by removal of each of the semantic labels.

Fig. 4: Increase in error due to removal of each of the 13 semantic labels by zeroing out the channel corresponding to that label. The bar plot is arranged in descending order of error. The same order also provides an indication of the relative importance of that label towards making a driving decision.

The bar plot is arranged in descending order of error and hence of label importance. From this method we observe that the error increases dramatically if we are to remove channels corresponding to sidewalks and road lines. Next are the roads and fence labels in order of importance. This method also seems to reaffirm our observation from the Grad-Cam approach in Section III that road lines and sidewalks are of utmost importance for the control model to execute the correct steering command.

V Label Remapping

To predict the steering angle of the car, we train the model with a modular approach instead of an end-to-end learning approach, for reasons given in [4]. This approach is shown in Figure 5. The purpose of the perception module is to use the images captured by an RGB camera to extract semantic features of the scene. These extracted features are then fed to the control module which aims to produce the correct steering command.

Fig. 5:

The perception module is trained as an encoder-decoder architecture, without any skip connections. The encoder sub-module first embeds the raw image into a lower dimensional latent vector. The decoder sub-module reconstructs the semantic scene from this latent vector. If the low-dimensional latent vector contains all the necessary information to reconstruct the semantic scene to a reasonable degree of accuracy, then we directly feed it as an input to the control module instead of the semantic labels. The architecture is same as that used by 


However, as demonstrated earlier, not all the semantic labels may be important for the driving strategy. Reconciling the conclusions from the 2 methods presented in Section III and Section IV, we train a new perception module by remapping the semantic labels as described in Table I. Figure 6 shows some visualizations of the semantic maps resulting from this remapping.

Label Roads Sidewalks RoadLines Fences Vehicles Pedestrian Other
Mapped to Roads Sidewalks RoadLines Fences Vehicles Other Other
Label Vegetation Poles Traffic Sign Wall Building None
Mapped to Other Fences Fences Other Other None
TABLE I: The first and third rows of the table enlists the 13 semantic labels. The second and forth rows shows which labels they are remapped to. Note that the color of text of each label corresponds to the color of their semantic representation displayed in Figure 6.
Fig. 6: This figure shows 5 examples of how remapping influences the visual appearance of the semantic map. The 1st row contains the raw RGB image, the 2nd row are the corresponding semantic maps with all the 13 labels. The 3rd row are the semantics with labels remapped in accordance with Table I. Note, from the images in the 3rd row that those containing a pedestrian (5th column), vegetation (6th column) are remapped to the other class. Similarly, labels for poles are remapped to the fence class (1st and 2nd columns).

Table II gives a comparison of the training, validation, and test performance of the 2 models trained on the normal and the remapped segmentation labels. It can be observed that the remapping of labels does not cause any degradation in performance. In fact, a comparison of video2 (perception trained with all labels) and video3 (perception trained with remapped labels) shows that the remapped perception module produces a more stable segmentation reconstruction.

Segmentation Model Training Error Validation Error Test Error
With all labels 5.68 9.59 9.15
With remapped labels 4.64 9.20 9.11
TABLE II: The first row is the error for the control model whose perception module is trained with all labels, while the second row is for the perception module trained with the new remapped labels. The values in the table are the mean squared error (MSE) between the actual and the steering command predicted by the models represented in the order of .

V-a Discussion

The manner in which the remapping is to be done or how the sensitivity is calculated is arguable. For e.g. instead of just zeroing out the label for road lines it might have been worthwhile to see what would happen if we camouflage it into the road. This could be done by assigning the road line label to also be a road in the road channel. There could be a multitude of such possibilities to be considered, which may grow exponentially as we refine the segmentation map by increasing the number of semantic labels. Finding the important semantic labels is in itself a separate research topic. As explained in Section II, the authors of [6] have tried to address this problem in the context of autonomous driving.

Vi Conclusion

From the results of the experiments, we observed that training a perception module by remapping the less important semantic labels to other classes, did not lead to any degradation in model performance. The car was still able to be controlled under this approach. This has a positive repercussion in that the semantic classes to be labeled can be reduced thereby possibly curtailing the human effort.


  • [1] A. Meyer, N. O. Salscheider, P. F. Orzechowski, and C. Stiller, “Deep Semantic Lane Segmentation for Mapless Driving,” in IEEE/RJS International Conference on Intelligent Robots and Systems (IROS), 2018.
  • [2] W. Wang and Z. Pan, “DSNet for Real-Time Driving Scene Semantic Segmentation,” arXiv preprint arXiv:1812.07049, 2018.
  • [3] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving Policy Transfer via Modularity and Abstraction,” in Conference on Robot Learning (CoRL), 2018.
  • [4] P. Wenzel, Q. Khan, D. Cremers, and L. Leal-Taixé, “Modular Vehicle Control for Transferring Semantic Information Between Weather Conditions Using GANs,” in Conference on Robot Learning (CoRL), 2018.
  • [5] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An Open Urban Driving Simulator,” in Conference on Robot Learning (CoRL), 2017.
  • [6] B.-k. Chen, C. Gong, and J. Yang, “Importance-Aware Semantic Segmentation for Autonomous Driving System,” in

    International Joint Conference on Artificial Intelligence (IJCAI)

    , 2017.
  • [7] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks,” in

    IEEE Winter Conference on Applications of Computer Vision (WACV)

    , 2018.
  • [8] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,” in IEEE International Conference on Computer Vision (ICCV), 2017.