In the context of autonomous driving, semantic segmentation models are being widely used for the perception of the driving environment [1, 2] as well as for control of the ego-vehicle [3, 4]. Semantic maps offer several advantages over raw RGB data described below :
Figure 1 shows how two weather conditions have different RGB inputs but the same semantic pixel labels. Hence, if the correct semantic representation is used as input to predict the correct steering commands, the model does not need to learn for each and every weather condition.
The semantic labels can precisely localize the pixels of important road landmarks such as traffic lights and signs. The status/information contained on these can then be read off to take appropriate planning and control decisions.
A high proportion of the pixels have the same label as its neighbours. This redundancy can be utilized to reduce the dimensionality of the semantic scene. Hence, the number of parameters required to train the control module can then also be reduced.
Depending on the purpose of the semantic maps not all of its labels may be necessary. For example in predicting the steering angle of the ego-vehicle, certain labels such as road lines, sidewalks would be more important for the driving decision as opposed to labels for vegetation or buildings which do not have direct influence on the car’s controls. In this paper, we evaluate how sensitive and important the different semantic labels are for controlling the driving behaviour. Labels that do not influence the driving decision are remapped to other classes. We use the following methods to identify the important of the semantic labels:
Note that all the data collected and experiments performed are done using the CARLA simulator  which provides semantic labels for 13 classes. These classes correspond to roads, sidewalks, road lines, fences, vehicles, pedestrian, other objects, vegetation, poles, traffic signs, walls, buildings, and a none class for objects that do not fall into any of the prior labels. We aim to control only the steering angle of the car while keeping the throttle fixed. The range of the steering varies between and 1. The steering angle in degrees corresponding to these values depends on the vehicle being used. In our case, the default vehicle is used for which for the the maximum steering angle is 70. Videos referenced to in the subsequent sections could be found at the following link: https://www.youtube.com/playlist?list=PLKWxSGEZd0AcvsC2N5trDhPbbeLdGLfi3.
Ii Related Work
Semantic Sensitivity. The authors of  instituted a hierarchical structure for segregating the various semantic classes based on their relative importance by assigning them weights proportionate to their significance. Classes with differing importance were placed on the different levels of the hierarchy. A special ”Importance Aware Loss” was introduced that stressed the need for correctly segmenting the more important semantic classes in comparison to less important ones.
Gradient-weighted class activation maps (Grad-CAM) is a technique providing visual explanation on how a model classifies images[7, 8]. This is done by tracking the flow of gradients, to localize areas in the original input image that are important to the model for making the correct classification. Our task of predicting the steering command is that of regression rather than classification. Nevertheless, by using the same Grad-CAM technique of tracking the flow of gradients, we can ascertain regions in the input RGB image and ultimately the corresponding semantic labels that are fundamental for the model to produce the correct steering command. We train a simple end-to-end model, whose architecture is described in Figure 2. The model has the following parameters: channels = 3, , , , .
We track the backward gradient flow from layer 7 of the model right up till the input image, to hone in on regions that are of interest for driving. Figure 3 shows these regions overlaid on the original image as a heat map, where the color of the heat map corresponds to the intensity of relevance. Dark red color represents high importance, whereas light blue corresponds to regions of low relevance. It can be observed that the important regions for driving decision correspond to those with semantic labels of fences (row 1), road lines (row 2 and 4), vehicles (row 3), sidewalks boundary with the road (row 5), and the road (row 6) itself. It is also interesting to note from sample images in row 2 that the model seems to be skipping the shadows and only basing its decisions on the road lines. For certain samples of continuous driving sessions, video1 demonstrates regions in the input RGB image which the model uses for decision making.
Iv Semantic Label Removal Techniques
In this approach, we train an end-to-end model with the semantic labels as input. The model input has 13 channels, corresponding to the 13 semantic labels. The architecture of this model is described in Figure 2 with the following parameters, channels = 13, , , , .
Once this model is trained, we evaluate the sensitivity of each semantic label by feeding zeros to its corresponding channel and recording the change in error. The error is the mean squared error (MSE) between the steering angle predicted by the model and the ground truth. Zeroing out a channel effectively eliminates that label from the input. Removal of semantic labels which are critical towards making the driving decision would result in an increased error. Figure 4 shows the increase in error by removal of each of the semantic labels.
The bar plot is arranged in descending order of error and hence of label importance. From this method we observe that the error increases dramatically if we are to remove channels corresponding to sidewalks and road lines. Next are the roads and fence labels in order of importance. This method also seems to reaffirm our observation from the Grad-Cam approach in Section III that road lines and sidewalks are of utmost importance for the control model to execute the correct steering command.
V Label Remapping
To predict the steering angle of the car, we train the model with a modular approach instead of an end-to-end learning approach, for reasons given in . This approach is shown in Figure 5. The purpose of the perception module is to use the images captured by an RGB camera to extract semantic features of the scene. These extracted features are then fed to the control module which aims to produce the correct steering command.
However, as demonstrated earlier, not all the semantic labels may be important for the driving strategy. Reconciling the conclusions from the 2 methods presented in Section III and Section IV, we train a new perception module by remapping the semantic labels as described in Table I. Figure 6 shows some visualizations of the semantic maps resulting from this remapping.
Table II gives a comparison of the training, validation, and test performance of the 2 models trained on the normal and the remapped segmentation labels. It can be observed that the remapping of labels does not cause any degradation in performance. In fact, a comparison of video2 (perception trained with all labels) and video3 (perception trained with remapped labels) shows that the remapped perception module produces a more stable segmentation reconstruction.
|Segmentation Model||Training Error||Validation Error||Test Error|
|With all labels||5.68||9.59||9.15|
|With remapped labels||4.64||9.20||9.11|
The manner in which the remapping is to be done or how the sensitivity is calculated is arguable. For e.g. instead of just zeroing out the label for road lines it might have been worthwhile to see what would happen if we camouflage it into the road. This could be done by assigning the road line label to also be a road in the road channel. There could be a multitude of such possibilities to be considered, which may grow exponentially as we refine the segmentation map by increasing the number of semantic labels. Finding the important semantic labels is in itself a separate research topic. As explained in Section II, the authors of  have tried to address this problem in the context of autonomous driving.
From the results of the experiments, we observed that training a perception module by remapping the less important semantic labels to other classes, did not lead to any degradation in model performance. The car was still able to be controlled under this approach. This has a positive repercussion in that the semantic classes to be labeled can be reduced thereby possibly curtailing the human effort.
-  A. Meyer, N. O. Salscheider, P. F. Orzechowski, and C. Stiller, “Deep Semantic Lane Segmentation for Mapless Driving,” in IEEE/RJS International Conference on Intelligent Robots and Systems (IROS), 2018.
-  W. Wang and Z. Pan, “DSNet for Real-Time Driving Scene Semantic Segmentation,” arXiv preprint arXiv:1812.07049, 2018.
-  M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving Policy Transfer via Modularity and Abstraction,” in Conference on Robot Learning (CoRL), 2018.
-  P. Wenzel, Q. Khan, D. Cremers, and L. Leal-Taixé, “Modular Vehicle Control for Transferring Semantic Information Between Weather Conditions Using GANs,” in Conference on Robot Learning (CoRL), 2018.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An Open Urban Driving Simulator,” in Conference on Robot Learning (CoRL), 2017.
B.-k. Chen, C. Gong, and J. Yang, “Importance-Aware Semantic Segmentation for
Autonomous Driving System,” in
International Joint Conference on Artificial Intelligence (IJCAI), 2017.
A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
“Grad-CAM++: Improved Visual Explanations for Deep Convolutional
IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,” in IEEE International Conference on Computer Vision (ICCV), 2017.