As part of autonomous car driving systems, semantic segmentation is an essential component to obtain a full understanding of the car's environment. One difficulty, that occurs while training neural networks for this purpose, is class imbalance of training data. Consequently, a neural network trained on unbalanced data in combination with maximum a-posteriori classification may easily ignore classes that are rare in terms of their frequency in the dataset. However, these classes are often of highest interest. We approach such potential misclassifications by weighting the posterior class probabilities with the prior class probabilities which in our case are the inverse frequencies of the corresponding classes in the training dataset. More precisely, we adopt a localized method by computing the priors pixel-wise such that the impact can be analyzed at pixel level as well. In our experiments, we train one network from scratch using a proprietary dataset containing 20,000 annotated frames of video sequences recorded from street scenes. The evaluation on our test set shows an increase of average recall with regard to instances of pedestrians and info signs by 25% and 23.4%, respectively. In addition, we significantly reduce the non-detection rate for instances of the same classes by 61% and 38%.READ FULL TEXT VIEW PDF
Recently, methods based on Convolutional Neural Networks (CNN) achieved
Although Deep Convolutional Neural Networks trained with strong pixel-le...
Computer-assisted surgery has been developed to enhance surgery correctn...
Neural networks for semantic segmentation can be seen as statistical mod...
3D semantic scene labeling is a fundamental task for Autonomous Driving....
In this work, we train a network to simultaneously perform segmentation ...
A common issue with “real world” datasets is the imbalance of observed object classes. Class imbalance in datasets can have a detrimental effect on classification performance of neural networks (NNs) trained on such datsets, see also Buda17. Methods overcoming class imbalance can be divided into two main categories Small17. The first category are sampling-based methods that operate directly on a dataset with the aim to balance its class distribution. Oversampling and undersampling strategies have been proposed in Small17; Buda17; More16. In their basic versions, the dataset is balanced by increasing the number of instances from “minority” classes and by decreasing the number of instances from “majority” classes, respectively. A more advanced method called SMOTE Bowyer11 combines these two aforementioned approaches, resulting in an additional boost in classification performance.
The second category are algorithm-based methods Small17; Buda17; Lopez13
. They make use of cost-based training and decision thresholding. The idea behind these strategies is to assign different costs to classification mistakes for different classes. Accordingly, one possibility is to minimize the misclassification cost instead of the standard loss functionKukar98 during training. This would however bias the softmax probability output of the NN. The other possibility is to make class predictions cost-sensitive during inference phase after the network is fully trained by moving the output threshold towards inexpensive classes, see zhou06.
In this work we deal with a proprietary semantic segmentation dataset of the Volkswagen Group and convolutional neural networks (CNNs), more precisely a Full Resolution Residual Network (FRRN)Pohlen16, trained on this dataset. In contrast to the publicly available Cityscapes dataset cityscapes16, our dataset also contains scenes from highways and country roads. Consequently, classes like humans and traffic signs are underrepresented. Using a CNN trained for this task, the semantic segmentation as a pixel-wise classification is obtained by the maximum a-posteriori probability (MAP), i.e., by applying the function to the pixel-wise softmax output, see figure 1. The CNN as a statistical model aims at minimizing the chance of a misclassification which in decision theory is known as Bayes rule. Another mathematically natural approach from decision theory is the Maximum Likelihood (ML) rule. While the MAP / Bayes rule incorporates a prior belief about the semantic classes, the maximum likelihood rule decides only by means of the observed features and chooses the most typical class for the given pattern. From now on, we use the abbreviations Bayes and ML to refer to these decision rules.
Our approach of using the ML decision rule is motivated by work that studies the influence of risk factors on heart diseases Fahrmeir-1996; kleinbaum78. Given a person’s features a decision function is computed to determine whether a patient suffers from a heart disease or not. While the total number of falsely diagnosed patients is increased when using ML instead of Bayes, the number of falsely diagnosed patients, who are actually ill, is significantly reduced. This follows from the substantially rare occurrence of patients with a heart disease that the Bayes rule assumes as a prior belief. The ML rule determines the disease independently of this assumption.
Class balancing is not widely applied to CNNs for semantic segmentation. For instance, class balancing is taken into consideration in Fully Convolutional Networks (FCNs) Long16, however the authors decide not to take action since the training data is only moderately unbalanced. Furthermore, in SegNet Badrinarayanan15 median frequency balancing is used, i.e., the weight assigned each class is the corresponding inverse class frequency multiplied by the median of all class frequencies over the whole dataset. It has been shown empirically that using the computed weights in the loss function during training results in a sharp increase in class average accuracy and a slight increase in mean intersection over union, whereas the global accuracy decreases Badrinarayanan15.
With our dataset we pursue the approach to cover a wide variety of everyday street scenes in a preferably unbiased fashion. Thus, we do not want to change the training procedure, i.e., we neither change the loss function used for training nor balance the dataset with respect to the classes. However, we cannot ignore the issue of class imbalance and propose to approach this problem by applying decision rules.
In this work, we analyze the impact of applying the ML instead of the Bayes decision rule for CNNs for semantic segmentation. In other words, the dataset and the training procedure remain unchanged and the decision rules are only interchanged at inference. In contrast to rottmann18 that deals with false positive predictions, our main focus is to reduce false negative predictions. The remainder of this work is structured as follows: In section 2 we explain decision rules in general and in section 3 we employ them in combination with neural networks. Next, we implement ML and evaluate its performance in particular in comparison with Bayes in section 4. For our experiments, we train one FRRN from scratch on our dataset called “DS20k” containing annotated images of traffic scenes in Europe. The DS20k dataset is highly unbalanced, especially with respect to the classes “person” and “info sign” that are significantly underrepresented compared to the remaining classes. This setting is very different from the setting in the Cityscapes dataset cityscapes16 where these two classes are naturally boosted due to all recorded images showing urban street scenes. Moreover, we apply ML at pixel level in order to handle class imbalance in a position-specific manner.
Discriminant analysis is a multivariate statistical analysis task. Given a population consisting of two or more pre-defined clusters, in which each element of the population belongs to exactly one cluster, one wishes to classify observed data into these distinct groups. Therefore, the objective of discriminant analysis is to find a function that discriminates objects of the population based on observable features, and to predict the object’s class affiliation from it.
Let be a population consisting of disjoint subsets. For each element
we assume there exists one feature vector. Let
be random variables for feature vectorand class affiliation , respectively. Then, we define a decision rule
to be a function which assigns an element from the feature space to one class. We say, is the predicted class for feature vector . Furthermore, we describe
the a-priori probability of a random object to belong to class as
the a-posteriori probability of an object to belong to class given feature as
and the conditional likelihood of an object of class having feature as
Usually, none of these probabilities are known and they all need to be estimated. Assuming that this is already accomplished, we define the following decision rules:
The Bayes decision rule maps feature vectors to the class that gives the largest a-posteriori probability. Thus, the decision rule is defined by:
On the contrary, the Maximum Likelihood decision rule maps feature vectors to the class with the largest conditional likelihood. Thus, the decision rule is defined by:
In the latter, the class affiliation is an unknown parameter that needs to be estimated using the principle of maximum likelihood. The decision rule aims at finding the class for which the features are most typical (according to the observed features in the training set), independent of the a-priori probability of the particular class. The difference between these two decision rules lies in the adjustment with the prior class probabilities . Obviously, both decision rules are equal when the prior class distribution is balanced, i.e., .
Let be a (gray-scaled) input image with resolution . In analogy with the previous section, the pixel-wise classification in semantic segmentation is then performed by the function for the estimated class probabilities for classes , where corresponds to the pixel position in the input image and the values for are obtained from the softmax output of a segmentation network. This procedure maximizes the overall probability for a correct class estimation which is equivalent to the Bayes rule in decision theory. As stated in Fahrmeir-1996, this decision rule is optimal for the symmetric cost function
with being the predicted class while being the target class. This function implies an equal class weighting, also weighting every confusion of two classes, i.e., each type of misclassification, equally. In contrast to that, the ML rule is optimal for the inverse proportional cost function
which increases the cost of a misclassification if the a-priori class probability (in the following called prior) is low. Consequently, we need to first determine the class distribution of our dataset in order to apply ML instead of Bayes.
A statistical analysis of our dataset reveals an unbalanced class distribution in the training set that differs significantly from a uniform distribution, cf.figure 3 and figure 3.
For instance, the total number of pixels in DS20K belonging to the class PERSON is , whereas pixels belong to the class DRIVEABLE. That is a difference in two orders of magnitude. The confusion of these two classes would lead to possibly fatal situations and should be avoided, especially in the domain of near field perception. For the evaluation of the different decision rules, we compute priors and apply the decision rules at pixel level. Note, that the used segmentation network in our experiments will be a fully convolutional neural network which, on an input of infinite size, conserves the translation invariance of convolutional layers. For fully convolutional networks with a receptive field which is small compared to the dimensions of the image, an averaging of pixel-wise priors along the orbits of the translation group (adequately coarsened by pooling) would be adequate Cohen16. However, the receptive field of our network for a single output pixel contains up to of the input image. Thus, almost all output pixels are affected by boundary effects which enables them to guess their approximate location and justifies our decision for pixel-wise priors. For further discussion we refer to the appendix.
The priors are essential for the implementation of decision rules. We approximate them using the training set since our network is trained on these unbalanced data. Figure 3 shows the class distribution on full image level in the entire dataset. As there are image regions, where it is more likely that a certain class appears, we are interested in the pixel-wise class distribution of the training dataset. From figure 5 we conclude, that during training there are no pedestrians seen in the upper part of the image. The network thus will be biased towards not predicting a person in that area which might be wrong, e.g., when the street is ascending.
We evaluate the performance of Bayes and ML on annotated test images from DS20k that were not used during training. The network we use in our experiments is a slightly modified version of the full-resolution residual network (FRRN) Pohlen16 which is a combination of SegNet Badrinarayanan15 and ResNet HeZRS15 and performs well with respect to recognition and localization. Unlike the original version of the network we use dropout regularization JMLR14
instead of batch normalization. Furthermore, we modify the number of channels per layer.
We implement our FRRN with Tensorflowtensorflow2015-whitepaper and train the network by minimizing the negative log-likelihood loss function using the ADAM optimizer Ruder16. Besides that, we train the network on annotated training images (resolution pixels) of traffic scenes with different semantic classes. We use a batch size of , resulting in a training time of day and hours for epochs with NVIDIA Geforce GTX Tis.
Before we start a quantitative analysis on the impact of ML we visualize the segmentations obtained by both decision rules in order to get a basic understanding of the differences, see figure 7.
At first glance we observe that there are no significant differences between both segmentations. Both decision rules produce the same output for most of the pixels which indicates that the network is well-trained and predicts with high confidence. In particular, the predictions for frequent classes, such as road, nature, sky and buildings, enhance this impression because they make up a large portion of the image. Therefore, the pixel positions, where the decision rules produce different predictions, are of special interest for us.
On closer inspection we observe that the Bayes and ML predictions differ remarkably often at the transition from one class to another. While Bayes prefers to predict the more frequent class at these pixel positions, ML does the opposite and prefers the less frequent class. Thus, ML enlarges the size of “minority” class objects compared to Bayes for the purpose of decreasing the risk of missing any pixels belonging to rare classes. Moreover, we observe that ML produces many (in terms of class affiliation) isolated false positive pixels which are boosted by the priors. For instance, in figure 7 ML frequently classifies scattered pixels as NONDRIVEABLE (pavement, traffic island,…) in the upper part of the image. Since the class
has an extremely small a-priori probability in the upper part of the image, we see that even small posterior probabilitiescan result in being dominant and thus being the predicted class when using the ML decision rule. In order to reduce uncertainty, we employ Monte-Carlo dropout, see Gal:2016:DBA:3045390.3045502; DBLP:journals/corr/KendallBC15, at inference by computing different predictions under dropout and averaging the output probabilities.
Due to the occurrence of very local misclassifications we add two post processing steps: First, we discard all connected components of one class that contain less than pixels in our statistical computations. Second, we treat connected components of the same class which have less than pixels in-between as one connected component.
Let be the number of pixels of class predicted to belong to class . For the evaluation, we compute three different performance measures that are common in semantic segmentation:
Intersection over Union
We compare the segmentation performance of Bayes and ML, first in terms of mean intersection over union (mIoU) which is the average value over all classes for the intersection over union (IoU), i.e., .
Figure 9 shows that Bayes is superior to ML regarding the overall performance. In nearly all images the mIoU for Bayes is higher than for ML. Further averaging of the mIoU over all test images in figure 9 reveals a difference of more than in mIoU. This finding is not unexpected since Bayes maximizes the overall probability of correct class predictions. We also compare the IoUs for the rare classes PERSON and INFO SIGN (short hand: INFO). They even show an increased superiority of Bayes. In the subsequent paragraphs we show that this is indeed caused by an overproduction of false positives when using ML.
Since ML produces many additional false positives for rare classes, we would hope in this case to obtain less false negatives with ML compared to Bayes. By comparing the total number of connected components in the differently predicted segmentations and in the corresponding ground truth segmentation, see figure 11, we immediately take notice of a significant impact of ML: the number of connected components in ML segmentations exceeds the number of connected components in ground truth (GT) segmentations for both, the PERSON and INFO classes. In contrast to this, Bayes overlooks a significant amount of components.
Consequently, as we expect whole instances to be false positive in ML segmentations but also to recognize more of the rare objects compared to Bayes, we now analyze segment-wise precision and recall in more detail. For this purpose, we define that a (selection of) connected component(s) predicted by some decision rule is a correct object prediction, if there is a ground truth connected component of the same class with non-empty intersection.
The empirical cumulative distribution functions (CDFs) of class PERSON for precision and recall can be found in figure 13. Let and be two CDFs, then is dominated stochastically to 1st order by Pflug-et-al-1996,
In the following, we denote the CDFs of the Bayes decision rule regarding precision and recall by and , respectively. Analogously, refer to the ML decision rule.
As to be expected, we observe a clear advantage of Bayes in terms of precision since . For any precision value , in particular for low precision values, the frequency that one instance’s precision is below is significantly less with Bayes than with ML. The average difference is about . Hence, Bayes predicts PERSON segments with better precision than ML.
In terms of recall, we observe the opposite behavior: , i.e., ML is superior over Bayes in this metric. The average difference is about . However, for both decision rules the number of non-detected segments, i.e., and , is quite high.
Qualitatively, we observe very similar results for the INFO class (as with PERSON), see figure 13. In our studies of precision and recall we also observe that the findings from figure 7 hold statistically, i.e., we observe for rare classes that all predicted Bayes segments lie entirely inside of predicted ML segments. A graphical illustration of this is given in figure 14.
The benefit from applying ML instead of Bayes lies mainly with the reduction of non-detected ground truth objects. Therefore, it is reasonable to analyze the quantity of the latter, especially in relation to the amount of predicted false positive segments. Additionally, we analyze the false and non-detection frequencies in relation to the size of the predicted segment and the actual ground truth segment, respectively.
Figure 16 depicts the number of false positive PERSON segments depending on the size of the instance in the segmentation for ML and Bayes. In the left panel, there is a noticeable decrease in the frequency of false positive ML segments if the segment size increases, i.e., larger predicted segments are less likely to be entirely incorrect. For Bayes segments, the same tendency holds, even though not as strictly as for ML. Moreover, we see for every segment size bin that the amount of false positive ML segments considerably exceeds the amount of false positive Bayes segments. The right-hand panel of figure 16 shows the amount of Bayes false positives relative to the amount of ML false positives for different component sizes. For increasing component size there is also an increase in the relative amount of Bayes false positives.
Analogous to figure 16, figure 16 shows the number of entirely non-detected objects depending on their size. We observe a similar behavior for Bayes and ML like for the false-detections, also with respect to the object size. For the non-detection of the class PERSON we find a clear advantage in favor of ML, independent of the object size. Bayes overlooks roughly twice as many objects as ML does. This result indicates an uncertainty of the network in finding the rare class PERSON which can be alleviated by using ML. Figure 18 and figure 18 visualize the non-detection at pixel and at object level, respectively.
For the class INFO we observe an analogous behavior. We refer to the appendix for the analysis and corresponding figures.
In this work, we conducted an in-depth comparison of the ML and Bayes decision rules for a semantic segmentation network trained on an unbalanced dataset showing street scenes. In our tests we observe that ML is able to detect the rare classes PERSON and INFO more frequently than Bayes. Indeed, the pixels that Bayes and ML classify differently indicate that a less frequent class might be overlooked. We have seen that ML detects significantly more instances of rare classes in comparison to Bayes, but to the the detriment of producing even substantially more false-detections which makes ML not reliable for always predicting rare classes correctly. Apart from this, it is important to emphasize that ML only post-processes the softmax output of a neural network. This can be done simultaneously while applying the usual Bayes rule. In the end, we obtain two prediction masks and the additional ML mask is produced computationally nearly for free. What remains is to develop methods to draw plausible conclusions in order to combine both segmentations. Furthermore, the ML prediction can serve as uncertainty mask revealing labeling mistakes of training data or indicating new unlabeled images of high prediction uncertainty which then can be annotated and included in the training process in the manner of active learning. We make the source code of our analysis tool publicly available on GitHub:https://github.com/robin-chan/decision-rules.
This work is funded in part by Volkswagen Group Research.
Investigating the amount of false-detections as shown in the left panel of figure 20, we notice a staircase-like decrease of false positive ML and Bayes segments for increasing segment size. Moreover, in general there are also more false-detections with ML than with Bayes which can be studied comprehensively in terms of the ratio of false-detections, see right panel of figure 20. On average the amount of Bayes false positives is about of the amount of ML false positives. A slight increase of this percentage can be observed for larger component sizes.
The left panel of figure 20 shows that less info signs are entirely overlooked with ML than with Bayes. For both rules the frequency decreases for larger objects. Nevertheless, most objects of the considered class are rather small which noticeably affects the frequency of non-detections. From the ratio in the right panel of figure 20, we conclude that Bayes fails to detect approximately every second info sign, while ML fails to detect the same every third time. Compared to the performance for class PERSON, ML does not outperform Bayes on detecting info signs as greatly as on detecting persons, but still provides a remarkable performance gain. A visualization of the just mentioned benefit of ML is given in figure 22 and figure 22.
In section 3
, we introduced a method that uses pixel-wise “local” priors in order to handle the class imbalance depending on the location in the image. However, one might argue that a fully convolutional segmentation network is translation invariant (up to some stride caused by pooling) and the choice of priors should take this into account. We justify our preference over non-localized “global” priors with the network’s large receptive field. To this end, we illustrate the impact regarding the non-detection of a PERSON instance when using local priors in comparison to a global prior, seefigure 24, in particular when the local priors are lower than the global prior.
For class , let be the global prior and let be the local prior at pixel position . Then, we denote by
the set of pixel positions where the global prior is lower than or equal to the local prior and the set of pixel positions where the global prior is higher than the local prior, respectively.
For our test, we place one PERSON instance at , see figure 24 left image. Since, in this region of the image the network rather expects to see an info sign and the local priors of class PERSON and INFO are of a similar magnitude, we cropped the image such that the PERSON instance is located at in order to provoke a misclassification by using global priors, see figure 24 center and right image.
Figure 25 shows the segmentations produced by using the different priors. We observe that the person is entirely overlooked and predicted to be an info sign by using the global prior while the person is nearly fully detected by using the local priors. Although, this situation is artificially created, it illustrates the importance of using priors and, in particular, the positive effect of localized priors for images outside of the network’s learned experience.