A Network Structure to Explicitly Reduce Confusion Errors in Semantic Segmentation

08/01/2018 ∙ by Qichuan Geng, et al. ∙ Beihang University Baidu, Inc. 2

Confusing classes that are ubiquitous in real world often degrade performance for many vision related applications like object detection, classification, and segmentation. The confusion errors are not only caused by similar visual patterns but also amplified by various factors during the training of our designed models, such as reduced feature resolution in the encoding process or imbalanced data distributions. A large amount of deep learning based network structures has been proposed in recent years to deal with these individual factors and improve network performance. However, to our knowledge, no existing work in semantic image segmentation is designed to tackle confusion errors explicitly. In this paper, we present a novel and general network structure that reduces confusion errors in more direct manner and apply the network for semantic segmentation. There are two major contributions in our network structure: 1) We ensemble subnets with heterogeneous output spaces based on the discriminative confusing groups. The training for each subnet can distinguish confusing classes within the group without affecting unrelated classes outside the group. 2) We propose an improved cross-entropy loss function that maximizes the probability assigned to the correct class and penalizes the probabilities assigned to the confusing classes at the same time. Our network structure is a general structure and can be easily adapted to any other networks to further reduce confusion errors. Without any changes in the feature encoder and post-processing steps, our experiments demonstrate consistent and significant improvements on different baseline models on Cityscapes and PASCAL VOC datasets (e.g., 3.05



There are no comments yet.


page 10

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is easy to find confusing classes that share similar visual patterns in real world. For instance, in a street view, road and sidewalk could have close color appearances. In an action video clip, hand clapping and boxing could share common moving patterns [27]. In general, it could be impossible to completely avoid confusion errors even for human beings. However, confusion errors could be propagated and magnified throughout the training process in our models designed for various vision related tasks. In this work, we would like to mainly focus on the semantic segmentation task based on deep learning techniques. The extension could be made easily to other similar tasks such as object detection and image classification.

In semantic segmentation, a large amount of network structures has been proposed recently to deal with individual factors that could generate confusion errors. These factors include imbalanced data distributions and reduced resolution in the feature encoding process. Re-sampling and re-weighting (i.e., cost-sensitive) strategies [4, 16, 21, 19, 2] are commonly applied to deal with imbalanced class distributions. However, performance may not be always improved due to some negative factors such as the over-fitting risk and increased dataset complexities when more minority classes are added. For instance, in Cityscapes dataset, over 90% annotations come from the six majority classes including road, building, and vegetation. Less than 10% annotations come from the remaining 13 minority classes.

In order to tackle reduced resolution problem caused by the feature encoder, a large amount of deep networks have been proposed based on the Fully Convolutional Neural Network (FCN) 

[37, 28]. Image pyramids could be fed into the same model and feature maps are fused together at the end [15, 13, 32, 26, 8, 6]. Multiple levels of decoders could be added to restore the details of feature maps [28, 30, 34, 1, 25, 33, 20, 41]. Novel convolution layers and pooling layers could be applied to capture multi-scale context information [6, 7, 9, 46]. Current trends in semantic segmentation show that large kernels (e.g., in [31]) or multiple atrous convolutions with different rates (e.g., (6, 12, 18) in [7, 9]) to capture much richer context information. However, it remains unclear how many different rates or kernel sizes should be selected and whether the selected ones are optimal to cover a large range of object sizes. In street views, the object sizes could be significantly different, e.g., a passing truck could have the same height of the image while a traffic light may have only dozens of pixels. As a result, even the large kernel mentioned in [31] may not big enough to cover a object like a truck and capture enough context information.

In this paper, we propose a novel network structure that is aimed to reduce semantic confusion errors explicitly. Comparing with existing methods, our network structure is able to deal with all the factors in a more direct manner. Moreover, this structure is general and can be easily integrated into any existing networks to further improve performance. Specifically, our proposed network structure mainly has two following contributions.

  • We propose a method to build and ensemble multiple subnets with heterogeneous output spaces. These subnets are built based on the discriminative confusing groups inferred from the normalized confusion matrix. Each subnet is aimed to enlarge the distances among the confusing classes within each confusing group without affecting unrelated classes outside the group.

  • We propose an improved multi-class cross-entropy loss that considers both correct and incorrect labels. By adding a new term for the incorrect labels, both false negatives and false positives that are often caused by confusing classes are penalized directly. A re-weighting based on the confusion matrix is also applied for the new loss to further strengthen the penalization.

In Section 2, we introduce the related work in the semantic image segmentation. Section 3 describes our network structure and new loss function in details. An analysis of the loss function based on the information theory is given in this section. In Section 4, we provide a set of experiments on the Cityscapes and Pascal VOC datasets. We give a conclusion in Section 5.

2 Related Work

As mentioned in Section 1, confusion could be magnified by various factors during the network training. We divide the factors into two categories, imbalanced data distribution and reduced feature resolution. In this section, we mainly describes related work to deal with these two categories.

Imbalanced Data Distribution. One method is to over-sampling the minority classes and/or under-sampling the majority classes. As this strategy changes the data distribution, over-sampling may result in over-fitting and under-sampling may remove possible valuable information. SMOTE and its variants [4, 16, 21] have been proposed to avoid the over-fitting by generating new non-replicated examples. Another direction, re-weighting method, imposes additional penalties on the minority classes without changing the data distribution. For instance, inverse frequency and median frequency re-weighting [5, 29, 44, 45, 13] have been applied in the semantic segmentation works. In [40], online hard example mining (OHEM) is proposed to automatically select hard examples for training region-based ConvNet detectors. Huang et al. [19]

formulates a new quintuplet sampling method and the triple-header loss for the large-scale imbalanced classification. A loss max-pooling layer 

[2] defines a new loss function that takes the highest loss from a pixel-level weighting function. This loss function could obtain performance gain over many minority classes (the performance on one minority class, “truck”, degrades for unknown reasons) in the Cityscapes dataset.

Reduced Feature Resolution. During the feature encoding, resolutions of feature maps gradually reduced in order to capture long range information that is less sensitive to the input image transformation. However, details of context are gradually compressed or lost in this encoding process. Long et al. proposed the Fully Convolutional Network (FCN) for semantic segmentation in [28] that converts the fully-connected layers into convolutional layers in order to generate spatial label map directly. Based on this structure, a number of deep networks have been proposed. There are mainly three directions to improve the FCN. 1) Decoders could be added to gradually restore the context details. DeconvNet and SegNet [30, 1] apply the inverse pooling layers to build glass-like networks to upsample feature maps. 2) The dilated convolution, also called atrous convolution [6], could be used to generate feature maps with higher resolutions during the encoding process. Due to limited GPU memory and other reasons, we still need to downsample the feature maps (typically ) in many networks. 3) Recent works focus on capturing multi-scale context information. In [31], Peng et al. integrate global convolutional networks (GCN) into different levels of feature maps and apply the deconvolution operations to restore high-resolution label maps. Large kernel (e.g., ) used in the GCN enlarges the valid receptive field significantly. Chen et al. proposed the atrous spatial pyramid pooling (ASPP) module in [7] that arranges the atrous convolution operations in parallel with different atrous rates to obtain multi-scale context information. This module is further combined with a simple decoder module in [9]. 4) Other context modules (e.g., Conditional Random Fields (CRF)) also could be used as a post-processing step or jointly trained with deep networks [23, 47, 26, 36].

Batch Normalization.Batch normalization layer is a common-used layer in the semantic segmentation. As the importance of this layer has been discovered recently, we briefly introduce the work in the area although it is not connected with our contributions directly. Batch normalization parameters have been added to the ASPP module and found important during the training [7, 9]. The strategy is to compute batch normalization parameters with larger batch size and smaller feature map (e.g., downsampling rate) and freeze the parameters with larger feature map (e.g., downsampling rate). In [3], an in-place activated batch normalization (INPLACE-ABN) has been proposed to reduce the training memory so that batch size could be increased and statistics from the batch normalization could be more accurate. This novel batch normalization layer could boost the performance of the ResNet-38 model from 78.08% to 79.40% without other modifications of the network. Currently, the INPLACE-ABN ranks top one in the Cityscapes benchmark.

In this section, we describe advanced deep networks that have greatly boosted the performance in the semantic image segmentation. So far, to our knowledge, none of existing works in this area has explored the reduction of confusion errors explicitly as proposed in this paper.

3 Our Approach

The overall network structure is shown in Figure 1. Our subnets and loss layers could be easily integrated into most of existing network structures. We first separate the original network into two parts. The main part is used as the feature encoder. One or several convolutional layers and related batch normalization or activation layers used as our first subnet (i.e., subnet 0 in the Figure 1). The number of remaining subnets () is determined by the number of the confusing groups. The number is not too large in general, such as three for the Cityscapes dataset that contains 19 classes. Moreover, in case that we have a very complex dataset, we can use a threshold to reduce the number so that selected subnets focus on more confusing groups. Each subnet is trained separately for each confusing group. After training, the heterogeneous output scores are transformed and fused together to obtain final probabilities or scores. Note that the structure of each subnet could be adjusted for a specific confusing group. For example, we could use more or less convolution layers, different atrous rates, and concatenated feature maps from different ResNet blocks.

Figure 1: Overview of our proposed network structure that contains subnets.

3.1 Discriminative Confusing Groups

From the perspective of ensemble learning, we could view both fusion of subnets in our network and the ASPP module in [7, 9]

as ensemble classifiers. Each subnet could be considered as a classifier in our network, and each atrous convolution operation followed by batch normalization in the ASPP module also could be viewed as a classifier. The major difference is that the ASPP module takes all the classes into the consideration although the module uses different atrous rates. It remains unclear whether the individual classifiers are diverse enough and how many component classifiers should be included in the ensemble.

In Figure 2, we compare the normalized confusion matrices computed from pre-trained models of ResNet-101 and ResNet-38 for the Cityscapes dataset. It is easy to find that computed matrices share a very similar pattern, such as the “wall” class is strongly related to the “building” class in both networks although misclassification rates are 18% and 10% respectively. Given the pattern, we can divide all the classes into few discriminative confusing groups where the inter-group confusing errors are very small and could be neglected. As a result, the number of discriminative confusing groups might be a practical and appropriate ensemble size or ensemble cardinality.

Figure 2: Two confusion matrices computed based on the pre-trained models of ResNet-101 (right) and ResNet-38 (left). The matrices share a very similar pattern that indicates the indepedent confusing groups.

3.2 Improved Cross-Entropy Loss

In information theory, the cross-entropy between the ground truth distribution

(i.e., the one-hot label in classification) and the estimated distribution

is given by


where is the number of classes and is the output of softmax classifier. represents the network and is the input image. This term also can be interpreted as the loss associated with the probability assigned to the correct class without considering the relation between the correct class and the remaining classes, especially the confusing classes. In order to reduce the confusion to another incorrect class, intuitively, we also need to reduce the probability assigned to the incorrect class. As a result, we can formulate a new loss given by


In above equation, we treat the correct class and the remaining classes equally in which confusing classes are still not taken into consideration. Hence, we weight the new loss using a weight matrix () that could be computed from the normalized confusion matrix. The equation 2 is then converted into


where and is used to balance the losses between correct classes and incorrect classes. The derivative of the loss function 3 with respect to is


3.3 Fusion of Heterogeneous Output Spaces

As subnets belong to heterogeneous output spaces (denoted as source output spaces), we also need to find a way to transform them first into the target output space that contains all the labels and then we can apply traditional ensemble methods such as sum rule or product rule 

[22]. There are a number of ways to do the transformation, such as the regression model [38, 39] and neural network [11]. In [39], the similarity preserving principle states that, for every pair of classes in the source output space, there is a similarity indicator so that , where is the transformation function from the source output space to the target output space.

In each subnet in our network, we have classes belong to the corresponding confusing group and one “others” class that includes all the classes outside the group. As the source output spaces have certain overlapping areas, we do not need to apply the regression model in [39, 38]. Here we could apply a straightforward method as illustrated in Figure 3. For confusing classes in the group, there is one-to-one mapping between source output space and target output space. For the “others” class, we build a one-to-many mapping from the source to all the remaining classes outside the group in the target output space. It is not difficult to prove that this simple transformation satisfies the similarity preserving principle.

Figure 3: The transformation from source output space to target output space that satisfies similarity preserving principle in [39].

It is also possible design another network to learn the transformation, which might be one of our future works. In the experiments, we demonstrate that, even with our simple transformation and fusion, we still can achieve consistent improvements over the baseline model.

4 Experiments

We evaluate our network structure on Cityscapes dataset [12] and the extended PASCAL VOC dataset [14] using the official MXNET tool [10]. The performance is evaluated based on the mean intersection-over-union (mIoU). During the training, we use the standard SGD [24] with momentum 0.9 and weight decay 0.0005. The initial learning rate is 0.002 and updated in a linear schedule. Data augmentation such as mean subtraction, random crop, and random left-right flipping, are applied during the training.

4.1 Cityscapes

Cityscapes contains 24,998 street views collected in 50 cities. 5,000 images with resolution are fine annotated and remaining 19,998 are coarsely annotated. The 5,000 fine annotated images are further divided into train, validation, and test sets that have 2,975, 500, and 1,525 images, respectively. 19 semantic object classes are used for evaluation.

Two baseline models, ResNet-101 and ResNet-38, are selected for the experiments on Cityscapes dataset. The last 1000-way classification layer of the original ResNet-101 is removed. The feature stride is reduced from 32 to 8 for the semantic segmentation task by changing the convolution strides for block 3 and block 4. ResNet-101 is pre-trained on ImageNet 


and fine-tuned on Cityscapes for 100 epochs. In 

[43], a mIoU 73.63% is reported on Cityscapes validation dataset using ResNet-101. Our ResNet-101 obtains 74.70%, which is 1.03% higher. The ResNet-38 baseline model is the original released model in [43] with a mIoU 78.08%.

We partition all the classes into three confusing groups and build three additional subnets. Table 1 gives us the details of these confusing groups. The “others” class in each subnet contains all the remaining classes that are unrelated to the confusing classes within the corresponding group. Figure 4 shows the structure of all the subnets. All the parameters in the feature encoder are fixed during the training of subnets.

subnet nClass Class Name
subnet 0 19 all the 19 classes
subnet 1 7 others1, building, wall, fence, pole, traffic light, traffic sign
subnet 2 5 others2, car, truck, bus, train
subnet 3 5 others3, person, rider, motorcycle, bicycle
Table 1: Class names of three confusing groups on Cityscapes dataset.
Figure 4: The structure and shape information of four subnets used for Cityscapes dataset.

Improvement on ResNet-101. Table 4 shows the experiments using the subnets and improved cross-entropy loss. Without using our new loss, we improve the mIoU by 0.85%. The improvement of our fusion methods mainly focus on the confusing classes. The average gain on these confusing groups are 0.45%, 1.57% and 1.70%, respectively. In order to evaluate new loss, we only use subnet 0 by removing subnets from 1 to 3. We could improve mIoU to 76.21% with . When three subnets and new loss are stacked together, we are able to obtain mIoU 77.75%, which is 3.05% improvement over the baseline. Figure 8 presents examples of visual results for some confusing classes (e.g., wall, pole, rider, truck, and fence).

Per-class Performance on ResNet-101. In Figure 5, we demonstrate the per-class performance of our approach versus the ResNet-101 baseline. We find that IoU values of 18 classes are improved greatly comparing with the baseline. Certain confusing classes, such as sidewalk, wall, fence, rider, truck, bus, and motorcycle, have IoU gains over 3.5%. This result shows that our subnets and improved cross-entropy loss are effective to reduce confusion errors. In Figure 6, we further compare our per-class IoU gains with the per-class IoU gains from LMP [2]. We find that our IoU gains for most of classes are larger than the gains from LMP. One possible reason is that the LMP is mainly designed to reduce imbalanced data problems. However, confusion errors could come from other factors not limited to the imbalanced data distribution. Therefore, this indicates that it could be more beneficial and effective to handle confusion errors explicitly.

Figure 5: Comparison of per-class performance between our approach and the baseline.
Figure 6: Comparison of per-class performance gain between our approach and the LMP [2].

Comparison with Other Approaches.

Here we present a comparison between our approach and other approaches. In order to show a comprehensive comparison that includes baseline models, parameters, mIoU gains, we mainly choose the approaches that have been reported in published articles in recent years. We roughly partition these approaches into three different categories based on the problems being claimed to resolve by authors, i.e., “I” for imbalanced data, “F” for feature extraction (

e.g., multi-scale context information), “B” for improvement on batch normalization. Table 2 shows the comparison. As data augmentation has been done for all the approaches in the table, this option is not shown in the table.

Although the models in the table are aimed to resolve different problems, our improved mIoU and mIoU gain are comparable with the existing approaches proposed in recent years. Notice that I-ABN [3] currently ranked first with mIoU 82.0% in Cityscapes benchmark. If we only consider the modified batch normalization layer, the improved mIoU (77.58%) reported in the paper is close to our improved mIoU (77.75%) for baseline ResNeXt 101 and Resnet 101, respectively.

Based on this table, we find that, in order to obtain the optimal segmentation performance on Cityscapes, it is necessary to integrate multiple techniques together, such as ASPP, I-ABN, and LMP. This tells us the network structure should be general and flexible to fit into other different structures. Our approach is general, which can be easily combined with many existing works to further boost performance.

model OS ASPP CRF MS baseline mIoU mIoU gain
(baseline) (improved)
I LMP [2] 8 DeepLabV2 73.63 75.06 1.43
I FCRNs [42] 8 ResNet 101 68.58 71.16 2.58
I FCRNs [42] 8 ResNet 152 69.69 71.51 1.82
F DeepLabV2 [6] 16 ResNet 101 66.6 71.0 4.4
F DeepLabV2 [6] 16 ResNet 101 66.6 71.4 4.8
F GCN [31] 8 ResNet 152 - 76.9 -
F GCN [31] 8 ResNet 152 - 77.4 -
F ResNet38 [43] 8 ResNet 38 - 77.86 -
B I-ABN 101 [3] 8 ResNeXt 101 74.42 77.58 3.16
B I-ABN 152 [3] 8 I-ABN 101 77.58 78.49 0.91
Ours 8 ResNet 101 74.70 77.75 3.05
Table 2: Comprehensive comparison with other approaches on validation set of Cityscapes. The first column means the category of the problem being claimed to resolve by the authors. OS means output stride, MS means multi-scale input images used in testing. means the technique is applied in the model, a white space means it is not used in the model, “-” means that the technique or the number is not reported in the article.

Naturally we would like to apply our approach to existing state-of-the-art algorithms listed in Table 2. However, after careful examinations, the released versions of these algorithms are not sufficient to allow modifications, usually only the testing models are released. As a result, we choose to evaluate our approach on ResNet-38 released model (mIoU is 78.08% that is slightly higher than 77.86% reported in the paper). Without using our new loss, we improve the mIoU by 0.99%. With the new loss and subnet 0, we can improve mIoU to 79.38%, which is 1.30% improvement over the released model of ResNet-38.

subnet 0 0 - 3 0 0 - 3
loss CE    CE   New CE   New CE
mIoU 74.70 75.55 76.21 77.75
Table 4: Improvement on ResNet-38.
subnet 0 0 - 3 0
loss CE      CE    New CE
mIoU 78.08 79.07 79.38
Table 3: Improvement on ResNet-101.

4.2 Pascal Voc 2012

PASCAL VOC 2012 has 1,464 images for training, 1,449 images for validation, and 1,456 images for testing. 21 object classes including the “background” class are annotated. We also use the Semantic Boundaries dataset [17] as the auxiliary dataset, resulting in 10,582 images for training.

ResNet-101 is selected for the experiments on PASCAL VOC dataset. The structure of ResNet-101 is the same as the one used for evaluation of Cityscapes dataset. Similarly, ResNet-101 is pre-trained on ImageNet and fine-tuned on VOC dataset for 80 epochs. In [43], a mIoU 75.35% is reported with ResNet-101 on PASCAL VOC validation dataset. Our ResNet-101 obtains 75.43%, which is slightly higher.

Based on the confusion matrix shown in the left of Figure 7, we find that a number of classes are confused with the “background” class. Therefore, only one subnet is added for this confusing group (i.e., others, background, chair, dining table, potted plant, and sofa). The structure of our network is shown in the right of Figure 7.

Figure 7: (left) The confusion matrix for PASCAL VOC dataset. (right) The structure and shape information of our subnets used for PASCAL VOC dataset.

Improvement on ResNet-101. Table 5 shows the experiments using the subnets and improved cross-entropy loss. mIoU is increased to 75.51% when subnet 1 is used. mIoU is further increased to 76.91% when the improved cross-entropy loss is applied. Some visual results are shown in Figure 9.

Figure 8: Examples of semantic segmentation results on Cityscapes (cropped for visualization purpose). For every row, we list the cropped input image, ground truth label, label estimated from the baseline, label estimated from baseline + subnets, and label estimated from baseline + subnets + improved loss. Confusing classes for each column are (from left to right): (building, pole), (building, pole, fence), (rider, person), (truck, bus), (truck, building), and (fence, building).
Figure 9: Examples of semantic segmentation results on PASCAL VOC dataset. For every row, we list input image, ground truth label, error map for baseline, error map for baseline + subnets, error map for baseline + improved loss. Error maps in the last three rows represent the label inconsistencies, in which black color means labels are different and gray color means labels are the same.
subnet 0 0 and 1 0 and 1
loss CE CE    New CE
mIoU 75.43 76.51 76.91
Table 5: Improvement on PASCAL VOC dataset using ResNet-101.

5 Conclusion

In this paper, we present a novel network structure to reduce semantic confusion errors that could come from different factors. While most existing works are designed to deal with individual factors, our approach is a more direct way to handle confusion errors. Our approach consists of two major components: 1) an ensemble of subnets with heterogeneous outputs from discriminative confusing groups estimated from the normalized confusion matrix; 2) an improved cross-entropy loss with a new term that penalizes both false negatives and false positives often caused by confusing classes. Our experiments show that both components are effective and improve segmentation performance over different baseline models and datasets with different complexities. More importantly, our approach is general and flexible, which can be easily fit into most of existing network structures.