It is easy to find confusing classes that share similar visual patterns in real world. For instance, in a street view, road and sidewalk could have close color appearances. In an action video clip, hand clapping and boxing could share common moving patterns . In general, it could be impossible to completely avoid confusion errors even for human beings. However, confusion errors could be propagated and magnified throughout the training process in our models designed for various vision related tasks. In this work, we would like to mainly focus on the semantic segmentation task based on deep learning techniques. The extension could be made easily to other similar tasks such as object detection and image classification.
In semantic segmentation, a large amount of network structures has been proposed recently to deal with individual factors that could generate confusion errors. These factors include imbalanced data distributions and reduced resolution in the feature encoding process. Re-sampling and re-weighting (i.e., cost-sensitive) strategies [4, 16, 21, 19, 2] are commonly applied to deal with imbalanced class distributions. However, performance may not be always improved due to some negative factors such as the over-fitting risk and increased dataset complexities when more minority classes are added. For instance, in Cityscapes dataset, over 90% annotations come from the six majority classes including road, building, and vegetation. Less than 10% annotations come from the remaining 13 minority classes.
In order to tackle reduced resolution problem caused by the feature encoder, a large amount of deep networks have been proposed based on the Fully Convolutional Neural Network (FCN)[37, 28]. Image pyramids could be fed into the same model and feature maps are fused together at the end [15, 13, 32, 26, 8, 6]. Multiple levels of decoders could be added to restore the details of feature maps [28, 30, 34, 1, 25, 33, 20, 41]. Novel convolution layers and pooling layers could be applied to capture multi-scale context information [6, 7, 9, 46]. Current trends in semantic segmentation show that large kernels (e.g., in ) or multiple atrous convolutions with different rates (e.g., (6, 12, 18) in [7, 9]) to capture much richer context information. However, it remains unclear how many different rates or kernel sizes should be selected and whether the selected ones are optimal to cover a large range of object sizes. In street views, the object sizes could be significantly different, e.g., a passing truck could have the same height of the image while a traffic light may have only dozens of pixels. As a result, even the large kernel mentioned in  may not big enough to cover a object like a truck and capture enough context information.
In this paper, we propose a novel network structure that is aimed to reduce semantic confusion errors explicitly. Comparing with existing methods, our network structure is able to deal with all the factors in a more direct manner. Moreover, this structure is general and can be easily integrated into any existing networks to further improve performance. Specifically, our proposed network structure mainly has two following contributions.
We propose a method to build and ensemble multiple subnets with heterogeneous output spaces. These subnets are built based on the discriminative confusing groups inferred from the normalized confusion matrix. Each subnet is aimed to enlarge the distances among the confusing classes within each confusing group without affecting unrelated classes outside the group.
We propose an improved multi-class cross-entropy loss that considers both correct and incorrect labels. By adding a new term for the incorrect labels, both false negatives and false positives that are often caused by confusing classes are penalized directly. A re-weighting based on the confusion matrix is also applied for the new loss to further strengthen the penalization.
In Section 2, we introduce the related work in the semantic image segmentation. Section 3 describes our network structure and new loss function in details. An analysis of the loss function based on the information theory is given in this section. In Section 4, we provide a set of experiments on the Cityscapes and Pascal VOC datasets. We give a conclusion in Section 5.
2 Related Work
As mentioned in Section 1, confusion could be magnified by various factors during the network training. We divide the factors into two categories, imbalanced data distribution and reduced feature resolution. In this section, we mainly describes related work to deal with these two categories.
Imbalanced Data Distribution. One method is to over-sampling the minority classes and/or under-sampling the majority classes. As this strategy changes the data distribution, over-sampling may result in over-fitting and under-sampling may remove possible valuable information. SMOTE and its variants [4, 16, 21] have been proposed to avoid the over-fitting by generating new non-replicated examples. Another direction, re-weighting method, imposes additional penalties on the minority classes without changing the data distribution. For instance, inverse frequency and median frequency re-weighting [5, 29, 44, 45, 13] have been applied in the semantic segmentation works. In , online hard example mining (OHEM) is proposed to automatically select hard examples for training region-based ConvNet detectors. Huang et al. 
formulates a new quintuplet sampling method and the triple-header loss for the large-scale imbalanced classification. A loss max-pooling layer defines a new loss function that takes the highest loss from a pixel-level weighting function. This loss function could obtain performance gain over many minority classes (the performance on one minority class, “truck”, degrades for unknown reasons) in the Cityscapes dataset.
Reduced Feature Resolution. During the feature encoding, resolutions of feature maps gradually reduced in order to capture long range information that is less sensitive to the input image transformation. However, details of context are gradually compressed or lost in this encoding process. Long et al. proposed the Fully Convolutional Network (FCN) for semantic segmentation in  that converts the fully-connected layers into convolutional layers in order to generate spatial label map directly. Based on this structure, a number of deep networks have been proposed. There are mainly three directions to improve the FCN. 1) Decoders could be added to gradually restore the context details. DeconvNet and SegNet [30, 1] apply the inverse pooling layers to build glass-like networks to upsample feature maps. 2) The dilated convolution, also called atrous convolution , could be used to generate feature maps with higher resolutions during the encoding process. Due to limited GPU memory and other reasons, we still need to downsample the feature maps (typically ) in many networks. 3) Recent works focus on capturing multi-scale context information. In , Peng et al. integrate global convolutional networks (GCN) into different levels of feature maps and apply the deconvolution operations to restore high-resolution label maps. Large kernel (e.g., ) used in the GCN enlarges the valid receptive field significantly. Chen et al. proposed the atrous spatial pyramid pooling (ASPP) module in  that arranges the atrous convolution operations in parallel with different atrous rates to obtain multi-scale context information. This module is further combined with a simple decoder module in . 4) Other context modules (e.g., Conditional Random Fields (CRF)) also could be used as a post-processing step or jointly trained with deep networks [23, 47, 26, 36].
Batch Normalization.Batch normalization layer is a common-used layer in the semantic segmentation. As the importance of this layer has been discovered recently, we briefly introduce the work in the area although it is not connected with our contributions directly. Batch normalization parameters have been added to the ASPP module and found important during the training [7, 9]. The strategy is to compute batch normalization parameters with larger batch size and smaller feature map (e.g., downsampling rate) and freeze the parameters with larger feature map (e.g., downsampling rate). In , an in-place activated batch normalization (INPLACE-ABN) has been proposed to reduce the training memory so that batch size could be increased and statistics from the batch normalization could be more accurate. This novel batch normalization layer could boost the performance of the ResNet-38 model from 78.08% to 79.40% without other modifications of the network. Currently, the INPLACE-ABN ranks top one in the Cityscapes benchmark.
In this section, we describe advanced deep networks that have greatly boosted the performance in the semantic image segmentation. So far, to our knowledge, none of existing works in this area has explored the reduction of confusion errors explicitly as proposed in this paper.
3 Our Approach
The overall network structure is shown in Figure 1. Our subnets and loss layers could be easily integrated into most of existing network structures. We first separate the original network into two parts. The main part is used as the feature encoder. One or several convolutional layers and related batch normalization or activation layers used as our first subnet (i.e., subnet 0 in the Figure 1). The number of remaining subnets () is determined by the number of the confusing groups. The number is not too large in general, such as three for the Cityscapes dataset that contains 19 classes. Moreover, in case that we have a very complex dataset, we can use a threshold to reduce the number so that selected subnets focus on more confusing groups. Each subnet is trained separately for each confusing group. After training, the heterogeneous output scores are transformed and fused together to obtain final probabilities or scores. Note that the structure of each subnet could be adjusted for a specific confusing group. For example, we could use more or less convolution layers, different atrous rates, and concatenated feature maps from different ResNet blocks.
3.1 Discriminative Confusing Groups
as ensemble classifiers. Each subnet could be considered as a classifier in our network, and each atrous convolution operation followed by batch normalization in the ASPP module also could be viewed as a classifier. The major difference is that the ASPP module takes all the classes into the consideration although the module uses different atrous rates. It remains unclear whether the individual classifiers are diverse enough and how many component classifiers should be included in the ensemble.
In Figure 2, we compare the normalized confusion matrices computed from pre-trained models of ResNet-101 and ResNet-38 for the Cityscapes dataset. It is easy to find that computed matrices share a very similar pattern, such as the “wall” class is strongly related to the “building” class in both networks although misclassification rates are 18% and 10% respectively. Given the pattern, we can divide all the classes into few discriminative confusing groups where the inter-group confusing errors are very small and could be neglected. As a result, the number of discriminative confusing groups might be a practical and appropriate ensemble size or ensemble cardinality.
3.2 Improved Cross-Entropy Loss
In information theory, the cross-entropy between the ground truth distribution
(i.e., the one-hot label in classification) and the estimated distributionis given by
where is the number of classes and is the output of softmax classifier. represents the network and is the input image. This term also can be interpreted as the loss associated with the probability assigned to the correct class without considering the relation between the correct class and the remaining classes, especially the confusing classes. In order to reduce the confusion to another incorrect class, intuitively, we also need to reduce the probability assigned to the incorrect class. As a result, we can formulate a new loss given by
In above equation, we treat the correct class and the remaining classes equally in which confusing classes are still not taken into consideration. Hence, we weight the new loss using a weight matrix () that could be computed from the normalized confusion matrix. The equation 2 is then converted into
where and is used to balance the losses between correct classes and incorrect classes. The derivative of the loss function 3 with respect to is
3.3 Fusion of Heterogeneous Output Spaces
As subnets belong to heterogeneous output spaces (denoted as source output spaces), we also need to find a way to transform them first into the target output space that contains all the labels and then we can apply traditional ensemble methods such as sum rule or product rule. There are a number of ways to do the transformation, such as the regression model [38, 39] and neural network . In , the similarity preserving principle states that, for every pair of classes in the source output space, there is a similarity indicator so that , where is the transformation function from the source output space to the target output space.
In each subnet in our network, we have classes belong to the corresponding confusing group and one “others” class that includes all the classes outside the group. As the source output spaces have certain overlapping areas, we do not need to apply the regression model in [39, 38]. Here we could apply a straightforward method as illustrated in Figure 3. For confusing classes in the group, there is one-to-one mapping between source output space and target output space. For the “others” class, we build a one-to-many mapping from the source to all the remaining classes outside the group in the target output space. It is not difficult to prove that this simple transformation satisfies the similarity preserving principle.
It is also possible design another network to learn the transformation, which might be one of our future works. In the experiments, we demonstrate that, even with our simple transformation and fusion, we still can achieve consistent improvements over the baseline model.
We evaluate our network structure on Cityscapes dataset  and the extended PASCAL VOC dataset  using the official MXNET tool . The performance is evaluated based on the mean intersection-over-union (mIoU). During the training, we use the standard SGD  with momentum 0.9 and weight decay 0.0005. The initial learning rate is 0.002 and updated in a linear schedule. Data augmentation such as mean subtraction, random crop, and random left-right flipping, are applied during the training.
Cityscapes contains 24,998 street views collected in 50 cities. 5,000 images with resolution are fine annotated and remaining 19,998 are coarsely annotated. The 5,000 fine annotated images are further divided into train, validation, and test sets that have 2,975, 500, and 1,525 images, respectively. 19 semantic object classes are used for evaluation.
Two baseline models, ResNet-101 and ResNet-38, are selected for the experiments on Cityscapes dataset. The last 1000-way classification layer of the original ResNet-101 is removed. The feature stride is reduced from 32 to 8 for the semantic segmentation task by changing the convolution strides for block 3 and block 4. ResNet-101 is pre-trained on ImageNet
and fine-tuned on Cityscapes for 100 epochs. In, a mIoU 73.63% is reported on Cityscapes validation dataset using ResNet-101. Our ResNet-101 obtains 74.70%, which is 1.03% higher. The ResNet-38 baseline model is the original released model in  with a mIoU 78.08%.
We partition all the classes into three confusing groups and build three additional subnets. Table 1 gives us the details of these confusing groups. The “others” class in each subnet contains all the remaining classes that are unrelated to the confusing classes within the corresponding group. Figure 4 shows the structure of all the subnets. All the parameters in the feature encoder are fixed during the training of subnets.
|subnet 0||19||all the 19 classes|
|subnet 1||7||others1, building, wall, fence, pole, traffic light, traffic sign|
|subnet 2||5||others2, car, truck, bus, train|
|subnet 3||5||others3, person, rider, motorcycle, bicycle|
Improvement on ResNet-101. Table 4 shows the experiments using the subnets and improved cross-entropy loss. Without using our new loss, we improve the mIoU by 0.85%. The improvement of our fusion methods mainly focus on the confusing classes. The average gain on these confusing groups are 0.45%, 1.57% and 1.70%, respectively. In order to evaluate new loss, we only use subnet 0 by removing subnets from 1 to 3. We could improve mIoU to 76.21% with . When three subnets and new loss are stacked together, we are able to obtain mIoU 77.75%, which is 3.05% improvement over the baseline. Figure 8 presents examples of visual results for some confusing classes (e.g., wall, pole, rider, truck, and fence).
Per-class Performance on ResNet-101. In Figure 5, we demonstrate the per-class performance of our approach versus the ResNet-101 baseline. We find that IoU values of 18 classes are improved greatly comparing with the baseline. Certain confusing classes, such as sidewalk, wall, fence, rider, truck, bus, and motorcycle, have IoU gains over 3.5%. This result shows that our subnets and improved cross-entropy loss are effective to reduce confusion errors. In Figure 6, we further compare our per-class IoU gains with the per-class IoU gains from LMP . We find that our IoU gains for most of classes are larger than the gains from LMP. One possible reason is that the LMP is mainly designed to reduce imbalanced data problems. However, confusion errors could come from other factors not limited to the imbalanced data distribution. Therefore, this indicates that it could be more beneficial and effective to handle confusion errors explicitly.
Comparison with Other Approaches.
Here we present a comparison between our approach and other approaches. In order to show a comprehensive comparison that includes baseline models, parameters, mIoU gains, we mainly choose the approaches that have been reported in published articles in recent years. We roughly partition these approaches into three different categories based on the problems being claimed to resolve by authors, i.e., “I” for imbalanced data, “F” for feature extraction (e.g., multi-scale context information), “B” for improvement on batch normalization. Table 2 shows the comparison. As data augmentation has been done for all the approaches in the table, this option is not shown in the table.
Although the models in the table are aimed to resolve different problems, our improved mIoU and mIoU gain are comparable with the existing approaches proposed in recent years. Notice that I-ABN  currently ranked first with mIoU 82.0% in Cityscapes benchmark. If we only consider the modified batch normalization layer, the improved mIoU (77.58%) reported in the paper is close to our improved mIoU (77.75%) for baseline ResNeXt 101 and Resnet 101, respectively.
Based on this table, we find that, in order to obtain the optimal segmentation performance on Cityscapes, it is necessary to integrate multiple techniques together, such as ASPP, I-ABN, and LMP. This tells us the network structure should be general and flexible to fit into other different structures. Our approach is general, which can be easily combined with many existing works to further boost performance.
|I||FCRNs ||8||ResNet 101||68.58||71.16||2.58|
|I||FCRNs ||8||ResNet 152||69.69||71.51||1.82|
|F||DeepLabV2 ||16||ResNet 101||66.6||71.0||4.4|
|F||DeepLabV2 ||16||ResNet 101||66.6||71.4||4.8|
|F||GCN ||8||ResNet 152||-||76.9||-|
|F||GCN ||8||ResNet 152||-||77.4||-|
|F||ResNet38 ||8||ResNet 38||-||77.86||-|
|B||I-ABN 101 ||8||ResNeXt 101||74.42||77.58||3.16|
|B||I-ABN 152 ||8||I-ABN 101||77.58||78.49||0.91|
Naturally we would like to apply our approach to existing state-of-the-art algorithms listed in Table 2. However, after careful examinations, the released versions of these algorithms are not sufficient to allow modifications, usually only the testing models are released. As a result, we choose to evaluate our approach on ResNet-38 released model (mIoU is 78.08% that is slightly higher than 77.86% reported in the paper). Without using our new loss, we improve the mIoU by 0.99%. With the new loss and subnet 0, we can improve mIoU to 79.38%, which is 1.30% improvement over the released model of ResNet-38.
|subnet||0||0 - 3||0||0 - 3|
|loss||CE||CE||New CE||New CE|
|subnet||0||0 - 3||0|
4.2 Pascal Voc 2012
PASCAL VOC 2012 has 1,464 images for training, 1,449 images for validation, and 1,456 images for testing. 21 object classes including the “background” class are annotated. We also use the Semantic Boundaries dataset  as the auxiliary dataset, resulting in 10,582 images for training.
ResNet-101 is selected for the experiments on PASCAL VOC dataset. The structure of ResNet-101 is the same as the one used for evaluation of Cityscapes dataset. Similarly, ResNet-101 is pre-trained on ImageNet and fine-tuned on VOC dataset for 80 epochs. In , a mIoU 75.35% is reported with ResNet-101 on PASCAL VOC validation dataset. Our ResNet-101 obtains 75.43%, which is slightly higher.
Based on the confusion matrix shown in the left of Figure 7, we find that a number of classes are confused with the “background” class. Therefore, only one subnet is added for this confusing group (i.e., others, background, chair, dining table, potted plant, and sofa). The structure of our network is shown in the right of Figure 7.
Improvement on ResNet-101. Table 5 shows the experiments using the subnets and improved cross-entropy loss. mIoU is increased to 75.51% when subnet 1 is used. mIoU is further increased to 76.91% when the improved cross-entropy loss is applied. Some visual results are shown in Figure 9.
|subnet||0||0 and 1||0 and 1|
In this paper, we present a novel network structure to reduce semantic confusion errors that could come from different factors. While most existing works are designed to deal with individual factors, our approach is a more direct way to handle confusion errors. Our approach consists of two major components: 1) an ensemble of subnets with heterogeneous outputs from discriminative confusing groups estimated from the normalized confusion matrix; 2) an improved cross-entropy loss with a new term that penalizes both false negatives and false positives often caused by confusing classes. Our experiments show that both components are effective and improve segmentation performance over different baseline models and datasets with different complexities. More importantly, our approach is general and flexible, which can be easily fit into most of existing network structures.
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017)
-  Bulo, S.R., Neuhold, G., Kontschieder, P.: Loss maxpooling for semantic image segmentation. CVPR), July 7 (2017)
-  Bulò, S.R., Porzi, L., Kontschieder, P.: In-place activated batchnorm for memory-optimized training of dnns. arXiv preprint arXiv:1712.02616 (2017)
-  Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. pp. 475–482. Springer (2009)
-  Caesar, H., Uijlings, J., Ferrari, V.: Joint calibration for semantic segmentation. arXiv preprint arXiv:1507.01581 (2015)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99), 1–1 (2016)
-  Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3640–3649 (2016)
-  Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611 (2018)
-  Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Cissé, M., Al-Shedivat, M., Bengio, S.: Adios: Architectures deep in output space. In: International Conference on Machine Learning. pp. 2770–2779 (2016)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
-  Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2650–2658 (2015)
-  Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1), 98–136 (2015)
-  Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence 35(8), 1915–1929 (2013)
-  Han, H., Wang, W.Y., Mao, B.H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing. pp. 878–887. Springer (2005)
-  Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 991–998. IEEE (2011)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  Huang, C., Li, Y., Change Loy, C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5375–5384 (2016)
-  Islam, M.A., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4877–4885. IEEE (2017)
-  Jeatrakul, P., Wong, K.W., Fung, C.C.: Classification of imbalanced data by combining the complementary neural network and smote algorithm. In: International Conference on Neural Information Processing. pp. 152–159. Springer (2010)
-  Kittler, J.: Combining classifiers: A theoretical framework. Pattern analysis and Applications 1(1), 18–27 (1998)
-  Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: Advances in neural information processing systems. pp. 109–117 (2011)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
-  Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv preprint arXiv:1611.06612 (2016)
-  Lin, G., Shen, C., Van Den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3194–3203 (2016)
-  Liu, W., Tsang, I.W., Müller, K.R.: An easy-to-hard learning paradigm for multiple classes and multiple labels. The Journal of Machine Learning Research 18(1), 3300–3337 (2017)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
-  Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3376–3385 (2015)
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1520–1528 (2015)
-  Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve semantic segmentation by global convolutional network. arXiv preprint arXiv:1703.02719 (2017)
-  Pinheiro, P., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: International conference on machine learning. pp. 82–90 (2014)
-  Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. arXiv preprint (2017)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
-  Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. arXiv preprint arXiv:1503.02351 (2015)
-  Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Shi, X., Liu, Q., Fan, W., Philip, S.Y., Zhu, R.: Transfer learning on heterogenous feature spaces via spectral transformation. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on. pp. 1049–1054. IEEE (2010)
-  Shi, X., Liu, Q., Fan, W., Yang, Q., Yu, P.S.: Predictive modeling with heterogeneous sources. In: Proceedings of the 2010 SIAM International Conference on Data Mining. pp. 814–825. SIAM (2010)
-  Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 761–769 (2016)
-  Wojna, Z., Ferrari, V., Guadarrama, S., Silberman, N., Chen, L.C., Fathi, A., Uijlings, J.: The devil is in the decoder. arXiv preprint arXiv:1707.05847 (2017)
-  Wu, Z., Shen, C., Hengel, A.v.d.: High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339 (2016)
-  Wu, Z., Shen, C., Hengel, A.v.d.: Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
-  Xu, J., Schwing, A.G., Urtasun, R.: Tell me what you see and i will show you where it is. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3190–3197 (2014)
-  Xu, J., Schwing, A.G., Urtasun, R.: Learning to segment under various forms of weak supervision. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. pp. 3781–3790. IEEE (2015)
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 2881–2890 (2017)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1529–1537 (2015)