The per-pixel cross-entropy loss (CEL) is widely used in structured output prediction tasks such as contour detection, semantic segmentation, and instance segmentation bertasius2015deepedge ; hwang2015pixel ; long2015fully ; he2017mask as a spatial extension of generic image recognition krizhevsky2012imagenet ; he2016deep . However, the disadvantage of CEL is also obvious due to its additive nature and i.i.d. assumption of predictions. As toy examples in Fig. 1 (top block), CEL would yield the same overall error in either situation. However, it is clear that mistakes in either scenario should incur different overall errors, which should be calculated from the structure of entire patch. Therefore, structural reasoning is highly desirable for structured output prediction tasks.
Various attempts have been made to incorporate structural reasoning into structured output prediction in a cooperative way, including two mainstreams, bottom-up Conditional Random Fields (CRFs) krahenbuhl2011efficient ; zheng2015conditional and top-down shape priors xie2016top ; gygli2017value or Generative Adversarial Networks (GANs) goodfellow2014generative ; luc2016semantic : (1) CRF enforces label consistency between pixels and is commonly employed as a post-processing step krahenbuhl2011efficient ; chen2016deeplab
, or as a plug-in module inside deep neural networkszheng2015conditional ; liu2015semantic that coordinate bottom-up information. Effective as it is, CRF is usually sensitive to input appearance changes and needs expensive iterative inference. (2) As an example of learning top-down shape priors, GANs emerge as an alternative to enforce structural regularity in the structured prediction space. Specifically, the discriminator network is trained to distinguish the predicted mask from the ground truth mask. Promising as it is, GANs suffer from inaccurate boundary localization as a consequence of generic shape modeling.
Before we dive into our proposed framework, let us examine the toy examples in Fig. 1 again. Top-down cooperative approaches prefer an additional loss (together with CEL) that penalizes more on the abnormal structures that are deemed undesirable. Such trained networks are thus aware of intra-category shape invariance and inter-category object co-occurrences. However, we notice that in real examples as in Fig. 1 (bottom block), complex and deformable shapes and confusing co-occurrences are the most common mistakes in structured output prediction especially when the visual cues are ambiguous. As a result, training with shape priors sometimes deteriorates the prediction as shown in the bicycle example. We are thus inspired to tackle this problem from an opposing angle: top-down approaches should shift the focus to confusing co-occurring backgrounds or ambiguous boundaries of normal shapes so as to make the structured output prediction network learn harder.
We propose a new framework, which replaces CEL, for training structured prediction networks via an adversarial process, in which we train a structure analyzer to provide supervisory signals, the adversarial structure matching loss (ASML). By maximizing ASML, or learning to exaggerate structural mistakes from the structured prediction networks, the structure analyzer not only becomes aware of complex shapes of objects but adaptively emphasize those confusing co-occurrences. As a result, training structured prediction networks by minimizing ASML reduces contextual confusion among co-occurring objects and improves boundary localization. To improve the stability of training, we append a structure regularizer on the structure analyzer to compose a structure autoencoder. By training the autoencoder to reconstruct ground truth, which contains complete structures, we ensure the filters in the structure analyzer form a good structure basis. We demonstrate that structured output prediction networks trained using ASML outperforms its counterpart CEL on the figure-ground segmentation task on Weizmann horse datasetborenstein2002horse and the semantic segmentation task on PASCAL VOC 2012 dataset everingham2010pascal with various base architectures, such as FCN long2015fully , U-Net ronneberger2015u , DeepLab chen2016deeplab , and PSPNet zhao2016pyramid . We further verify the effectiveness of ASML particularly on resolving confusing context and improving boundary localization.
2 Related Work
Semantic Segmentation. The field of semantic segmentation has progressed fast in the last few years since the introduction of fully convolutional networks long2015fully . Both deeper zhao2016pyramid ; li2017not and wider noh2015learning ; ronneberger2015u ; yu2015multi network architectures have been proposed and have dramatically boosted the performance on standard benchmarks like PASCAL VOC 2012 everingham2010pascal . For example, Yu et al. yu2015multi enabled fine-detailed segmentation results using dilated (i.e., enlarged kernel) convolutions whereas Zhao et al. zhao2016pyramid exploited global context information through pyramid pooling module. Though these methods yield impressive performance w.r.t. mIoU (mean intersection over union), they fail to capture abundant structure information present in natural scenes as shown in Fig. 1.
Structure Modeling. To overcome the aforementioned drawback, people have explored several ways to incorporate structure information krahenbuhl2011efficient ; chen2015learning ; zheng2015conditional ; liu2015semantic ; lin2016efficient ; bertasius2016convolutional ; xie2016top ; gygli2017value ; ke2018adaptive . For example, Chen et al. chen2016deeplab utilized denseCRF krahenbuhl2011efficient as post-processing to refine the final segmentation results. Zheng et al. zheng2015conditional and Liu et al. liu2015semantic further made the CRF module differentiable within the deep neural network. Besides, low-level cues, such as affinity shi2000normalized ; maire2016affinity ; liu2017learning ; bertasius2016convolutional and contour bertasius2016semantic ; chen2016semantic have also been leveraged to encode image structures. However, these methods either are sensitive to transient appearance traits or require expensive iterative inference.
3.1 Adversarial Structure Matching Loss
We consider semantic segmentation as an example of structured output prediction tasks, in which a segmentation network (segmenter) , which usually is a deep CNN, is trained to map an input image to a per-pixel label mask . We propose to train such a segmenter with another network, structure analyzer. The structure analyzer extracts -dimensional multi-layer structure features from either ground truth masks, denoted as , or predictions, denoted as . We train the structure analyzer to maximize the distance between the structure features from either inputs, so that it learns to exaggerate structural mistakes made by the segmenter. On the contrary, we simultaneously train the segmenter to minimize the same distance. In other words, segmenter and structure analyzer play the following two-player minimax game with value function :
that is, we prefer the optimal segmenter as the one that learns to predict the true structures to satisfy structure analyzer. Note that the structure analyzer will bias its discriminative power towards similar but subtly different structures as they occur more frequently through the course of training.
One might relate this framework to GANs goodfellow2014generative . A critical distinction is that GANs try to minimize the data distributions between real and fake examples, and thus accept a set of solutions. Here, structured output prediction tasks require specific one-to-one mapping of each pixel between ground truth masks and predictions. Therefore, the discrimination of structures should take place for every patch between corresponding masks, hence the name adversarial structure matching loss (ASML).
3.2 Global Optimality of and Convergence
We would like the segmenter to converge to a good mapping of given , if given enough capacity and training time. To simplify the dynamic of convergence, we consider both segmenter and structure analyzer as models with infinite capacity in a non-parametric setting.
For a fixed , if , then is infinitely large for an optimal .
If , there exists an index such that , where . Without loss of generality, we assume if and let and .
We consider a special case where on the -th dimension of the input is a linear mapping, i.e., . As is with infinite capacity, we know there exists such that
Note that as . Thus . ∎
In practice, parameters of are restricted within certain range under weight regularization so would not go to infinity.
For an optimal , if and only if .
If , , for any . Hence .
If or , contradicts Proposition 1. Hence . ∎
If () is a Nash equilibrium of the system, then and
From Proposition 1, we proved if . From Corollary 1, we proved if and only if . Since for any and , the Nash equilibrium only exists when , or . ∎
From the proofs, we recognize the imbalanced powers between segmenter and structure analyzer where structure analyzers can arbitrarily enlarge the value function if the segmenter is not optimal. In practice, we should limit the training of structure analyzers or apply weight regularization to prevent gradient exploding. Therefore, we train the structure analyzer only once per iteration with a learning rate that is equal to or less than the one for segmenter. Another trick is to binarize the predictions(winner-take-all across channels for every pixel) before calculating ASML for structure analyzer. In this way, structure analyzers will focus on learning to distinguish the structures instead of the confidence levels of predictions.
3.3 Reconstructing as Structure Regularization
Although theoretically structure analyzers would discover any structural difference between predictions and ground truth, randomly initialized structure analyzers suffer from missing certain structures in the early stage. For example, if filter responses for a sharp curve are initially very low, ASML for the sharp curve will be as small, resulting in inefficient learning. This problem will emerge when training both segmenters and structure analyzers from scratch. To alleviate this problem, we propose a regularization method to stabilize the learning of structure analyzers.
One way to ensure the filters in structure analyzer form a good structure basis is through reconstructing ground truth, which contains complete structures. If filters in structure analyzer fail to capture certain structures, the ground truth mask cannot be reconstructed. Hence, we append a structure regularizer on top of structure analyzer to compose a structure autoencoder. we denote the structure regularizer , where denotes features from the structure analyzer, which are not necessarily the same set as features for ASML; hence the reconstruction mapping: . As a result, the final objective function is as follows
Note that the structure regularization loss is independent to .
We demonstrate the effectiveness of our proposed methods and compare the results on several popular semantic segmentation architectures trained using CEL or ASML. We first give an overview of the datasets, evaluation metrics, and implementation details used in these experiments. Then we present the main results and analyses on confusion and boundaries.
4.1 Experimental Setup
Tasks and datasets. We compare our proposed ASML against CEL on the Weizmann horse borenstein2002horse and PASCAL VOC 2012 everingham2010pascal datasets. The Weizmann horse is a relatively small dataset for figure-ground segmentation that contains side-view horse images, which are split into training and validation images. The VOC dataset is a well-known benchmark for generic image segmentation which includes object classes and a ‘background’ class, containing and images for training and validation, respectively.
Architectures. For all the structure autoencoders (i.e., structure analyzer and structure regularizer), we use U-Net ronneberger2015u architectures with either 7 conv layers for instance segmentation or 5 conv layers for semantic segmentation. We conduct experiments on different segmentation CNN architectures to demonstrate the effectiveness of our proposed method. On horse dataset borenstein2002horse , we use U-Net ronneberger2015u (with 7 convolutional layers) as our base architecture. On VOC everingham2010pascal dataset, we carry out experiments and thorough analyses over different architectures with ResNet101 he2016deep backbone, including FCN long2015fully , DeepLab chen2016deeplab , and PSPNet zhao2016pyramid , which is a highly competitive segmentation model. Aside from base architectures, neither extra weight parameters nor post processing are required at inference time.
Implementation details on Weizmann horse. We use the poly learning rate policy where the current learning rate equals the base one multiplied by with max iterations as epochs. We set the base learning rate as with Adam optimizer for both and . Momentum and weight decay are set to and , respectively. We set the batch size as and use no data augmentation other than random mirroring. We set for structure regularization.
|Loss||Horse mIoU (%)||VOC mIoU (%)|
|ASML w/o rec.||77.83||72.14|
|ASML w/o adv.||76.70||71.26|
|Base / Loss||mIoU (%)|
|FCN / CEL||68.91|
|FCN / ASML||72.14|
|DeepLab / CEL||77.54|
|DeepLab / ASML||78.05|
|PSPNet / CEL||80.12|
|PSPNet / GAN||80.74|
|PSPNet / ASML||81.43|
Implementation details on VOC dataset. Our implementation follows the implementation details depicted in chen2017rethinking . We use the poly learning rate policy where the current learning rate equals the base one multiplied by . We set the base learning rate with SGD optimizer as for and for . The training iterations for all experiments on all datasets are K while the performance can be further improved by increasing the iteration number. Momentum and weight decay are set to and
, respectively. For data augmentation, we adopt random mirroring and random resizing between 0.5 and 2 for all datasets. We do not use random rotation and random Gaussian blur. We do not upscale the logits (prediction map) back to the input image resolution, instead, we followchen2016deeplab ’s setting by downsampling the ground-truth labels for training (). The crop size is set to and batch size is set to . We update BatchNorm parameters with
for ImageNet-pretrained layers andfor untrained layers. For ASML, we set for structure regularization.
4.2 Main Result
We evaluate both figure-ground and semantic segmentation tasks via mean pixel-wise intersection-over-union (denoted as mIoU) long2015fully . We first conduct ablation study on both datasets to thoroughly analyze the effectiveness of using different layers of structure features in the structure analyzer. As shown in Table 2, using low- to mid-level features (from to ) of structure analyzers yields the highest performance ( and mIoU on Weizmann horse dataset and VOC dataset, respectively). We also report mIoU on VOC dataset using different base architectures as shown in Table 2. Our proposed method achieves consistent improvements across all three base architectures, boosting mIoU by with FCN, with DeepLab and with PSPNet. ASML is also higher than GANs (together with CEL) on VOC dataset. We demonstrate some visual results in Fig. 4.
4.3 Confusing Context Improvement
We next demonstrate the robustness of our proposed method under confusing context. We first calculate the confusion matrix of pixel accuracy on PASCAL VOC 2012everingham2010pascal validation set. We identify that ‘background’ is biggest confuser for most of the categories and hence we summarize the percentage of confusion in Fig. 3 (i.e., the ‘background’ column from the confusion matrix). ASML reduces the overall confusion caused by ‘background’ from to on FCN and from to on PSPNet with relative error reduction. Large improvements come from resolving confusion of ‘chair’, ‘plant’, ‘sofa’, and ‘tv’.
4.4 Boundary Localization Improvement
We argue that our proposed method is more sensitive to complex shapes of objects. We evaluate boundary localization using standard contour detection metrics amfm_pami2011
. The contour detection metrics compute the correspondences between prediction boundaries and ground-truth boundaries, and summarize the results with precision, recall, and f-measure. We compare the results with different loss functions: CEL, GAN and ASML on VOC validation set. Shown in Table3, ASML outperforms both CEL and GAN among most categories and overall. Also note that the boundaries of thin-structure objects are much better captured by ASML, such as ‘bike’ and ‘chair’.
|Loss / Measure||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv||mean|
|CEL / precision||89.37||82.62||86.87||69.82||69.29||78.59||73.84||80.36||55.80||85.85||39.06||75.58||83.18||74.63||74.45||65.86||84.72||40.99||68.66||60.30||71.99|
|GAN / precision||88.24||84.34||91.17||67.87||65.71||83.09||76.78||78.69||54.57||87.65||40.39||76.04||86.29||74.08||75.36||64.60||84.97||39.77||70.26||58.96||72.44|
|ASML / precision||90.92||89.50||89.60||72.22||70.45||81.91||80.26||79.82||58.94||85.35||51.44||73.22||87.67||76.39||79.71||65.83||86.46||44.52||73.59||69.18||75.35|
|CEL / recall||69.41||42.05||65.85||41.67||62.37||62.59||56.16||66.72||29.72||58.73||27.87||66.64||60.78||51.26||55.40||23.44||54.73||41.25||55.08||51.20||52.15|
|GAN / recall||67.49||40.26||62.95||37.43||58.89||61.12||55.01||63.24||28.60||56.07||27.74||65.58||59.39||49.45||53.44||22.34||53.12||41.73||53.92||47.22||50.25|
|ASML / recall||69.75||48.62||63.56||40.28||62.55||63.73||57.69||65.93||38.07||59.10||29.99||66.71||62.19||52.37||57.53||25.80||54.7||49.69||55.42||55.84||53.98|
|CEL / f-measure||78.13||55.73||74.91||52.19||65.65||69.68||63.80||72.91||38.78||69.75||32.52||70.83||70.24||60.78||63.53||34.58||66.50||41.12||61.12||55.38||59.91|
|GAN / f-measure||76.48||54.50||74.48||48.25||62.11||70.43||64.10||70.12||37.53||68.39||32.89||70.43||70.36||59.31||62.53||33.20||65.37||40.73||61.01||52.44||58.73|
|ASML / f-measure||78.94||63.01||74.37||51.71||66.27||71.68||67.13||72.21||46.26||69.84||37.89||69.82||72.77||62.14||66.83||37.07||67.04||46.96||63.22||61.80||62.35|
- (1) Bertasius, G., Shi, J., Torresani, L.: Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In: CVPR. (2015)
Hwang, J.J., Liu, T.L.:
Pixel-wise deep learning for contour detection.In: ICLR Workshop. (2015)
- (3) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
- (4) He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: CVPR. (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.:
Imagenet classification with deep convolutional neural networks.In: NIPS. (2012)
- (6) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
- (7) Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV (2010)
- (8) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. (2017)
- (9) Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS. (2011)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D.,
Huang, C., Torr, P.H.:
Conditional random fields as recurrent neural networks.In: ICCV. (2015)
- (11) Xie, S., Huang, X., Tu, Z.: Top-down learning for structured labeling with convolutional pseudoprior. In: ECCV. (2016)
- (12) Gygli, M., Norouzi, M., Angelova, A.: Deep value networks learn to evaluate and iteratively refine structured outputs. In: ICML. (2017)
- (13) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)
- (14) Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. NIPS Workshop (2016)
- (15) Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
- (16) Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: CVPR. (2015)
- (17) Borenstein, E., Ullman, S.: Class-specific, top-down segmentation. In: ECCV. (2002)
- (18) Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. (2015)
- (19) Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In: CVPR. (2017)
- (20) Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: CVPR. (2015)
- (21) Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
- (22) Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning deep structured models. In: ICML. (2015)
- (23) Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR. (2016)
- (24) Bertasius, G., Torresani, L., Yu, S.X., Shi, J.: Convolutional random walk networks for semantic image segmentation. In: CVPR. (2017)
- (25) Ke, T.W., Hwang, J.J., Liu, Z., Yu, S.X.: Adaptive affinity field for semantic segmentation. arXiv preprint arXiv:1803.10335 (2018)
- (26) Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI (2000)
- (27) Maire, M., Narihira, T., Yu, S.X.: Affinity cnn: Learning pixel-centric pairwise relations for figure/ground embedding. In: CVPR. (2016)
- (28) Liu, S., De Mello, S., Gu, J., Zhong, G., Yang, M.H., Kautz, J.: Learning affinity via spatial propagation networks. In: NIPS. (2017)
- (29) Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neural fields. In: CVPR. (2016)
- (30) Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In: CVPR. (2016)
- (31) Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
- (32) Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. TPAMI (2011)