Semantic segmentation is one of the most fundamental and challenging computer vision problems. The goal of the task is to assign every pixel in an image a proper category label. With the breakthrough brought by deep learning, promising results have been achieved. Most of the cutting-edge methods belong to the fully convolutional networks
, which consider semantic segmentation as a pixel-wise classification problem. To be specific, networks will derive the label of a single pixel solely from the image patch within its receptive field. The pixel and the corresponding patch become an independent sample, and relations between neighboring samples will be ignored other than they will be trained together in the same learning step. This simplification can bring great challenges to classify pixels on the boundary area since those pixels may have distinct labels while their corresponding patches highly resemble each other, as shown in Fig.1 (d). In addition, Fig. 1 (b) shows that noisy predictions can often be found in the middle of correct predictions, which could have been avoided if the model also takes the surrounding predictions into consideration. Both failure cases can be alleviated if we have the structural knowledge of the relationships between pixels.
and modified loss function[13, 24, 2]
. However, we find that these methods have various drawbacks. Most of them are manually designed based on heuristic priors, such as relations with raw pixel value or larger weights on boundary area . These priors are insufficient to fully capture the image structural information and can only deal with simple failure cases. As the backbone network becomes stronger [3, 4, 5], these solutions become less effective and may even lead to inferior results. Besides, methods like [14, 13] choose to model image structure on the final outputs of class distribution, where the rich information of high dimensional features  has been lost. Post-processing methods like conditional random fields can also be time-consuming due to the dense modeling. As a consequence, most of the leading models [5, 27, 15] have not utilized these aforementioned techniques.
Based on the analysis above, we propose to explicitly require the model to predict the similarity between pixels, which is the pair-wise pixel affinity information. To get a more efficient representation of image structure, we decide to model a sparse version of affinity and term it as dilated affinity. The new task is parallel to semantic prediction, and thus it directly forces the high dimensinal feature maps to capture structural information during the training stage. This can be viewed as a substitution of customized loss functions. Furthermore, we can use the dilated affinity information to refine segmentation with a fast affinity propagation post-processing. This extra step is inspired by the manual annotating process and can alleviate possible noisy predictions as well as vague edges.
There are several advantages of the proposed method. First, our method is designed to fully utilize structural information based on direct supervision with minimum manual designed prior. Second, The moderate scale of dilated affinity is crucial to our success because dense modeling like CRF  can lose locality and dramatically raise the computation cost. On the other hand, relations between adjacent pixels are overwhelmed with positive signals, leaving negligible information to explore. Dilated affinity avoids both of the problems. Third, the ground truth of affinity information can be directly derived from the semantic label. This indicates that we do not ask for any extra information. Instead, we enhance the system based on what we have and try to explore every detail of existing information. Last but not least, it is easy to implement with different frameworks and require minor extra cost of computation.
We utilize the state-of-the-art DeepLabv3+  as our backbone networks. Notable improvements are observed on PASCAL VOC 2012  and Cityscapes  datasets. Sec. 2 introduce related works of three directions, fully convolutional networks, image structural formation, and pixel affinity. Sec. 3 explain our methods in detail and Sec. 4 show experimental results on PASCAL VOC 2012 and Cityscapes. Finally, in Sec. 5 we conclude our framework.
2 Related works
Fully convolutional networks.
Fully convolutional network  is one of the pioneers that introduce deep learning into semantic segmentation and achieve impressive performance on benchmark datasets. Two important techniques are proposed and have been explored extensively afterward. First, they adapt networks that are originally designed for image recognition into a fully convolutional fashion and emit dense output directly. Later works found that dilated convolution  alleviates the precision decrease caused by excessive spatial destruction. In addition, increasing the receptive field [18, 28, 4] can give extra performance boost. Second, skip architecture was proposed to refine the segmentation results with multiple level features and various substitutions [24, 10, 5, 27, 15] have been investigated after that.
Image structural information.
Works focusing on image structural information have also been developed. Ronneberger et al.  decides to assign higher weights to samples on edges. Ke et al.  customizes the loss function to pull similar pixels together and push different ones away. Several post-processing methods choose to refine the predictions by aggregating the outputs on the image level. Conditional random field (CRF)  is one of the earliest attempts on this direction, and many following methods try to enhance its capability, e.g
., CRF as recurrent neural network, Markov random field  and spatial propagation . However, these attempts usually require much additional cost, both on time and memory, and fail to give rise to better performance when the backbone methods are sufficiently strong .
Pair-wise pixel affinity is a fundamental computer vision concept and has been widely used under deep learning scenarios. Maire et al.  utilize affinity relation in the spectral embedding field while Liu et al.  constructs a linear propagation module to learn pair-wise similarity matrix. Recently, pixel affinity  and pixel link  innovatively model the problem as a task about telling whether two pixels belong to the same instance and have shown effectiveness on various practical scenes. We draw on their experience and modify the current state-of-the-art methods to enable the model to tell whether two adjacent pixels belong to the same class rather than the same instance.
In this section, we first illustrate the concept of dilated affinity and explain several specific designs which allow it to fully cooperate with the original task of segmentation. Then we introduce the details of our network architecture and loss computation. In the end, we describe the affinity propagation post-processing which combines the coarse segmentation results and affinity information.
3.1 Dilated pixel affinity
Pair-wise pixel affinity is the concept which describes the similarity between pixels and may have different mathematical definition depending on the context of the problem. Under the scenario of semantic segmentation, we denote the affinity of two pixels with a binary signal and assign it with a positive value of 1 when the two pixels belong to the same class , otherwise 0. As shown in Eq. (1), and are the semantic labels of pixel and . And is their affinity:
When capturing the affinity of a pair of pixels, we only consider pixels within a restricted area since distant pixels lose locality and the complexity of modeling every pair of pixels grows rapidly with the size of the feature map. On the other hand, nearby pixels are also discarded as the signals are overwhelmed with positive values, leaving negligible information to explore. Thus, we decide to sparsely sample from pixels with reasonable distances in the same way of dilated convolution [26, 3] and term this sampling method dilated affinity. As shown in Fig. 1 (e).
To be specific, for pixel at position on the feature map, the network is required to predict its affinity with a group of pixels, which are on the 8 directions of dilation rate . We denote this single group of pixels as . Inspired by [19, 16], multiple groups of pixels with various dilation rates are taken into consideration. We denote the set of dilation rates as and the set of all targeted pixels as . Their relations are shown below.
The influence of different choices of is explored in Sec. 4.
3.2 Architecture and loss computation
We choose the extraordinary DeepLabv3+  with ResNet-101  as our baseline model. In the original DeepLabv3+, the feature map generated by ResNet-101 first go through the Atrous Spatial Pyramid Pooling (ASPP) module 
. Then it is refined by a decoder, resulting in a 256-dimension feature map four times smaller than the original input. At last, a 1x1 convolutional layer is used to reduce the channel size to the number of classes and bilinear resize interpolation is used to upsample the feature map to the size of the original input. Softmax cross entropy function is adopted to compute the final loss.
To allow the network to predict dilated affinity, we add an extra 1x1 convolutional layer onto the feature map generated by the decoder. The new branch is parallel to the original segmentation branch. This parallel design is more sufficient to capture affinity information than . Bilinear resize is still be used but the softmax operation is replaced with the sigmoid operation. The overall architecture is shown in Fig. 2.
Dilated affinity learning resembles one-stage object detection in many ways that both of them are required to output extra information other than semantic predictions, and both of them face severe sample imbalance on the additional output. Dilated affinity with small rates still faces the problem of signal imbalance that positive signals occupy most of the proportion. To alleviate the signal imbalance talked above, we resort to focal loss  and reweighting different samples.
The form of focal loss function is shown in Eq. (4), where is the ground-truth of dilated affinity,
is the estimated probability of positive affinity andis the focusing parameter , which is set to 2 in our experiments.
As for the weighting scheme, a straightforward approach would be reweighting different samples based on the inverse frequency of different signals , whether positive or negative. However, it would be more appropriate to reweight based on independent samples rather than subdivided signals. And for a pixel in the boundary area, its positive affinity signals are as valuable as the negative ones. Thus we opt for a different solution. For pixels in set , we divide pixels into nine categories based on the number of positive affinity signals from their eight neighbors. These categories are denoted as to . We calculate the proportion of to for each , and use their inverse frequencies as weights instead. The distribution of pixels for different datasets and different dilation rates are shown in Fig. 3.
Directly using the inverse frequency based on positive neighbors may result in absurdly large weights to samples of and and experiments indicate that this aggressive weighting scheme can cause damage to the learning of semantic segmentation. Thus we propose to use the square root value of inverse frequency instead. The final form of affinity loss is shown below.
In Eq. (5), is the frequency of for dilation rate , and is the amount of positive signal in . As it suggests, the weight of pixels from is always equal to . The affinity loss is multiplied with a parameter before added to the total loss, and its value is selected via cross-validation. We find that it is crucial to assign a value large enough to gain improvements from the joint training procedure. However, too large can cause damage to the learning of semantic segmentation, although the accuracy of dilated affinity may keep rising.
3.3 Refine segmentation with affinity propagation
As discussed in Sec. 1, It is more robust to classify a pixel not only based on the image patch, but also the neighboring pixels and the affinity information of them. Also, when inspecting on the manual annotating process, we realize that humans do not classify every pixel separately since it is not only laborious but also unnecessary. What they actually do is recognizing the majority of the pixels, then refining the annotation by considering the similarity between them, especially on edges.
Inspired by this observation, we decide to refine the classification of a pixel by adding an additional factor, which is proportional to the predictions of nearby pixels and the corresponding affinity. To fully express our intention, we define a general form of the refinement below and discuss the specific design in detail.
In Eq. (6), is the class prediction of pixel , and is the refined prediction. Both and are vectors. is a weight parameter whose value is selected by cross-validation. For a pixel in , we simplify its affinity as , and its prediction of different categories as . is the normalization function, which will make sure that the sum of is equal to 1. The max value of is utilized to keep confident predictions consistent with original predictions. Furthermore, because is always positive and may introduce noise when its value is small, we change the sigmoid operation of to a steeper version during the post-processing, as shown in Eq. (7).
This design can force small affinity value to zero and decrease the difference between high affinity signals. It is noted that this refining process can be executed with multiple times like CRF , propagating the original classifications through the connection of positive affinity. Unlike the edge merge process , our affinity propagation is faster and more robust since there is no demand to output instances results. The performance of different iteration times are shown in Sec. 4
4.1 Experimental setup
, which is only pretrained on ImageNet
. The parameters in the ASPP module and the decoder are randomly initialized. Biases of the affinity layers are initialized asfollowing , where is set to the frequency of positive signals for .
All experiments are conducted on 4 GTX 1080Ti GPUs with Tensorflow
. Furthermore, we implement the Cross-GPU Batch Normalization to alleviate unstable statistics caused by small batch size on each GPU [23, 25].
Dataset. We evaluate our methods on two datasets, PASCAL VOC 2012  and Cityscapes . PASCAL VOC 2012 is a semantic segmentation task that has 20 foreground classes and 1 background class. We utilize the extra annotation provided by , resulting 10 582, 1 449 and 1 456 for train, validation and test separately. Cityscapes  is a large-scale semantic segmentation dataset, most of which are street scenes of various cities. We train the model on the 2975 images in the training set and test it on the 500 images in the validation set.
Learning rate and training steps. For PASCAL VOC 2012, we train the model for iterations with the crop size of 513 and the batch size of 16. As for Cityscapes, the training steps is
, crop size changes to 769, while the batch size decreases to 12. The output stride, which is the spatial resolution ratio of the input image to the final output, is set to 16 during training. The rest settings, such as learning rate schedule and data augmentation are the same with DeepLabv3+.
Evaluation metric. The evalutation metric is the mean intersection-over-union (mIoU) score. The output stride changes to 8 during evaluation.
4.2 Ablation study on PASCAL VOC 2012
We set to to compare different choices of weighting scheme and loss functions. The baseline uses equal weights and cross-entropy loss function. All the other three weighting schemes, e.g.inverse frequency based on signals (signal-reweight), inverse frequency based on neighbors (neighbor-reweight), and the square root of the inverse frequency based on neighbors (sqrt-reweight) are adopting focal loss function.
The weighting scheme of the square root value of inverse frequency has the best performance, as shown in the table below. The value of for different weight schemes selected by cross-validation is also shown and we can see that the best for neighbor-reweight is much smaller than the others due to the absurdly large weights on and . This indicates the necessity of using the sqrt-reweight scheme.
Fig. 4 shows the accuracy of dilated affinity with respect to different weighting schemes and dilation rates. The accuracies of affinity, especially those of to , is important for our affinity propagation process. For to , neighbor-reweight has the best performance, while for to , sqrt-reweight and baseline achieve a better performance.
We investigate dilated affinity of various dilation rates with the sqrt-reweight scheme. Following the discussion in Sec 3, we explore dilation rates of three aspect ratios, e.g.1:1, 1:2 and 2:1. The first column of the following table denotes the rates. In the brackets, a scalar like is short for , while a tuple like represents rates of as well as .
We use the results achieved by the dilated affinity of and to test our affinity propagation process. We also provide the outcome when we use the ground truth of dilated affinity to refine segmentation results as a supplement. is set to 6 according to cross-validation.
|Refine with predicted affinity||78.96%||79.03%||79.07%||79.10%||79.15%|
|Refine with ground truth||81.16%||81.58%||81.94%||82.22%||82.53%|
|Refine with predicted affinity||78.62%||78.91%||79.09%||79.17%||79.21%|
|Refine with ground truth||82.33%||82.94%||83.10%||83.37%||83.59%|
Experiments show that dilated affinity with small rates is in favor of the joint training procedure but less effective when used in the affinity propagation stage, while large rates dilated affinity is useful in the propagation stage but may interfere the learning of semantic segmentation. A good practice is based on the tradeoff between these two factors.
4.3 Experiments on PASCAL VOC 2012
The best result of our method uses , , and the sqrt-reweight scheme. We compare it with other methods focusing on the utilization of structural information. Experiment settings of different methods are consistent with the best case in their papers. The mIoU score of our implemented DeepLabv3+ is 1.42% lower than the one reported in the original paper .
4.4 Experiments on Cityscapes
When evaluating the proposed algorithm on Cityscapes dataset, we adopt the best setting on PASCAL VOC, which is ,
and 10 times iterations in affinity propagation. We did not do an exhausting search on these hyperparameters as this part is only to show the applicability.
|DeepLabv3+ Dilated Affinity||78.70%|
Our proposed extra learning of dilated affinity information can consistently improve the current state-of-the-art method with minor extra cost. Dilated affinity can assist the original task of semantic segmentation from two aspects. First, joint training with dilated affinity helps the learning process of semantic segmentation. Second, segmentation results can be refined with a fast affinity propagation post-processing, which exploits the extra information generated by the network. Different choices of how to learn the extra information are fully explored in our paper and the learned hyperparameters can be extended to other datasets.
-  Martín Abadi, Ashish Agarwal, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.
Samuel Rota Bulò, Gerhard Neuhold, and Peter Kontschieder.
Loss max-pooling for semantic image segmentation.In CVPR, pages 7082–7091, 2017.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018.
-  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR, abs/1802.02611, 2018.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
-  Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
-  Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
-  Golnaz Ghiasi and Charless C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In ECCV, 2016.
-  Bharath Hariharan, Pablo Arbelaez, Lubomir D. Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X. Yu. Adaptive affinity fields for semantic segmentation. In ECCV, 2018.
-  Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
-  Di Lin, Yuanfeng Ji, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Multi-scale context intertwining for semantic segmentation. In ECCV, 2018.
-  Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. ICCV, 2017.
-  Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spatial propagation networks. In NIPS, pages 1519–1529, 2017.
-  Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better. CoRR, abs/1506.04579, 2015.
-  Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. Affinity derivation and graph merge for instance segmentation. In ECCV, 2018.
-  Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Semantic image segmentation via deep parsing network. CoRR, abs/1509.02634, 2015.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  Michael Maire, Takuya Narihira, and Stella X. Yu. Affinity CNN: learning pixel-centric pairwise relations for figure/ground embedding. In CVPR, pages 174–182, 2016.
-  Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. In CVPR, pages 6181–6189, 2018.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
-  Yuxin Wu and Kaiming He. Group normalization. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 3–19, 2018.
-  Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2016.
-  Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, 2018.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
-  Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015.