Semantic segmentation of an image refers to the task of assigning each pixel a categorical label, e.g., motorcycle or person Zhao et al. (2017). Owing to the rapid development of deep learning, tremendous progress has been made for fully-annotated images. Notable examples include the DeepLab Chen et al. (2018) and the PSPNet Zhao et al. (2017). These methods assume that the pixel-level labels are available immediately upon request. However, this assumption is over-optimistic, since it could involve the annotation of data by expensive means in terms of cost and labor time. As a result, weakly-supervised semantic segmentation which only requires a few weak labels, such as bounding box Dai et al. (2015), scribble Lin et al. (2016), points Bearman et al. (2016) and tags Long et al. (2015), has attracted increasing attention.
This paper focuses on weakly-supervised semantic segmentation with image-level tags. Under the circumstance, the lack of structure information on the organization of image pixels prevents network learning from labels directly, which makes the problem of weakly-supervised semantic segmentation ill-conditioned. Consequently, how to recover the cross-image structure information and the inner-image structure information becomes a pivotal issue. The cross-image structure information describes how to organize pixels for a specific semantic category across multiple images. For example, given this kind of information for the person category, we can cluster pixels (in multiple images) into local regions (e.g., heads and faces) that are discriminative to recognize a person. Unlike cross-image structure information, the inner-image structure information tells us how to organize pixels (in a single image) based on their low-level features like textures or colors. This kind of information always contains details about object boundaries.
In this paper, we propose Dual-Feedback Network (DFN), a closed-loop system with two feedback chains, for weakly-supervised semantic segmentation. The architecture of DFN is shown in Figure 1. Our general idea is to adopt a divide-and-conquer strategy to recover both cross-image and inner-image structure information, and each feedback chain is imposed to compensate for the lack of one type of structure information. The first chain updates the labels progressively to correct errors made by static pseudo-label (see Figure 1(c)). Specifically, we generate initial seeds using Class Activation Mapping (CAM) Zhou et al. (2016)
and set it the pseudo-label to train the network in the first few epochs. After that, the network output is routed back to the initial seeds to amend their probability distributions. The first feedback chain repeats this procedure several iterations until seeds are not changed, whereas the second feedback chain uses a customized random walk to integrate inner-image structure information characterized by a relationship matrix built upon superpixels into network training (see Figure1(d)). Our method demonstrates state-of-the-art performance on the PASCAL VOC 2012 segmentation dataset.
To summarize, our contributions are threefold:
We interpret weakly-supervised semantic segmentation as a closed-loop problem, and introduce two feedback chains to recover, respectively, the cross-image structure information and the inner-image structure information. We also demonstrate that the new network, i.e., DFN, can be trained end-to-end.
We construct a relationship matrix based on superpixels to recover inner-image structure information. This method is computational and memory efficient, and robust to noisy labels.
Our method outperforms existing weakly-supervised semantic segmentation methods with image-level tags. Especially, the mean Intersection-Over-Union (mIOU) values of our method are 60.0% and 61.1% on val and test sets, respectively.
2 Related Work
Compared with fully-supervised semantic segmentation, the main issue for weakly-supervised semantic segmentation is the lack of structure information that tells us how to organize the pixels. This problem becomes more serious if only image-level tags are available. Existing weakly-supervised semantic segmentation with image-level tags can be divided into two categories.
The first category uses a specialized loss function or objective to recover the structure information.Pathak et al. (2014) initially suggests a multi-instance loss (MIL) function to recognize the most discriminatory region for each category. Later, Pathak et al. (2015) adds some linear constrains into an iteratively updating process to restrict the structure information of the output, leading to an improved segmentation result. On the other hand, Durand et al. (2017)
proposes a weakly-supervised learning transfer layer to discover complementary regions for the subcategories which belong to the same category, whereasLi et al. (2018) builds a guided attention inference network and uses an attention mining loss to constrain the output. However, this kind of methods only restricts a small part of pixels in the network output.
Different from the first category, the methods in the second category generate pseudo-label to incorporate the structure information, and use the new generated labels to constrain all of the pixels in the network output. For example, Kolesnikov and Lampert (2016) obtains the pseudo-label using CAM, Saleh et al. (2016) uses the activations from the middle layers of VGG-16 Simonyan and Zisserman (2015) to generate some candidate regions, whereas Hou et al. (2017) sets attention map of each image as labels. Wei et al. (2017b) proposes a simple-to-complex mechanism to generate the pseudo-label iteratively. The authors then apply an erasing mechanism into their network structure and discover new regions to expand the truth area in the pseudo-label Wei et al. (2017a). Despite substantial progress made by these methods, they always use static pseudo-label which is likely to accumulate errors incurred by initial inaccurate seeds or a dense conditional random field (DenseCRF) model Krahenbuhl and Koltun (2011) that is sensitive to noisy labels.
In this paper, we introduce two feedback chains into existing weakly-supervised semantic segmentation network to form a close-loop system, which enables iterative extraction of robust structure information and dynamical correction of errors made by inaccurate seed localization. Specifically, the first feedback chain updates the seed by the network output to recover the cross-image structure, whereas the second feedback chain incorporates robust inner-image structure information characterized by a relationship matrix built upon superpixels to further expand seed regions. It is worth noting that deep seed region growing (DSRG) Huang et al. (2018) also uses dynamic seed. However, it does not correct the inaccurate seed localization explicitly, which constrains the upper bound of segmentation performance. On the other hand, Mining Common Object Features (MCOF) Wang et al. (2018) also suggests using superpixels to robustly rebuild inner-image structure information in the presence of noisy labels. Unlike MCOF, we use a simple relationship matrix to characterize superpixels, rather than an extra network that requires significantly higher computational and memory burden. Moreover, experiments demonstrate that our network outperforms both DSRG and MCOF with a large margin.
3 Our Approach
In this section, we elaborate our proposed DFN, especially the way to recover robust inner-image structure information with the aid of relationship matrix.
Similar to Kolesnikov and Lampert (2016); Wang et al. (2018); Huang et al. (2018), we use DeepLab-LargeFOV network Chen et al. (2018) as the baseline network and build two feedback chains as shown in Figure 2. Specifically, we first generate initial seeds using CAM and the saliency detection method in Jiang et al. (2013). Then, in the second feedback chain, we integrate the robust inner-image structure information characterized by a relationship matrix built upon superpixels to produce the mixed seeds. Next, the baseline network is trained with the obtained mixed seeds under the seed loss function Huang et al. (2018) and the boundary loss function Kolesnikov and Lampert (2016). Using the network output, we update the seeds progressively along the first feedback chain. We repeat above procedures until there is no change on seeds.
3.2 The First Feedback Chain
Although the seed cues localized by CAM or saliency maps are discriminative and precise, they are sparsely distributed and still contain tiny errors. To correct these errors, we adopt an updating mechanism to adjust seeds progressively. Specifically, in each iteration, we update the seed locations with the network output using the following simple formula:
where denotes seeds in the current iteration, denotes seeds from the last iteration, is the network output, and is a weighting factor that balances the update rate.
One should note that there are two key points behind Eq. (1). First, to preserve the boundary information and reduce the memory cost, the seeds are stored based on superpixels, rather than 41 41 pixel blocks as in Wei et al. (2017b) and Huang et al. (2018). Second, in the interval of two subsequent seeds iterations, the network is trained independently with
epochs. Moreover, we monitor the seeds states in different iterations to determine whether to continue the update. Specifically, each superpixel in seeds is characterized by a vector with its element corresponds to the probability to each category. Then, for one superpixel, we view it unchanged if the sum of absolute differences between elements of two vectors in two subsequent iterations is less than. We stop seeds update if of superpixels are unchanged. Figure 3 shows the evolution of seeds at different epochs. As can be seen, with the increase of the number of iterations, initial errors are corrected and the dynamic seeds get closer to the ground truth.
3.3 The Second Feedback Chain
3.3.1 Robust Inner-Image Structure Information Recovery
Previous works use DenseCRF to build the relationships among pixels to recover the inner-image structure information. Unfortunately, the DenseCRF relies heavily on the color and location of each pixel, which makes it sensitive to noisy pixels. To address this limitation, we recover inner-image structrue information based on superpixels generated by Felzenszwalb’s method Felzenszwalb and Huttenlocher (2004) and refined by Region Adjacency Graph Tremeau and Colantoni (2000) (see the first row of Figure 4 as an example).
The construction of relationship matrix consists of four steps. First, we generate Euclidean distance matrix for superpixels using zoom-out features Mostajabi et al. (2015). However, unlike Mostajabi et al. (2015), we only use the first two convolutional layers in VGG-16 to obtain low-level features. Nevertheless, owing to the diversity of objects under complex backgrounds in each image, the Euclidean distance with absolute values cannot precisely reflect the similarity between any two superpixels. Therefore, we then transform the Euclidean matrix to a simple similarity matrix (denoted ), as shown in Figure 4(e). Specifically, we pick the smallest values in each row of the Euclidean matrix and set their values to , indicating that they are mutually similar. We then set the values of remaining elements in each row of the Euclidean matrix to , suggesting that they are not similar. Next, we construct an adjacency matrix (denoted ) for superpixels to incorporate their location information. If two superpixels are neighboring, the corresponding element in the adjacency matrix is , otherwise (see Figure 4(f)). We finally obtain a relationship matrix (denoted ) with the following formula (see Figure 4(f)):
where denotes entry-wise product.
3.3.2 Customized Random Walk
After recovering the inner-image structure information characterized by a relationship matrix built upon superpixels, the second feedback chain performs a customized random walk to incorporate both the inner-image structure information and the cross-image structure information into network training.
The input to our customized random walk includes relationship matrix, dynamic seeds, and network output. However, a preprocessing to dynamic seeds and network output is required herein. The reason is that each superpixel in seeds or network output is characterized by a vector with its element corresponds to the probability to each category. It means that if the value of the maximum element in the vector (of a superpixel) is small, we have a high classification uncertainty to assign this superpixel to its corresponding category. Moreover, the superpixels with high classification uncertainty will mislead the direction of seed expansion in random walk. Therefore, it is necessary to filter out these superpixels at first. To this end, we perform image thresholding to both dynamic seeds and network output. Specifically, we use a threshold function (denoted
) coupled with two hyperparametersand to distinguish foreground and background in seeds. Similarly, we use another threshold function (denoted ) coupled with hyperparameters and to distinguish foreground and background in the network output. For example, if the maximum element of one superpixel belong to the foreground in dynamic seeds is less than , we drop out this superpixel in the following customized random walk.
By using the processed seeds as the initial state and the relationship matrix as the transition probability matrix, the basic random walk Blum et al. (2018) generates candidate segmentation regions with “”, where “” denotes matrix multiplication. However, to remove false segmentation, we use the network output to guide the refinement of result from the basic random walk process. Therefore, the result of our one-step customized random walk is given by:
where denotes entry-wise product. We repeat the operation of “” times on the initial state to obtain the -step random walk result. For example, the result of two-step customized random walk is given by:
We present two examples of our customized random walk in Figure 5. By merging the robust inner-image structure information into the dynamic seeds, the obtained mixed seeds fit well with object boundaries and expand confident regions.
4.1 Experiment Setup
Dataset and Evaluation Metrics
Dataset and Evaluation MetricsWe evaluate the proposed DFN on the PASCAL VOC 2012 segmentation benchmark dataset Everingham et al. (2015) which constains 20 foreground object classes and one background class. The segmentation part of VOC 2012 dataset is split into three parts: training (train, 1464 images), validation (val, 1449 images) and testing (test, 1456 images). Same to Huang et al. (2018) and Wang et al. (2018), the training set is extended with additional images from Hariharan et al. (2011), resulting in an augmented set of 10582 images. Following the common practice, we use the mean Intersection-Over-Union (mIoU) criterion to compare our method with other approaches on both val and test sets. We report our results on standard val set with the ground truth segmentation masks are available. For the test set, we submit the results of our final best model to the official evaluation server.
Implementation details For all the experiments, we use the DeepLab-LargeFOV network Chen et al. (2018) as the baseline segmentation architecture which is initialized from VGG-16 Simonyan and Zisserman (2015)
pre-trained on ImageNetDeng et al. (2009). We use a mini-batch of images for SGD and initial learning rate of - which is decreased by a factor of every epochs. The momentum is , the dropout rate is and the total epochs of training is .
For the first feedback chain, we set to and start to update the seeds every epochs after the -th epoch. For the second feedback chain, we set to , to , to , and to . In addition, we perform two steps of customized random walk. In the test phase, same to Kolesnikov and Lampert (2016); Huang et al. (2018), the fully-connected CRF Krahenbuhl and Koltun (2011) and the multi-scale prediction Chen et al. (2018) are applied with their default parameters.
We use tensorflow to implement our approach. The code will be released soon. Anyone can use it and trains on a single NVIDIA GTX 1080TI GPU for abouthours.
4.2 Comparison with Previous Methods
Table 1 and Table 2 report the mIoU values of our method on PASCAL VOC 2012 val and test sets, respectively, against previous state-of-the-art methods, namely EM-Adapt Papandreou et al. (2015), STC Wei et al. (2017b), SEC Kolesnikov and Lampert (2016), MCOF Wang et al. (2018), and DSRG Huang et al. (2018). As can be seen, our method outperforms all compared methods in terms of mIoU value under the same experimental setting. Although MCOF Wang et al. (2018) uses an extra network to recover structure information with superpixels, our method achieves a performance gain of and on val and test sets, respectively. Meanwhile, the improvement is and respectively compared to DSRG Huang et al. (2018).
Computational and Memory Cost We summarize the number of parameters and the number of floating-point operations (FLOPS) in Table 3. Compared with DSRG and MOCF, our network only requires approximately of parameters and reduces at least FLOPS. This result suggests that our method is more computational and memory efficient.
Different Supervision Types We compare our network with other methods under different types of supervisions. They are FCN Long et al. (2015), DeepLab Chen et al. (2018), WSSL Papandreou et al. (2015), BoxSup Dai et al. (2015), RAWK Vernaza and Chandraker (2017), ScribbleSup Lin et al. (2016), and What’sPoint Bearman et al. (2016). As can be seen in Table 4, our network achieves comparable performance to other methods that require stronger supervisions, e.g., the WSSL, the RWAK, or even the fully-supervised FCN. This result also suggests that there is a large performance gap between fully-supervised semantic segmentation and the weakly supervised semantic segmentation, especially for the points supervisions.
Quantitative Results The segmentation results shown in Figure 6 corroborate our quantitative evaluations. Note that, our method can generate precise segmentation results even for images containing complex backgrounds. However, our method is likely to fail when there are multiple small and dense objects on top of another larger object. Taking the last row of Figure 6 as an example. Our method misclassified most of pixels in the table into background. One possible reason is that there are lots of plates and foods on the table, such that various of colors and textures of these small and dense objects make our method hard to generate a precise relationship matrix.
|Method||Params (M)||FLOPS (G)||mIoU|
|MCOF||about 40||about 201||57.6|
4.3 Ablation Studies
To validate the effects of different components, we perform some ablation experiments under different settings. In Table 5, we summarize the performance of our network in different degrading settings. Specifically, the “baseline” indicates our baseline network without two feedback chains, the “baseline+F1” indicates only integrating the first feedback chain into the baseline network, the “baseline+F2” indicates only integrating the second feedback chain into the baseline network, whereas the “baseline+F1+F2” indicates integrating both two chains. We observe percent gain in mIoU using the first feedback chain. This suggests that our dynamic seeds update mechanism indeed reduces more errors than DSRG. Moreover, by comparing the mIoU value of “baseline+F2” with that of MCOF (), although both methods attempt to recover inner-image structure information based on superpixels, our relationship matrix achieves percent performance gain, with significantly less computational and memory burden.
|baseline + F1||88.0||57.8||80.0|
|baseline + F2||88.2||58.4||80.3|
|baseline + F1+ F2||89.3||60.0||81.4|
4.4 Effects of Hyperparameters
We finally evaluate the performance of our network with respect to different hyperparameter settings and specify the way to pinpoint their values. The hyperparameter in Eq. (1) represents the update rate in the first feedback chain. A large makes the dynamic seeds cannot supervise network training effectively, whereas a small slows down the correction of errors incurred by initial inaccurate seeds. We go through possible values of in a large range and select the best-performing one. From Table 6, the optimal value of is .
In the second feedback chain, there are four hyperparameters, i.e., , , and . We set their values to , , and . These hyperparameters are tuned with a coarse-to-fine procedure. In the coarse module, we roughly determine a satisfactory range for each hyperparameter (for example, the range of is ). Then, in the fine-tuning module, we divide these parameters into two groups based on their correlations: (1) ; (2) . When we test values of one group of hyperparameters, another group is set to default values, i.e., the mean value of the optimal range given by the coarse module. In each group, the hyperparameters are tuned with grid search (for example, is tuned at the range with an interval ). We finally pinpointed the specific value as the one that can achieve the highest mIoU value (for example, the final value of is ). We also test the sensitivity of these hyperparameters. To this end, we evaluate the performance variation for one hyperparameter with other three fixed. The results are shown in Figure 7. We can observe that in a wide range, our network outperforms the baseline with a large margin. It means that our performance is not sensitive to these four hyperparameters. Moreover, the performance variation for and seems less than that for and . One possible reason is that seeds update slowly than network output.
5 Conclusion and future work
In this paper, we designed a closed-loop network architecture, namely Dual-Feedback Network, for weakly-supervised semantic segmentation with image-level tags by introducing two feedback chains. The proposed network can correct the errors made by initial inaccurate seed localization with a seed updating mechanism and increase the robustness to noisy labels by incorporating inner-image structure information from superpixels. Experiments on PASCAL VOC 2012 show that our network outperforms previous state-of-the-art methods under the same experimental condition. Our method is also much more computational and memory efficient. Moreover, it is robust to different parameter settings.
In the future, we will continue improving the performance of DFN. This is because our prediction results in some examples still suffer from non-smooth boundaries or inconsistent small holes. At the same time, we are also interested in applying generative adversarial nets Goodfellow et al. (2014) to recover the missing structure information.
- Bearman et al.  Amy Bearman, Olga Russakovsky, and Vittorio Ferrari, et al. What’s the point: Semantic segmentation with point supervision. ECCV, pages 549–565, 2016.
Blum et al. 
A. Blum, J. Hopcroft, and R. Kannan.
Foundations of data science.[Online]. Available:https://www.cs.cornell.edu/jeh/book.pdf, 2018.
- Chen et al.  Liangchieh Chen, George Papandreou, and Iasonas Kokkinos, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE T-PAMI, 40(4):834–848, 2018.
- Dai et al.  Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. ICCV, pages 1635–1643, 2015.
- Deng et al.  Jia Deng, Wei Dong, and Richard Socher, et al. Imagenet: A large-scale hierarchical image database. CVPR, pages 248–255, 2009.
- Durand et al.  Thibaut Durand, Taylor Mordan, and Nicolas Thome, et al. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. CVPR, pages 5957–5966, 2017.
- Everingham et al.  Mark Everingham, S M Eslami, and Luc Van Gool, et al. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
- Felzenszwalb and Huttenlocher  Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. IJCV, 59(2):167–181, 2004.
- Goodfellow et al.  Ian J Goodfellow, Jean Pougetabadie, and Mehdi Mirza, et al. Generative adversarial nets. NeurIPS, pages 2672–2680, 2014.
- Hariharan et al.  Bharath Hariharan, Pablo Arbelaez, and Lubomir D Bourdev, et al. Semantic contours from inverse detectors. ICCV, pages 991–998, 2011.
- Hou et al.  Qibin Hou, Daniela Massiceti, and Puneet Kumar Dokania, et al. Bottom-up top-down cues for weakly-supervised semantic segmentation. CVPR Workshop on Energy Minimization Methods, pages 263–277, 2017.
- Huang et al.  Zilong Huang, Xinggang Wang, and Jiasi Wang, et al. Weakly-supervised semantic segmentation network with deep seeded region growing. CVPR, pages 7014–7023, 2018.
- Jiang et al.  Huaizu Jiang, Jingdong Wang, and Zejian Yuan, et al. Salient object detection: A discriminative regional feature integration approach. CVPR, pages 2083–2090, 2013.
- Kolesnikov and Lampert  Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. ECCV, pages 695–711, 2016.
- Krahenbuhl and Koltun  Philipp Krahenbuhl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. NeurIPS, pages 109–117, 2011.
- Li et al.  Kunpeng Li, Ziyan Wu, and Kuanchuan Peng, et al. Tell me where to look: Guided attention inference network. CVPR, pages 9215–9223, 2018.
- Lin et al.  Di Lin, Jifeng Dai, and Jiaya Jia, et al. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. CVPR, pages 3159–3167, 2016.
- Long et al.  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CVPR, pages 3431–3440, 2015.
- Mostajabi et al.  Mohammadreza Mostajabi, Payman Yadollahpour, and Gregory Shakhnarovich. Feedforward semantic segmentation with zoom-out features. CVPR, pages 3376–3385, 2015.
Papandreou et al. 
George Papandreou, Liangchieh Chen, and Kevin P Murphy, et al.
Weakly- and semi-supervised learning of a dcnn for semantic image segmentation.arXiv, 2015.
- Pathak et al.  Deepak Pathak, Evan Shelhamer, and Jonathan Long, et al. Fully convolutional multi-class multiple instance learning. arXiv, 2014.
Pathak et al. 
Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell.
Constrained convolutional neural networks for weakly supervised segmentation.ICCV, pages 1796–1804, 2015.
- Saleh et al.  Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian, and Mathieu Salzmann, et al. Built-in foreground/background prior for weakly-supervised semantic segmentation. ECCV, 9912:413–432, 2016.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICIR, 2015.
- Tremeau and Colantoni  Alain Tremeau and Philippe Colantoni. Regions adjacency graph applied to color image segmentation. IEEE T-IP, 9(4):735–744, 2000.
- Vernaza and Chandraker  Paul Vernaza and Manmohan Chandraker. Learning random-walk label propagation for weakly-supervised semantic segmentation. CVPR, pages 2953–2961, 2017.
- Wang et al.  Xiang Wang, Shaodi You, and Xi Li, et al. Weakly-supervised semantic segmentation by iteratively mining common object features. CVPR, pages 1354–1362, 2018.
- Wei et al. [2017a] Yunchao Wei, Jiashi Feng, and Xiaodan Liang, et al. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. CVPR, pages 6488–6496, 2017.
- Wei et al. [2017b] Yunchao Wei, Xiaodan Liang, and Yunpeng Chen, et al. Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE T-PAMI, 39(11):2314–2320, 2017.
- Zhao et al.  Hengshuang Zhao, Jianping Shi, and Xiaojuan Qi, et al. Pyramid scene parsing network. CVPR, pages 6230–6239, 2017.
Zhou et al. 
Bolei Zhou, Aditya Khosla, and Agata Lapedriza, et al.
Learning deep features for discriminative localization.CVPR, pages 2921–2929, 2016.