Scene parsing is of critical importance in mid- and high-level vision. Various machine vision applications lie on the basis of detailed and accurate scene analysis and segmentation i.e. outdoor driving system [Ess et al.2009]Wan et al.2014] and editing [Tsai et al.2016]. To achieve the goal of successfully recognizing objects in common scenes, datasets with careful and comprehensive labeling such as the PASCAL-Context [Mottaghi et al.2014] and ADE 20K [Zhou et al.2017] are putting forward to the community at great expense. Compared with image semantic segmentation where many semantic regions are coarsely divided into a unified background class, pixel-wise annotations for scene parsing datasets require much more efforts as more fine-grained regions (e.g. wall and ground) need to be specified manually. Though current state-of-the-art methods such as PSPNet [Zhao et al.2017] and Deeplab [Chen et al.2018] have already obtained impressive performance on these datasets, the limitations of relying on densely-annotated data would be magnified as dozens of new circumstances i.e. autonomous driving in the wild, drone navigation coming into horizon. It is both time consuming and expensive to carefully annotate a brand new dataset for each specific application.
Attempts have been made to alleviate the annotation burden by introducing multiple weakly supervised formats i.e. image-level supervision [Pathak et al.2014], box supervision [Dai, He, and Sun2015] and scribble supervision [Lin et al.2016]. In the mean time, as suggested in [Bearman et al.2016]
, to annotate a single pixel for each instance is a natural scheme for human reference and the cost could be greatly alleviated. Furthermore, Bearman’s work tackles the easier semantic segmentation and focuses more on analyzing the effectiveness of the point-based scheme itself by analyzing the annotation quality and taking the interaction with annotators as an important factor in the loss function. We believe that the potential of point-based supervision has not been well explored and also, the more challenging task of point-based scene parsing has remained untouched.
Thus, this paper serves as an initial attempt to explore the possibility of point-guided weakly supervised scene parsing. Being given only one semantic annotated point per instance, we propose a novel point-based distance metric learning method (PDML) to tackle this challenging task. PDML leverages semantic relationship among the annotated points by encouraging the feature representations of the intra- and inter-category points to keep consistent. Points within the same category are optimized to share more similar feature representations and oppositely, features of points from different categories are optimized to be more distinct. We implement this optimizing procedure by utilizing a distance metric loss, which collaborates with the point-wise cross-entropy loss to optimize the whole deep neural network. More important, different from current weakly supervised methods whose solutions are constrained in a single image, we conduct distance metric learning across different training images, so that the limited human annotated points can be fully exploited. Extensive experiments are performed on two challenging scene parsing benchmarks: PASCAL-Context and ADE 20K. We achieve the mIoU score of 30.0% on PASCAL-Context, which is impressive compared to the result of 39.6% from fully supervised scheme by using only the number of annotated pixels. And we achieve mIoU of 19.6% on ADE 20K, while the SegNet [Badrinarayanan, Kendall, and Cipolla2017] produces roughly 21% under full annotation of this dataset.
In conclusion, the major contributions of this paper lie in the following aspects:
We are the first to deal with the task of point-based weakly supervised scene parsing.
We propose a novel deep metric learning method PDML which optimizes the intra- and inter-category embedding feature consistency among the annotated points.
PDML is performed across different training images to fully exploit the limited annotations, which is very novel compared to traditional intra-image methods.
Our method has competitive performance both qualitatively and quantitatively on PASCAL-Context and ADE 20K scene parsing dataset.
Weakly Supervised Semantic Segmentation
propose dynamic prediction of foreground and background by using Expectation-Maximization algorithm.[Kolesnikov and Lampert2016] introduce the new loss function of seed, expand and constrain. [Wei et al.2017b] utilize a simple to complex framework to further improve the performance. And object region mining and localization is experimented in [Wei et al.2016], [Wei et al.2017a], [Zhang et al.2018b] and [Hou et al.2018]. And [Wei et al.2018] discuss the effect of dilated convolution in this task. However, the methods listed below all focus on object-based semantic segmentation while no attempts have been made to deal with the more difficult scene parsing. Recently [Zhang et al.2018a] tackle image-description guided scene parsing but investigating on parsing a scene image into a structured configuration.
Compared to image-level annotations where no explicit location related information are given, multiple attempts have been made to provide various region-specified semantic supervision. Annotated bounding boxes are utilized in [Dai, He, and Sun2015] and [Papandreou et al.2015]. [Lin et al.2016] use scribbles as the supervision information and graphic models are used in optimization. Furthermore, [Bearman et al.2016] have a similar setup with us but target at semantic segmentation. Also, they focus more on analyzing point supervision regime itself by comparing the annotation time, error rate as well as quality with other supervision regimes. Their method takes the confidence of annotators as a parameter in the loss function and transfers image-level annotations to objectness priors by utilizing another model pretrained on non-overlapping datasets. In comparison, we put attention on the more difficult task of scene parsing. We do not use any additional data and focus on exploring the cross-image semantic relations to boost the scene parsing performance.
Deep Metric Learning
Deep metric learning on embedding features has been explored in various tasks such as image query-and-retrieval [Oh Song et al.2016]Schroff, Kalenichenko, and Philbin2015] and verification [Ming et al.2017]. It has also been applied in semantic instance segmentation [Fathi et al.2017] and grouping [Kong and Fowlkes2018]. More recently, [Liu et al.2018] use metric learning as a fast and efficient way for training a semantic segmentation network.
Compared to full, box and scribble supervision regime, point-based supervision data is most natural for human reference and easiest to obtain. However, as illustrated in Table 1, the annotation number of pixels in single image is too tiny to train a neural network efficiently.
Consider the limitation of supervision information and inspired by recent deep metric learning methods e.g. [Liu et al.2018]
, we focus on exploring the relationship between feature representations of annotated pixels. While all current methods try to optimize embedding feature distances within a single image, we apply a novel method by forming triples from the embedding vectors of annotated pixels across images and optimizing the feature consistency within the triples by distance metric learning. In each triple, two embedding vectors belong to the same category and we name them as a positive pair. The other is from a different category and it forms a negative pair with one element in the positive pair. We minimize the distances between positive pairs and maximize those between negative pairs on inter-image level. There are at least two reasons for doing so:
Objects and stuff which have similar feature representations before the classification module would be more likely to be specified into the same class. Oppositely, embedding features from different categories, if being distinct enough with each other, are more easier for the classifier to distinguish.
Under the point-based regime, most annotated pixels in one image come from different categories. Simply optimizing distances between negative pairs would not help training. While extending to inter-image level, balanced number of positive and negative pairs can be obtained.
Furthermore, different from image-level weakly supervised methods relying heavily on the saliency maps generated by pretrained specific models, scribble and box guided methods depending on being optimized iteratively, our method does not require any additional data and can be learned in an end-to-end manner.
pretrained on ImageNet as the backbone of our feature extractor. Atrous convolutions are adopted to increase the receptive field and reduce the degree of signal downsampling. Given a batch of input images, we first use the feature extractor to form deep embedding features. The features are then assigned to two streams. The first stream is feeding into the point-based distance metric learning module where the embedding feature vectors across different images are optimized towards learning representation consistency. The second stream is feeding into a fully-convolutional classification module to generate the pixel-wise prediction. At the mean time, online extension is performed to dynamically gather pixels with high classification confidence and tight spatial relationship to the original annotated pixels to form the extended label. The point-wise cross-entropy loss is calculated between the classification results and the two labels.
Point supervision (PointSup) serves as the baseline of our method. Let donate the set of training images with point-wise annotations. Then we have where is the th image in the training set. is the corresponding pseudo annotated mask, where only several pixels are annotated with semantic labels. Let donate our segmentation module where ,
refer to the feature extraction module and its parameters, classification module and its parameters, respectively. The objective function is
and then consider the class label set of the current image as
, let the the conditional probability generated by the classification module of any labelat the any location as , then we get
where refers to number of annotated pixels in .
Distance Metric Learning
To optimize feature representations of annotated pixels to keep consist, we aim at minimizing distances between positive pairs and maximizing those between negative pairs. However, it is hard to find positive pairs within a single image. Only optimizing negative pairs would do no help but make the loss hard to converge. By extending to inter-image level, we could obtain balanced number of two kinds of pairs as illustrated in Figure 2. Let be a subgroup of the training set , then . For each image e.g. in the subgroup, we could define the embedding vector set , where is the feature vector corresponding to the th annotated pixel in image . Suppose for three different feature vectors , , , where shares the same category with and different from , we apply the loss as
where , corresponding to the red dotted line in Figure 2(b) (i.e. ()), can be expressed as
which aims at minizing the L2-norm distance between same-category embedding vectors. And , corresponding to the combination of one gray and one red dotted line in Figure 2(b) (i.e. () and ()), can be expressed as
which aims at maximizing the gap between and . is a constant value and only when the gap is within this value would the the triple embedding vectors be optimized. Two hyper-parameters and are used to balance the the effect of and . We set , , in practice. And the algorithm of optimizing dense PDML is expressed in Algorithm 1.
To further improve the performance, we put attention on gathering more pixels during the training process. Previous works such as superpixel [Achanta et al.2012]
and K-Means clustering[Ray and Turi1999] have explored the possibility of gathering pixels by measuring the similarity of low-level features. However, both methods have obvious drawbacks under the point-based scene parsing regime by gathering lots of wrong pixels. And from experiments we find that wrong pixels would greatly degrade the performance of the network. More detailed analyses would be found in the section of discussion. To tackle this issue, we adopt a simple but accurate online extension method to collect more pixels with low false positive rate to extend the annotation data. Take as an example, we now assume the current weights of feature extractor and classification module as and , then we could specify the new label candidate by judging the pixel-wise classification score:
where is a threshold to filter the pixels. In other words, for every location in the input image, if the max classification score of this pixel is greater than the threshold and the corresponding class is within the class label set of this image, it would be chosen for the extended label. From another perspective, we believe that pixels with close spatial distance to the annotated pixels are more likely from the same category. Thus we extend each annotated pixel in to a square to form the second label candidate . Then we generate the final extended label by just selecting the candidates appearing in both schemes as:
Dataset and Evaluation Metrics
Our proposed model is trained and evaluated on two challenging scene parsing datasets: PASCAL-Context [Mottaghi et al.2014] and ADE 20K [Zhou et al.2017], as shown in Table 2. In PASCAL-Context, the most frequent 59 classes are used and others are divided into a unified background class. In ADE 20K, we adopt the annotation of 150 meaningful classes. We generate one pixel annotation for each instance in each image. The performances are quantitatively evaluated by pixel-wise accuracy and mean region intersection over union (mIoU).
|Dataset||# Training||# Eval||Pixel/Image|
We utilize ResNet101 [He et al.2016] with the modification of atrous convolution as the backbone of our feature extractor. Weights pretrained on ImageNet are adopted to initialize. During training, we take a mini-batch of 16 images and randomly crop patches of the size of from original images. We use the optimizer of SGD where momentum is set to 0.9 and the weight decay is 0.0005. The initial base learning rate is set to 0.00025 for parameters in the feature extraction layers and ten times for parameters in the classification module. Both learning rate will be decayed under the scheme of . All the experiments are conducted on two NVIDIA V100 GPUs. Our code will be available publicly.
Quantitative and Qualitative Results
We evaluate different methods quantitatively by using pixel accuracy and mIoU which describes the the precision of prediction and the average performance among all classes, respectively. We run multiple experiments to determine the effects of the three parts of our proposed method: PointSup, PDML and online extension. To make a more comprehensive understanding of the effect of our method, we also train our network with fully-annotated label. The quantitative results are shown in Tables 3 and 6 (note we omit % of mIoU for simplicity). On PASCAL-Context, the full supervision setting could yield a 39.6% mIoU and our method, with only the number of annotated pixels, could obtain the 30.0% mIoU performance. On ADE 20K, we achieve the mIoU of 19.6%, while the SegNet [Badrinarayanan, Kendall, and Cipolla2017] achieves roughly 21% mIoU under full supervision. And the qualitative results are shown in Figures 5 and 6. Our final method combining point supervision, distance metric learning and online extension has the best scene parsing quality subjectively.
|FullSup||PointSup||PDML||Online Ext.||mIOU||Pixel Acc|
|PASCAL-Context validation dataset|
|ADE 20K validation dataset|
Analysis of PDML
Recall that we apply the loss to optimize the consistency of embedding vectors. We name the distance between the positive pairs as and that between negative pair as . aims at constraining and aims at increasing the gap between and . We argue that is very import in the whole scheme, as shown in Figure 3. With only applying , though the gap between and is optimized to be larger, the absolute values increase greatly at the mean time which leads to great performance drop. And by applying the constrain of , the absolute values of distances remain at the normal scale and the gap is also optimized to make it easier for the classification module to distinguish.
Also, we visualize the distribution of the pair numbers of each distance value in training procedure to demonstrate the effectiveness of
. At the first epoch, the distributions of distances of positive and negative pairs are almost symmetrical. During the training process, an obvious peak shift can be observed. The peak of the distances between positive pairs moves towards the origin of the coordinate axis and peak of the distances between negative pairs moves oppositely.
We have three hyper-parameters in the loss function: , , . Margin is set for a mining purpose. A small margin would limit the number of optimizing embedding vector triples and make it less effective for classification. And a large margin size, bringing too many triples for optimization, would cause the training loss being hard to converge. A moderate margin value of 20 would produce the best performance. Loss weights and are used to balance the effect of and . We set to 1 and adjust correspondingly. A small , similar to Figure 3, can not constrain the absolute value of and to remain at the normal scale and would lead to poor performances. While a large would weaken the effect of maximizing the distance between and and make it hard to distinguish embedding vectors from different categories. A moderate value of 0.8 achieves the best performance.
Analysis of Extension Method
We compare our online label extension method with other frequently used clustering method.
Superpixel Superpixel is used to cluster pixels by taking low-level feature similarity into consideration. We set the number of superpixels in the range of 50 to 200 per image depending on the number of annotated pixels. Each annotated pixel would be extended to the corresponding superpixel covering it.
K-Means We perform K-Means clustering on the feature representations of the input images. The annotated pixels are set to be the initial clustering centers and we set the maximum iteration time to be 300.
Score and Region Recall in the online extension part, we use two methods to generate new label candidates. The first is using a score thresholding and we name this extension method as score. And another is simply extending every pixel in the current label to a square with the original one as the center. We name this method as region.
On the basis of the weight obtained by PDML, we implement different extension methods and their results are shown in Table 5. The pixel-wise extension accuracy is of critical importance in influencing the performance. Superpixel and region have better extension accuracies and have better performances correspondingly. We also test various thresholds for the score scheme. Our method taking both score with the threshold of 0.7 and region into consideration has an accuracy of 98.2% and the best testing performance.
|Superpixel||K-Means||Score||Region||Extension Acc||mIoU||Pixel Acc|
This paper is the first to tackle the task of point-based semantic scene parsing. We propose a novel deep metric learning method to leverage semantic relationship among the annotated points by encouraging the feature representation of the intra- and inter-category points to keep consistent. Points within the same category are optimized to share more similar feature representations and oppositely, those of points from different categories are optimized to be more distinct. Different from all current weakly supervised methods whose solutions are constrained in a single image, our proposed method focuses on optimizing the embedding vectors across different images in the training dataset to obtain sufficient balanced embedding vector pairs. The whole model can be trained in an end-to-end manner. Our method has competitive performance both qualitatively and quantitatively on PASCAL-Context and ADE 20K scene parsing datasets.
Acknowledgements. This work is supported in part by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network.
- [Achanta et al.2012] Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S.; et al. 2012. Slic superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI 34(11).
- [Badrinarayanan, Kendall, and Cipolla2017] Badrinarayanan, V.; Kendall, A.; and Cipolla, R. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI 39(12):2481–2495.
- [Bearman et al.2016] Bearman, A.; Russakovsky, O.; Ferrari, V.; and Fei-Fei, L. 2016. What’s the point: Semantic segmentation with point supervision. In ECCV, 549–565. Springer.
- [Chen et al.2018] Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4):834–848.
- [Dai, He, and Sun2015] Dai, J.; He, K.; and Sun, J. 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE ICCV, 1635–1643.
[Ess et al.2009]
Ess, A.; Müller, T.; Grabner, H.; and Van Gool, L. J.
Segmentation-based urban traffic scene understanding.In BMVC, volume 1, 2. Citeseer.
- [Fathi et al.2017] Fathi, A.; Wojna, Z.; Rathod, V.; Wang, P.; Song, H. O.; Guadarrama, S.; and Murphy, K. P. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE CVPR, 770–778.
- [Hou et al.2018] Hou, Q.; Jiang, P.-T.; Wei, Y.; and Cheng, M.-M. 2018. Self-erasing network for integral object attention. In NIPS.
- [Kolesnikov and Lampert2016] Kolesnikov, A., and Lampert, C. H. 2016. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 695–711. Springer.
- [Kong and Fowlkes2018] Kong, S., and Fowlkes, C. 2018. Recurrent pixel embedding for instance grouping. In IEEE CVPR, 9018–9028.
- [Lin et al.2016] Lin, D.; Dai, J.; Jia, J.; He, K.; and Sun, J. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In IEEE CVPR, 3159–3167.
- [Liu et al.2018] Liu, Y.; Jiang, P.-T.; Petrosyan, V.; Li, S.-J.; Bian, J.; Zhang, L.; and Cheng, M.-M. 2018. Del: Deep embedding learning for efficient image segmentation. In IJCAI, 864–870.
- [Ming et al.2017] Ming, Z.; Chazalon, J.; Luqman, M. M.; Visani, M.; and Burie, J.-C. 2017. Simple triplet loss based on intra/inter-class metric learning for face verification. In IEEE ICCVW, 1656–1664.
- [Mottaghi et al.2014] Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; and Yuille, A. 2014. The role of context for object detection and semantic segmentation in the wild. In IEEE CVPR.
- [Oh Song et al.2016] Oh Song, H.; Xiang, Y.; Jegelka, S.; and Savarese, S. 2016. Deep metric learning via lifted structured feature embedding. In IEEE CVPR, 4004–4012.
- [Papandreou et al.2015] Papandreou, G.; Chen, L.-C.; Murphy, K.; and Yuille, A. L. 2015. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734.
- [Pathak et al.2014] Pathak, D.; Shelhamer, E.; Long, J.; and Darrell, T. 2014. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144.
[Ray and Turi1999]
Ray, S., and Turi, R. H.
Determination of number of clusters in k-means clustering and
application in colour image segmentation.
Proceedings of the 4th international conference on advances in pattern recognition and digital techniques, 137–143.
- [Schroff, Kalenichenko, and Philbin2015] Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In IEEE CVPR, 815–823.
- [Tsai et al.2016] Tsai, Y.-H.; Shen, X.; Lin, Z.; Sunkavalli, K.; and Yang, M.-H. 2016. Sky is not the limit: semantic-aware sky replacement. ACM Trans. Graph. 35(4):149–1.
- [Wan et al.2014] Wan, J.; Wang, D.; Hoi, S. C. H.; Wu, P.; Zhu, J.; Zhang, Y.; and Li, J. 2014. Deep learning for content-based image retrieval: A comprehensive study. In ACM MM.
- [Wei et al.2016] Wei, Y.; Liang, X.; Chen, Y.; Jie, Z.; Xiao, Y.; Zhao, Y.; and Yan, S. 2016. Learning to segment with image-level annotations. Pattern Recognition 59:234–244.
- [Wei et al.2017a] Wei, Y.; Feng, J.; Liang, X.; Cheng, M.-M.; Zhao, Y.; and Yan, S. 2017a. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, volume 1, 3.
- [Wei et al.2017b] Wei, Y.; Liang, X.; Chen, Y.; Shen, X.; Cheng, M.-M.; Feng, J.; Zhao, Y.; and Yan, S. 2017b. Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI 39(11):2314–2320.
- [Wei et al.2018] Wei, Y.; Xiao, H.; Shi, H.; Jie, Z.; Feng, J.; and Huang, T. S. 2018. Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In IEEE CVPR, 7268–7277.
[Zhang et al.2018a]
Zhang, R.; Lin, L.; Wang, G.; Wang, M.; and Zuo, W.
Hierarchical scene parsing by weakly supervised learning with image descriptions.IEEE TPAMI (1):1–1.
- [Zhang et al.2018b] Zhang, X.; Wei, Y.; Feng, J.; Yang, Y.; and Huang, T. 2018b. Adversarial complementary learning for weakly supervised object localization. In IEEE CVPR.
- [Zhao et al.2017] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In IEEE CVPR, 2881–2890.
- [Zhou et al.2017] Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; and Torralba, A. 2017. Scene parsing through ade20k dataset. In IEEE CVPR.