Weakly Supervised Semantic Segmentation Using Constrained Dominant Sets

by   Sinem Aslan, et al.
Università Ca' Foscari Venezia

The availability of large-scale data sets is an essential pre-requisite for deep learning based semantic segmentation schemes. Since obtaining pixel-level labels is extremely expensive, supervising deep semantic segmentation networks using low-cost weak annotations has been an attractive research problem in recent years. In this work, we explore the potential of Constrained Dominant Sets (CDS) for generating multi-labeled full mask predictions to train a fully convolutional network (FCN) for semantic segmentation. Our experimental results show that using CDS's yields higher-quality mask predictions compared to methods that have been adopted in the literature for the same purpose.


page 9

page 10


Weakly Supervised Semantic Segmentation Based on Web Image Co-segmentation

Training a Fully Convolutional Network (FCN) for semantic segmentation r...

WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations

Compared with tedious per-pixel mask annotating, it is much easier to an...

A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images

Semantic segmentation is the pixel-wise labelling of an image. Since the...

Semantic Segmentation Alternative Technique: Segmentation Domain Generation

Detecting objects of interest in images was always a compelling task to ...

Dilated Temporal Fully-Convolutional Network for Semantic Segmentation of Motion Capture Data

Semantic segmentation of motion capture sequences plays a key part in ma...

Pixel-level Corrosion Detection on Metal Constructions by Fusion of Deep Learning Semantic and Contour Segmentation

Corrosion detection on metal constructions is a major challenge in civil...

Hyperspectral Image Semantic Segmentation in Cityscapes

High-resolution hyperspectral images (HSIs) contain the response of each...

1 Introduction

Semantic segmentation is one of the most well-studied research problems in computer vision. The goal is to achieve pixel-level classification, i.e., to label each pixel in a given input image with the class of the object or region that covers it. Predicting the class of each pixel yields to complete scene understanding which is the main problem of a wide range of computer vision applications, e.g. autonomous driving

[9], human-computer interaction [17], earth observation [5], biomedical applications [29], dietary assessment systems [4]

, etc. Stunning performances of DCNNs (Deep Convolutional Neural Networks) at image classification tasks have encouraged researchers to employ them for pixel-level classification as well. Outstanding methods in well-known benchmarks, e.g. PASCAL VOC 2012, train some fully convolutional networks (FCN) with supervision of fully-annotated ground-truth masks. However, obtaining such precise fully-annotated masks is extremely expensive and this limits the availability of large-scale annotated training sets for deep learning architectures. In order to address the aforementioned issue, recent works explored supervision of DCNN architectures for semantic segmentation using low-cost annotations like image-level labels

[13], point tags [6], bounding box [14, 10, 18] and scribbles[15, 28, 23, 25], that are weaker than the pixel-level labels.

Creating weak annotations is much easier than creating full annotations which helps to obtain large training sets for semantic segmentation. However, these annotations are not as precise as full annotations and their quality depends on the decisions made by the users, which degrades their reliability. Hence, literature works proposed different strategies for weakly-supervised semantic segmentation to deal with these issues. While a number of works [23, 25] proposed to employ a genuine cost function to get into account only the initially given true weak annotations at the training stage, another and the most common approach [10, 15, 14, 18, 28] has been supervising DCNN architectures by predicted full mask annotations which are obtained by post-processing the weak-annotations.

Among these two strategies, we follow the second one and propose to generate full mask annotations from scribbles by an interactive segmentation technique which has proven to be extremely effective in a variety of computer vision problems including image and video segmentation [31, 21]. For the same purpose, literature works have used a number of shallow interactive segmentation methods, e.g. variants of GrabCut [22] are used in [18, 14] for propagating bounding box annotations to supervise a convolutional network. In order to propagate bounding box annotations, [10] proposed to perform iterative optimization between generating full mask approximations and training the network. Using a similar iterative scheme, [15] propagated scribble annotations by superpixels via optimizing a multi-label graph cuts model of [7]. [28] proposed a random-walk based label propagation mechanism to propagate scribble annotations.

In this paper, we aim to explore the potential of Constrained Dominant Sets (CDS) [32, 31, 2, 1, 30] for generating predicted full annotations to be used in supervision of a convolutional neural network for semantic segmentation. Representing images in an edge-weighted graph structure, main idea in constrained segmentation approach in [31] is finding the collection of dominant set clusters on the graph that are constrained to contain the components of a given annotation. CDS approach is applied for co-segmentation and interactive segmentation using modalities of bounding box or scribble and superiority of it over the state of the art segmentation techniques like Graph Cut, Lazy Snapping, Geodesic Segmentation, Random Walker, Transduction, Geodesic Graph Cut, Constrained Random Walker is proved in [31]. Motivated by the reported performance achievements for single cluster extraction (i.e. foreground extraction) in [31], we used CDS for multiple cluster extraction involving multi-label scribbles for the PASCAL VOC 2012 dataset. Since our goal is mainly exploring the performance of CDS in full mask prediction for weakly-supervised semantic segmentation, we trained a basic segmentation network, namely Fully Convolutional Network (FCN-8s) of [16] based on VGG16 architecture, and compared our performance with other full mask prediction schemes in the literature that supervise the same type of deep learning architecture. Our experimental results on the standard dataset PASCAL VOC 2012 show the effectiveness of our approach compared to existing algorithms.

2 Constrained Dominant Sets

Dominant Set Framework.

In the dominant-set clustering framework [19, 21], an input image is represented as an undirected edge-weighted graph with no self-loops , where is the set of vertices that correspond to image points (pixels or superpixels), is the set of edges that represent the neighborhood relations between vertices, and is the (positive) weight function that represent the similarity between linked node pairs. A symmetric affinity (or similarity) matrix is constructed to represent the graph that is denoted by where , if and otherwise.

Next, a weight , which is (recursively) defined as Eq. 1, is assigned to each vertex ,


where denotes the (relative) similarity between nodes () and , with respect to the average similarity between node and its neighbours in (defined by ).

A positive indicates that adding into its neighbours in will increase the internal coherence of the set, while when it is negative overall coherence gets decreased. Based on aforementioned definitions, a non-empty subset of vertices such that for any non-empty , is said to be dominant set if it is a maximally coherent data set, i.e. satisfying two basic properties of a cluster that are internal coherence (, for all ) and external incoherence (, for all ).

Consider the following linearly-constrained quadratic optimization problem,



is the transposition of the vector

and is the standard simplex of , defined as

. With the assumption of affinity matrix

is symmetric, it is shown by [19] that if is an dominant set, then its weighted characteristic vector defined as in Eq. 3 is the strict local solution of the Standard Quadratic Program in Eq. 2.


Conversely, if is a strict local solution to Eq. 2, then its support is a dominant set of . Thus, a dominant set can be found by localizing a solution of Eq. 2 by a continuous optimization technique and gathering the support set of the found solution. Notice that the value of a component in the found provides a measure of how strong that component contributes to the cohesiveness of the cluster.

Constrained Dominant Set Framework.

In [32, 31] the notion of a constrained dominant set is introduced, which aims at finding a dominant set constrained to contain vertices from a given seed set . Based on the edge-weighted graph definition with affinity matrix , a parameterized family of quadratic programs is defined as in Eq. 4 [31] for the set and a parameter ,

maximize (4)
subject to

where is the diagonal matrix whose elements are set to if the corresponding vertices are in and to otherwise. It is theoretically proven, and empirically illustrated for interactive image segmentation [31], that if is the set of vertices selected by the user, by setting it is guaranteed that all local solutions of (4) will have a support that necessarily contains at least one element of . Here,

is the largest eigenvalue of the principal submatrix of

indexed by elements of .

In order to find constrained dominant sets by solving the aforementioned quadratic optimization problem (4), [31]

used Replicator Dynamics that is developed and studied in evolutionary game theory

[19]. In this work we use Infection and Immunization Dynamics (InImDyn) [20] which proved to be a faster and as accurate alternative to it.

3 Proposed approach

We propose to generate full mask predictions (to be used for supervising a semantic segmentation network) by post-processing weak annotations, i.e. scribble annotations, using CDS. Moreover, we propose to use CDS for multiclass clustering of pixels, i.e. semantic segmentation, while previously CDS has been used only for interactive foreground segmentation [31, 32].

3.1 Preprocessing step for CDS

Superpixel generation. A common approach followed by image segmentation works has been using superpixels as input entities instead of image pixels. A superpixel is a group of pixels with similar colors and using superpixels not only provides reduced computational complexity, but also yields computing features on meaningful regions. Among a variety of techniques, i.e. SLIC, Oriented Watershed Transform (OWT), we have preferred to use the method developed by Felzenszwalb and Huttenlocher [11] similar to [26] which is a fast and publicly available algorithm. Method of Felzenszwalb and Huttenlocher [11] has also been used in another weakly-supervised semantic segmentation framework [15] experimenting on the same dataset with us. Proposed method in [11] is a graph-based segmentation scheme where a graph is constructed for an image such that each element to be segmented represents a vertex of the graph and dissimilarity, i.e. color differences, between two vertices constitutes a weighted edge. The vertices (or subgraphs) are started to be merged regarding to a merging criteria given in Eq. 5, where is the edge between two subgraphs and , is the weight on edge and MST() be the minimum spanning tree of .


Here, is a threshold function in which is decided by the user, i.e. high values of yield to lower number of (large) segments, and vice-versa. Another parameter given by the user is the smoothing factor (we denote by ) of the Gaussian kernel that is used to smooth the image at the preprocessing step.

Feature extraction. Once the superpixels are generated on the image, a feature vector is computed for each superpixel. In the application of CDS model for interactive image segmentation in [31]

, median of the color of all pixels in RGB, HSV, and L*a*b* color spaces and Leung-Malik (LM) Filter Bank are concatenated in the feature extraction process. Differently from

[31], we compute the same feature types with ScribbleSup [15], which has experimented on the same dataset with us, that are color and texture histograms denoted by and in Eq. 6. More specifically, is a histogram computed on the color space using 25 bins and is a histogram of gradients at the horizontal and vertical orientations where 10 bins are used for each orientation for the superpixel .

3.2 Application of CDS for full mask predictions

In order to generate full mask predictions using the CDS model, an input image is represented as a graph where vertices depict the superpixels of the image and edge-weights between vertices reflect the similarity between corresponding superpixels. We use scribbles as the given weak annotations in this work which serve as constraints in the CDS implementation. Previously, CDS has been applied for interactive foreground segmentation [31] where dominant set clusters covering a set of given nodes for a single object class were explored. In this work our problem demand for multiclass clustering of pixels. Hence, here represents the manually selected pixels of the class where and is the number of classes in the dataset, e.g. for PASCAL VOC 2012.

Accordingly, for each class of scribbles that exist in a given image, by ignoring the existence of the remaining classes in the image we perform foreground segmentation, i.e. 2-class clustering of image pixels, as in [31] by computing its CDS’s. Thus, for the class the union of the extracted dominant sets, i.e. if dominant sets are extracted which contain the set , represents the segmented regions of object in class . We then repeat this process for every class that exist in the image using the corresponding information. If a node, i.e. superpixel, is found in more than one class of , we assign it to the one having the highest value in its weighted characteristic vector which is found by solving the quadratic program in Eq. 4 by InImDyn (see Section 2).

Computation of the Affinity matrix. Before computing the CDS clusters, the affinity (or similarity) between superpixels should be computed to construct the matrix in Eq. 4. In [31], dissimilarity measurements are transformed to affinity space by using the Gaussian kernel , where is the feature vector of the superpixel , is the scale parameter for the Gaussian kernel and if is true, 0 otherwise. Differently from [31], we use the Gaussian kernel in Eq. 6 where different values are used for different feature types. The kernel in Eq. 6 is also adopted in [15] which experiments on the same dataset and uses the same feature types with us.


Using different color spaces. Quality of generated superpixels effects the performance of the segmentation algorithm directly and a number of segmentation works (examples include but not limited to [26, 3]) have emphasized that higher segmentation performances can be obtained by using different color transformations of the input image to deal with different scene and lighting conditions. Motivated by the related literature studies [26, 3], we compute superpixels in a variety of color spaces with a range of invariance properties. Specifically, we use five color spaces, that were also used in [26] for determining high quality object locations by employing segmentation as a selective search strategy, that are Intensity (grey-scale image), , which denotes channels of normalized plus intensity, , that denotes the Hue channel of . We generate superpixels and compute mask predictions using CDS model for each color space of the input image, then we decide the final label for a pixel based on most frequently occurred class label, i.e. by using the scheme of majority voting. In addition to using different color spaces we also vary the threshold parameter (in Eq. 5) to get benefit from a large set of diversification as recommended in [26].

4 Experiments

Dataset and evaluation. We trained the models on the 10582 augmented PASCAL VOC training set [12] and evaluated them on the 1449 validation set. We used the scribble annotations published in [15]. In what follows accuracy is evaluated using pixel accuracy (), mean accuracy () and mean Intersection over Union () as in [16], where is the number of pixels of class predicted to belong to class , is the number of different classes, and be the total number of pixels of class .

Implementation details. We used the VGG16-based FCN-8s network [16] of the MatConvNet-FCN toolbox [27]

which we initialized by ImageNet pretrained model, i.e. VGG-VD-16 in

[27]. We trained by SGD with momentum and, similar to [16], we used momentum 0.9, weight decay of , mini batch size of 20 images and learning rate of

. With these selected hyperparameters we observed that the pixel accuracy is being converged on the validation set.

Performance of CDS is sensitive to the selection of the parameter of the Gaussian kernel (see Section 3.2) and in [31] three different results are reported for different selections of : 1) CDSBestSigma, where best is selected separately for every image; 2)CDSSingleSigma, by searching in a fixed range, i.e. 0.05 and 0.2; 3)CDSSelfTuning, where is replaced by , where , i.e. the mean of the K-NearestNeighbor of the sample , is fixed to 7. To decide values of the and parameters (in Eq. 6) we followed CDSBestSigma strategy in [31]. Additionally, in the graph structure we cut the edges between vertices correspond to non-adjacent superpixels vertices by setting the corresponding items to zero in the affinity matrix like has been done in [15], which has provided better segmentation maps. We then min-max normalized the matrix to be scaled in the range of and symmetrized it.

Performance evaluation. We first explored the performance using different color spaces on the predicted full annotations of 10582 images (denoted by PredSet to mention Predicted Set in Table 1), before training the network with them. Then, by training the network with the Predicted Sets we report performance on the Test Set, i.e. PASCAL VOC 2012 Val set. In the implementation of the superpixel generation of [11] we used smoothing factor of (FH stands for Felzenszwalb and Huttenloche [11]) in the experiments of Table 1. For each color space we performed majority voting (denoted by MV both in Table 1 and 2) over obtained maps with (in Eq. 5).

We see at Table 1 that using different color spaces affects the quality of the predicted full annotations (PredSet) and highest quality mask predictions in terms of mIoU are obtained when we use the Intensity (66.51%). Performing majority voting over maps obtained in all color spaces provided highest quality mask predictions for both CDS (73.28%) and GraphCut (63.51%). We then trained the network with the predicted sets of CDS-Intensity, CDS-MV, GraphCut-MV and published full mask annotations and present their performance on the test set in Table 1. We see that by using CDS-MV in training we outperform GraphCut (which was employed in [15]) significantly and we are quiet approaching to the performance of fully-annotated mask training (59.2% vs. 61.6%).

Color space mean IoU Pixel Acc. mean Acc.
PredSet-CDS- 66.51 89.05 75.95
PredSet-CDS- 65.47 88.36 76.15
PredSet-CDS- 64.70 88.13 75.29
PredSet-CDS- 66.49 89.27 74.60
PredSet-CDS- 57.16 85.12 68.21
PredSet-CDS- 73.28 91.47 82.05
PredSet-GraphCut- 63.51 86.48 81.83
TestSet-CDS-Intensity 57.41 89.01 70.56
TestSet-CDS-MV 59.20 89.59 73.05
TestSet-GraphCut- 52.25 85.80 72.43
TestSet-With Full Masks 61.60 90.27 78.95
Table 1: Quality of obtained mask predictions (PredSet) and using them in network training performance on the PASCAL VOC 2012 Val set (TestSet) (MV: Majority Voting, implementation of GraphCut in our framework.)

Comparison with other full-mask prediction methods. There is a large variety of interactive segmentation algorithms that can be used for full mask prediction to train a semantic segmentation network. To be as fair as possible we make comparison with the reported performances of the methods that are carried on in similar conditions with us, e.g. the ones which employ scribbles as weak annotations, achieve network training using cross entropy loss computed over all pixel predictions but not only on given weak annotations, and do not iterate between the shallow segmentation method and network training with the obtained mask predictions as in ScribbleSup [15]. On the other hand, we performed the Graph Cut algorithm employed in ScribbleSup [15] in our framework by using the published code111 mouse.cs.uwaterloo.ca/code/gco-v3.0.zip referred in [15] and present its performance. In fact, our approach can be considered as the first iteration step of such an iterative scheme, and it can be extended to be used in further iterations by updating initial scribble annotations by considering network scores obtained with high confidence.

Considering the above issues we compare with the methods whose accuracy on the test set is reported when their mask predictions are used to train a segmentation network. Specifically, we refer to the performance results of the popular methods GrabCut [22], NormalizedCut [23], and KernelCut [24] reported in [23]. It is mentioned in [24, 23] that for each image pixel, RGB (color) and XY (location) features are concatenated to be used in these algorithms. Then, segmentation proposals generated by them are used to train a VGG16-based DeepLab-Msc-largeFOV network [8]. It is reported in [8] that DeepLab-Msc-largeFOV, which employs atrous convolution and multiscale prediction, outperforms FCN-8s by around 9% (71.6% vs. 62.2%) at PASCAL VOC 2012 validation set when trained by full mask annotations, which provides an advantage at comparative works. On the other hand, we also present the performance gap between weak and full mask training to provide a more fair comparison in Table 2. In Table 2, the performance results of full mask training (64.1 %), GrabCut [22], NormalizedCut [23], and KernelCut [24] are acquired from [23].

Method mIoU Gap between full and weak supervision
With Full Masks [23] 64.1 -
GrabCut [22] 55.5 8.6
NormalizedCut [23] 58.7 5.4
KernelCut [24] 59.8 4.3
With Full Masks 61.6 -
GraphCut-MV 52.25 9.35
CDS-MV 59.20 2.40
CDS-MV 60.22 1.38
Table 2: Performance comparison on PASCAL VOC 2012 val set.

For CDS, we train with mask predictions generated by two different selections of : (i) (corresponding to PredSet-CDS- in Table 1); and (ii) , where we selected the best among and for each image. It can be seen at the segmentation performances on the val set given in Table 2 that we outperform the literature works at (60.22%), and we are superior at both parameter selections in terms of performance gap between full and weak supervision, i.e. we approach to the performance of our full mask training (61.6%) by 2.4% and 1.38% at selection of and , respectively. Two example images from the generated set, i.e. PredSet, of are presented in Figure 1. Figure 2 shows examples from testing on the val set when it is trained by PredSet-CDS-MV. It can be seen in Figure 1 and 2 that our results are the ones most closest to the ground truth of input images.

Figure 1: Generated mask predictions (Images for GrabCut [22], Normalized Cut [23], and KernelCut [24] are acquired from [23])
Figure 2: Testing on PASCAL VOC 2012 val set. (Images for GrabCut [22], Normalized Cut [23], and KernelCut [24] are acquired from [23])

5 Conclusions

In this paper we have proposed to apply Constrained Dominant Set (CDS) model, which is proved to be an effective method compared to state-of-the-art interactive segmentation algorithms, for propagating weak scribble annotations of a given set of images to obtain the multi-labeled full mask predictions of them. Achieved mask predictions are then used to train a Fully Convolutional Network for semantic segmentation. While CDS has been applied for pixelwise binary classification problem, it has not been explored for semantic segmentation before and this paper presents our work in this direction. Experimental results showed that proposed approach generates higher quality full mask predictions than the existing methods that have been adopted for weakly-supervised semantic segmentation in literature works.


  • [1] L. T. Alemu and M. Pelillo (2018)

    Multi-feature fusion for image retrieval using constrained dominant sets

    CoRR abs/1808.05075. Cited by: §1.
  • [2] L. T. Alemu, M. Shah, and M. Pelillo (2019) Deep constrained dominant sets for person re-identification. CoRR abs/1904.11397. Cited by: §1.
  • [3] S. Aslan, G. Ciocca, and R. Schettini (2017) On comparing color spaces for food segmentation. In International Conference on Image Analysis and Processing, pp. 435–443. Cited by: §3.2.
  • [4] S. Aslan, G. Ciocca, and R. Schettini (2018) Semantic food segmentation for automatic dietary monitoring. In 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin) (ICCE-Berlin 2018), Cited by: §1.
  • [5] N. Audebert, B. Le Saux, and S. Lefèvre (2016) Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. In Asian Conference on Computer Vision, pp. 180–196. Cited by: §1.
  • [6] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei (2016) What’s the point: semantic segmentation with point supervision. In European Conference on Computer Vision, pp. 549–565. Cited by: §1.
  • [7] Y. Boykov and V. Kolmogorov (2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (9), pp. 1124–1137. Cited by: §1.
  • [8] Chen,Liang-Chieh, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, Cited by: §4.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In

    IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: §1.
  • [10] J. Dai, K. He, and J. Sun (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1635–1643. Cited by: §1, §1, §1.
  • [11] P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation. International Journal of Computer Vision 59 (2), pp. 167–181. Cited by: §3.1, §4.
  • [12] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §4.
  • [13] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7014–7023. Cited by: §1.
  • [14] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele (2017) Simple does it: weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 876–885. Cited by: §1, §1, §1.
  • [15] D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §1, §1, §1, §3.1, §3.1, §3.2, §4, §4, §4, §4.
  • [16] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §4, §4.
  • [17] M. Oberweger, P. Wohlhart, and V. Lepetit (2015)

    Hands deep in deep learning for hand pose estimation

    arXiv preprint arXiv:1502.06807. Cited by: §1.
  • [18] G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille (2015)

    Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation

    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §1.
  • [19] M. Pavan and M. Pelillo (2007) Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1), pp. 167–172. Cited by: §2, §2, §2.
  • [20] S. Rota Bulò, M. Pelillo, and I. M. Bomze (2011) Graph-based quadratic optimization: a fast evolutionary approach. Computer Vision and Image Understanding 115 (7), pp. 984–995. Cited by: §2.
  • [21] S. Rota Bulò and M. Pelillo (2017) Dominant-set clustering: a review. European Journal of Operational Research 262 (1), pp. 1–13. Cited by: §1, §2.
  • [22] C. Rother, V. Kolmogorov, and A. Blake (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), Vol. 23, pp. 309–314. Cited by: §1, Figure 1, Figure 2, Table 2, §4.
  • [23] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers (2018) Normalized cut loss for weakly-supervised cnn segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Cited by: §1, §1, Figure 1, Figure 2, Table 2, §4.
  • [24] M. Tang, D. Marin, I. B. Ayed, and Y. Boykov (2016) Normalized cut meets mrf. In European Conference on Computer Vision, pp. 748–765. Cited by: Figure 1, Figure 2, Table 2, §4.
  • [25] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov (2018) On regularized losses for weakly-supervised cnn segmentation. In European Conference on Computer Vision (ECCV), pp. 507–522. Cited by: §1, §1.
  • [26] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §3.1, §3.2.
  • [27] A. Vedaldi and K. Lenc (2015) MatConvNet – convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia, Cited by: §4.
  • [28] P. Vernaza and M. Chandraker (2017) Learning random-walk label propagation for weakly-supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7158–7166. Cited by: §1, §1, §1.
  • [29] J. Wang, J. D. MacKenzie, R. Ramachandran, and D. Z. Chen (2016) A deep learning approach for semantic segmentation in histology tissue images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 176–184. Cited by: §1.
  • [30] E. Zemene, L. T. Alemu, and M. Pelillo (2016) Constrained dominant sets for retrieval. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, pp. 2568–2573. Cited by: §1.
  • [31] E. Zemene, L. T. Alemu, and M. Pelillo (2018) Dominant sets for “constrained” image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, §3.2, §3.2, §3, §4.
  • [32] E. Zemene and M. Pelillo (2016) Interactive image segmentation using constrained dominant sets. In European Conference on Computer Vision, pp. 278–294. Cited by: §1, §2, §3.