Semantic segmentation is one of the most well-studied research problems in computer vision. The goal is to achieve pixel-level classification, i.e., to label each pixel in a given input image with the class of the object or region that covers it. Predicting the class of each pixel yields to complete scene understanding which is the main problem of a wide range of computer vision applications, e.g. autonomous driving, human-computer interaction , earth observation , biomedical applications , dietary assessment systems 
, etc. Stunning performances of DCNNs (Deep Convolutional Neural Networks) at image classification tasks have encouraged researchers to employ them for pixel-level classification as well. Outstanding methods in well-known benchmarks, e.g. PASCAL VOC 2012, train some fully convolutional networks (FCN) with supervision of fully-annotated ground-truth masks. However, obtaining such precise fully-annotated masks is extremely expensive and this limits the availability of large-scale annotated training sets for deep learning architectures. In order to address the aforementioned issue, recent works explored supervision of DCNN architectures for semantic segmentation using low-cost annotations like image-level labels, point tags , bounding box [14, 10, 18] and scribbles[15, 28, 23, 25], that are weaker than the pixel-level labels.
Creating weak annotations is much easier than creating full annotations which helps to obtain large training sets for semantic segmentation. However, these annotations are not as precise as full annotations and their quality depends on the decisions made by the users, which degrades their reliability. Hence, literature works proposed different strategies for weakly-supervised semantic segmentation to deal with these issues. While a number of works [23, 25] proposed to employ a genuine cost function to get into account only the initially given true weak annotations at the training stage, another and the most common approach [10, 15, 14, 18, 28] has been supervising DCNN architectures by predicted full mask annotations which are obtained by post-processing the weak-annotations.
Among these two strategies, we follow the second one and propose to generate full mask annotations from scribbles by an interactive segmentation technique which has proven to be extremely effective in a variety of computer vision problems including image and video segmentation [31, 21]. For the same purpose, literature works have used a number of shallow interactive segmentation methods, e.g. variants of GrabCut  are used in [18, 14] for propagating bounding box annotations to supervise a convolutional network. In order to propagate bounding box annotations,  proposed to perform iterative optimization between generating full mask approximations and training the network. Using a similar iterative scheme,  propagated scribble annotations by superpixels via optimizing a multi-label graph cuts model of .  proposed a random-walk based label propagation mechanism to propagate scribble annotations.
In this paper, we aim to explore the potential of Constrained Dominant Sets (CDS) [32, 31, 2, 1, 30] for generating predicted full annotations to be used in supervision of a convolutional neural network for semantic segmentation. Representing images in an edge-weighted graph structure, main idea in constrained segmentation approach in  is finding the collection of dominant set clusters on the graph that are constrained to contain the components of a given annotation. CDS approach is applied for co-segmentation and interactive segmentation using modalities of bounding box or scribble and superiority of it over the state of the art segmentation techniques like Graph Cut, Lazy Snapping, Geodesic Segmentation, Random Walker, Transduction, Geodesic Graph Cut, Constrained Random Walker is proved in . Motivated by the reported performance achievements for single cluster extraction (i.e. foreground extraction) in , we used CDS for multiple cluster extraction involving multi-label scribbles for the PASCAL VOC 2012 dataset. Since our goal is mainly exploring the performance of CDS in full mask prediction for weakly-supervised semantic segmentation, we trained a basic segmentation network, namely Fully Convolutional Network (FCN-8s) of  based on VGG16 architecture, and compared our performance with other full mask prediction schemes in the literature that supervise the same type of deep learning architecture. Our experimental results on the standard dataset PASCAL VOC 2012 show the effectiveness of our approach compared to existing algorithms.
2 Constrained Dominant Sets
Dominant Set Framework.
In the dominant-set clustering framework [19, 21], an input image is represented as an undirected edge-weighted graph with no self-loops , where is the set of vertices that correspond to image points (pixels or superpixels), is the set of edges that represent the neighborhood relations between vertices, and is the (positive) weight function that represent the similarity between linked node pairs. A symmetric affinity (or similarity) matrix is constructed to represent the graph that is denoted by where , if and otherwise.
Next, a weight , which is (recursively) defined as Eq. 1, is assigned to each vertex ,
where denotes the (relative) similarity between nodes () and , with respect to the average similarity between node and its neighbours in (defined by ).
A positive indicates that adding into its neighbours in will increase the internal coherence of the set, while when it is negative overall coherence gets decreased. Based on aforementioned definitions, a non-empty subset of vertices such that for any non-empty , is said to be dominant set if it is a maximally coherent data set, i.e. satisfying two basic properties of a cluster that are internal coherence (, for all ) and external incoherence (, for all ).
Consider the following linearly-constrained quadratic optimization problem,
is the transposition of the vectorand is the standard simplex of , defined as
. With the assumption of affinity matrixis symmetric, it is shown by  that if is an dominant set, then its weighted characteristic vector defined as in Eq. 3 is the strict local solution of the Standard Quadratic Program in Eq. 2.
Conversely, if is a strict local solution to Eq. 2, then its support is a dominant set of . Thus, a dominant set can be found by localizing a solution of Eq. 2 by a continuous optimization technique and gathering the support set of the found solution. Notice that the value of a component in the found provides a measure of how strong that component contributes to the cohesiveness of the cluster.
Constrained Dominant Set Framework.
In [32, 31] the notion of a constrained dominant set is introduced, which aims at finding a dominant set constrained to contain vertices from a given seed set . Based on the edge-weighted graph definition with affinity matrix , a parameterized family of quadratic programs is defined as in Eq. 4  for the set and a parameter ,
where is the diagonal matrix whose elements are set to if the corresponding vertices are in and to otherwise. It is theoretically proven, and empirically illustrated for interactive image segmentation , that if is the set of vertices selected by the user, by setting it is guaranteed that all local solutions of (4) will have a support that necessarily contains at least one element of . Here,
is the largest eigenvalue of the principal submatrix ofindexed by elements of .
used Replicator Dynamics that is developed and studied in evolutionary game theory. In this work we use Infection and Immunization Dynamics (InImDyn)  which proved to be a faster and as accurate alternative to it.
3 Proposed approach
We propose to generate full mask predictions (to be used for supervising a semantic segmentation network) by post-processing weak annotations, i.e. scribble annotations, using CDS. Moreover, we propose to use CDS for multiclass clustering of pixels, i.e. semantic segmentation, while previously CDS has been used only for interactive foreground segmentation [31, 32].
3.1 Preprocessing step for CDS
Superpixel generation. A common approach followed by image segmentation works has been using superpixels as input entities instead of image pixels. A superpixel is a group of pixels with similar colors and using superpixels not only provides reduced computational complexity, but also yields computing features on meaningful regions. Among a variety of techniques, i.e. SLIC, Oriented Watershed Transform (OWT), we have preferred to use the method developed by Felzenszwalb and Huttenlocher  similar to  which is a fast and publicly available algorithm. Method of Felzenszwalb and Huttenlocher  has also been used in another weakly-supervised semantic segmentation framework  experimenting on the same dataset with us. Proposed method in  is a graph-based segmentation scheme where a graph is constructed for an image such that each element to be segmented represents a vertex of the graph and dissimilarity, i.e. color differences, between two vertices constitutes a weighted edge. The vertices (or subgraphs) are started to be merged regarding to a merging criteria given in Eq. 5, where is the edge between two subgraphs and , is the weight on edge and MST() be the minimum spanning tree of .
Here, is a threshold function in which is decided by the user, i.e. high values of yield to lower number of (large) segments, and vice-versa. Another parameter given by the user is the smoothing factor (we denote by ) of the Gaussian kernel that is used to smooth the image at the preprocessing step.
Feature extraction. Once the superpixels are generated on the image, a feature vector is computed for each superpixel. In the application of CDS model for interactive image segmentation in 
, median of the color of all pixels in RGB, HSV, and L*a*b* color spaces and Leung-Malik (LM) Filter Bank are concatenated in the feature extraction process. Differently from, we compute the same feature types with ScribbleSup , which has experimented on the same dataset with us, that are color and texture histograms denoted by and in Eq. 6. More specifically, is a histogram computed on the color space using 25 bins and is a histogram of gradients at the horizontal and vertical orientations where 10 bins are used for each orientation for the superpixel .
3.2 Application of CDS for full mask predictions
In order to generate full mask predictions using the CDS model, an input image is represented as a graph where vertices depict the superpixels of the image and edge-weights between vertices reflect the similarity between corresponding superpixels. We use scribbles as the given weak annotations in this work which serve as constraints in the CDS implementation. Previously, CDS has been applied for interactive foreground segmentation  where dominant set clusters covering a set of given nodes for a single object class were explored. In this work our problem demand for multiclass clustering of pixels. Hence, here represents the manually selected pixels of the class where and is the number of classes in the dataset, e.g. for PASCAL VOC 2012.
Accordingly, for each class of scribbles that exist in a given image, by ignoring the existence of the remaining classes in the image we perform foreground segmentation, i.e. 2-class clustering of image pixels, as in  by computing its CDS’s. Thus, for the class the union of the extracted dominant sets, i.e. if dominant sets are extracted which contain the set , represents the segmented regions of object in class . We then repeat this process for every class that exist in the image using the corresponding information. If a node, i.e. superpixel, is found in more than one class of , we assign it to the one having the highest value in its weighted characteristic vector which is found by solving the quadratic program in Eq. 4 by InImDyn (see Section 2).
Computation of the Affinity matrix. Before computing the CDS clusters, the affinity (or similarity) between superpixels should be computed to construct the matrix in Eq. 4. In , dissimilarity measurements are transformed to affinity space by using the Gaussian kernel , where is the feature vector of the superpixel , is the scale parameter for the Gaussian kernel and if is true, 0 otherwise. Differently from , we use the Gaussian kernel in Eq. 6 where different values are used for different feature types. The kernel in Eq. 6 is also adopted in  which experiments on the same dataset and uses the same feature types with us.
Using different color spaces. Quality of generated superpixels effects the performance of the segmentation algorithm directly and a number of segmentation works (examples include but not limited to [26, 3]) have emphasized that higher segmentation performances can be obtained by using different color transformations of the input image to deal with different scene and lighting conditions. Motivated by the related literature studies [26, 3], we compute superpixels in a variety of color spaces with a range of invariance properties. Specifically, we use five color spaces, that were also used in  for determining high quality object locations by employing segmentation as a selective search strategy, that are Intensity (grey-scale image), , which denotes channels of normalized plus intensity, , that denotes the Hue channel of . We generate superpixels and compute mask predictions using CDS model for each color space of the input image, then we decide the final label for a pixel based on most frequently occurred class label, i.e. by using the scheme of majority voting. In addition to using different color spaces we also vary the threshold parameter (in Eq. 5) to get benefit from a large set of diversification as recommended in .
Dataset and evaluation. We trained the models on the 10582 augmented PASCAL VOC training set  and evaluated them on the 1449 validation set. We used the scribble annotations published in . In what follows accuracy is evaluated using pixel accuracy (), mean accuracy () and mean Intersection over Union () as in , where is the number of pixels of class predicted to belong to class , is the number of different classes, and be the total number of pixels of class .
which we initialized by ImageNet pretrained model, i.e. VGG-VD-16 in. We trained by SGD with momentum and, similar to , we used momentum 0.9, weight decay of , mini batch size of 20 images and learning rate of
. With these selected hyperparameters we observed that the pixel accuracy is being converged on the validation set.
Performance of CDS is sensitive to the selection of the parameter of the Gaussian kernel (see Section 3.2) and in  three different results are reported for different selections of : 1) CDSBestSigma, where best is selected separately for every image; 2)CDSSingleSigma, by searching in a fixed range, i.e. 0.05 and 0.2; 3)CDSSelfTuning, where is replaced by , where , i.e. the mean of the K-NearestNeighbor of the sample , is fixed to 7. To decide values of the and parameters (in Eq. 6) we followed CDSBestSigma strategy in . Additionally, in the graph structure we cut the edges between vertices correspond to non-adjacent superpixels vertices by setting the corresponding items to zero in the affinity matrix like has been done in , which has provided better segmentation maps. We then min-max normalized the matrix to be scaled in the range of and symmetrized it.
Performance evaluation. We first explored the performance using different color spaces on the predicted full annotations of 10582 images (denoted by PredSet to mention Predicted Set in Table 1), before training the network with them. Then, by training the network with the Predicted Sets we report performance on the Test Set, i.e. PASCAL VOC 2012 Val set. In the implementation of the superpixel generation of  we used smoothing factor of (FH stands for Felzenszwalb and Huttenloche ) in the experiments of Table 1. For each color space we performed majority voting (denoted by MV both in Table 1 and 2) over obtained maps with (in Eq. 5).
We see at Table 1 that using different color spaces affects the quality of the predicted full annotations (PredSet) and highest quality mask predictions in terms of mIoU are obtained when we use the Intensity (66.51%). Performing majority voting over maps obtained in all color spaces provided highest quality mask predictions for both CDS (73.28%) and GraphCut (63.51%). We then trained the network with the predicted sets of CDS-Intensity, CDS-MV, GraphCut-MV and published full mask annotations and present their performance on the test set in Table 1. We see that by using CDS-MV in training we outperform GraphCut (which was employed in ) significantly and we are quiet approaching to the performance of fully-annotated mask training (59.2% vs. 61.6%).
|Color space||mean IoU||Pixel Acc.||mean Acc.|
|TestSet-With Full Masks||61.60||90.27||78.95|
Comparison with other full-mask prediction methods. There is a large variety of interactive segmentation algorithms that can be used for full mask prediction to train a semantic segmentation network. To be as fair as possible we make comparison with the reported performances of the methods that are carried on in similar conditions with us, e.g. the ones which employ scribbles as weak annotations, achieve network training using cross entropy loss computed over all pixel predictions but not only on given weak annotations, and do not iterate between the shallow segmentation method and network training with the obtained mask predictions as in ScribbleSup . On the other hand, we performed the Graph Cut algorithm employed in ScribbleSup  in our framework by using the published code111 mouse.cs.uwaterloo.ca/code/gco-v3.0.zip referred in  and present its performance. In fact, our approach can be considered as the first iteration step of such an iterative scheme, and it can be extended to be used in further iterations by updating initial scribble annotations by considering network scores obtained with high confidence.
Considering the above issues we compare with the methods whose accuracy on the test set is reported when their mask predictions are used to train a segmentation network. Specifically, we refer to the performance results of the popular methods GrabCut , NormalizedCut , and KernelCut  reported in . It is mentioned in [24, 23] that for each image pixel, RGB (color) and XY (location) features are concatenated to be used in these algorithms. Then, segmentation proposals generated by them are used to train a VGG16-based DeepLab-Msc-largeFOV network . It is reported in  that DeepLab-Msc-largeFOV, which employs atrous convolution and multiscale prediction, outperforms FCN-8s by around 9% (71.6% vs. 62.2%) at PASCAL VOC 2012 validation set when trained by full mask annotations, which provides an advantage at comparative works. On the other hand, we also present the performance gap between weak and full mask training to provide a more fair comparison in Table 2. In Table 2, the performance results of full mask training (64.1 %), GrabCut , NormalizedCut , and KernelCut  are acquired from .
|Method||mIoU||Gap between full and weak supervision|
|With Full Masks ||64.1||-|
|With Full Masks||61.6||-|
For CDS, we train with mask predictions generated by two different selections of : (i) (corresponding to PredSet-CDS- in Table 1); and (ii) , where we selected the best among and for each image. It can be seen at the segmentation performances on the val set given in Table 2 that we outperform the literature works at (60.22%), and we are superior at both parameter selections in terms of performance gap between full and weak supervision, i.e. we approach to the performance of our full mask training (61.6%) by 2.4% and 1.38% at selection of and , respectively. Two example images from the generated set, i.e. PredSet, of are presented in Figure 1. Figure 2 shows examples from testing on the val set when it is trained by PredSet-CDS-MV. It can be seen in Figure 1 and 2 that our results are the ones most closest to the ground truth of input images.
In this paper we have proposed to apply Constrained Dominant Set (CDS) model, which is proved to be an effective method compared to state-of-the-art interactive segmentation algorithms, for propagating weak scribble annotations of a given set of images to obtain the multi-labeled full mask predictions of them. Achieved mask predictions are then used to train a Fully Convolutional Network for semantic segmentation. While CDS has been applied for pixelwise binary classification problem, it has not been explored for semantic segmentation before and this paper presents our work in this direction. Experimental results showed that proposed approach generates higher quality full mask predictions than the existing methods that have been adopted for weakly-supervised semantic segmentation in literature works.
Multi-feature fusion for image retrieval using constrained dominant sets. CoRR abs/1808.05075. Cited by: §1.
-  (2019) Deep constrained dominant sets for person re-identification. CoRR abs/1904.11397. Cited by: §1.
-  (2017) On comparing color spaces for food segmentation. In International Conference on Image Analysis and Processing, pp. 435–443. Cited by: §3.2.
-  (2018) Semantic food segmentation for automatic dietary monitoring. In 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin) (ICCE-Berlin 2018), Cited by: §1.
-  (2016) Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. In Asian Conference on Computer Vision, pp. 180–196. Cited by: §1.
-  (2016) What’s the point: semantic segmentation with point supervision. In European Conference on Computer Vision, pp. 549–565. Cited by: §1.
-  (2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (9), pp. 1124–1137. Cited by: §1.
-  (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, Cited by: §4.
The cityscapes dataset for semantic urban scene understanding.
IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.
-  (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1635–1643. Cited by: §1, §1, §1.
-  (2004) Efficient graph-based image segmentation. International Journal of Computer Vision 59 (2), pp. 167–181. Cited by: §3.1, §4.
-  (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §4.
-  (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7014–7023. Cited by: §1.
-  (2017) Simple does it: weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 876–885. Cited by: §1, §1, §1.
-  (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §1, §1, §1, §3.1, §3.1, §3.2, §4, §4, §4, §4.
-  (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §4, §4.
Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807. Cited by: §1.
Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §1.
-  (2007) Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1), pp. 167–172. Cited by: §2, §2, §2.
-  (2011) Graph-based quadratic optimization: a fast evolutionary approach. Computer Vision and Image Understanding 115 (7), pp. 984–995. Cited by: §2.
-  (2017) Dominant-set clustering: a review. European Journal of Operational Research 262 (1), pp. 1–13. Cited by: §1, §2.
-  (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), Vol. 23, pp. 309–314. Cited by: §1, Figure 1, Figure 2, Table 2, §4.
-  (2018) Normalized cut loss for weakly-supervised cnn segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Cited by: §1, §1, Figure 1, Figure 2, Table 2, §4.
-  (2016) Normalized cut meets mrf. In European Conference on Computer Vision, pp. 748–765. Cited by: Figure 1, Figure 2, Table 2, §4.
-  (2018) On regularized losses for weakly-supervised cnn segmentation. In European Conference on Computer Vision (ECCV), pp. 507–522. Cited by: §1, §1.
-  (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §3.1, §3.2.
-  (2015) MatConvNet – convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia, Cited by: §4.
-  (2017) Learning random-walk label propagation for weakly-supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7158–7166. Cited by: §1, §1, §1.
-  (2016) A deep learning approach for semantic segmentation in histology tissue images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 176–184. Cited by: §1.
-  (2016) Constrained dominant sets for retrieval. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, pp. 2568–2573. Cited by: §1.
-  (2018) Dominant sets for “constrained” image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, §3.2, §3.2, §3, §4.
-  (2016) Interactive image segmentation using constrained dominant sets. In European Conference on Computer Vision, pp. 278–294. Cited by: §1, §2, §3.