1 Introduction
Semantic segmentation is one of the most wellstudied research problems in computer vision. The goal is to achieve pixellevel classification, i.e., to label each pixel in a given input image with the class of the object or region that covers it. Predicting the class of each pixel yields to complete scene understanding which is the main problem of a wide range of computer vision applications, e.g. autonomous driving
[9], humancomputer interaction [17], earth observation [5], biomedical applications [29], dietary assessment systems [4], etc. Stunning performances of DCNNs (Deep Convolutional Neural Networks) at image classification tasks have encouraged researchers to employ them for pixellevel classification as well. Outstanding methods in wellknown benchmarks, e.g. PASCAL VOC 2012, train some fully convolutional networks (FCN) with supervision of fullyannotated groundtruth masks. However, obtaining such precise fullyannotated masks is extremely expensive and this limits the availability of largescale annotated training sets for deep learning architectures. In order to address the aforementioned issue, recent works explored supervision of DCNN architectures for semantic segmentation using lowcost annotations like imagelevel labels
[13], point tags [6], bounding box [14, 10, 18] and scribbles[15, 28, 23, 25], that are weaker than the pixellevel labels.Creating weak annotations is much easier than creating full annotations which helps to obtain large training sets for semantic segmentation. However, these annotations are not as precise as full annotations and their quality depends on the decisions made by the users, which degrades their reliability. Hence, literature works proposed different strategies for weaklysupervised semantic segmentation to deal with these issues. While a number of works [23, 25] proposed to employ a genuine cost function to get into account only the initially given true weak annotations at the training stage, another and the most common approach [10, 15, 14, 18, 28] has been supervising DCNN architectures by predicted full mask annotations which are obtained by postprocessing the weakannotations.
Among these two strategies, we follow the second one and propose to generate full mask annotations from scribbles by an interactive segmentation technique which has proven to be extremely effective in a variety of computer vision problems including image and video segmentation [31, 21]. For the same purpose, literature works have used a number of shallow interactive segmentation methods, e.g. variants of GrabCut [22] are used in [18, 14] for propagating bounding box annotations to supervise a convolutional network. In order to propagate bounding box annotations, [10] proposed to perform iterative optimization between generating full mask approximations and training the network. Using a similar iterative scheme, [15] propagated scribble annotations by superpixels via optimizing a multilabel graph cuts model of [7]. [28] proposed a randomwalk based label propagation mechanism to propagate scribble annotations.
In this paper, we aim to explore the potential of Constrained Dominant Sets (CDS) [32, 31, 2, 1, 30] for generating predicted full annotations to be used in supervision of a convolutional neural network for semantic segmentation. Representing images in an edgeweighted graph structure, main idea in constrained segmentation approach in [31] is finding the collection of dominant set clusters on the graph that are constrained to contain the components of a given annotation. CDS approach is applied for cosegmentation and interactive segmentation using modalities of bounding box or scribble and superiority of it over the state of the art segmentation techniques like Graph Cut, Lazy Snapping, Geodesic Segmentation, Random Walker, Transduction, Geodesic Graph Cut, Constrained Random Walker is proved in [31]. Motivated by the reported performance achievements for single cluster extraction (i.e. foreground extraction) in [31], we used CDS for multiple cluster extraction involving multilabel scribbles for the PASCAL VOC 2012 dataset. Since our goal is mainly exploring the performance of CDS in full mask prediction for weaklysupervised semantic segmentation, we trained a basic segmentation network, namely Fully Convolutional Network (FCN8s) of [16] based on VGG16 architecture, and compared our performance with other full mask prediction schemes in the literature that supervise the same type of deep learning architecture. Our experimental results on the standard dataset PASCAL VOC 2012 show the effectiveness of our approach compared to existing algorithms.
2 Constrained Dominant Sets
Dominant Set Framework.
In the dominantset clustering framework [19, 21], an input image is represented as an undirected edgeweighted graph with no selfloops , where is the set of vertices that correspond to image points (pixels or superpixels), is the set of edges that represent the neighborhood relations between vertices, and is the (positive) weight function that represent the similarity between linked node pairs. A symmetric affinity (or similarity) matrix is constructed to represent the graph that is denoted by where , if and otherwise.
Next, a weight , which is (recursively) defined as Eq. 1, is assigned to each vertex ,
(1) 
where denotes the (relative) similarity between nodes () and , with respect to the average similarity between node and its neighbours in (defined by ).
A positive indicates that adding into its neighbours in will increase the internal coherence of the set, while when it is negative overall coherence gets decreased. Based on aforementioned definitions, a nonempty subset of vertices such that for any nonempty , is said to be dominant set if it is a maximally coherent data set, i.e. satisfying two basic properties of a cluster that are internal coherence (, for all ) and external incoherence (, for all ).
Consider the following linearlyconstrained quadratic optimization problem,
(2) 
where
is the transposition of the vector
and is the standard simplex of , defined as. With the assumption of affinity matrix
is symmetric, it is shown by [19] that if is an dominant set, then its weighted characteristic vector defined as in Eq. 3 is the strict local solution of the Standard Quadratic Program in Eq. 2.(3) 
Conversely, if is a strict local solution to Eq. 2, then its support is a dominant set of . Thus, a dominant set can be found by localizing a solution of Eq. 2 by a continuous optimization technique and gathering the support set of the found solution. Notice that the value of a component in the found provides a measure of how strong that component contributes to the cohesiveness of the cluster.
Constrained Dominant Set Framework.
In [32, 31] the notion of a constrained dominant set is introduced, which aims at finding a dominant set constrained to contain vertices from a given seed set . Based on the edgeweighted graph definition with affinity matrix , a parameterized family of quadratic programs is defined as in Eq. 4 [31] for the set and a parameter ,
maximize  (4)  
subject to 
where is the diagonal matrix whose elements are set to if the corresponding vertices are in and to otherwise. It is theoretically proven, and empirically illustrated for interactive image segmentation [31], that if is the set of vertices selected by the user, by setting it is guaranteed that all local solutions of (4) will have a support that necessarily contains at least one element of . Here,
is the largest eigenvalue of the principal submatrix of
indexed by elements of .In order to find constrained dominant sets by solving the aforementioned quadratic optimization problem (4), [31]
used Replicator Dynamics that is developed and studied in evolutionary game theory
[19]. In this work we use Infection and Immunization Dynamics (InImDyn) [20] which proved to be a faster and as accurate alternative to it.3 Proposed approach
We propose to generate full mask predictions (to be used for supervising a semantic segmentation network) by postprocessing weak annotations, i.e. scribble annotations, using CDS. Moreover, we propose to use CDS for multiclass clustering of pixels, i.e. semantic segmentation, while previously CDS has been used only for interactive foreground segmentation [31, 32].
3.1 Preprocessing step for CDS
Superpixel generation. A common approach followed by image segmentation works has been using superpixels as input entities instead of image pixels. A superpixel is a group of pixels with similar colors and using superpixels not only provides reduced computational complexity, but also yields computing features on meaningful regions. Among a variety of techniques, i.e. SLIC, Oriented Watershed Transform (OWT), we have preferred to use the method developed by Felzenszwalb and Huttenlocher [11] similar to [26] which is a fast and publicly available algorithm. Method of Felzenszwalb and Huttenlocher [11] has also been used in another weaklysupervised semantic segmentation framework [15] experimenting on the same dataset with us. Proposed method in [11] is a graphbased segmentation scheme where a graph is constructed for an image such that each element to be segmented represents a vertex of the graph and dissimilarity, i.e. color differences, between two vertices constitutes a weighted edge. The vertices (or subgraphs) are started to be merged regarding to a merging criteria given in Eq. 5, where is the edge between two subgraphs and , is the weight on edge and MST() be the minimum spanning tree of .
(5) 
Here, is a threshold function in which is decided by the user, i.e. high values of yield to lower number of (large) segments, and viceversa. Another parameter given by the user is the smoothing factor (we denote by ) of the Gaussian kernel that is used to smooth the image at the preprocessing step.
Feature extraction. Once the superpixels are generated on the image, a feature vector is computed for each superpixel. In the application of CDS model for interactive image segmentation in [31]
, median of the color of all pixels in RGB, HSV, and L*a*b* color spaces and LeungMalik (LM) Filter Bank are concatenated in the feature extraction process. Differently from
[31], we compute the same feature types with ScribbleSup [15], which has experimented on the same dataset with us, that are color and texture histograms denoted by and in Eq. 6. More specifically, is a histogram computed on the color space using 25 bins and is a histogram of gradients at the horizontal and vertical orientations where 10 bins are used for each orientation for the superpixel .3.2 Application of CDS for full mask predictions
In order to generate full mask predictions using the CDS model, an input image is represented as a graph where vertices depict the superpixels of the image and edgeweights between vertices reflect the similarity between corresponding superpixels. We use scribbles as the given weak annotations in this work which serve as constraints in the CDS implementation. Previously, CDS has been applied for interactive foreground segmentation [31] where dominant set clusters covering a set of given nodes for a single object class were explored. In this work our problem demand for multiclass clustering of pixels. Hence, here represents the manually selected pixels of the class where and is the number of classes in the dataset, e.g. for PASCAL VOC 2012.
Accordingly, for each class of scribbles that exist in a given image, by ignoring the existence of the remaining classes in the image we perform foreground segmentation, i.e. 2class clustering of image pixels, as in [31] by computing its CDS’s. Thus, for the class the union of the extracted dominant sets, i.e. if dominant sets are extracted which contain the set , represents the segmented regions of object in class . We then repeat this process for every class that exist in the image using the corresponding information. If a node, i.e. superpixel, is found in more than one class of , we assign it to the one having the highest value in its weighted characteristic vector which is found by solving the quadratic program in Eq. 4 by InImDyn (see Section 2).
Computation of the Affinity matrix. Before computing the CDS clusters, the affinity (or similarity) between superpixels should be computed to construct the matrix in Eq. 4. In [31], dissimilarity measurements are transformed to affinity space by using the Gaussian kernel , where is the feature vector of the superpixel , is the scale parameter for the Gaussian kernel and if is true, 0 otherwise. Differently from [31], we use the Gaussian kernel in Eq. 6 where different values are used for different feature types. The kernel in Eq. 6 is also adopted in [15] which experiments on the same dataset and uses the same feature types with us.
(6) 
Using different color spaces. Quality of generated superpixels effects the performance of the segmentation algorithm directly and a number of segmentation works (examples include but not limited to [26, 3]) have emphasized that higher segmentation performances can be obtained by using different color transformations of the input image to deal with different scene and lighting conditions. Motivated by the related literature studies [26, 3], we compute superpixels in a variety of color spaces with a range of invariance properties. Specifically, we use five color spaces, that were also used in [26] for determining high quality object locations by employing segmentation as a selective search strategy, that are Intensity (greyscale image), , which denotes channels of normalized plus intensity, , that denotes the Hue channel of . We generate superpixels and compute mask predictions using CDS model for each color space of the input image, then we decide the final label for a pixel based on most frequently occurred class label, i.e. by using the scheme of majority voting. In addition to using different color spaces we also vary the threshold parameter (in Eq. 5) to get benefit from a large set of diversification as recommended in [26].
4 Experiments
Dataset and evaluation. We trained the models on the 10582 augmented PASCAL VOC training set [12] and evaluated them on the 1449 validation set. We used the scribble annotations published in [15]. In what follows accuracy is evaluated using pixel accuracy (), mean accuracy () and mean Intersection over Union () as in [16], where is the number of pixels of class predicted to belong to class , is the number of different classes, and be the total number of pixels of class .
Implementation details. We used the VGG16based FCN8s network [16] of the MatConvNetFCN toolbox [27]
which we initialized by ImageNet pretrained model, i.e. VGGVD16 in
[27]. We trained by SGD with momentum and, similar to [16], we used momentum 0.9, weight decay of , mini batch size of 20 images and learning rate of. With these selected hyperparameters we observed that the pixel accuracy is being converged on the validation set.
Performance of CDS is sensitive to the selection of the parameter of the Gaussian kernel (see Section 3.2) and in [31] three different results are reported for different selections of : 1) CDSBestSigma, where best is selected separately for every image; 2)CDSSingleSigma, by searching in a fixed range, i.e. 0.05 and 0.2; 3)CDSSelfTuning, where is replaced by , where , i.e. the mean of the KNearestNeighbor of the sample , is fixed to 7. To decide values of the and parameters (in Eq. 6) we followed CDSBestSigma strategy in [31]. Additionally, in the graph structure we cut the edges between vertices correspond to nonadjacent superpixels vertices by setting the corresponding items to zero in the affinity matrix like has been done in [15], which has provided better segmentation maps. We then minmax normalized the matrix to be scaled in the range of and symmetrized it.
Performance evaluation. We first explored the performance using different color spaces on the predicted full annotations of 10582 images (denoted by PredSet to mention Predicted Set in Table 1), before training the network with them. Then, by training the network with the Predicted Sets we report performance on the Test Set, i.e. PASCAL VOC 2012 Val set. In the implementation of the superpixel generation of [11] we used smoothing factor of (FH stands for Felzenszwalb and Huttenloche [11]) in the experiments of Table 1. For each color space we performed majority voting (denoted by MV both in Table 1 and 2) over obtained maps with (in Eq. 5).
We see at Table 1 that using different color spaces affects the quality of the predicted full annotations (PredSet) and highest quality mask predictions in terms of mIoU are obtained when we use the Intensity (66.51%). Performing majority voting over maps obtained in all color spaces provided highest quality mask predictions for both CDS (73.28%) and GraphCut (63.51%). We then trained the network with the predicted sets of CDSIntensity, CDSMV, GraphCutMV and published full mask annotations and present their performance on the test set in Table 1. We see that by using CDSMV in training we outperform GraphCut (which was employed in [15]) significantly and we are quiet approaching to the performance of fullyannotated mask training (59.2% vs. 61.6%).
Color space  mean IoU  Pixel Acc.  mean Acc. 

PredSetCDS  66.51  89.05  75.95 
PredSetCDS  65.47  88.36  76.15 
PredSetCDS  64.70  88.13  75.29 
PredSetCDS  66.49  89.27  74.60 
PredSetCDS  57.16  85.12  68.21 
PredSetCDS  73.28  91.47  82.05 
PredSetGraphCut  63.51  86.48  81.83 
TestSetCDSIntensity  57.41  89.01  70.56 
TestSetCDSMV  59.20  89.59  73.05 
TestSetGraphCut  52.25  85.80  72.43 
TestSetWith Full Masks  61.60  90.27  78.95 
Comparison with other fullmask prediction methods. There is a large variety of interactive segmentation algorithms that can be used for full mask prediction to train a semantic segmentation network. To be as fair as possible we make comparison with the reported performances of the methods that are carried on in similar conditions with us, e.g. the ones which employ scribbles as weak annotations, achieve network training using cross entropy loss computed over all pixel predictions but not only on given weak annotations, and do not iterate between the shallow segmentation method and network training with the obtained mask predictions as in ScribbleSup [15]. On the other hand, we performed the Graph Cut algorithm employed in ScribbleSup [15] in our framework by using the published code^{1}^{1}1 mouse.cs.uwaterloo.ca/code/gcov3.0.zip referred in [15] and present its performance. In fact, our approach can be considered as the first iteration step of such an iterative scheme, and it can be extended to be used in further iterations by updating initial scribble annotations by considering network scores obtained with high confidence.
Considering the above issues we compare with the methods whose accuracy on the test set is reported when their mask predictions are used to train a segmentation network. Specifically, we refer to the performance results of the popular methods GrabCut [22], NormalizedCut [23], and KernelCut [24] reported in [23]. It is mentioned in [24, 23] that for each image pixel, RGB (color) and XY (location) features are concatenated to be used in these algorithms. Then, segmentation proposals generated by them are used to train a VGG16based DeepLabMsclargeFOV network [8]. It is reported in [8] that DeepLabMsclargeFOV, which employs atrous convolution and multiscale prediction, outperforms FCN8s by around 9% (71.6% vs. 62.2%) at PASCAL VOC 2012 validation set when trained by full mask annotations, which provides an advantage at comparative works. On the other hand, we also present the performance gap between weak and full mask training to provide a more fair comparison in Table 2. In Table 2, the performance results of full mask training (64.1 %), GrabCut [22], NormalizedCut [23], and KernelCut [24] are acquired from [23].
Method  mIoU  Gap between full and weak supervision 

With Full Masks [23]  64.1   
GrabCut [22]  55.5  8.6 
NormalizedCut [23]  58.7  5.4 
KernelCut [24]  59.8  4.3 
With Full Masks  61.6   
GraphCutMV  52.25  9.35 
CDSMV  59.20  2.40 
CDSMV  60.22  1.38 
For CDS, we train with mask predictions generated by two different selections of : (i) (corresponding to PredSetCDS in Table 1); and (ii) , where we selected the best among and for each image. It can be seen at the segmentation performances on the val set given in Table 2 that we outperform the literature works at (60.22%), and we are superior at both parameter selections in terms of performance gap between full and weak supervision, i.e. we approach to the performance of our full mask training (61.6%) by 2.4% and 1.38% at selection of and , respectively. Two example images from the generated set, i.e. PredSet, of are presented in Figure 1. Figure 2 shows examples from testing on the val set when it is trained by PredSetCDSMV. It can be seen in Figure 1 and 2 that our results are the ones most closest to the ground truth of input images.
5 Conclusions
In this paper we have proposed to apply Constrained Dominant Set (CDS) model, which is proved to be an effective method compared to stateoftheart interactive segmentation algorithms, for propagating weak scribble annotations of a given set of images to obtain the multilabeled full mask predictions of them. Achieved mask predictions are then used to train a Fully Convolutional Network for semantic segmentation. While CDS has been applied for pixelwise binary classification problem, it has not been explored for semantic segmentation before and this paper presents our work in this direction. Experimental results showed that proposed approach generates higher quality full mask predictions than the existing methods that have been adopted for weaklysupervised semantic segmentation in literature works.
References

[1]
(2018)
Multifeature fusion for image retrieval using constrained dominant sets
. CoRR abs/1808.05075. Cited by: §1.  [2] (2019) Deep constrained dominant sets for person reidentification. CoRR abs/1904.11397. Cited by: §1.
 [3] (2017) On comparing color spaces for food segmentation. In International Conference on Image Analysis and Processing, pp. 435–443. Cited by: §3.2.
 [4] (2018) Semantic food segmentation for automatic dietary monitoring. In 2018 IEEE 8th International Conference on Consumer Electronics  Berlin (ICCEBerlin) (ICCEBerlin 2018), Cited by: §1.
 [5] (2016) Semantic segmentation of earth observation data using multimodal and multiscale deep networks. In Asian Conference on Computer Vision, pp. 180–196. Cited by: §1.
 [6] (2016) What’s the point: semantic segmentation with point supervision. In European Conference on Computer Vision, pp. 549–565. Cited by: §1.
 [7] (2004) An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (9), pp. 1124–1137. Cited by: §1.
 [8] (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, Cited by: §4.

[9]
(2016)
The cityscapes dataset for semantic urban scene understanding.
In
IEEE conference on computer vision and pattern recognition
, pp. 3213–3223. Cited by: §1.  [10] (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1635–1643. Cited by: §1, §1, §1.
 [11] (2004) Efficient graphbased image segmentation. International Journal of Computer Vision 59 (2), pp. 167–181. Cited by: §3.1, §4.
 [12] (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §4.
 [13] (2018) Weaklysupervised semantic segmentation network with deep seeded region growing. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7014–7023. Cited by: §1.
 [14] (2017) Simple does it: weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 876–885. Cited by: §1, §1, §1.
 [15] (2016) Scribblesup: scribblesupervised convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §1, §1, §1, §3.1, §3.1, §3.2, §4, §4, §4, §4.
 [16] (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §4, §4.

[17]
(2015)
Hands deep in deep learning for hand pose estimation
. arXiv preprint arXiv:1502.06807. Cited by: §1. 
[18]
(2015)
Weakly and semisupervised learning of a deep convolutional network for semantic image segmentation
. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §1.  [19] (2007) Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1), pp. 167–172. Cited by: §2, §2, §2.
 [20] (2011) Graphbased quadratic optimization: a fast evolutionary approach. Computer Vision and Image Understanding 115 (7), pp. 984–995. Cited by: §2.
 [21] (2017) Dominantset clustering: a review. European Journal of Operational Research 262 (1), pp. 1–13. Cited by: §1, §2.
 [22] (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), Vol. 23, pp. 309–314. Cited by: §1, Figure 1, Figure 2, Table 2, §4.
 [23] (2018) Normalized cut loss for weaklysupervised cnn segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Cited by: §1, §1, Figure 1, Figure 2, Table 2, §4.
 [24] (2016) Normalized cut meets mrf. In European Conference on Computer Vision, pp. 748–765. Cited by: Figure 1, Figure 2, Table 2, §4.
 [25] (2018) On regularized losses for weaklysupervised cnn segmentation. In European Conference on Computer Vision (ECCV), pp. 507–522. Cited by: §1, §1.
 [26] (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §3.1, §3.2.
 [27] (2015) MatConvNet – convolutional neural networks for matlab. In Proceeding of the ACM International Conference on Multimedia, Cited by: §4.
 [28] (2017) Learning randomwalk label propagation for weaklysupervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7158–7166. Cited by: §1, §1, §1.
 [29] (2016) A deep learning approach for semantic segmentation in histology tissue images. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 176–184. Cited by: §1.
 [30] (2016) Constrained dominant sets for retrieval. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 48, 2016, pp. 2568–2573. Cited by: §1.
 [31] (2018) Dominant sets for “constrained” image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, §3.2, §3.2, §3, §4.
 [32] (2016) Interactive image segmentation using constrained dominant sets. In European Conference on Computer Vision, pp. 278–294. Cited by: §1, §2, §3.