Accurately segmenting a foreground object of interest from an image with convenient human interactions plays a central role in image and video editing. One widely used interaction is to annotate a bounding box around the foreground object. On one hand, this input bounding box provides the spatial location of the foreground. On the other hand, based on the image information within and outside this bounding box, we can have an initial estimation of the appearance models of the foreground and background, with which a binary labeling is finally performed to achieve a refined segmentation of the foreground and background[15, 17, 16, 18, 13, 10].
However, due to the complexity of the object boundary and appearance, most of the existing methods of this kind prefer the input bounding box to tightly enclose the foreground object. An example is shown in Fig. 1, where the widely used GrabCut  algorithm fails when the bounding box does not tightly cover the foreground object. The preference of a tight bounding box increases the burden of the human interaction, and moreover it prevents these algorithms from utilizing automatically generated bounding boxes, such as boxes from object proposals [2, 23, 22], that are usually not guaranteed to tightly cover the foreground object. In this paper, we focus on developing a new LooseCut algorithm that can accurately segment the foreground object with loosely-bounded boxes.
A loosely bounded box may contain more background than a tightly bounded box. As a result, the initial appearance model of the foreground is highly inaccurate by using the pixels within the bounding box. This may substantially reduce the segmentation performance as shown by the Grabcut result in Fig. 1. In this paper, we propose two strategies to address this problem. First, we explicitly emphasize the appearance difference between the foreground and background models. Second, we explicitly encourage the consistent labeling to the similar-appearance pixels, either adjacent or non-adjacent. These two strategies can help identify the background pixels within the bounding box, as shown in Fig. 2.
In this paper, we follow GrabCut by formulating the foreground/background segmentation as a binary labeling over an MRF built upon the image grid, and the appearances of the foreground and background are described by two Gaussian Mixture Models (GMMs). More specifically, we add aglobal similarity constraint and a label consistency term to the MRF energy to implement the above mentioned two strategies. Finally, we solve the proposed MRF model using an iterated max-flow algorithm. In the experiments, we evaluate the proposed LooseCut in three publicly-available image datasets, and compare its performance against several state-of-the-art interactive image segmentation algorithms. We also show that LooseCut can be used for enhancing the performance of unsupervised video segmentation and image saliency detection.
2 Related Work
In recent years, interactive image segmentation based on input bounding boxes have drawn much attention in the computer vision and graphics community, resulting in a number of effective algorithms[15, 17, 16, 18, 13, 10]. Starting from the classical GrabCut algorithm, many of these algorithms use graph cut models: the input image is modeled by a graph and the foreground/background segmentation is then modeled by a binary graph cut that minimizes a pre-defined energy function . In GrabCut , initial appearance models of the foreground and background are estimated using the image information within and outside the bounding box. A binary MRF model is then applied to label each pixel as the foreground or background, based on which the appearance models of the foreground and background are re-estimated. This process is repeated until convergence. As illustrated in Fig. 1, the performance of GrabCut is highly dependent on the initial estimation of the appearance models of the foreground and background, which might be very poor when the input bounding box does not tightly cover the foreground object. The LooseCut algorithm developed in this paper also follows the general procedure introduced in GrabCut, but introduce a new constraint and a new energy term to the MRF model to specifically handle the loosely-bounded boxes.
PinPoint  is another MRF-based algorithm for interactive image segmentation with a bounding box. It incorporates a topology prior derived from geometry properties of the bounding box and encourages the segmented foreground to be tightly enclosed by the bounding box. Therefore, its performance gets much worse with a loosely bounded box. Also using an MRF model, OneCut  is recently developed for interactive image segmentation. Its main contribution is to incorporate an MRF energy term that reflects the appearance overlap between foreground and background histograms. As shown in the latter experiments, the -distance based appearance overlap used in OneCut is still insufficient to handle loosely-bounded boxes. In , a pPBC algorithm is developed for interactive image segmentation using an efficient parametric pseudo-bound optimization strategy. However, in our experiment shown in Section 4, pPBC still cannot give satisfactory segmentation results when the input bounding box is loose.
Other than using the MRF model, MILCut  formulates the interactive image segmentation as a multiple instance learning problem by generating positive bags along the sweeping lines within the bounding box. MILCut may not generate the desirable positive bags along the sweeping lines for a loosely bounded box. Active contour  takes the input bounding box as an initial contour and iteratively deforms it toward the boundary of the foreground object. Due to its sensitivity to image noise, active contour usually requires the initial contour to be close to the underlying foreground object boundary.
3 Proposed Approach
In this section, we first briefly review the classical GrabCut algorithm and then explain the proposed LooseCut algorithm.
GrabCut  actually performs a binary labeling to each pixel using an MRF model. Let be the binary labels at each pixel , where if is in foreground if is in background and let denotes the appearance models including foreground GMM and background GMM . Grabcut seeks an optimal labeling that minimizes
where defines a pixel neighboring system, e.g., 4-neighbor or 8-neighbor connectivity. The unary term measures the cost of labeling pixel as foreground or background based on the appearance models . The pairwise term enables the smoothness of the labels by penalizing discontinuity among the neighboring pixels with different labels. Max-flow algorithm  is usually used for solving this MRF optimization problem. GrabCut takes the following steps to achieve the binary image segmentation with an input bounding box:
Estimating initial appearance models , using the pixels inside and outside the bounding box respectively.
Based on the current appearance models , quantizing the foreground and background likelihood of each pixel and using it to define the unary term . And solve for the optimal labeling that minimizes Eq. (1).
Based on the obtained labeling , refining and going back to Step 2. Repeating this process until convergence.
3.2 MRF Model for LooseCut
Following the MRF model used in GrabCut, the proposed LooseCut takes the following MRF energy function:
where is the GrabCut energy given in Eq. (1), and is an energy term for encouraging label consistency, weighted by . In minimizing Eq. (2), we enforce a global similarity constraint to better estimate and distinguish the foreground and background. In the following, we elaborate on the global similarity constraint and the label consistency term .
3.3 Global Similarity Constraint
In this section, we define the proposed global similarity constraint. Let have Gaussian components with means , and have Gaussian components with means , . For each Gaussian component in the foreground GMM , we first find its nearest Gaussian component in as
With this, we can define the similarity between the Gaussian component and the entire background GMM as
which is the inverse of the mean difference between and its nearest Gaussian component in the background GMM. Then, we define the global similarity function as
3.4 Label Consistency Term
To encourage the label consistency of the similar-appearance pixels, either adjacent or non-adjacent, we first cluster all the image pixels using a recent superpixel algorithm  that preserves both feature and spatial consistency. Following a K-means-style procedure, this cluster algorithm partitions the image into a set of compact superpixels and each resulting cluster is made up of one or more superpixels. An example is shown in Fig. 3, where the region color indicates the clusters: superpixels with the same color constitute a cluster.
Let indicates the cluster , and pixels belonging to should be encouraged to be given the same label, e.g., and in Fig. 3. To accomplish this, we set a cluster label (taking values 0 or 1) for each cluster and define the label-consistency energy term as
where is an indicator function taking 1 or 0 for true or false argument. In the proposed algorithm, we will solve for both the pixel labels and cluster labels simultaneously in the MRF optimization.
In this section, we propose an algorithm to find the optimal binary labeling that minimizes the energy function defined in Eq. (2), subject to the global similarity constraint. Specifically, in each iteration, we first fix the labeling and optimize over by enforcing the global similarity constraint on . After that, we fix and find an optimal that minimizes . These two steps of optimization is repeated alternately until convergence or a preset maximum number of iterations is reached. As an initialization, we use the input bounding box to define a binary labeling in iteration 0. In the following, we elaborate on these two optimization steps.
Fixing and Optimizing over : With fixed binary labeling , we can estimate using a standard EM-based clustering algorithm: All the pixels with label 1 are taken for computing the foreground GMM and all the pixels with label 0 are used for computing the background GMM . We intentionally select and such that since some background components are mixed to the foreground for the initial defined by a loosely bounded box. For the obtained and , we examine whether the global similarity constraint is satisfied, i.e, or not. If this constraint is satisfied, we take the resulting and continue to the next step of optimization. If this constraint is not satisfied, we further refine using the following algorithm:
Calculate the similarity between each Gaussian component of and , by following Eq. (4) and identify the Gaussian components of with the largest similarity to .
Among these components, if any one, say , does not satisfy , we delete it from .
After all the deletions, we use the remaining Gaussian components to construct an updated .
This algorithm will ensure the updated and satisfies the global similarity constraint.
Fixing and Optimizing over : Inspired by  and , we build an undirect graph with auxiliary nodes as shown in Fig. 4 to find an optimal that minimizes the energy . In this graph, each pixel is represented by a node. For each pixel cluster , we construct an auxiliary node to represent it. Edges are constructed to link the auxiliary node and the nodes that represent the pixels in , with the edge weight as used in Eq. (2). An example of the constructed graph is shown in Fig. 4, where pink nodes , , and represent three pixels in a same cluster, which is represented by the auxiliary node . All the nodes in blue represent another cluster. With a fixed , we use the max-flow algorithm  on this graph to seek an optimal that minimizes the energy .
In OneCut, a color histogram is first constructed for the input image and then one auxiliary node is constructed for each histogram bin. All the pixels are then quantized into these bins and the pixels in each bin are then linked to its corresponding auxiliary node. In this paper, we use superpixel-based clusters to define the auxiliary nodes.
The unary energy term in OneCut is different from the one in the proposed method and as a result, we define the edge weights involving the source and sink nodes differently from OneCut. OneCut follows the ballooning technique: The weight is set to 1 for the edges between and any pixels inside the bounding box, and 0 otherwise; Similarly, the weight is set to 0 for the edges between and any pixels in the bounding box, and otherwise. In the proposed algorithm, the weights of the edges that are incident from or reflect the unary term in Eq. (2), which is based on the appearance models .
With these two differences, OneCut seeks to minimize the -distance based histogram overlap between the foreground and background. This is different from the goal of the proposed algorithm: we seek better label consistency of the pixels in the same cluster by using this graph structure. We will compare with OneCut in the latter experiments. The full LooseCut algorithm is summarized in Algorithm 1.
To justify the proposed LooseCut algorithm, we conduct experiments on three widely used image datasets – the GrabCut dataset , the Weizmann dataset [3, 5], and the iCoseg dataset , and compare its performance against several state-of-the-art interactive image segmentation methods, including GrabCut , OneCut , MILCut , and pPBC . We also conduct experiments to show the effectiveness of LooseCut in two applications: unsupervised video segmentation and image saliency detection.
Metrics: As in   , we use Error Rate to evaluate an interactive image segmentation by counting the percentage of misclassified pixels inside the bounding box. We also take the pixel-wise F-measure
Parameter Settings: For the number of Gaussian components in GMMs, is set to and is set to . As discussed in Section 3.5, . To enforce the global similarity constraint, we delete component in . The number of clusters (auxiliary nodes in graph) is set to . For the LooseCut energy defined in Eq. (2), we consistently set . The unary term and binary term in Eq. (2) are the same as in  and RGB color features are used to construct the GMMs. We set in deleting the foreground GMM component to enforce the global similarity constraint. For all the comparison methods, we follow their default or recommended settings in their codes.
|GrabCut Dataset||Weizmann Dataset||iCoseg Dataset|
|F-measure||Error Rate||F-measure||Error Rate||F-measure||Error Rate||F-measure||Error Rate|
4.1 Interactive Image Segmentation
In this experiment, we construct bounding boxes with different looseness and examine the resulting segmentation. As illustrated in Fig. 5, we compute the fit box to the ground-truth foreground and slightly dilate it by 10 pixels along four directions, i.e., left, right, up, and down. We take it as the baseline bounding box with looseness. We then keep dilating this bounding box uniformly along all four directions to generate a series of looser bounding boxes – a box with a looseness (in percentage) indicates its area increase by against the baseline bounding box. A bounding box will be cropped when any of its sides reaches the image perimeter. An example is shown in Fig. 5.
GrabCut dataset  consists of 50 images. Nine of them contain multiple objects while the ground truth is only annotated on a single object, e.g., ground truth only label one person but there are two people in the loosely bounded box. Such images are not applicable to test performance change when we enlarge the box looseness. Therefore, we use the remaining 41 images in our experiments. From Weizmann dataset [3, 5], we pick a subset of 45 images for testing, by discarding the images where the baseline bounding box has almost cover the full image and cannot be dilated to construct looser bounding boxes. For the similar reason, from iCoseg dataset , we select a subset of 45 images for our experiment.
Experimental results on these three datasets are summarized in Fig. 6. In general, the segmentation performance degrades when the bounding-box looseness increases for both the proposed LooseCut and all the comparison methods. However, LooseCut shows a slower performance degradation than the comparison methods. When the looseness is high, e.g., or , LooseCut shows much higher F-measure and much lower Error Rate than all the comparison methods. Since MILCut’s code is not publicly available, we only report MILCut’s F-measure and Error Rate values with the baseline bounding boxes on the GrabCut dataset and the Weizmann dataset by copying it from the original paper. Table 1 reports the values of F-measure and Error Rate of segmentation with varying-looseness bounding boxes on GrabCut dataset. Sample segmentation results, together with the input bounding boxes with different looseness, are shown in Fig. 7.
4.2 Unsupervised Video Segmentation
The goal of unsupervised video segmentation is to automatically segment the objects of interest from each video frame. The segmented objects can then be associated across frames to infer the motion and action of these objects. It is important for video analysis and semantic understanding . One popular approach for unsupervised video segmentation is to detect a set of object proposals, in the form of bounding boxes , from each frame and then extract the objects of interest from these proposals .
In practice, a detected proposal may only cover part of the object of interest, so we detect a set of object proposals and merge them together to construct a large mask, which has a better chance to cover the whole object. Clearly, this merged mask may only loosely bound the object of interest and the object could be extracted by mask based segmentation algorithms. Specifically, we apply a recent FusionEdgeBox algorithm  to detect top 10 object proposals in each video frame for the merged mask.
This experiment is conducted on a subset (21 videos, 657 frames) of JHMDB video dataset . Table 2 shows the unsupervised video segmentation performance, in terms of F-measure and Error Rate averaged over all the frames. We can see that the proposed LooseCut substantially outperforms GrabCut, OneCut and pPBC in this task. Sample video segmentation results are shown in Fig. 8.
4.3 Image Saliency Detection
Recently, GrabCut has been used to detect the salient area from an image . As illustrated in Fig. 9: a set of pre-defined bounding boxes are overlaid to the input image and with each bounding box, GrabCut is applied for a foreground segmentation. The probabilistic saliency map is finally constructed by combining all the foreground segmentations. In this experiment, it is clear that many pre-defined bounding boxes are not tight.
In this experiment, out of 1000 images in the Salient Object dataset , we randomly select 100 images for testing. 15 pre-defined masks are shown in Fig. 9. For quantitative evaluation, we follow 
to binarize a resulting saliency map using an adaptive threshold (two times the mean saliency of the map). Table3 reports the precision, recall and F-measure of saliency detection when using GrabCut, OneCut, pPBC, and LooseCut for foreground segmentation. We also include comparisons of two state-of-the-art saliency detection methods that do not use pre-defined masks, namely FT  and RC . Sample saliency detection results are shown in Fig. 10.
We can see that LooseCut outperforms GrabCut, OneCut and pPBC in this task. It also outperforms FT which does not use bounding-box based segmentation. RC  achieves the best performance for saliency detection, because it combines more complex saliency cues than segmentation based approach.
|Methods||GrabCut Dataset||Weizman Dataset||iCoseg Dataset|
|F-measure||Error Rate||F-measure||Error Rate||F-measure||Error Rate|
|LooseCut w/o proposed constraint & term||0.788||13.7||0.688||19.4||0.686||15.0|
|LooseCut w/o global similarity constraint||0.801||12.0||0.709||17.9||0.691||14.8|
|LooseCut w/o label consistency term||0.822||7.3||0.836||7.4||0.806||6.3|
4.4 Additional Results
In this section, we report additional results that justify the usefulness of the global similarity constraint and the label consistency term, the running time of the proposed algorithm and possible failure cases.
We run experiments on the three image segmentation datasets when by removing the global similarity constraint and/or the label consistency term, together with their corresponding optimization steps in the proposed LooseCut algorithm. The quantitative performance is shown in Table 4. We can see that both the global similarity constraint and the label consistency term help improve the segmentation performance. The global similarity constraint helps improve the segmentation performance more significantly than the label consistency term.
For the running time, we test LooseCut and all the comparison methods on a PC with Intel 3.3GHz CPU and 4GB RAM. We compares their running time for different image size. In this experiment, OneCut only has one iteration, and the iterations of GrabCut and LooseCut are stopped until convergence or a maximum 10 iterations is reached. As shown in Table 5, if the image size is less than , the running time of three algorithms are very close. For large images, LooseCut and OneCut takes more time than GrabCut. In general, LooseCut still shows reasonable running time. Our current LooseCut code is implemented in Matlab and C++, and it can be substantially optimized for speed.
Due to the proposed global similarity constraint and label consistency term, LooseCut may fail when the foreground and background show highly similar appearances, as shown in Fig. 11.
This paper proposed a new LooseCut algorithm for interactive image segmentation by taking a loosely bounded box. We further introduced a global similarity constraint and a label consistency term into MRF model. We developed an iterative algorithm to solve the new MRF model. Experiments on three image segmentation datasets showed the effectiveness of LooseCut against several state-of-the-art algorithms. We also showed that LooseCut can be used to enhance the important applications of unsupervised video segmentation and image saliency detection.
-  R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In CVPR, pages 1597–1604, 2009.
-  B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010.
-  S. Alpert, M. Galun, R. Basri, and A. Brandt. Image segmentation by probabilistic bottom-up aggregation and cue integration. In CVPR, pages 1–8, 2007.
-  D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. icoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, pages 3169–3176, 2010.
-  E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In ECCV, pages 109–122. 2002.
-  Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In ICCV, pages 105–112, 2001.
-  M.-M. Cheng, N. Mitra, X. Huang, P. Torr, and S.-M. Hu. Global contrast based salient region detection. TPAMI, 37(3):569–582, 2015.
-  M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, pages 2141–2148, 2010.
-  H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
-  M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988.
-  P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009.
-  Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, pages 1995–2002, 2011.
-  V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding box prior. In ICCV, pages 277–284, 2009.
-  H. Li, F. Meng, and K. Ngan. Co-salient object detection from multiple images. TMM, 15(8):1896–1909, 2013.
-  C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3):309–314, 2004.
-  M. Tang, I. Ayed, and Y. Boykov. Pseudo-bound optimization for binary energies. In ECCV, pages 691–707, 2014.
-  M. Tang, L. Gorelick, O. Veksler, and Y. Boykov. Grabcut in one cut. In ICCV, pages 1769–1776, 2013.
-  J. Wu, Y. Zhao, J.-Y. Zhu, S. Luo, and Z. Tu. Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation. In CVPR, pages 256–263, 2014.
-  H. Yu, M. Xian, and X. Qi. Unsupervised co-segmentation based on a new global gmm constraint in mrf. In ICIP, pages 4412–4416, 2014.
-  D. Zhang, O. Javed, and M. Shah. Video object co-segmentation by regulated maximum weight cliques. In ECCV, pages 551–566. 2014.
-  Y. Zhou, L.Ju, and S. Wang. Multiscale superpixels and supervoxels based on hierarchical edge-weighted centroidal voronoi tessellation. In WACV, pages 1076–1083, 2015.
-  Y. Zhou, H. Yu, and S. Wang. Feature sampling strategies for action recognition. CoRR, abs/1501.06993, 2015.
-  C. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405. 2014.