Binary image segmentation, the process of partitioning a digital image into foreground and background regions, is one of the most fundamental problems in computer vision, which has been widely used in many applications, such as image and video editing, object extraction and recognition, photo composition, medical image analysis, and so on. Automatic image segmentation sometimes could not produce satisfactory results due to the fact that the foreground object is really ambiguous if any prior knowledge or high-level understanding of the content is absent. Therefore, most of existing binary image segmentation algorithms allow user-provided interactions in order to obtain information about the object of interest, such as scribbles [Bai and Guillermo2007, Boykov and Mariepierre2001, Boykov and Kolmogorov2006, Protiere and Sapiro2007, Wang, Agrawala, and Cohen2007] or bounding boxes [Rother, Kolmogorov, and Blake2004, Lempitsky et al.2009, Cheng et al.2015]. Through these interactions from users, high-level semantic knowledge could be obtained, further leading to desired binary segmentation results.
In general, there are two basic requirements for a good interactive image segmentation algorithm: 1) given a certain user input, the algorithm should produce accurate segmentation results that reflect the user intent; 2) user interaction should not be too complicated. To these ends, we adopt “clicks” or “scribbles” on the desired foreground and background regions (see Figure 1) as effective user interactions, which can be easily controlled in most cases.
The emergence of many interactive segmentation algorithms has been witnessed in recent years. A popular framework is the total variation model [Unger et al.2008, Kwon, Li, and Wong2013, Shi, Pang, and Xu2016]. This model consists of an unary term that uses foreground and background colors inferred from the respective seed pixels, and a total variation term to localize edge. While the total variation term can explicitly refine object boundaries, it does not use any color information as guidance, leading to unsatisfactory results in many cases. Another popular method is the graph cut method first introduced by [Boykov and Mariepierre2001], with numerous variations (e.g. geodesic graph cut [Price, Morse, and Cohen2010]). However, graph cut method may suffer from the shrink bias toward shorter paths [Price, Morse, and Cohen2010] because its pairwise term includes a summation over the boundary of the segmented regions. Geodesic graph cut seeks to solve this problem by using the geodesic distance instead of Euclidean distance as unary term, but computing geodesic distance over the whole image is time-consuming. In addition to the forementioned approaches, there are also some image segmentation methods based on superpixels instead of pixels because superpixels can help reduce the computation complexity and thus accelerate the algorithms [Wang, Ju, and Wang2011, Schick, Bauml, and Stiefelhagen2012, Papazoglou and Ferrari2013, Rantalankila, Kannala, and Rahtu2014, Wang, Shen, and Porikli2015, Khoreva et al.2017]. Although the use of superpixels can release the computational cost, it may produce inaccurate boundaries on the other hand. Therefore, image segmentation methods based on superpixels often need to apply some post-processing to refine the boundaries [Schick, Bauml, and Stiefelhagen2012, Feng et al.2016].
In this work, we aim to design an interactive binary image segmentation method which can achieve high quality segmentation results. To these ends, we will formulate our task as a binary labeling problem via Markov Random Field (MRF) with the unary and pairwise terms to model image segmentation. It has been shown that fast bilateral solver (FBS) [Barron and Poole2016], a novel algorithm for edge-aware smoothing developed recently, can help to denoise and preserve object boundaries efficiently. Therefore, we further take a bilateral affinity term as the pairwise term in the MRF framework in order to obtain bilateral-smooth results. Through the alternating direction strategy, we are able to apply steepest gradient descent (SGD) and FBS together to effectively solve the target energy minimization problem. To sum up, our method includes the following major components and advantages:
Geodesic distance is employed to compute unary term to separate regions with similar color appearance that belong to different labels. To ensure algorithm efficiency, we compute geodesic distance in the superpixel level to dramatically release the computational burden.
The bilateral affinity is used as the pairwise term, which produces segmentation results with good boundary sensitivity regardless of image complexity.
The overall optimization problem is split into two subproblems through alternating direction approach, which can be effectively solved via SGD and FBS, respectively.
Interactive image segmentation
A large body of work has been proposed for interactive image segmentation on color images. Among them, the graph cut [Boykov and Mariepierre2001] approach has been a very popular one, in which an image is represented as a graph and a globally optimal segmentation based on the MRF energy minimization is found with balanced region and boundary information. However, graph cut is often limited because it only relies on color information and thus can fail in cases where the foreground and background color distributions are overlap or very complicated. In addition, graph cut may suffer from the aforementioned shrink bias problem [Price, Morse, and Cohen2010].
Another effective framework is to use the geodesic information. A geodesic image segmentation algorithm was proposed in [Criminisi, Sharp, and Blake2008], in which image segmentation is viewed as an approximate energy minimization problem in a conditional random field, and the segmentation task is then finished by expanding outward from the seeds to selectively fill the desired region. Geodesic information is useful for selecting objects with complex boundaries such as those with long and thin parts. However, since it only works from the interior of the selected object outwards and does not explicitly consider the object boundary, it may suffer from a bias that favors shorter paths back to the seeds. Another interactive segmentation method based on geodesic information is the geodesic graph-cut introduced in [Price, Morse, and Cohen2010]. It combines geodesic distance-based region information with edge information in the graph-cut optimization framework. It also introduces a spatially-varying weighting scheme based on the local confidence of the geodesic component, which is used to adjust the relative weighting between the binary and pairwise terms. Geodesic distance component effectively helps avoid the tendency for geodesic segmentation to degenerate to Euclidean distance maps when the foreground/background colors are indistinct.
There are also some interactive image segmentation methods performing on the superpixels level instead of the pixel level since superpixels can groups pixels into perceptually meaningful regions and greatly reduce the computation complexity. [Schick, Bauml, and Stiefelhagen2012] ameliorated foreground segmentation through post-processing based on superpixels. This method first converts the pixel-based segmentation into a probabilistic superpixel representation and then use MRF to refine segmentation. In [Feng et al.2016] superpixels are used instead of pixels as graph nodes in the modified graph cut algorithm in order to ensure good responsiveness and efficiency of interactive segmentations.
Lot of researches related to image segmentation impose the regularization requirement of the results by adding a pairwise term into the models [Chartrand and Staneva2008, Lam, Gao, and Liew2010, Zhu, Tai, and Chan2013, Zhang et al.2015]. The Rudin-Osher-Fatemi and total variation regularizations were used in [Chartrand and Staneva2008], which can preserve and smooth boundary edges. Euler’s elastica was applied in [Zhu, Tai, and Chan2013] as regularization of the activate contour segmentation model. Although the modified activate contour model is able to preserve local boundaries as well as capture fine elongated structures of objects, it still encounter the problem of omitting relatively small objects. In [Zhang et al.2015] an image segmentation method was proposed, which adopts a sparse and low-rank based nonconvex regularization. This model can capture the global structure of the whole data, often leading to better segmentation results than the total variation based model. All aforementioned regularization methods are only based on the pixel label similarity without taking account of image color information. Bilateral affinity is also a regularization that can be added into the image segmentation models. It combines color and spatial information to locate the object boundary and denoise in order to produce smooth segmentation results. Fast bilateral solver [Barron and Poole2016]
, a novel algorithm that can very efficiently compute the bilateral affinity term, has been applied in various computer vision tasks, such as stereo, depth super-resolution and colorization, to produce edge-aware smoothing results. We will use the bilateral affinity regularization in the proposed algorithm.
The proposed method
The proposed method is based on solving the minimization problem of the following energy functional within the MRF framework:
where is the set of pixels of the given image, is the set of pairs of neighboring pixels, is a regularizing parameter, and
is a binary vector of labels defined by
with and denoting the set of pixels of the foreground and background of the image, respectively. The first term and the second term in the righthand of (1) are often referred as the “unary term” and the “pairwise term”, respectively. In the binary segmentation task, the unary term represents the cost of assign label to pixel , and is often formulated as follows:
How to choose is a core issue of the binary segmentation problem, and two commonly-used forms can be presented as:
from the “Gaussian color model” for each region –
from the “general color distribution” –
where denotes the image density at the pixel , and
are the mean and variance of the forground/background seed sets, respectively, and
is the probability of the-th pixel belonging to the forground/background. The pairwise term represents the cost for assigning a pixel pair and , and is used to introduce some additional smooth constraints in the segmentation tasks, such as denoising, edge-preserving, etc.
In our method, the unary term is computed by utilizing the geodesic distance, accompanied by the superpixel-based acceleration, which aims at distinguishing the similar colors in the foreground and background regions. The pairwise term is generated by introducing the bilateral affinity as a regularizer, which is used to ensure the edge-preservation. Finally, to efficiently solve the minimization of (1), we adopt the alternating direction strategy to split the minimization problem of (1) into two subproblems, and then iteratively solve them by SGD and FBS until the solution converges.
Next we present a detailed description of the proposed method and its implementation techniques.
In the binary segmentation task, users need to provide some cues to distinguish the foreground and background regions. Based on these cues, will be then determined by some ways. Although (3) and (4) are widely used in many interactive segmentation methods, their drawbacks are also obvious. For example, it could fail to separate the color-similar regions located at the foreground and background boundaries. To overcome this difficulty, we adopt the geodesic distance at the superpixel level to compute .
Let be the set of seeds with user annotated labels where and stand for the foreground and the background respectively. The geodesic distance from a pixel to is then defined by
where denotes the geodesic distance between the two pixels and , which can be computed by the famous Dijkstra algorithm. However, the computation of geodesic distances at the pixel level is extremely time-consuming, and hence restrict its application in practice, especially when the image is very large. To reduce such computational burden, we use the superpixels to approximate and accelerate computation of geodesic distances.
A superpixel can be defined as a group of pixels which have similar characteristics and it is generally a color-based segmentation. Superpixels have become the fundamental units in many imporatnt computer vision tasks. Our idea is described by the following steps: 1) generate a set of superpixels of the image, , with centers by an existing method, such as SLIC [Radhakrishna et al.2010], HEWCVT [Zhou, Ju, and Wang2015] and so on; 2) for , compute , the geodesic distance from the center of the superpixel to the foreground-background seed set; 3) obtain if in the unary term.
The generation of superpixels is fast, and relieve the computational burden of geodesic distances dramatically. However, there also exist some drawbacks in practice. For example, superpixels often cross the edges between two color-similar objects, which directly affect the final foreground/background segmentation results. To solve this drawback, we choose in (1
) as the bilateral affinity matrix. Each element of the bilateral affinity matrixreflects the affinity between pixels and in the reference image in the YUV colorspace:
where is a pixel in the reference image with the spatial position and the color (for clearness, we denote ), and the parameters and control the extent of the spatial, luma, and chroma support of the filter, respectively. The choice of (6) leads to edge-preserving to some extent.
Fast bilateral solver
Fast bilateral solver, proposed in [Barron and Poole2016], is a novel algorithm for edge-aware smoothing that combines the flexibility and speed of simple filtering approach with the accuracy of the domain-specific optimization algorithm. FBS attempts to minimize the following functional
where is an input target vectorized image, is a confidence vectorized image, is the pairwise term multiplier, and is defined by (6), which is the bilateral filter weight for the pixel pairs given a reference RGB image .
Direct solution of (7) is generally computationally expensive, especially when the image has high resolution. There are techniques for speeding up bilateral filtering. Two of them, the permutohedral lattice [Adam2006] and the bilateral grid [Bai and Guillermo2007]
express bilateral filtering as a splat/blur/slice procedure: 1) pixel values are “splatted” into a small set of vertices in a grid; 2) those values are “blurred” in bilateral space; 3) the filtered pixel values are produced via a “slice” (an interpolation) of the blurred vertex values. These approaches correspond to a compact and efficient factorization of:
where the multiplication by is the “splat”, the multiplication by is the “blur”, and the multiplication by is the “slice”. The factorization (8) allow for the optimization problem (7) to be “splatted” and solved in the bilateral space. A bistochasticized version of can be obtained on the “simplified” bilateral grid [Barron et al.2015]:
where and are two bistochastization matrices and satisfies
whose matrix form is
By introducing the transformation , the optimization (10) in terms of pixels can be described as an optimization in terms of bilateral-space vertices
and is the Hadamard product. Note that (10) is a quadratic optimization problem and its solution is equivalent to solving the sparse linear system
A pixel-space solution of (11) then can be obtained by simply slicing :
Solution by alternating direction
The minimization of the energy functional (1) is equivalent to
The above optimization problem (12) can be solved effectively by the alternating direction procedure iteratively:
fixing , solve the minimization problem
fixing , solve the minimization problem
where is a regularizing parameter. The sub-problem (13) can be easily solved by SGD, and (14) is just the same as (7), which can be solved efficiently by FBS. We present the whole segmentation algorithm in the following Algorithm 1.
In this section, we first introduce the testing dataset and then compare the proposed method with several state-of-the-art ones on the interactive binary segmentation performance based on commonly-used evaluation criterions. In addition, we also give an ablation study for the proposed method.
In this study, we test the proposed method on the VGG interactive image segmentation dataset, provided by Visual Geometry Group, University of Oxford (http://www.robots.ox.ac.uk/~vgg/data/iseg) [Gulsha et al.2010]. This dataset contains 151 images and the ground truth (GT) of segmentations. In addition, the dataset also provides a simple annotation for each image. In detail, there are 49 images from GrabCut, 99 images from PASCAL VOC’09 and 3 images from the alpha-matting dataset. Images from the GrabCut dataset contain complex shapes but the foreground and background tend to have disjoint color distributions. The VOC dataset on the other hand has simpler shapes (e.g., car, bus) but more complex appearances, where the color distributions of the foreground and background are overlap. The given annotations for images in the dataset are quite simple, which sometimes do not offer sufficient semantic information for some complicated images and correspondingly lead to undesired segmentation results for all compared methods as we find. Thus we also add some extra clicks or scribbles for these images, which produce a set of slightly modified annotations, expecting to improve the segmentation results.
Comparisons with other methods
We use some statistics (quality measures), such as the IoU, -score, error rate, boundary precision and boundary recall to evaluate the performance of the proposed method (PM) and compare it with some other well-known binary segmentation methods, such as geodesic graph cut (GEO), graph cut (GC), the total variation model using primal-dual method (TVPD), and the total variation model using alternative direction method (TVAD). The two total variation methods use the same unary term as the proposed method, but the total variation as the pairwise term. For the proposed method, we use 1600 superpixels produced by SLIC to compute the geodesic distances (the effect of the number of superpixels on the segmentation performance will be investigated in the later subsection). We also set and in the propose method.
We first compare different segmentation methods using the VGG dataset with original annotations (OA). Figure 2 presents some examples for the visual comparisons among different methods, and Table 1 reports the average values of IoU, -score and error rate of these methods over the whole data set. Note that the higher IoU and -score are, the better the segmentation results. The IoU of PM is 0.623, which is higher than all other methods whose IoU ranges from 0.476 to 0.613. As to the -score, PM achieves 0.812 and also shows a dominating advantage upon other methods. The error rate of PM is 7.91%, which is lower than other methods. To demonstrate better segmentation performance, we also compare these segmentation methods using the dataset with modified annotations (MA). Due to more a prior information provided by the foreground and background seeds, the performance of segmentation have been augmented for all methods. At the same time, PM still performs the best among all methods, whose IoU now reaches 0.834, -score 0.930 and error rate only . The results of average boundary precision and boundary recall by these methods are compared in Figure 3, which shows that PM almost outperforms all other compared methods on the edge-preserving ability except boundary precision for the dataset with original annotations.
We now test how the number of superpixels and the choice of regularizers affect the performance of the proposed method.
Effect of the number of superpixels
We now set different numbers of superpixels for the proposed method and compare the segmentation results. As shown in Figure 4, there are some vacancies around the object edges in the segmentations when the number of superpixels is set be 800. When the number of superpixels raises to 1600, all vacancies have been filled up and the segmentation results become much more monolithic. The choice of superpixels’ number should be compromised, because less superpixels can significantly decrease the computational cost of geodesic distances but make the segmentation performance be poorer, and more superpixels can lead to better segmentations but require more computation times. To this end, we investigate the effect of superpixels’ number on the segmentation performance quantitatively. Figure 5 shows that the average values of IoU and -score increases slowly when the number of superpixels grows, and when the number of superpixels is more than 1600, the segmentation performances almost have no improvements. Therefore, we suggest to set the number of superpixels as around 1600 in order to ensure the segmentation performance and computational efficiency simultaneously.
Effect of different regularizers
In the proposed method, we use the bilateral affinity as a regularizer and efficiently compute it by using the FBS module. We first test the segmentation performance in two cases, one is combined with the FBS module, and the other is without the FBS module. Figure 6 shows the segmentation results of these two cases with 1600 superpixels for two example images. This experiment is run on the dataset with modified annotations. We can easily see that the FBS module can greatly help to make the edge of object much more continuous and smooth.
|RGB||PM w/o FBS||PM|
There exist many other regularizers to preserve the edge information in the segmentation. For the comparison with FBS, we take the commonly used total variation (TV) module to show the influence of different regularizer on segmentation results. Several examples are presented in Figure 7, which demonstrates better performance of the FBS module over the TV module to some extent.
We present an interactive binary segmentation method based on the MRF framework, which contains the unary term and the pairwise term. The unary term is constructed by using the geodesic distances based on superpixels, which can help reduce the sensitivity to the seed placement. To relax the computational burden, the geodesic distances from the center of each superpixel to the foreground and background seed sets is computed instead of that from each pixel to the sets. Furthermore, we use the bilateral affinity as a regularizer to generate the pairwise term, which can denoise and well preserve the edge information. To finally solve the energy minimization problem, we take the alternative direction strategy to split it into two subproblems, which can be effectively solved by SGD and FBS, respectively. Experimental results on the VGG interactive image segmentation dataset demonstrates that the proposed method is able to obtain satisfactory segmentations for a variety of images and could outperform several state-of-the-art ones according to the comparisons in the paper.
- [Adam2006] Adam. 2006. Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11):1768–1783.
- [Bai and Guillermo2007] Bai, X., and Guillermo, S. 2007. A geodesic framework for fast interactive image and video segmentation and matting. In ICCV.
- [Barron and Poole2016] Barron, J. T., and Poole, B. 2016. The fast bilateral solver. ECCV 617–632.
- [Barron et al.2015] Barron, J. T.; Adams, A.; Shih, Y.; and Hernandez, C. 2015. Fast bilateral-space stereo for synthetic defocus. In CVPR.
- [Boykov and Kolmogorov2006] Boykov, Y., and Kolmogorov, V. 2006. Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11):1768–1783.
- [Boykov and Mariepierre2001] Boykov, Y., and Mariepierre, J. 2001. Interactive graph cuts for optimal boundary and region segmentation of objects in nd images. In ICCV.
- [Chartrand and Staneva2008] Chartrand, R., and Staneva, V. 2008. Nonconvex regularization for image segmentation. In International Conference on Image Processing, 334–337.
- [Cheng et al.2015] Cheng, M.; Prisacariu, V. A.; Zheng, S.; Torr, P. H. S.; and Rother, C. 2015. Densecut: densely connected crfs for realtime grabcut. Computer Graphics Forum 34(7):193–201.
- [Criminisi, Sharp, and Blake2008] Criminisi, A.; Sharp, T.; and Blake, A. 2008. Geos: Geodesic image segmentation. 5302:99–112.
- [Feng et al.2016] Feng, J.; Price, B. L.; Cohen, S.; and Chang, S. 2016. Interactive segmentation on rgbd images via cue selection. In CVPR, 156–164.
[Gulsha et al.2010]
Gulsha, V.; Rother, C. an Criminisi, A.; Blake, A.; and Zisserman, A.
Geodesic star convexity for interactive image segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- [Khoreva et al.2017] Khoreva, A.; Benenson, R.; Hosang, J. H.; Hein, M.; and Schiele, B. 2017. Simple does it: Weakly supervised instance and semantic segmentation. 1665–1674.
- [Kwon, Li, and Wong2013] Kwon, T. J.; Li, J.; and Wong, A. 2013. Etvos: An enhanced total variation optimization segmentation approach for sar sea-ice image segmentation. IEEE Transactions on Geoscience and Remote Sensing 51(2):925–934.
- [Lam, Gao, and Liew2010] Lam, B. S. Y.; Gao, Y.; and Liew, A. W. 2010. General retinal vessel segmentation using regularization-based multiconcavity modeling. IEEE Transactions on Medical Imaging 29(7):1369–1381.
- [Lempitsky et al.2009] Lempitsky, V. S.; Kohli, P.; Rother, C.; and Sharp, T. 2009. Image segmentation with a bounding box prior. In ICCV, 277–284.
- [Papazoglou and Ferrari2013] Papazoglou, A., and Ferrari, V. 2013. Fast object segmentation in unconstrained video. 1777–1784.
- [Price, Morse, and Cohen2010] Price, B. L.; Morse, B. S.; and Cohen, S. D. 2010. Geodesic graph cut for interactive image segmentation. 3161–3168.
- [Protiere and Sapiro2007] Protiere, A., and Sapiro, G. 2007. Interactive image segmentation via adaptive weighted distances. IEEE Transactions on Image Processing 16(4):1046–1057.
- [Radhakrishna et al.2010] Radhakrishna, A.; Appu, S.; Kevin, S.; Aurelien, L.; Pascal, F.; and Sabine, S. 2010. Slic superpixels. EPFL Technical Report 149300.
- [Rantalankila, Kannala, and Rahtu2014] Rantalankila, P.; Kannala, J.; and Rahtu, E. 2014. Generating object segmentation proposals using global and local search. 2417–2424.
- [Rother, Kolmogorov, and Blake2004] Rother, C.; Kolmogorov, V.; and Blake, A. 2004. Grabcut: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH 23(3):309–314.
- [Schick, Bauml, and Stiefelhagen2012] Schick, A.; Bauml, M.; and Stiefelhagen, R. 2012. Improving foreground segmentations with probabilistic superpixel markov random fields. 27–31.
[Shi, Pang, and Xu2016]
Shi, B.; Pang, Z.; and Xu, J.
Image segmentation based on the hybrid total variation model and the k-means clustering strategy.Inverse Problems and Imaging 10(3):807–828.
- [Unger et al.2008] Unger, M.; Pock, T.; Trobin, W.; Cremers, D.; and Bischof, H. 2008. Tvseg - interactive total variation based image segmentation. 1–10.
- [Wang, Agrawala, and Cohen2007] Wang, J.; Agrawala, M.; and Cohen, M. F. 2007. Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(3):9.
- [Wang, Ju, and Wang2011] Wang, J.; Ju, L.; and Wang, X. 2011. Image segmentation using local variation and edge-weighted centroidal voronoi tessellations. IEEE Transactions on Image Processing 20(11):3242–3256.
- [Wang, Shen, and Porikli2015] Wang, W.; Shen, J.; and Porikli, F. 2015. Saliency-aware geodesic video object segmentation. 3395–3402.
[Zhang et al.2015]
Zhang, X.; Xu, C.; Li, M.; and Sun, X.
Sparse and low-rank coupling image segmentation model via nonconvex
International Journal of Pattern Recognition and Artificial Intelligence29(2):1555004.
- [Zhou, Ju, and Wang2015] Zhou, Y.; Ju, L.; and Wang, S. 2015. Multiscale superpixels and supervoxels based on hierarchical edge-weighted centroidal voronoi tessellation. IEEE transaction of image processing 24(11):3834–3845.
- [Zhu, Tai, and Chan2013] Zhu, W.; Tai, X.; and Chan, T. F. 2013. Image segmentation using euler’s elastica as the regularization. Journal of Scientific Computing 57(2):414–438.