Automatic Trimap Generation for Image Matting

by   Vikas Gupta, et al.
IIT Gandhinagar

Image matting is a longstanding problem in computational photography. Although, it has been studied for more than two decades, yet there is a challenge of developing an automatic matting algorithm which does not require any human efforts. Most of the state-of-the-art matting algorithms require human intervention in the form of trimap or scribbles to generate the alpha matte form the input image. In this paper, we present a simple and efficient approach to automatically generate the trimap from the input image and make the whole matting process free from human-in-the-loop. We use learning based matting method to generate the matte from the automatically generated trimap. Experimental results demonstrate that our method produces good quality trimap which results into accurate matte estimation. We validate our results by replacing the automatically generated trimap by manually created trimap while using the same image matting algorithm.



There are no comments yet.


page 8

page 10


Self-supervised 3D Shape and Viewpoint Estimation from Single Images for Robotics

We present a convolutional neural network for joint 3D shape prediction ...

An Efficient Real Time Method of Fingertip Detection

Fingertips detection has been used in many applications, and it is very ...

Video Synthesis from a Single Image and Motion Stroke

In this paper, we propose a new method to automatically generate a video...

Automatic thread painting generation

ThreadTone is an NPR representation of an input image by half-toning usi...

Automatic Generation of Constrained Furniture Layouts

Efficient authoring of vast virtual environments hinges on algorithms th...

Simple-QE: Better Automatic Quality Estimation for Text Simplification

Text simplification systems generate versions of texts that are easier t...

Learning MRI Artifact Removal With Unpaired Data

Retrospective artifact correction (RAC) improves image quality post acqu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image matting is the process of accurately estimating the foreground object in images and videos. It is a very important technique in image and video editing applications, particularly in film production for creating visual effects. In case of image segmentation, we segment the image into foreground and background by labeling the pixels. Image segmentation generates a binary image, in which a pixel either belongs to foreground or background. However, image matting is different from the image segmentation, wherein some pixels may belong to foreground as well as background, such pixels are called partial or mixed pixels. Image matting is concerned about determining the convex combination of foreground and background intensity for each partial pixel. Porter and Duff in 1984 first proposed the problem of accurately separating a foreground object from the background in order to composite with a new background for creating a new image, which looks more realistic [17]. The preliminary version of this paper has been published in [9].
Given an image , the image matting problem is mathematically stated as given in equation 1.


Where, represents the matte and it can take any value in , and and are foreground and background pixel value respectively. If then the pixel at location belongs to definite foreground or defiinite background respectively. Otherwise that pixel is called a partial or a mixed pixel. In natural images, majority of pixels either belong to definite foreground or definite background region. However, in order to fully separate the foreground from the background in an image, accurate estimation of the alpha values for partial or mixed pixels is necessary. Note that in equation 1, if we consider a full color image (RGB), there are unknowns ( for each color channel and ) and three equations (one for each color channel). Thus image matting problem is a severely under-constrained problem. Such under-constrained problems can be solved by adding more information into it. This additional information is provided in the form of trimap [7] or scribbles [21], i.e., labeling some pixels belonging to definite foreground or definite background. In order to fully extract meaningful foreground object, almost all the matting techniques rely on the user intervention, wherein the user segments the input image into three regions: definite foreground, definite background, and unknown region. This three-level map is called as a trimap. Now the matting problem is reduced, and it will have to determine the values of and for the pixels in the unknown region based on the available information of definite foreground and definite background region. Instead of carefully labeling the input image into three regions to generate a trimap, some recently proposed methods rely on the user to provide few foreground and background scribbles as input to extract a matte. So this method marks majority of pixels as unknown region. Ideally, the trimap should consist of very small unknown region around the foreground boundary, and it should contain only the partial or mixed pixels. Since smaller the unknown region (less number of mixed pixels) the more the accurate will be. However generating such an accurate trimap requires lot of human efforts and it is often undesirable, particularly in the case of transparent objects. Thus, accuracy of a trimap is one of the important factors which affects the performance of a matting algorithm [23]. So, while developing a matting algorithm there will always be a trade off between the accuracy of the matte and the amount of user efforts required. Recently, Levin et al. proposed spectral matting algorithm [12], which automatically extracts the matte from the input image without any user intervention. However, the limitation of this method is that it generates erroneous result for images with highly- textured background. Therefore, to alleviate such problems user specified trimap or scribbles are needed to get the highly accurate matte. However, we can reduce the user efforts for manually generating the trimap by automatically generating more accurate trimap.
In this paper, we propose a novel method to automatically generate trimap from the given image. We use the saliency map of the image to generate the trimap. First, we oversegment the image using SLIC superpixel algorithm [1]. Then we obtain the local features using Oriented Texture Curves (OTC) feature descriptor [15]

for each superpixel in the over-segmented image. These feature vectors are then clustered to obtain the background and foreground superpixels. Then we update the saliency map of the image and threshold it to obtain the binary map. This binary map is then eroded and dilated in order to obtain the desired trimap. The steps involved in the proposed method is depicted in Fig.

1. The main contributions of our paper are given below.

  1. We propose an automatic trimap generation framework for image matting to get rid of any human intervention.

  2. Instead of working on each pixel, we employ superpixels to over-segment the image and process a group of pixels together.

  3. We use image saliency and an appropriate local feature descriptor to identify the foreground and background superpixels which helps in automatic generation of trimap.

The rest of the paper is organized as follows. In section 2, we briefly survey the state-of-the-art matting algorithms as well as existing methods for automatic trimap generation. Section 3 gives the details of the proposed automatic trimap generation algorithm. In section 4, we show and discuss the results of image matting obtained using the trimap generated from our approach. Section 5 concludes the paper with some ideas for future improvement.

Figure 1: Proposed saliency based automatic trimap generation framework.

2 Related Work

In this section we review previous work relevant to our work. In particular, we discuss some of the recent state-of-the-art matting algorithms as well as existing methods for automatic trimap generation. Generally the matting algorithms are classified as

Sampling based approaches [18, 7, 21] and Affinity based approaches [20, 8, 4, 26, 11, 12].

2.1 Sampling based approaches

The basic principle of these approaches is to use neighboring foreground and background pixels as samples to estimate the alpha values for the unknown pixels. Ruzon and Tomasi proposed a sampling based approach [18]

for matting. In this approach, alpha values are measured along a manifold connecting the “frontiers” of each object’s color distribution. The unknown region is divided into subregions and a local window is defined in these subregions such that it covers the unknown region, and a local foreground and background region. The optimal alpha is the one that yields an intermediate distribution for which the observed color has maximum probability.

The Bayesian approach proposed by Chuang et al. also uses probabilistic approach to solve the matting problem [7]. The main difference is that a continuously sliding window is used for selecting the neighborhood, which marches inward from the foreground and background regions. These foreground and background samples are used to build color distributions. The matting problem is formulated in a well-defined Bayesian framework and maximum a posteriori (MAP) technique is used to solve for the matte.
The previous two methods assumes that the unknown region is the narrow band around the foreground boundary and therefore they use local color models. So there are ample amount of foreground and background pixels within a local window centered on any unknown pixel. But this assumption fails if the trimap is not well defined and it consist of only a few scribbles. In the case of rough trimap, global sampling method is used to tackle the sampling problem. Wang and Cohen proposed an iterative optimization

based matting approach which computes Gaussian Mixture Models (GMMs) from the user marked foreground pixels and background pixels, then assign each marked pixel to one Gaussian for further global sampling

[21]. The sampling based approaches works well when the trimap is well defined.

2.2 Affinity based approaches

The affinity based approaches do not require explicit foreground and background color information to solve the matting problem. These methods utilize the local image statistics by defining various affinities between neighboring pixels to model the matte gradient across the image instead of directly estimating the alpha value at each pixel. Poisson matting estimates the matte gradient from the image using boundary information from a user-supplied trimap and then reconstructs the matte by solving Poisson equation [20]. It is based on the assumption that intensity change in the foreground and the background is smooth. Grady et al. employed random walk algorithm to calculate the final alpha values based on affinity [8]. Given a trimap, for each pixel in the unknown region, its alpha value is set to be the probability that a random walker starting from this pixel location will reach a pixel in the foreground before striking a pixel in the background, when biased to avoid crossing the foreground boundary. The geodesic matting method measures the weighted geodesic distance from the user-provided scribbles to the pixels in the unknown region (outside of the scribbles) for labeling them as foreground or background pixel [4].
Zheng et al. proposed an interactive matting algorithm which is similar to geodesic matting called FuzzyMatte [26]. In this method instead of computing geodesic distance, it computes the fuzzy connectedness between the unknown pixel and the known foreground and background pixels. The final alpha value is then calculated using the fuzzy connectedness. The disadvantage of this method is that the fuzzy connectedness is sensitive to image noise which may lead to the misclassification of pixels in the unknown region. Closed-form matting approach explicitly derives a cost function from local smoothness assumptions on foreground and background colors and shows that in the resulting expression, it is possible to analytically eliminate the foreground and background colors to obtain a quadratic cost function in alpha [11]

. This cost function can be solved by a sparse linear system of equations, which yields the globally optimal alpha matte. The affinity used in this approach does not have any global parameters. Instead, it uses local estimates of mean and variances which leads to significant improvement in the performance as demonstrated in

[11]. The spectral matting

method uses spectral segmentation techniques to obtain basis set of fuzzy matting components from the smallest eigenvectors of the matting Laplacian. These matting components are used as building blocks to easily construct semantically meaningful foreground mattes

[12]. The practical applications of this approach is limited as the memory consumption is very high.

2.3 Other approaches

Robust matting method combines the color sampling and affinity together in a single optimization process to get more accurate and robust matting solution [22]

. It samples the foreground and background colors for unknown pixels and determines the confidence of these samples. The high confidence samples are chosen to contribute to the matting energy function which is minimized by a Random Walk. Zheng and Kambhamettu utilized semi-supervised learning to solve the digital matting problem which results in a local learning based matting approach and a global learning based approach

[25]. The local learning based matting approach trains a local alpha-color model for each pixel in the image only based on its neighboring pixels which are considered to be most related and suits better than the scribble based matting. The global learning based approach learns the global alpha-color model from some chosen labeled pixels closer to the unlabeled pixel, and suits better to the case when a trimap is provided and the unknown region is narrow. We use this image matting algorithm to evaluate the effectiveness of the automatic trimaps generated in this work.

Most of the existing automatic trimap generation algorithms rely on the binary segmentation of the image to get the initial boundary of the foreground object [14] [5][19] [6]. Singh et al. employed Canny edge detection followed by morphological operations (erosion and dilation) to yield the boundary of the foreground [19]. A corrected trimap is then obtained by applying region growing algorithm to the unknown region of the image obtained by dilating the foreground boundary. Chang-Lin Hsieh et al. proposed an automatic trimap generation method, which generates an initial guess of trimap form the binary segmented image [5]. They employed dynamic brush width method to obtain content adaptive trimap from the initial guess of trimap. Ahmad Al-Kabbany and Eric Dubois employed Gestalt laws of grouping to generate the trimap automatically [3]. Cho Donghyeon et al. utilized depth information and adaptive analysis of color distribution along the foreground boundary of the light field images[6].

In our proposed method, instead of using binary segmentation, we use over-segmentation algorithm [1] to get the superpixels of the image. We use these superpixels to roughly decide the foreground and background region of the image. The detailed process of automatically generating the trimap is described in the next section.

3 Automatic Trimap Generation

In this section, we describe in detail our proposed framework for automatically generating the trimap from a given image. We assume that there is a single salient object present in the given scene. The complete framework is divided into three parts as: over-segmentation and feature description, identification of background and foreground superpixels, and trimap generation and matting.

3.1 Over-segmentation and Invariant Feature Description

Consider an input image as shown in Fig. 2(a). At first we segment the image into superpixels using the algorithm given in [1]. Due to superpixels, the resulting over-segmented image is shown in Fig. 2(b). Note that each superpixel contains distinct texture and color information, therefore we compute the OTC features for a patch of size (see Fig. 3) in each superpixel [15]. The OTC descriptor captures the texture of a patch along multiple orientations, while maintaining robustness to illumination changes, geometric distortions, and local contrast differences. It provides a 185-dimensional texture feature in eight different directions.
We obtain the saliency maps , of the input image using three different methods [10, 13, 24]. Each of these methods uses different framework to obtain the saliency map. In [10], Huaizu Jiang et al. employed supervised learning approach to integrate regional features such as the regional contrast, regional property, and regional backgroundness descriptors together to form the master saliency map. In [13], the image is segmented to obtain a set of object candidates and then a fixation algorithm is used to rank different regions based on their saliency score. In [24], Rui Zhao et al

. utilized global context and local context models to obtain multi-context saliency model using deep convolutional neural networks (CNN). The saliency maps

obtained from these three methods are then combined to get a single saliency map (see Fig. 2(c)) as given in equation 2.


where, , and are constants. We choose the same value of for , and in this work.

3.2 Identification of Background and Foreground Superpixels

We use the saliency map SM to classify superpixels into salient and non-salient superpixels. For each superpixel, we obtain the median value in the saliency map. If this median value is greater than a threshold then that superpixel is classified as a salient superpixel. Otherwise, it is classified as a non-salient superpixel. Initially, we consider the salient superpixels as foreground superpixels andn non-salient superpixels as background superpixels. It may happen that some salient superpixels belong to background and some non-salient superpixels belong to foreground. To alleviate this problem, we cluster the OTC features of superpixels classified as foreground into five different clusters using -means clustering. Similarly we cluster the OTC features of superpixels classified as background into five different clusters using -means clustering.
For each superpixel, which was initially classified as foreground, we compute the euclidean distance between that superpixel and the cluster centers of the background superpixels. If this distance is less than a threshold then that superpixels is identified as a background superpixel. The same process is repeated for the superpixels which were initially classified as background to identify more foreground superpixels.We repeat the same process for all the superpixels identified as background using the cluster center estimated by the foreground superpixels. The separated foreground and background superpixels are shown in Fig. 2(d, e). Based on this information we modify the saliency map SM so that only the foreground region will have the salient value. Finally we get the modified saliency map as shown in Fig. 2(f).

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l)

Figure 2:

Intermediate Results: (a) Input image, (b) Over-segmented image, (c) Saliency map, (d) Foreground superpixel, (e) Background superpixel, (f) Modified saliency map, (g) Binarized saliency map, (h) Eroded saliency map, (i) Dilated saliency map, (j) Differnce, (k) Trimap, (l) Estimated matte using


3.3 Trimap Generation and Matting

To generate the trimap, we need a binarized saliency map. The modified saliency map is binarized using Otsu’s thresholding method [16] as shown in Fig.2(g). The binarized saliency map is then eroded and dilated to get the eroded map and the dilated map as shown in Fig.2(h, i). We use a disk structuring element of radii 5 and 10 for the erosion and dilation respectively. The eroded map is subtracted from the dilated map to get the unknown region of the trimap as given in equation 3.


The obtained difference map is multiplied with a constant , where (see Fig. 2(j)). This difference map is then added to the eroded saliency map , which results into a trimap (TM) as shown in Fig. 2(k). This process is explained in equation 4.

Figure 3: Patch extraction from superpixels: A patch of is extracted form the superpixels to obtain the OTC features

We use the Learning based matting technique to obtain the alpha matte for the input image by using the trimap obtained from our proposed framework [25]. The estimated alpha matte is depicted in the Fig. 2(l).

4 Results and Discussion

In this section, we present and discuss the results obtained by our proposed framework. We test our proposed method on a number of images obtained from FT [2] and PASCAL-S [13] datasets. We compare the trimaps generated by the proposed framework with the manually created trimaps.

Our method works well in the case of images where the background part is natural, which can be noticed in Fig. 4. The first column shows the input images, the second column depicts the manually created trimaps, in the third column the trimaps generated by the our proposed approach are shown. The mattes corresponding to both these trimaps are shown in the fourth and the fifth column respectively. We employ the matting algorithm proposed in [25].

The first row of Fig. 4 shows the results for an image which consists of a foreground object (post office box) and a natural background. Here, we can notice that the automatically generated trimap is quite similar to that of manually created trimap thereby leading to accurate matte estimation, similar observation can be made for the images shown in the second, fourth, fifth, and seventh rows. For the image used shown in third row, there is little difference in the automatically generate trimap and the manually created trimap. Some part of the foreground is marked as unknown in the automatically generated trimap, which is marked as definite foreground in the manually created trimap. However, the matting algorithms takes care of it and we get approximately similar mattes from both these trimaps. In the sixth row, we can notice that the trimap obtained using the proposed approach marks the unknown region (foreground boundary) very accurately compared to that of the manually generated trimap.

The results illustrated in Fig. 4 demonstrate that the automatically generated trimap is as accurate as the manually created trimap for generating the mattes. To validate our claim, we compute sum of square differences (SSD) for the matte generated using two different trimaps i.e., trimap using the proposed approach and the manually created trimap. The SSD for the images in the first to the seventh row are , respectively. We observed that the SSD values are very small. The proposed method has some limitations which can be observed in the case of images in which background is synthetically generated. If there is an ambiguity between foreground and background color, then the proposed method might lead to some errors in the trimap.

We implemented this framework in MATLAB on a PC with Intel i5-4460s 2.9 GHz processor and 12 GB RAM. For segmenting the image into superpixels we set the value of in the range of 250 to 400. The threshold is set to equal to of the highest salient value in the saliency map. The threshold value is set to equal the mean of distances between the OTC feature vectors of superpixels belonging to foreground (or background) and the cluster centers of the background (or foreground) superpixels. The constant C is chosen to be equal to 0.65. Our proposed method takes only a few seconds to generate the trimap for any given image thereby automating the entire image matting process.

Figure 4: (a) Input image, (b) Trimap (manually generated ), (c) Trimap (Using proposed approach), (d) Matte by using (b) (using [25]), (e) Matte by using (c) (using [25]).

5 Conclusion

Image matting is an important process for accurate estimation of foreground object from the background in image and video editing applications. This task is ill-posed thereby poses a significant challenge for computational photography. In literature review, we note that almost all the matting algorithms require user intervention in the form of trimap or scribbles as input to these algorithms. The performance of these algorithms depends on these user inputs. Also manually generating a trimap consumes a lot of time. To alleviate this problem and make the whole matting process automatic, we have proposed a simple and efficient framework for automatically generating the trimap for a given input image. The experimental results demonstrate that the automatically generated trimaps are very close to that of manually created trimaps which results in accurate matte estimation.

We believe that the automation of the entire matting process will be adapted by researchers and practitioners soon. However, there could be images where there is no distinct salient object presents. In such a scenario, generating the matte automatically is a challenge to be addressed in future. We would like to extend the proposed approach for processing videos. An automatic matte generation module is highly desirable for a variety of computational photography task that found by students, researchers, artist, and compositors. We would like to make the approach robust enough so that it would serve as a vital tool for studios in order to generate augmented reality effects in the movies. Another future challenge is to extract mattes corresponding to multiple foreground objects from a background automatically.


  • [1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Susstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282 (2012)
  • [2] Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: IEEE CVPR, 2009.
  • [3] Al-Kabbany, A., Dubois, E.: A novel framework for automatic trimap generation using the gestalt laws of grouping. Proc. SPIE 9410, Visual Information Processing and Communication VI (2015)
  • [4] Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE ICCV, 2007.
  • [5] Chang-Lin, H., Ming-Sui, L.: Automatic trimap generation for digital image matting. In: IEEE Signal and Information Processing Association Annual Summit and Conference. pp. 1–5 (2013)
  • [6] Cho Donghyeon, Kim Sunyeong, T.Y.W., Kweon, I.S.: Automatic trimap generation and consistent matting for light-field images. IEEE transactions on pattern analysis and machine intelligence (2016)
  • [7] Chuang, Y.Y., Curless, B., Salesin, D.H., Szeliski, R.: A bayesian approach to digital matting. In: IEEE CVPR, 2001.
  • [8] Grady, L., Schiwietz, T., Aharon, S., Westermann, R.: Random walks for interactive alpha-matting. In: Proceedings of VIIP, 2005.
  • [9] Gupta, V., Raman, S.: Automatic trimap generation for image matting. In: Signal and Information Processing (IConSIP), International Conference on. pp. 1–5. IEEE (2016)
  • [10] Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.: Salient object detection: A discriminative regional feature integration approach. In: IEEE CVPR, 2013.
  • [11] Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 228–242 (2008)
  • [12] Levin, A., Rav-Acha, A., Lischinski, D.: Spectral matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1699–1712 (2008)
  • [13] Li, Y., Hou, X., Koch, C., Rehg, J., Yuille, A.: The secrets of salient object segmentation. In: IEEE CVPR, 2014.
  • [14] Li Hongliang, N.K.N., Qiang, L.: Faceseg: automatic face segmentation for real-time video. IEEE Transactions on Multimedia 11(1), 77–88 (2009)
  • [15]

    Margolin, R., Zelnik-Manor, L., Tal, A.: Otc: A novel local descriptor for scene classification. In: Springer ECCV, 2014.

  • [16] Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285-296), 23–27 (1975)
  • [17] Porter, T., Duff, T.: Compositing digital images. In: ACM Siggraph Computer Graphics. vol. 18, pp. 253–259 (1984)
  • [18] Ruzon, M.A., Tomasi, C.: Alpha estimation in natural images. In: IEEE CVPR, 2000.
  • [19] Singh Sweta, J.A.S., Charul, B.: Automatic trimap and alpha-matte generation for digital image matting. In: IEEE International Conference on Contemporary Computing. pp. 202–208 (2013)
  • [20] Sun, J., Jia, J., Tang, C.K., Shum, H.Y.: Poisson matting. ACM Transactions on Graphics, 23(3), 315–321 (2004)
  • [21] Wang, J., Cohen, M.F.: An iterative optimization approach for unified image segmentation and matting. In: IEEE ICCV, 2005.
  • [22] Wang, J., Cohen, M.F.: Optimized color sampling for robust matting. In: IEEE CVPR, 2007.
  • [23] Wang, J., Cohen, M.F.: Image and video matting: A survey. Foundations and Trends in Computer Graphics and Vision 3(2), 97–175 (2007)
  • [24]

    Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In: IEEE CVPR, 2015.

  • [25] Zheng, Y., Kambhamettu, C.: Learning based digital matting. In: IEEE ICCV, 2009.
  • [26] Zheng, Y., Kambhamettu, C., Yu, J., Bauer, T., Steiner, K.: Fuzzymatte: A computationally efficient scheme for interactive matting. In: IEEE CVPR, 2008.