Semantic Human Matting

09/05/2018 ∙ by Quan Chen, et al. ∙ 0

Human matting, high quality extraction of humans from natural images, is crucial for a wide variety of applications. Since the matting problem is severely under-constrained, most previous methods require user interactions to take user designated trimaps or scribbles as constraints. This user-in-the-loop nature makes them difficult to be applied to large scale data or time-sensitive scenarios. In this paper, instead of using explicit user input constraints, we employ implicit semantic constraints learned from data and propose an automatic human matting algorithm (SHM). SHM is the first algorithm that learns to jointly fit both semantic information and high quality details with deep networks. In practice, simultaneously learning both coarse semantics and fine details is challenging. We propose a novel fusion strategy which naturally gives a probabilistic estimation of the alpha matte. We also construct a very large dataset with high quality annotations consisting of 35,513 unique foregrounds to facilitate the learning and evaluation of human matting. Extensive experiments on this dataset and plenty of real images show that SHM achieves comparable results with state-of-the-art interactive matting methods.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Human matting, which aims at extracting humans from natural images with high quality, has a wide variety of applications, such as mixed reality, smart creative composition, live streaming, film production, etc. For example, in an e-commerce website, smart creative composition provides personalized creative image to customers. This requires extracting fashion models from huge amount of original images and re-compositing them with new creative designes. In such a scenario, due to the huge volume of images to be processed and in pursuit of a better customer experience, it is critical to have an automatic high quality extraction method. Fig. 1 gives an example of smart creative composition with automatic human matting in a real-world e-commerce website.

Designing such an automatic method is not a trivial task. One may think of turning to either semantic segmentation or image matting techniques. However, neither of them can be used by itself to reach a satisfactory solution. On the one hand, semantic segmentation, which directly identifies the object category of each pixel, usually focuses on the coarse semantics and is prone to blurring structural details. On the other hand, image matting, widely adopted for fine detail extractions, usually requires user interactions and therefore is not suitable in data-intensive or time-sensitive scenarios such as smart creative composition. More specifically, for an input image , matting is formulated as a decomposition into foreground , background and alpha matte with a linear blend assumption:

(1)

where for color images, there are 7 unknown variables but only 3 known variables, and thus, this decomposition is severely under-constrained (Levin et al., 2008). Therefore, most matting algorithms (Levin et al., 2008; Chen et al., 2013; Aksoy et al., 2017; Xu et al., 2017) need to take user designated trimaps or scribbles as extra constraints.


Figure 1. Semantic Human Matting (SHM) and its applications. SHM takes natural image (top left) as input and outputs corresponding alpha matte (bottom left). The predicted alpha matte could be applied to background editting (top right) and smart creative composition (bottom right)

In this paper, we propose a unified method, Semantic Human Matting (SHM), which integrates a semantic segmentation module with a deep learning based matting module to automatically extract the alpha matte of humans. The learned semantic information distinguishing foreground and background is employed as an implicit constraint to a deep matting network which complements the capability of detail extraction. A straightforward way to implement such a method is to train these two modules separately and feed the segmentation results as trimaps into the matting network. However, this intuitive approach does not work well 

(Shen et al., 2016)

. The reason is that semantic segmentation aims at classifying each pixel and is able to roughly distinct humans from background, whereas the goal of matting is to assign to each pixel a more fine grained float opacity value of foreground without determining the semantics. They are responsible for recovering coarse segmentations and fine details respectively, and therefore they need to be carefully handled in order to cooperate properly towards high quality human matting. Shen

et al. (Shen et al., 2016) use a closed form matting (Levin et al., 2008) layer through which the semantic information can directly propagate and constitute the final result. But with deep learning based matting, the matting module is highly nonlinear and trained to focus on structural patterns of details, thus the semantic information from input hardly retains. To combine the coarse semantics and fine matting details exquisitely, we propose a novel fusion strategy which naturally gives a probabilistic estimation of the alpha matte. It can be viewed as an adaptive ensemble of both high and low level results on each pixel. Further, with this strategy, the whole network automatically amounts the final training error to the coarse and the fine, and thus can be trained in an end-to-end fashion.

We also constructed a very large dataset with high quality annotations for the human matting task. Since annotating details is difficult and time-consuming, high quality datasets for human matting are valuable and scarce. The most popular alphamatting.com dataset (Rhemann et al., 2009) has made significant contributions to the matting research. Unfortunately it only consists of 27 training images and 8 testing images. Shen et al(Shen et al., 2016) created a dataset of 2,000 images, but it only contains portrait images. Besides, the groundtruth images of this dataset are generated with closed form matting (Rhemann et al., 2009)

and KNN matting 

(Chen et al., 2013) and therefore can be potentially biased. Recently, Xu et al(Xu et al., 2017) built a large high quality matting dataset, with 202 distinct human foregrounds. To increase the volume and diversity of human matting data that benefit the learning and evaluation of human matting, we collected another 35,311 distinct human images with fine matte annotations. All human foregrounds are composited with different backgrounds and the final dataset includes 52,511 images for training and 1,400 images for testing. More details of this dataset will be discussed in Section 3.

Extensive experiments are conducted on this dataset to empirically evaluate the effectiveness of our method. Under the commonly used metrics of matting performance, our method can achieve comparable results with the state-of-the-art interactive matting methods (Levin et al., 2008; Chen et al., 2013; Aksoy et al., 2017; Xu et al., 2017). Moreover, we demonstrate that our learned model generalizes to real images with justifying plenty of natural human images crawled from the Internet.

To summarize, the main contributions of our work are three-fold:

1. To the best of our knowledge, SHM is the first automatic matting algorithm that learns to jointly fit both semantic information and high quality details with deep networks. Empirical studies show that SHM achieves comparable results with the state-of-the-art interactive matting methods.

2. A novel fusion strategy, which naturally gives a probabilistic estimation of the alpha matte, is proposed to make the entire network properly cooperate. It adaptively ensembles coarse semantic and fine detail results on each pixel which is crucial to enable end-to-end training.

3. A large scale high quality human matting dataset is created. It contains 35,513 unique human images with corresponding alpha mattes. The dataset not only enables effective training of the deep network in SHM but also contributes with its volume and diversity to the human matting research.

2. Related works

In this section, we will review semantic segmentation and image matting methods that most related to our works.

Since Long et al. (Long et al., 2015) use Fully Convolutional Network (FCN) to predict pixel level label densely and improve the segmentation accuracy by a large margin, FCN has became the main framework for semantic segmentation and kinds of techniques have been proposed by researchers to improve the performance. Yu et al(Yu and Koltun, 2015) propose dilated convolutions to increase the receptive filed of the network without spatial resolution decrease, which is demonstrated effective for pixel level prediction. Chen et al(Chen et al., 2016) add fully connected CRFs on the top of network as post processing to alleviate the ”hole” phenomenon of FCN. In PSPNet (Zhao et al., 2017), network-in pyramid pooling module is proposed to acquire global contextual prior. Peng et al(Peng et al., 2017) state that using large convolutional kernels and boundary refinement block can improve the pixel level classification accuracy while maintaining precise localization capacity. With the above improvements, FCN based models trained on large scale segmentation datasets, such as VOC (Everingham et al., [n. d.]) and COCO (Lin et al., 2014), have achieved the top performances in semantic segmentation. However, these models can not be directly applied to semantic human matting for the following reasons. 1) The annotations of current segmentation datasets are relative ”coarse” and ”hard” to matting task. Models trained on these datasets do not satisfy the accuracy requirement of pixel level location and floating level alpha values for matting. 2) Pixel level classification accuracy is the only consideration during network architecture and loss design in semantic segmentation. This leads the model prone to blurring complex structural details which is crucial for matting performance.

In the past decades, researchers have developed variety of general matting methods for natural images. Most methods predict the alpha mattes through sampling (Chuang et al., 2001; Wang and Cohen, 2007; Gastal and Oliveira, 2010; He et al., 2011; Shahrian et al., 2013) or propagating (Sun et al., 2004; Grady et al., 2005; Levin et al., 2008; Chen et al., 2013; Aksoy et al., 2017)

on color or low-level features. With the rise of deep learning in computer vision community, several CNN based methods

(Cho et al., 2016; Xu et al., 2017) have been proposed for general image matting. Cho et al. (Cho et al., 2016)

design a convolutional neural network to reconstruct the alpha matte by taking the results of the closed form matting

(Levin et al., 2008) and KNN matting (Chen et al., 2013) along with the normalized RGB color image as inputs. Xu et al. (Xu et al., 2017) directly predict the alpha matte with a pure encoder decoder network which takes the RGB image and trimap as inputs and achieve the state-of-the-art results. However, all the above general image matting methods need scribbles or trimap obtained from user interactions as constraints and so they can not be applied in automatic way.

Recently, several works (Shen et al., 2016; Zhu et al., 2017) have been proposed to make an automatic matting system. Shen et al. (Shen et al., 2016) use closed form matting (Levin et al., 2008) with CNN to automatically obtain the alpha mattes of portrait images and back propagate the errors to the deep convolutional network. Zhu et al. (Zhu et al., 2017) follow the similar pipeline while designing a smaller network and a fast filter similar to guided filter (He et al., 2010) for matting to deploy the model on mobile phones. Despite our method and the above two works both use CNNs to learn semantic information instead of manual trimaps to automate the matting process, our method is quite different from theirs:

Figure 2. Compositional images and corresponding alpha mattes in our dataset. The first three columns come from the general matting dataset create by Xu et al. (Xu et al., 2017) and the last three columns come from model images we collected from an e-commerce website. It’s worth notting that images shown here are all resized to the same height.

1) Both the above methods use the traditional methods as matting module, which compute the alpha matte by solving the matting equation (Eq. 1) and may introduce artifacts when the distributions of foreground and background color overlap (Xu et al., 2017). We employ a FCN as matting module so as to directly learn complex details in a wide context which have been shown much more robust by (Xu et al., 2017). 2) By solving the matting equation, these method can directly affect the final perdition with the input constraints and thus propagate back the errors. However, when the deep matting network is adopted, the cooperation of coarse semantics and find details must be explicitly handled. Thus a novel fusion strategy is proposed and enables the end-to-end training of the entire network.

3. Human matting dataset

Data Source Train Set Test Set
Foreground Image Foreground Image
DIM(Xu et al., 2017) 182 18,200 20 400
Model 34,311 34,311 1,000 1,000
Total 34,493 52,511 1,020 1,400
Table 1. Configuration of our human matting dataset.

As a newly defined task in this paper, the first challenge is that semantic human matting encounters the lack of data. To address it, we create a large scale high quality human matting dataset. The foregrounds in this dataset are humans with some accessories(e.g., cellphones, handbags). And each foreground is associated with a carefully annotated alpha matte. Following Xu et al.(Xu et al., 2017), foregrounds are composited onto new backgrounds to create a human matting dataset with 52,511 images in total. Some sample images in our dataset are shown in Fig. 2.

In details, the foregrounds and corresponding alpha matte images in our dataset comprise:

  • Fashion Model dataset. More than 188k fashion model images are collected from an e-commerce website, whose alpha mattes are annotated by sellers in accordance with commercial quality. Volunteers are recruited to carefully inspect and double-check the mattes to remove those even only with small flaws. It takes almost 1,200 hours to select 35,311 images out of them. The low pass rate (18.88 %) guarantees the high standard of the alpha matte in our dataset.

  • Deep Image Matting (DIM) dataset (Xu et al., 2017). We also select all the images that only contain human from DIM dataset, resulting 202 foregrounds.

The background images are from COCO dataset and the Internet. We ensure that background images do not contain humans. The foregrounds are split into train/test set, and the configuration is shown in Table 1. Following (Xu et al., 2017), each foreground is composited with N backgrounds. For foregrounds from Fashion Model dataset, due to their large number, N is set to 1 for both training and testing dataset. For foregrounds from DIM dataset, N is set to 100 for training dataset and 20 for testing dataset, as in (Xu et al., 2017). All the background images are randomly selected and unique.

Datasets Foreground Image Annotation
alpha matting (Rhemann et al., 2009) 35 Objects 35 Manually
Shen et al. (Shen et al., 2016) 2,000 Portraits 2,000 CF (Levin et al., 2008), KNN (Chen et al., 2013)
DIM (Xu et al., 2017) 493 Objects 49,300 Manually
Our dataset 35,513 Humans 52,511 Manually
Table 2. The properties of the existing matting datasets.
Figure 3. Overview of our semantic human matting method. Given an input image, a T-Net, which is implemented as PSPNet-50, is used to predict the 3-channel trimap. The predicted trimap is then concatenated with the original image and fed into the M-Net to predict the raw alpha matte. Finally, both the predicted trimap and raw alpha matte are fed into the Fusion Module to generate the final alpha matte according to Eq. 4. The entire network is trained in an end-to-end fashion.

Table 2 shows the comparisons of basic properties between existing matting datasets and ours. Compared with previous matting datasets, our dataset differs in the following aspects: 1) The existing matting datasets contain hundreds of foreground objects, while our dataset contains 35,513 different foregrounds which is much larger than others; 2) In order to deal with the human matting task, foregrounds containing human body are needed. However, DIM(Xu et al., 2017) dataset only contains 202 human objects. The dataset proposed by Shen et al. (Shen et al., 2016) consists of portraits, which are limited to heads and part of shoulders. In contrast, our dataset has a larger diversity that might cover the whole human body, i.e. head, arms, legs etc. in various poses, which is essential for human matting; 3) Unlike the dataset of Shen et al (Shen et al., 2016) which is annotated by Closed From (Levin et al., 2008), KNN (Chen et al., 2013) and therefore can be potentially biased, all 35,513 foreground objects are manually annotated and carefully inspected, which guarantees the high quality alpha mattes and ensures the semantic integrity and unicity. The dataset not only enables effective training of the deep network in SHM but also contributes with its volume and diversity to the human matting research.

4. Our method

Our SHM is targeted to automatically pull the alpha matte of a specific semantic pattern—the humans. Fig 3 shows its pipeline. The SHM takes an image (usually 3 channels representing RGB) as input, and directly outputs a 1-channel alpha matte image with identical size of input. Note that no auxiliary information (e.g. trimap and scribbles) is required.

SHM aims to simultaneously capture both coarse semantic classification information and fine matting details. We design two subnetworks to separately handle these two tasks. The first one, named T-Net, is responsible to do pixel-wise classification among foreground, background and unknown regions; while the second one, named M-Net, takes in the output of T-Net as semantic hint and describes the details by generating the raw alpha matte image. The outputs of T-Net and M-Net are fused by a novel Fusion Module to generate the final alpha matte result. The whole networks are trained jointly in an end-to-end manner. We describe these submodules in detail in the following subsections.

4.1. Trimap generation: T-Net

The T-Net plays the role of semantic segmentation in our task and roughly extract foreground region. Specifically, we follow the traditional trimap concept and define a 3-class segmentation—the foreground, background and unknown region. Therefore, the output of T-Net is a 3-channel map indicating the possibility that each pixel belongs to each of the 3 classes. In general, T-Net can be implemented as any of the state-of-the-art semantic segmentation networks (Long et al., 2015; Yu and Koltun, 2015; Chen et al., 2016; Zhao et al., 2017; Peng et al., 2017). In this paper, we choose PSPNet-50 (Zhao et al., 2017) for its efficacy and efficiency.

4.2. Matting network: M-Net

Similar to general matting task (Xu et al., 2017), the M-Net aims to capture detail information and generate alpha matte. The M-Net takes the concatenation of 3-channel images and the 3-channel segmentation results from T-Net as 6-channel input. Note that it differs from DIM (Xu et al., 2017) which uses 3-channel images plus 1-channel trimap (with 1, 0.5, 0 to indicate foreground, unknown region and background respectively) as 4-channel input. We use 6-channel input since it conveniently fits the output of T-Net and we empirically find that with 6-channel or 4-channel input have nearly equal performance.

As shown in Fig. 3, the M-Net

is a deep convolutional encoder-decoder network. The encoder network has 13 convolutional layers and 4 max-pooling layers, while the decoder network has 6 convolutional layers and 4 unpooling layers. The hyper-parameters of encoder network are the same as the convolutional layers of VGG16 classification network expect for the ”conv1” layer in VGG16 that has 3 input channels whereas 6 in our M-Net. The structure of

M-Net differs from DIM (Xu et al., 2017) in following aspects: 1) M-Net

has 6-channel instead of 4-channel inputs; 2) Batch Normalization is added after each convolutional layer to accelerate convergence; 3) ”conv6” and ”deconv6” layers are removed since these layers have large number of parameters and are prone to overfitting.

4.3. Fusion Module

The deep matting network takes the predicted trimap as input and directly computes the alpha matte. However, as shown in Fig. 3, it focuses on the unknown regions and recovers structual and textural details only. The semantic information of foreground and background is not retained well. In this section, we describe the fusion strategy in detail.

We use , and

to denote the foreground, background and unknown region channel that predicted by T-Net before softmax. Thus the probability map of foreground

can be written as

(2)

We can obtain and in the similarly way. It is obvious that , where 1 denotes an all-1 matrix that has the same width and height of input image. We use to denote the output of M-Net.

Noting that the predicted trimap gives the probability distribution of each pixel belonging the three categories, foreground, background and unknown region. When a pixel locates in the unknown region, which means that it is near the contour of a human and constitutes the complex structural details like hair, matting is required to accurately pull the alpha matte. At this moment, we would like to use the result of matting network,

, as an accurate prediction. Otherwise, if a pixel locates outside the unknown region, then the conditional probability of the pixel belonging to the foreground is an appropriate estimation of the matte, i.e., . Considering that is the probability of each pixel belonging to the unknown region, a probabilistic estimation of alpha matte for all pixels can be written as

(3)

where denotes the output of Fusion Module. As , we can rewrite Eq. 3 as

(4)

Intuitively, this formulation shows that the coarse semantic segmentation is refined by the matting result with details, and the refinement is controlled explicitly by the unknown region probability. We can see that when is close to 1, is close to 0, so is approximated by , and when is close to 0, is approximated by . Thus it naturally combines the coarse semantics and fine details. Furthermore, training errors can be readily propagated through to corresponding components, enabling the end-to-end training of the entire network.

4.4. Loss

Following Xu et al(Xu et al., 2017), we adopt the alpha prediction loss and compositional loss. The alpha prediction loss is defined as the absolute difference between the groundtruth alpha and predicted alpha . And the compositional loss is defined as the absolute difference between the groundtruth compositional image values and predicted compositional image values . The overall prediction loss for at each pixel is

(5)

where is set to 0.5 in our experiments. It is worth noting that unlike Xu et al(Xu et al., 2017) which only focus on unknown regions, in our automatic settings, the prediction loss is summed over the entire image.

In addition, we need to note that the loss forms another decomposition problem of groundtruth matte, which is again under-constrained. To get a stable solution to this problem, we introduce an extra constraint to keep the trimap meaningful. A classification loss for the trimap over each pixel is thus involved.

Finally, we get the total loss

(6)

where we just keep to a small value to give a decomposition constraint, e.g. 0.01 throughout our paper.

4.5. Implementation Detail

The pre-train technique (Hinton et al., 2006) has been widely adopted in deep learning and shown its effectiveness. We follow this common practice and first pre-train the two sub-netowrks T-Net and M-Net separately and then finetune the entire net in an end-to-end way. Further, when pre-training the subnetwork, extra data with large amount specific to sub-tasks can also be empolyed to sufficiently train the models. Note that the dataset used for pre-training should not overlap with the test set.

T-Net pre-train

To train T-Net, we follow the common practice to generate the trimap ground truth by dilating the groundtruth alpha mattes. In training phase, square patches are randomly cropped from input images and then uniformly resized to 400400. To avoid overfitting, these samples are also augmented by randomly performing rotation and horizontal flipping. As our T-Net makes use of PSPNet50 which is based on ResNet50 (He et al., 2016)

, we initialize relevant layers with off-the-shelf model trained on ImageNet classification task and randomly initialize the rest layers. The cross entropy loss for classification(

i.e., in Eq. 6) is employed.

M-Net pre-train

We follow deep matting network training pipeline as (Xu et al., 2017) to pre-train M-Net. Again, the input of M-Net is a 3-channel image with the 3-channel trimap generated by dilating and eroding the groundtruth alpha mattes. Worth noting that, we find it is crucial for the performance of matting to augment the trimap with different kernel sizes of dilating and eroding, since it makes the result more robust to the various unknown region widths. For data augmentation, the input images are randomly cropped and resized to 320320. Entire DIM (Xu et al., 2017) dataset is empployed during pre-training M-Net regardless whether it contains humans, since the M-Net focuses on the local pattern rather than the global semantic meaning. The regression loss same as term in Eq. 6 is adopted.

End-to-end training

End-to-end training is performed on human matting dataset and the model is initialized by pre-trained T-Net and M-Net. In training stage, the input image is randomly cropped as 800800 patches and fed into T-Net to obtain semantic predictions. Considering that M-Net needs to be more focused on details and trained with large diversity, augmentations are performed on the fly to randomly crop different patches (320320, 480480, 640640 as in (Xu et al., 2017)) and resize to 320320. Horizontal flipping is also randomly adopted with 0.5 chance. The total loss in Eq. 6 is used. For testing, the feed forward is conducted on the entire image without augmentation. More specifically, when the longer edge of the input image is larger than 1500 pixels, we first scale it to 1500 for the limitation of GPU memory. We then feed it to the network and finally rescale the predicted alpha matte to the size of the original input image for performance evaluation. In fact, we can alternatively perform testing on CPU for large images without losing resolution.

5. Experiments

5.1. Experimental Setup

We implement our method with PyTorch 

(Paszke et al., 2017) framework. The T-Net and M-Net are first pre-trained and then fine-tuned end to end as described in Section 4.5. During end-to-end training phase, we use Adam as the optimizer. The learning rate is set to and the batch size is 10.

Dataset

We evaluate our method on the human matting dataset, which contains 52,511 training images and 1,400 testing images as described in Section 3.

Measurement

Four metrics are used to evaluate the quality of predicted alpha matte (Rhemann et al., 2009): SAD, MSE, Gradient error and Connectivity error. SAD and MSE are obviously correlated to the training objective, and the Gradient error and Connectivity error are proposed by (Rhemann et al., 2009) to reflect perceptual visual quality by a human observer. To be specific, we normalize both the predicted alpha matte and groundtruth to 0 to 1 when calculating all these metrics. Further, all metrics are caculated over entire images instead of only within unknown regions and averaged by the pixel number.

Baselines

In order to evaluate the effectiveness of our proposed methods, we compare our method with the following state-of-the-art matting methods111Implementations provided by their authors are used except for DIM. We implement the DIM network with the same structure as M-Net except with 4 input channels for a fair comparision.: Closed Form (CF) matting (Levin et al., 2008) , KNN matting (Chen et al., 2013), DCNN matting (Cho et al., 2016) , Information Flow Matting (IFM) (Aksoy et al., 2017) and Deep Image Matting (DIM) (Xu et al., 2017). Noting that, all these matting methods are interactive and need extra trimaps as input. For a fair comparison, we provide them with predicted trimaps by the well pretrained T-Net. We denote these methods as PSP50 + , where represents the above methods.

To demonstrate the results of applying semantic segmentation to matting problem, we also design the following baselines:

  • PSP50 Seg: a PSPNet-50 is used to extract humans via the predicted mask. The groundtruth mask used to train this network is obtained by binarizing the alpha matte with a threshold of 0.

  • PSP50 Reg: a PSPNet-50 is trained to predict the alpha matte as regression with L1 loss.

Methods
SAD
()
MSE
()
Gradient
()
Connectivity
()
PSP50 Seg 14.821 11.530 52.336 44.854
PSP50 Reg 10.098 5.430 15.441 65.217
PSP50+CF (Levin et al., 2008) 8.809 5.218 21.819 43.927
PSP50+KNN (Chen et al., 2013) 7.806 4.390 20.476 56.328
PSP50+DCNN (Cho et al., 2016) 8.378 4.756 20.801 50.574
PSP50+IFM (Aksoy et al., 2017) 7.576 4.275 19.762 52.470
PSP50+DIM (Xu et al., 2017) 6.140 3.834 19.414 41.884
Our Method 3.833 1.534 5.179 36.513
Table 3. The quantitative results on human matting testing dataset. The best results are emphasized in bold

5.2. Performance Comparison

In this section, we compare our method with the state-of-the-art matting methods with generated trimaps and designed baselines on the human matting testing dataset. Trimaps are predicted by the pre-trained T-Net and are provided to interactive matting methods. The quantitative results are listed in Table 3.

The performances of binary segmentation and regression are poor. Since complex structural details as well as the concepts of human are required in this task, the results show that it is hard to learn them simultaneously with a FCN network. Using the trimaps predicted by the same PSP50 network, DIM outperforms the other methods, such as CF, KNN, DCNN and IFM. It is due to the strong capabilities of deep matting network to model complex context of images. We can see that our method performs much better than all baselines. The key reason is that our method successfully coordinate the coarse semantics and fine details with a probabilistic fusion strategy which enables a better end-to-end training.

Several visual examples are shown in Fig. 4. Compared to other methods (from column 2 to column 4), our method can not only obtain more ”sharp” details, such as hairs, but also have much little semantic errors which may benefit from the end-to-end training.

Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) TrimapGT+IFM (Aksoy et al., 2017) TrimapGT+DIM (Xu et al., 2017) Our method Alpha GT

Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) TrimapGT+IFM (Aksoy et al., 2017) TrimapGT+DIM (Xu et al., 2017) Our method Alpha GT

Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) TrimapGT+IFM (Aksoy et al., 2017) TrimapGT+DIM (Xu et al., 2017) Our method Alpha GT

Figure 4. The visual comparison results on the semantic human matting testing dataset

5.3. Automatic Method vs. Interactive Methods

Methods
SAD
()
MSE
()
Gradient
()
Connectivity
()
TrimapGT+CF 6.772 2.258 9.0390 34.248
TrimapGT+KNN 8.379 3.413 16.451 83.458
TrimapGT+DCNN 6.760 2.162 9.753 44.392
TrimapGT+IFM 5.933 1.798 8.290 54.257
TrimapGT+DIM 2.642 0.589 3.035 25.773
Our Method 3.833 1.534 5.179 36.513
Table 4. The quantitative results of our method and several state-of-the-art matting methods that need trimap on the semantic human matting testing dataset.

We compare our method with state-of-the-art interactive matting methods taking the groundtruth trimaps as inputs, which are generated by the same strategy used in T-Net pretraining stage. We denote the baselines as TrimapGT + , where represents 5 state-of-the-art matting methods including CF (Levin et al., 2008) , KNN (Chen et al., 2013), DCNN (Cho et al., 2016), IFM (Aksoy et al., 2017) and DIM (Xu et al., 2017). Table 4 shows the comparisons. We can see that the result of our automatic method trained by end-to-end strategy is higher than most interactive matting methods, and is slightly inferior to TrimapGT+DIM. Note that our automatic method only takes in the original RGB images, while interactive TrimapGT + baselines take additional groundtruth trimaps as inputs. Our T-Net could infer the human bodies and estimate coarse predictions which are then complemented with matting details by M-Net. Despite slightly higher test loss, our automatic method is visually comparable with DIM, the state-of-the-art interactive matting methods, as shown in Fig. 4 (column ”TrimapGT+DIM” vs. ”Our method”).

5.4. Evaluation and Analysis of Different Components

Methods
SAD
()
MSE
()
Gradient
()
Connectivity
()
no end-to-end 7.576 4.275 19.762 52.470
no Fusion 4.231 2.146 5.230 56.402
no 4.536 2.278 5.424 52.546
Our Method 3.833 1.534 5.179 36.513
Table 5. Evaluation of Different Components.

The Effect of End-to-end Training

In order to evaluate the effectiveness of the end-to-end strategy, we compare our end-to-end trained model with that using only pre-trained parameters (no end-to-end). The results are listed in Table 5. We can see that network trained in end-to-end manner performs better than no end-to-end, which shows the effectiveness of the end-to-end training.

The Evaluation of Fusion Module

To validate the importance of the proposed Fusion Module, we design a simple baseline that directly outputs the result of M-Net, i.e. . It is trained with the same objective as Eq. 6. We compare the performance between our method with Fusion Module and this baseline without Fusion Module in Table 5. We can see that our method with Fusion Module achieves better performance than the baseline. Especially note that although other metrics remain relatively small, the Connectivity error of baseline gets quite large. It can be due to a blurring of the structural details when predicting the whole alpha matte only with M-Net. Thus the designed fusion module, which leverages both the coarse estimations from T-Net and the fine predictions from M-Net, is crucial for better performance.

The Effect of Constraint

In our implementation, we introduce a constraint for the trimap, i.e. . We train a network removing it to investigate the effect of such a constraint. We denote the network trained in this way as no . The performance of this network is shown in Table 5. We can see that the network without performs better than that without end-to-end training, but is worse than the proposed method. This constraint makes the trimap more meaningful and the decomposition in more stable.

(a) (b) (c) (d)

Figure 5. Intermediate results visualization on a real image. (a) an input image, (b) trimap predicted by T-Net, (c) raw alpha matte predicted by M-Net, (d) fusion result according to Eq. 4.
Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) Our method Composition
Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) Our method Composition
Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) Our method Composition
Image PSP50 Reg PSP50+IFM (Aksoy et al., 2017) PSP50+DIM (Xu et al., 2017) Our method Composition
Figure 6. The visual comparison results on the real images.

Visualization of Intermediate Results

To better understand the mechanism of SHM, we visualize the intermediate results on a real image shown in Fig 5. The first column (a) shows the original input image, the second column (b) shows the predicted foreground (green), background (red) and unknown region (blue) from T-Net, the third column (c) shows the predicted alpha matte from M-Net, and the last column (d) shows the fusion result of the second column (b) and the third column (c) according to Eq. 4. We can see that the T-Net could segment the rough estimation of human main body, and automatically distinguish the definite human edges where predicted unknown region is narrower and structural details where predicted unknown region is wider. In addition, with the help of the coarse prediction provided by T-Net, M-Net could concentrate on the transitional regions between foreground and background and predict more structural details of alpha matte. Further, we combine the advantages of T-Net and M-Net and obtain a high quality alpha matte with the aid of Fusion Module.

5.5. Applying to real images

Since the images in our dataset are composited with annotated foregrounds and random backgrounds, to investigate the ability of our model to generalize to real-world images, we apply our model and other methods to plenty of real images for a qualitative analysis. Several visual results are shown in Fig. 6. We find that our method performs well on real images even with complicated backgrounds. Note that the hair details of the woman in the first image of Fig. 6 are only recovered nicely by our method. Also, the fingers in the second image are blurred incorrectly by other methods, whereas our method distinguishes them well. Compsition examples of the foregrounds and new backgrounds with the help of automatically predicted alpha matte are illustrated in the last column of Fig. 6. We can see these compositions have high visual quality. More results can be found in supplementary materials.

6. Conclusion

In this paper, we focus on the human matting problem which shows a great importance for a wide variety of applications. In order to simultaneously capture global semantic information and local details, we propose to cascade a trimap network and a matting network, as well as a novel fusion module to generate alpha matte automatically. Furthermore, we create a large high quality human matting dataset. Benefitting from the model structure and dataset, our automatic human matting achieves comparable results with state-of-the-art interactive matting methods.

Acknowledgement

We thank Jian Xu for many helpful discussions and valuable suggestions, and Yangjian Chen, Xiaowei Li, Hui Chen, Yuqi Chen for their support on developing the image labeling tool, and Min Zhou for some comments that improved the manuscript.

References

  • (1)
  • Aksoy et al. (2017) Yagız Aksoy, Tunç Ozan Aydın, Marc Pollefeys, and ETH Zürich. 2017. Designing effective inter-pixel information flow for natural image matting. In

    Computer Vision and Pattern Recognition (CVPR)

    .
  • Butler et al. (2012) Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. 2012. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision. Springer, 611–625.
  • Chen et al. (2016) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016).
  • Chen et al. (2013) Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. 2013. KNN matting. IEEE transactions on pattern analysis and machine intelligence 35, 9 (2013), 2175–2188.
  • Cho et al. (2016) Donghyeon Cho, Yu-Wing Tai, and Inso Kweon. 2016. Natural image matting using deep convolutional neural networks. In European Conference on Computer Vision. Springer, 626–643.
  • Chuang et al. (2001) Yung-Yu Chuang, Brian Curless, David H Salesin, and Richard Szeliski. 2001. A bayesian approach to digital matting. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 2. IEEE, II–II.
  • Everingham et al. ([n. d.]) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • Gastal and Oliveira (2010) Eduardo SL Gastal and Manuel M Oliveira. 2010. Shared Sampling for Real-Time Alpha Matting. In Computer Graphics Forum, Vol. 29. Wiley Online Library, 575–584.
  • Grady et al. (2005) Leo Grady, Thomas Schiwietz, Shmuel Aharon, and Rüdiger Westermann. 2005. Random walks for interactive alpha-matting. In Proceedings of VIIP, Vol. 2005. 423–429.
  • Gupta et al. (2016) Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2315–2324.
  • He et al. (2011) Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. 2011. A global sampling method for alpha matting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2049–2056.
  • He et al. (2010) Kaiming He, Jian Sun, and Xiaoou Tang. 2010. Guided image filtering. In European conference on computer vision. Springer, 1–14.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hinton et al. (2006) Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
  • Lee and Wu (2011) Philip Lee and Ying Wu. 2011. Nonlocal matting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2193–2200.
  • Levin et al. (2008) Anat Levin, Dani Lischinski, and Yair Weiss. 2008. A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2008), 228–242.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
  • Peng et al. (2017) Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large Kernel Matters–Improve Semantic Segmentation by Global Convolutional Network. arXiv preprint arXiv:1703.02719 (2017).
  • Rhemann et al. (2009) Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. 2009. A perceptually motivated online benchmark for image matting. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 1826–1833.
  • Ros et al. (2016) German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. 2016. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Shahrian et al. (2013) Ehsan Shahrian, Deepu Rajan, Brian Price, and Scott Cohen. 2013. Improving image matting using comprehensive sampling sets. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 636–643.
  • Shen et al. (2016) Xiaoyong Shen, Xin Tao, Hongyun Gao, Chao Zhou, and Jiaya Jia. 2016. Deep automatic portrait matting. In European Conference on Computer Vision. Springer, 92–107.
  • Sun et al. (2004) Jian Sun, Jiaya Jia, Chi-Keung Tang, and Heung-Yeung Shum. 2004. Poisson matting. In ACM Transactions on Graphics (ToG), Vol. 23. ACM, 315–321.
  • Wang and Cohen (2007) Jue Wang and Michael F Cohen. 2007. Optimized color sampling for robust matting. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 1–8.
  • Xu et al. (2017) Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. 2017. Deep image matting. In Computer Vision and Pattern Recognition (CVPR).
  • Yu and Koltun (2015) Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
  • Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2881–2890.
  • Zhu et al. (2017) Bingke Zhu, Yingying Chen, Jinqiao Wang, Si Liu, Bo Zhang, and Ming Tang. 2017. Fast Deep Matting for Portrait Animation on Mobile Phone. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 297–305.