Trimap-guided Feature Mining and Fusion Network for Natural Image Matting

Utilizing trimap guidance and fusing multi-level features are two important issues for trimap-based matting with pixel-level prediction. To utilize trimap guidance, most existing approaches simply concatenate trimaps and images together to feed a deep network or apply an extra network to extract more trimap guidance, which meets the conflict between efficiency and effectiveness. For emerging content-based feature fusion, most existing matting methods only focus on local features which lack the guidance of a global feature with strong semantic information related to the interesting object. In this paper, we propose a trimap-guided feature mining and fusion network consisting of our trimap-guided non-background multi-scale pooling (TMP) module and global-local context-aware fusion (GLF) modules. Considering that trimap provides strong semantic guidance, our TMP module focuses effective feature mining on interesting objects under the guidance of trimap without extra parameters. Furthermore, our GLF modules use global semantic information of interesting objects mined by our TMP module to guide an effective global-local context-aware multi-level feature fusion. In addition, we build a common interesting object matting (CIOM) dataset to advance high-quality image matting. Experimental results on the Composition-1k test set, Alphamatting benchmark, and our CIOM test set demonstrate that our method outperforms state-of-the-art approaches. Code and models will be publicly available soon.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

page 8

05/26/2021

Context-aware Cross-level Fusion Network for Camouflaged Object Detection

Camouflaged object detection (COD) is a challenging task due to the low ...
12/09/2021

Edge-aware Guidance Fusion Network for RGB Thermal Scene Parsing

RGB thermal scene parsing has recently attracted increasing research int...
05/26/2020

Learning Local Features with Context Aggregation for Visual Localization

Keypoint detection and description is fundamental yet important in many ...
05/21/2020

Instance-aware Image Colorization

Image colorization is inherently an ill-posed problem with multi-modal u...
12/02/2019

IPG-Net: Image Pyramid Guidance Network for Object Detection

For Convolutional Neural Network based object detection, there is a typi...
01/08/2022

SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Due to the wavelength-dependent light attenuation, refraction and scatte...
04/29/2021

Video Salient Object Detection via Adaptive Local-Global Refinement

Video salient object detection (VSOD) is an important task in many visio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a kind of image matting task, the alpha matting task separates foreground objects in images by predicting an alpha matte which represents the opacity of the foreground at each pixel. In a mathematical form, alpha matting defines the natural image as a convex combination of a foreground image and a background image at each pixel , as shown below:

(1)
Figure 1: The illustration of trimap-based matting. Columns from left to right, an input image, a trimap, and a ground-truth alpha matte, respectively. The black region, white region, and gray region in trimap denote background, foreground, and unknown regions, respectively. The non-background area points out an interesting object.

where is the value of the predicted alpha matte at pixel . As is explained in [18, 1], the problem is highly ill-posed. To solve this problem, many approaches [4, 31] utilize trimap or scribble as constraint information to reduce the solution space. Trimap-based approaches can not only reduce the solution space by using trimap, but also know which object should be treated as the foreground object in a complex image based on the guidance of non-background area in input trimap. As is shown in Fig. 1, the input trimap points out which objects should be predicted as a foreground for both overlapped objects and objects close to each other. Since other salient but not interesting objects such as parts of sheep can also appear in the unknown regions of trimap in Fig. 1, a good trimap-based approach should utilize the semantic guidance of input trimap to predict the opacity of interesting object, instead of simply predicting all salient objects in unknown regions.

Deep learning methods have achieved significant improvements in trimap-based matting tasks in recent years. Most of them [33, 24, 13, 9, 22, 21] utilize trimap information by directly concatenating input image and input trimap to feed an encoder network. Some of them [1, 29, 23] learn or process trimap information with an extra network. Attention-based methods such as GCA [18] and HDMatt [34] propagate the information flow between different regions indicated by the trimaps, depending on the similarity between patches of keys and patches of queries. However, these approaches neglect the strong high-level semantic cues for interesting objects provided by non-background regions in input trimaps and have not utilized them to mine high-level semantic information of interesting objects in an efficient way.

Fusing or aligning the low-resolution high-level features and high-resolution low-level features is another important issue for image matting. Most approaches [33, 18, 9, 29] adopt static methods which upsample high-level features by transposed convolutions or bilinear upsampling and then fuse them with low-level features by addition or a convolution layer. Advanced content-based methods emerge in recent years. IndexNet [24], CARAFE [30] and AU [6] adopt content-based spatial dynamic upsampling by predicting content-aware upsampling kernel instead of distance-based upsampling or static transposed convolution. Considering that semitransparent parts of a given foreground object may have different appearances in different background scenes, spatial dynamic fusion kernels may work better than static convolution kernels. However, these content-based approaches fuse the high-level and low-level features only depending on local features, which may neglect the global context feature with high-level semantic information closely related to the interesting objects.

In this paper, we propose a trimap-guided feature mining and fusion network (TMFNet) which mines high-level semantic information of the interesting object under trimap guidance efficiently and fuse multi-level features with global-local context-aware spatial dynamic kernels effectively. The proposed TMFNet mainly consists of our trimap-guided non-background multi-scale pooling (TMP) module and global-local context-aware feature fusion (GLF) modules.

We propose TMP module to mine semantic context information of interesting objects by utilizing the high-level semantic guidance of input trimap without extra parameters. Global pooling and large kernel pooling such as the pyramid pooling module [35] are widely used in both image matting [9] and semantic segmentation [32] to capture semantic information from the global context. As shown in Fig. 1

, trimap-based matting needs to separate interesting objects pointed out by non-background regions in trimap instead of every object, which inspires us to aggregate the weighted average from the non-background area of high-level semantic features in different scales. Since image matting requires more spatially accurate prediction, we combine non-background average pooling and multi-scale pooling kernels with a stride of 1 to build our TMP module, which integrates the high-level semantic cues of trimap into the semantic context information mining of interesting objects without extra parameters.

Different from previous content-based aligning or fusion methods [24, 6, 30], our GLF modules not only utilize local features, but also introduce the global context feature mined from our TMP module to predict better dynamic fusion kernels efficiently and effectively. Since both global context and local feature are important for pixel-level prediction, the proper selection of a global feature improves our feature fusion significantly.

In addition, we build a common interesting object matting dataset to advance high-quality and high-resolution trimap-based image matting.

Our major contributions can be summarized as follows:

  • We propose a trimap-guided non-background multi-scale pooling (TMP) module to mine semantic information of interesting objects, which utilizes high-level semantic guidance in trimap without extra parameters.

  • We design a novel lightweight global-local context-aware feature fusion (GLF) module which introduces a global feature with high-level semantic information mined from our TMP module to promote the generation of fusion kernels efficiently and effectively.

  • We build a common interesting object matting (CIOM) dataset to advance high-quality and high-resolution trimap-based natural image matting.

  • Experimental results on Composition-1k [33] test set, Alphamatting [26] benchmark, and our CIOM test set demonstrate that our TMFNet outperforms the state-of-the-art approaches in natural image matting.

2 Related Work

Trimap-based image matting. In general, trimap-based image matting methods fall into three main categories: sampling-based methods, affinity-based methods, and learning-based methods. Sampling-based methods [4, 8, 11, 27] solve Eq. 1 by sampling colors from background and foreground regions for each pixel in the unknown region. Propagation methods [3, 10, 16, 17]estimate alpha by propagating its values from known regions to unknown regions based on the color line model proposed by [16]. Benefiting from the development of powerful deep convolution networks and a large-scale matting dataset [33], most trimap-based approaches [33, 24, 13, 9, 22, 21] utilize the semantic cues of input trimap by concatenating images and trimap to feed the network. Some approaches use an extract network to learn or extra semantic information from input trimap. ADA [1] uses an extra decoder to learn an adapted trimap and propagates it with the output of the alpha decoder. SIM [29]

uses an extra patch-based classifier to generate their semantic trimap, which needs their extended dataset with class labels of foreground objects. TIMINet

[23] uses an extra network component to mine trimap information. Non-local matting approaches [18, 34] guide information flow from the image context to unknown pixels in trimap using an attention mechanism.

Fusing or aligning high-level features and low-level features. CNN-based models usually provide high-level features with low resolution and low-level features with high resolution. Therefore, it is an important issue to fuse or align high-level and low-level features for deep matting tasks. Existing approaches fall into two main categories: static method and content-based method. Most approaches [33, 9, 18, 21] adopt static methods, which upsample high-level features with bilinear kernel or transposed convolution and the upsampled high-level features fused with low-level features by convolution or direct addition. Other approaches [24, 30, 6] adopt content-based spatial dynamic upsampling instead of static upsampling to advance feature fusion. IndexNet [24] and CARAFE [30] generate upsampling kernels according to first-order features, while AU [6] uses second-order information to generate its upsampling kernels. All these methods only use local features for their dynamic or static fusion.

Figure 2: Our proposed TMP module, GLF module and framework of our TMFNet.

3 Our Baseline for Deep Alpha Matting

Baseline Structure. Our encoder is a ResNet-50 [12] like [29] with an output stride of 16. An image and a one-hot trimap are concatenated as a 6-channel input like [18]. The input is fed to our encoder to generate different levels of features. Our baseline decoder firstly passes the output of C5 stage in ResNet-50 to a Pyramid Pooling layer [35] followed by two

convolution layers which is denoted as “ppm”. Then the high-level output features of “ppm” will be fused with low-level features from stages of C2, C1, and the 6-channel input through a bilinear upsampling, concatenation and a convolution layer with a Leaky ReLU

[25] in sequence. We denote this fusing process in the baseline as “static fusion” in this paper. Finally, the fused features are fed to two convolution layers to predict the alpha matte. More details are reported in supplemental materials.

Loss Function. We adopt the alpha loss and composition loss in DIM [33] as our base loss, then a Laplacian loss [13] is added to the total loss, which is shown as:

(2)

4 A New Image Matting Dataset

Focusing on high-quality interesting object image matting, we collect 733 high-quality images with common interesting foreground objects without motion blur. We manually extract their alpha mattes and foreground images with Photoshop. We select 683 labeled alpha mattes with corresponding foregrounds as our training set which are composited onto the background images from COCO

[20]

during training. The other 50 labeled images are composited onto background images from ADE20k

[36] to form 1000 test images according to the composition rules in [33]. The average number of pixels for Composition-1k [33] training images and test images are and , respectively. The average number of pixels for our CIOM training images and test images are and , respectively, which makes it more suitable for matting of higher resolution and quality. More details are in supplemental materials.

Figure 3: The visual comparison results on Composition-1k [33] test set. From left to right, the original image, trimap, IndexNet [24], GCA [18], baseline, ours and ground-truth.

5 Proposed Methods

5.1 The TMP Module

To construct a powerful semantic representation for complex scenes for image segmentation, the pyramid pooling module [35] aggregates global context information and sub-region context with large-kernel pooling and fuses them. However, trimap-based image matting approaches need to focus on the interesting object pointed out by trimap instead of every object in context. Since input trimap contains high-level semantic guidance, passing it with low-level images through the same network structure is not enough to extract its semantic guidance. Observing that non-background regions of trimap are closely related to interesting objects, it is reasonable to focus the semantic feature mining on the non-background regions. Considering that image matting needs a more smooth representation, it is proper to extract a powerful and smooth semantic representation for interesting objects by fusing the weighted average of high-level features on non-background regions with different-scale large kernels with a stride of 1. In this way, we are able to integrate the high-level semantic guidance of trimap into mining semantic information of interesting objects without extra network components.

With the above analysis, we introduce the trimap-guided non-background multi-scale pooling module (TMP), which provides powerful semantic representation for interesting objects pointed out by trimap efficiently. Our TMP takes a high-level feature map and a non-background weighted mask in the same spatial size of as inputs. To get , we generate a non-background binary mask , where is the background region in trimap, and then bilinearly resize it in the same spatial size of to get . As is shown in Fig. 2, We reduce the channels of input feature map with 4 parallel convolution layers to get 4 reduce features denoted as s. To focus feature mining on non-background region, we build the non-background pooling unit as:

(3)

where is an average pooling layer with a kernel size of k and a stride of 1, is the Hadamard product, , and are the reduced feature map, non-background weighted mask and , respectively. Then we use 4 non-background pooling units with different kernel sizes to harvest the semantic information of interesting objects from these reduced feature maps s. Finally, the outputs of non-background pooling units are concatenated with the high-level input feature and then they are fused by two convolution layers to form the multi-scale context representation with high-level semantic information for the interesting object. We set the kernel sizes of non-background pooling units to 31, 17, 11, and 5 corresponding to bin sizes of 1, 2, 3, and 6 in “ppm” with a input, respectively. When the input resolution of images is , our TMP has similar kernel sizes for pooling kernels with the “ppm”. What’s more, our TMP also has the same parameter size as the “ppm” in the baseline.

5.2 The GLF Module

Existing approaches use static methods [9, 18] or content-based methods [30, 6] to upsample a high-level feature map, then concatenate it with the low-level one, and fuse them by a convolution layer. However, these methods only focus on local features.

Since both local details and global context are important for matting an interesting object, we construct our global-local context-aware feature fusion (GLF) module utilizing local features of high-level and low-level feature maps and a global feature with high-level semantic information in a proper way.

As is briefly shown in Fig. 2, our GLF module firstly uses pixel shuffle [28] to align the spatial sizes of high-level feature and low-level feature , and then concatenates them together. Then we use a 11 convolution layer to distribute their information into N groups of channels in . After that, N groups of kernels at each spatial position are generated from both local features of and the global feature . And then channels in the same group share a kernels map to fuse the spatial information. Finally, a 11 convolution fuses information from different groups. In this way, our GLF module fuses a high-level feature and a low-level feature under the guidance of a global feature efficiently. The exact mathematical description is as follows.

Given a low-level feature , a high-level feature , and a global feature as inputs, the GLF module firstly distributes information of and along channel dimension as:

(4)

where , , and are internal feature map, convolution, concatenate and pixel shuffle [28], respectively. Then we generate N groups of kernels at each spatial position as:

(5)

where is the Leaky ReLU [25] with a negative slope of 0.01 and is the broadcast addition. We divide the kernels and features into N groups viewed as and , respectively. Then a spatial fusion for each group ( in Fig. 2 for our GLF module) is implemented as:

(6)

where and are indices of groups and indices of channels in each group, respectively, and and are offsets over kernels in at each position, respectively. Finally, is reshaped to and we use a convolution to fuse information between groups and channels as:

(7)

where is the final output of GLF module and

is batch normalization

[14].

Since the high-level feature will be fused with 3 low-level features in sequence, selecting a proper global context feature is essential for our GLF Module. It is trivial to get high-level global context by applying a global pooling to , which is denoted as GLF(B) in Table 3. However, after is fused with , the high-level semantic information is decreased and the information of local details is increased, which makes it improper to provide global context information for the next GLF module. Since our TMP module mines strong semantic information for objects pointed out by trimap, we apply a global average pooling to its output to generate the global feature for GLF module denoted as GLF in Table 2 and Table 3. We also compare with the global feature from the input of our TMP module, namely the output of the C5 stage in ResNet-50, which is denoted as GLF(C) in Table 3.

5.3 Framework of TMFNet

We replace “ppm” and “static fusion” modules in the baseline network with our TMP module and GLF modules respectively to form our proposed network. As shown in Fig. 2, fusion stages from left to right are called F1, F2, and F3, respectively. The arrows pointing to GLFs from left, top, and bottom denote the inputs of , , and , respectively. For GLF modules, we set their internal channel numbers s to 256, 256, and 32 for stages of F1, F2, and F3, respectively and we set 16 channels for each group in all fusion stages. The output channel numbers of GLFs are the same as the baseline’s. Since the spatial size of high-level feature input should be of the low-level feature’s in the GLF module, we place a bilinear upsampling layer with a ratio of 2 before the F1 stage. In this way, our proposed network costs 0.9M parameters fewer than our baseline due to our lightweight GLF modules.

Figure 4: The visual comparison results on high-resolution real-world images. From left to right, the original image, trimap, DIM [33], IndexNet [24], GCA [18] and ours.

6 Experiments

6.1 Experiment Settings

Our proposed method is evaluated on Composition-1k [33], Alphamatting [26] and our CIOM datasets with quantitative results.

Alphamatting [26] is an online real-world matting benchmark, which provides 27 images and alpha mattes for training and 8 testing images with 3 trimaps for each one for evaluation.

Composition-1k [33] provides 431 and 50 pairs of foreground images and alpha mattes for training and test, respectively. 1000 testing images is generated by compositing each of 50 test pairs onto 20 background images from VOC [7], and a corresponding trimap is provided for each testing image. The backgrounds for training are from COCO dataset [20].

Our CIOM provides 683 and 50 pairs of foreground images and alpha mattes for training and test. And the resolution of testing images is up to , which can provide quantitative results for high-resolution and high-quality matting.

Evaluation Metrics. We evaluate the quantitative results using metrics of the Sum of Absolute Differences (SAD), Mean Squared Error (MSE), Gradient error (Grad) and Connectivity error (Conn).

Implementation Details. The baseline and proposed methods are trained for 200, 000 iterations with a batch size of 32 in total for detailed ablation study in Table 23, and 5 using 2 Tesla V100 GPUs. Especially, to compare with SOTA methods in Table 1 and 4, the baseline and proposed framework are trained with a batch size of 64 in total using 4 Tesla V100 GPUs. We use Adam optimizer [15] with an initialized learning rate of 0.01. The policy of learning rate decay follows GCA [18]. For data augmentation, we follow the training protocol proposed in [18, 19]

, including a random composition of two foreground images, random resizing images with random interpolation, random affine transformation and color jitters. Trimaps are generated by a dilation and an erosion on alpha images with random kernel sizes from 1 to 30 during training. The

patches centered on an unknown region are cropped and composite with a random background image from COCO [20]. Models trained on Composition-1k [33] training set or CIOM training set follow the same settings above. As for testing, our proposed method inferences each image without scaling in our CIOM test set or the Composition-1k [33] test set as a whole on a single 32GB Tesla V100 GPU.

6.2 Comparison with Prior Work

We compare our method with other SOTA deep trimap-based image matting methods, including LFPNet [21], FBA [9], SIM [29], TIMINet [23], GCA [18], AU [6], ADA [1], IndexNet [24] and DIM [33].

Methods SAD MSE Grad Conn Params


DIM[33]
50.4 14.0 31.0 50.8 >130M
Index[24] 45.8 13.0 25.9 43.7 8.2M
ADA [1] 41.7 10.0 16.9 - -
GCA [18] 35.3 9.1 16.9 32.5 25.3M
AU [6] 32.2 8.2 16.4 29.3 8.1M
TIMI [23] 29.1 6.0 11.5 25.4 -
SIM [29] 28.0 5.8 10.8 24.8 70M
FBA [9] 25.8 5.2 10.6 20.8 34.7M
LFP*[21] 22.4 3.6 7.6 17.1 141M
Baseline 26.4 4.7 9.3 22.4 34.8M
Ours 23.0 4.0 7.5 18.7 33.9M
Ours 22.1 3.6 6.7 17.6 33.9M
Table 1: Quantitative results on the Composition-1k test set. denotes results with test-time augmentation. * denotes training or pre-training with extra matting data.
Methods SAD MSE Grad Conn
DIM [33] 39.7 6.7 13.4 35.4
Index [24] 32.5 5.2 11.4 28.0
GCA [18] 32.1 6.8 18.2 25.9
Baseline 27.0 3.0 7.9 20.9
Ours(TMP) 22.4 2.2 5.8 15.2
Ours(TMP+GLF) 20.2 1.8 4.8 13.6
Table 2: Quantitative results on our CIOM test set.
Figure 5: Visualization of our fusion kernel maps. Columns from left to right, initial fusion kernels, kernels predicted by a “LF” module, kernels predicted by a GLF module, respectively.

Composition-1k test set. Quantitative and visual results are reported in Table 1 and Fig. 3. The proposed model achieves 22.1 SAD on the Composition-1k test set which outperforms other SOTA methods on the Composition-1k test set without using extra data or annotation. The proposed model also achieves significant improvements on our strong baseline model using fewer parameters. As is shown in Fig. 3, our TMFNet focuses better on the interesting objects than the baseline method and other SOTA methods [18, 24] with interference from salient background objects. More details are reported in supplemental materials.

Alphamatting benchmark. Compared with other state-of-the-art methods such as LFPNet [21], SIM [29], ADA [1], GCA [18] and AU [6], our method performances better on metrics of both SAD and MSE, shown in Table 4. Several visual results are shown in Fig. 6 and our methods also have better visual performance on these real-world cases in alphamatting [26] benchmark.

Figure 6: The visual comparison results on Alphamatting benchmark. From left to right, the original image, trimap, AU [6], GCA [18], ADA [1], SIM [29] and ours.

High-resolution real-world images. Besides real-world cases in Alphamatting [26] benchmark, we also collect several high-resolution real-world images and draw trimaps for them. As is shown in Fig. 4, our model has a better prediction for details than the SOTA methods [33, 24, 18] for high-resolution real-world testing cases.

Our CIOM test set provides quantitative results for high-resolution and high-quality images with a resolution up to 40003867. As is shown in Table 2, our proposed TMP module and GLF module bring 4.6 and 2.2 SAD improvements, respectively. We also compare our method with DIM [33], IndexNet [24], and GCA [18] trained on our CIOM training set. All testings are implemented on a 32GB Tesla V100 GPU. Since GCA [18] can only evaluate images with a resolution up to 3000 with 32GB memory, we downsample images larger than 3000 for the testing of GCA. Our method achieves 20.2 SAD, which outperforms other methods significantly. Visual comparison for proposed methods, baseline, and some of those methods can be seen in Fig. 7. Our method outperforms the above methods on both quantitative and visual results on this high-resolution matting benchmark.

Methods SAD MSE Params
base loss:
Basic 28.1 5.8 34.8M
Basic+TP 27.1 5.1 34.8M
Basic+MP 27.9 5.4 34.8M
Basic+TMP 26.9 5.1 34.8M
Basic+TMP+LF 26.3 5.1 33.8M
Basic+TMP+GLF 24.9 4.8 33.9M
Basic+TMP+GLF(B) 26.2 5.0 33.9M
Basic+TMP+GLF(C) 26.7 5.0 34.1M
Comparison Methods
base loss:
Basic+ASPP[2] 27.9 5.3 41.2M
Basic+TMP+CARAFE[30] 28.6 5.7 35.6M
+laplacian loss:
Basic 27.4 5.2 34.8M
Basic+TMP 26.0 4.7 34.8M
Basic+TMP+GLF 24.0 4.1 33.9M
Table 3: Ablations and comparison on the Composition-1k test set.
Figure 7: The visual comparison results on our CIOM test set. From left to right, the original image, trimap, IndexNet [24], GCA [18], baseline, ours and ground-truth.
Methods SAD MSE Grad
O S L U O O
ADA [1] 12.1 10.9 11.1 14.4 12.8 12.3
AU [6] 12.5 11.4 9.8 16.3 14.6 11.3
GCA[18] 13.7 14.4 11.5 15.3 14.5 12.8
SIM[29] 5.8 6.3 5 6 6.3 6.2
LFP[21] 4.5 3.8 3.5 6.4 4.1 2.8
Ours 3.3 2.3 2.9 4.6 4 3.9
Table 4: Quantitative results of our method and several representative state-of-the-art methods on Alphamatting [26] benchmark. “S”, “L”, “U” denote three trimap sizes and scores denote average rank across 8 test samples. “O” denotes the overall average rank across “S”, “L” and “U”.

6.3 Ablation and Comparison

Our TMP module consists of trimap-guided non-background average pooling and multi-scale pooling kernels, which mines the high-level semantic context of interesting objects pointed out by trimap. We report the ablation study in Table 3. The “Basic+TP” refers to replacing the adaptive average pooling with non-background adaptive average pooling in “ppm”, which brings 1.0 SAD improvement by focusing the feature mining on non-background area without extra parameters. “Basic+MP” refers to replacing the adaptive average pooling with our multi-scale pooling with a stride of 1 whose smooth representation improves 0.2 SAD. Finally, “Basic+TMP” refers to replacing the “ppm” in baseline with our TMP module, which improves 1.2 and 1.4 SAD under base loss and with Laplacian loss, respectively. Besides “ppm” in baseline, we also compare our TMP with ASPP [2] module, namely “Basic+ASPP” in Table 3 and our TMP outperforms ASPP with fewer parameters.

The ablation study for our GLF module can be seen in Table 3. The “LF” refers to a local-aware fusion module, namely a GLF module without using global context feature . The “Basic+TMP+LF” improves 0.6 SAD from “Basic+TMP” and saves 1M parameters by replacing all the “static fusion” with our local-aware fusion modules. For selecting the global context feature for our GLF module, we compare GLF, GLF(B), and GLF(C) in Table 3. The results show that GLF using global context from the output of our TMP module outperforms other selections significantly and it brings 1.4 SAD improvements with only 0.1M extra parameters. In total, “Basic+TMP”+GLF improves 2.0 SAD and costs 0.9M fewer parameters. In addition, we also compare with the existing local-aware dynamic upsampling method such as CARAFE [30], which only uses high-level feature to predict upsampling kernels and shows a negative effect for matting task shown in Table 3. Above analyses show that both global and local features are important for feature fusion in matting and our GLF module performs a global-local context-aware spatial fusion to improve natural image matting both efficiently and effectively. Ablation studies of fusion stages are in Appendix A.

6.4 Visualization of Our Fusion Kernels

We visualize our predicted fusion kernel maps of the initial network, a trained local-aware fusion module “LF”, and a trained GLF module in Fig. 5. These kernel maps in the first row and the second row of Fig. 5 are generated from the input cases of crystal and dandelion shown in Fig. 3, respectively. Compared with the initial one, the trained local-aware fusion module “LF” learns the structure of interesting objects, and suppresses the interference of salient background objects to some degree. With the proper selection of a global context feature mined from our TMP module, our global-local context-aware fusion module (GLF) learns the structure of interesting objects better and its predicted fusion kernel maps have clearer boundaries. These visual results of predicted kernel maps show that the feature mining and the proper design of the fusion modules in our TMFNet are good for learning the structure of interesting objects in complex scenes.

7 Conclusion

In this paper, we observe that previous trimap-based matting methods lack an efficient way to integrate trimap guidance into semantic context feature mining for interesting objects and they also ignore the importance of a global context feature with high-level semantic information for feature fusion in matting. Based on this observation, we propose a trimap-guided feature mining and fusion network for natural image matting. Our TMP module mines a powerful semantic context representation for interesting objects pointed out by trimap without extra parameters and our GLF uses the high-level semantic global context from TMP to promote our global-local context-aware feature fusion both efficiently and effectively. To advance the high-resolution and high-quality matting, we build a large-scale high-resolution dataset for common interesting object matting. Finally, extensive experiments demonstrate that our method outperforms the state-of-the-art methods.

References

  • [1] Shaofan Cai, Xiaoshuai Zhang, Haoqiang Fan, Haibin Huang, Jiangyu Liu, Jiaming Liu, Jiaying Liu, Jue Wang, and Jian Sun. Disentangled image matting. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    , pages 8819–8828, 2019.
  • [2] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017.
  • [3] Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn matting. IEEE transactions on pattern analysis and machine intelligence, 35(9):2175–2188, 2013.
  • [4] Yung-Yu Chuang, Brian Curless, David H Salesin, and Richard Szeliski. A bayesian approach to digital matting. In

    Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001

    , volume 2, pages II–II. IEEE, 2001.
  • [5] MMCV Contributors. MMCV: OpenMMLab computer vision foundation. https://github.com/open-mmlab/mmcv, 2018.
  • [6] Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity-aware upsampling for deep image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6841–6850, 2021.
  • [7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [8] Xiaoxue Feng, Xiaohui Liang, and Zili Zhang. A cluster sampling method for image matting via sparse coding. In European Conference on Computer Vision, pages 204–219. Springer, 2016.
  • [9] Marco Forte and François Pitié. , , alpha matting. arXiv preprint arXiv:2003.07711, 2020.
  • [10] Leo Grady, Thomas Schiwietz, Shmuel Aharon, and Rüdiger Westermann. Random walks for interactive alpha-matting. In Proceedings of VIIP, volume 2005, pages 423–429, 2005.
  • [11] Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global sampling method for alpha matting. In CVPR 2011, pages 2049–2056. IEEE, 2011.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [13] Qiqi Hou and Feng Liu. Context-aware image matting for simultaneous foreground and alpha estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4130–4139, 2019.
  • [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    , pages 448–456. PMLR, 2015.
  • [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] Anat Levin, Dani Lischinski, and Yair Weiss. A closed-form solution to natural image matting. IEEE transactions on pattern analysis and machine intelligence, 30(2):228–242, 2007.
  • [17] Anat Levin, Alex Rav-Acha, and Dani Lischinski. Spectral matting. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  • [18] Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 34, pages 11450–11457, 2020.
  • [19] Yaoyi Li, Qingyao Xu, and Hongtao Lu. Hierarchical opacity propagation for image matting. arXiv preprint arXiv:2004.03249, 2020.
  • [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [21] Qinglin Liu, Haozhe Xie, Shengping Zhang, Bineng Zhong, and Rongrong Ji. Long-range feature propagating for natural image matting. In Proceedings of the 29th ACM International Conference on Multimedia, pages 526–534, 2021.
  • [22] Yuhao Liu, Jiake Xie, Yu Qiao, Yong Tang, and Xin Yang. Prior-induced information alignment for image matting. IEEE Transactions on Multimedia, 2021.
  • [23] Yuhao Liu, Jiake Xie, Xiao Shi, Yu Qiao, Yujie Huang, Yong Tang, and Xin Yang. Tripartite information mining and integration for image matting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7555–7564, 2021.
  • [24] Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. Indices matter: Learning to index for deep image matting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3266–3275, 2019.
  • [25] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al.

    Rectifier nonlinearities improve neural network acoustic models.

    In Proc. icml

    , volume 30, page 3. Citeseer, 2013.

  • [26] Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela Rott. A perceptually motivated online benchmark for image matting. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1826–1833. IEEE, 2009.
  • [27] Mark A Ruzon and Carlo Tomasi. Alpha estimation in natural images. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1, pages 18–25. IEEE, 2000.
  • [28] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang.

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  • [29] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Semantic image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11120–11129, 2021.
  • [30] Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019.
  • [31] Jue Wang and Michael F Cohen. An iterative optimization approach for unified image segmentation and matting. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 936–943. IEEE, 2005.
  • [32] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun.

    Unified perceptual parsing for scene understanding.

    In Proceedings of the European Conference on Computer Vision (ECCV), pages 418–434, 2018.
  • [33] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2970–2979, 2017.
  • [34] Haichao Yu, Ning Xu, Zilong Huang, Yuqian Zhou, and Humphrey Shi. High-resolution deep image matting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3217–3224, 2021.
  • [35] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • [36] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.

Appendix A Ablation for GLF Module in Fusion Stages

As shown in Table 5 for Composition-1k [33] test set, based on “Basic+TMP”, we gradually replace “static fusion” with local-aware fusion modules in the decoder at stages of F1, F2, and F3 in Figure 2, and then the global context is gradually applied to these stages. When local-aware fusion modules are applied to all 3 fusion stages, it improves the SAD from 26.9 to 26.3 and saves about 1M parameters. And when the global context is applied to all 3 fusion stages, it improves the SAD from 26.3 to 24.9 with only about 0.1M extra parameters. In total, our GLF modules improve the SAD from 26.9 to 24.9 and save about 0.9M parameters.

Fusion SAD MSE Params
base loss:
Basic+TMP 26.9 5.1 34.8M
base loss:
Local feature only:
F1 26.6 5.1 33.9M
F1+F2 26.4 5.1 33.8M
F1+F2+F3 26.3 5.1 33.8M
base loss:
+Global context:
F1 26.2 5.0 33.8M
F1+F2 25.6 5.0 33.9M
F1+F2+F3 24.9 4.8 33.9M
Table 5: Ablation for GLF module in fusion stages.

Appendix B Computation Costs

We compare the computation costs of our TMFNet, the baseline model, GCA [18], and FBA [9] under an input with a resolution of in Table 6. The proposed TMFNet has lower computation costs than the baseline model and several SOTA methods [18, 9]. The results of GFLOPs are based on the way of calculation in the MMCV [5] including every major operation in each model.

Methods GFLOPs Parameters
GCA [18] 5385 25.3M
FBA [9] 2741 34.7M
Baseline 1410 34.8M
Ours 1121 33.9M
Table 6: Computation costs of models.