Official repository for Natural Image Matting via Guided Contextual Attention
Over the last few years, deep learning based approaches have achieved outstanding improvements in natural image matting. Many of these methods can generate visually plausible alpha estimations, but typically yield blurry structures or textures in the semitransparent area. This is due to the local ambiguity of transparent objects. One possible solution is to leverage the far-surrounding information to estimate the local opacity. Traditional affinity-based methods often suffer from the high computational complexity, which are not suitable for high resolution alpha estimation. Inspired by affinity-based method and the successes of contextual attention in inpainting, we develop a novel end-to-end approach for natural image matting with a guided contextual attention module, which is specifically designed for image matting. Guided contextual attention module directly propagates high-level opacity information globally based on the learned low-level affinity. The proposed method can mimic information flow of affinity-based methods and utilize rich features learned by deep neural networks simultaneously. Experiment results on Composition-1k testing set and alphamatting.com benchmark dataset demonstrate that our method outperforms state-of-the-art approaches in natural image matting. Code and models are available at https://github.com/Yaoyi-Li/GCA-Matting.READ FULL TEXT VIEW PDF
Deep generative approaches have recently made considerable progress in i...
Recent deep learning based approaches have shown promising results on im...
Image matting is a key technique for image and video editing and composi...
Accurately determining a change in protein binding affinity upon mutatio...
Recent deep learning based image inpainting methods which utilize contex...
Bone age assessment (BAA) is clinically important as it can be used to
We present a novel, purely affinity-based natural image matting algorith...
Official repository for Natural Image Matting via Guided Contextual Attention
The natural image matting is one of the important tasks in computer vision. It has a variety of applications in image or video editing, compositing and film post-production[Wang, Cohen, and others2008, Aksoy, Ozan Aydin, and Pollefeys2017, Lutz, Amplianitis, and Smolic2018, Xu et al.2017, Tang et al.2019]. Matting has received significant interest from the research community and been extensively studied in the past decade. Alpha matting refers to the problem that separating a foreground object from the background and estimating transitions between them. The result of image matting is a prediction of alpha matte which represents the opacity of a foreground at each pixel.
Mathematically, the natural image is defined as a convex combination of foreground image and background image at each pixel as:
where is the alpha value at pixel that denotes the opacity of the foreground object. If is not or , then the image at pixel is mixed. Since the foreground color , background color and the alpha value are left unknown, the expression of alpha matting is ill-defined. Thus, most of the previous hand-crafted algorithms impose a strong inductive bias to the matting problem.
One of the basic idea widely adopted in both affinity-based and sampling-based algorithms is to borrow information from the image patches with similar appearance. Affinity-based methods [Levin, Lischinski, and Weiss2008, Chen, Li, and Tang2013, Aksoy, Ozan Aydin, and Pollefeys2017] borrow the opacity information from known patches with the similar appearance to unknown ones. Sampling-based approaches [Wang and Cohen2007, Gastal and Oliveira2010, He et al.2011, Feng, Liang, and Zhang2016] borrow a pair of samples from the foreground and background to estimate the alpha value at each pixel in the unknown region based on some specific assumption. One obstacle of the previous affinity-based and sampling-based methods is that they cannot handle the situation that there are only background and unknown areas in the trimap. It is because that these methods have to make use of both foreground and background information to estimate the alpha matte.
Benefiting from the Adobe Image Matting dataset [Xu et al.2017], more learning-based image matting methods [Xu et al.2017, Lutz, Amplianitis, and Smolic2018, Lu et al.2019, Tang et al.2019] has emerged in recent years. Most of learning-based approaches use network prior as the inductive bias and predict alpha mattes directly. Moreover, SampleNet [Tang et al.2019] proposed to leverage deep inpainting methods to generate foreground and background pixels in the unknown region rather than select from the image. It provides a combination of the learning-based and sampling-based approach.
In this paper, we propose a novel image matting method based on the opacity propagation in a neural network. The information propagation has been widely adopted within the neural network framework in recent years, from natural language processing[Vaswani et al.2017, Yang et al.2019], data mining [Kipf and Welling2016, Veličković et al.2017] to computer vision [Yu et al.2018, Wang et al.2018]. SampleNet Matting [Tang et al.2019] indirectly leveraged the contextual information for foreground and background inpainting. In contrast, our proposed method conducts information flow from the image context to unknown pixels directly. We devise a guided contextual attention module, which mimic the affinity-based propagation in a fully convolutional network. In this module, the low-level image features are used as a guidance and we perform the alpha feature transmission based on the guidance. We show an example of our guided contextual attention map in Figure 1 and more details in the section of results. In the guided contextual attention module, features from two distinct network branches are leveraged together. The information of both known and unknown patches are transmitted to feature patches in the unknown region with similar appearance.
Our proposed method can be viewed from two different perspectives. On one hand, the guided contextual attention can be elucidated as an affinity-based method for alpha matte value transmission with a network prior. Unknown patches share high-level alpha features with each other under the guidance of similarity between low-level image features. On the other hand, the proposed approach can also be seen as a guided inpainting task. In this aspect, image matting task is treated as an inpainting task on the alpha image under the guidance of input image. The unknown region is analogous to the holes to be filled in image inpainting. Unlike inpainting methods which borrows pixels from background of the same image, image matting borrows pixel valueor from the known area in the alpha matte image under the guidance of original RGB image to fill in the unknown region.
In general, natural image matting methods can be classified into three categories: sampling-based methods, propagation methods and learning-based methods.
Sampling-based methods [Wang and Cohen2007, Gastal and Oliveira2010, He et al.2011, Feng, Liang, and Zhang2016] solve combination equation (1) by sampling colors from foreground and background regions for each pixel in the unknown region. The pair of foreground and background samples are selected under different metrics and assumptions. Then the initial alpha matte value is calculated by the combination equation. Robust Matting [Wang and Cohen2007] selected samples along the boundaries with confidence. The matting function was optimized by a Random Walk. Shared Matting [Gastal and Oliveira2010] selected the best pairs of samples for a set of neighbor pixels and reduced much of redundant computation cost. In Global Matting [He et al.2011], all samples available in image were utilized to estimate the alpha matte. The sampling was achieve by a randomized patch match algorithm. More recently, CSC Matting [Feng, Liang, and Zhang2016] collected a set of more representative samples by sparse coding to avoid missing out true sample pairs.
Propagation methods [Levin, Lischinski, and Weiss2008, Chen, Li, and Tang2013, Aksoy, Ozan Aydin, and Pollefeys2017], which are also known as affinity-based methods, estimate alpha mattes by propagating the alpha value from foreground and background to each pixel in the unknown area. The Closed-form Matting [Levin, Lischinski, and Weiss2008]
is one of the most prevailing algorithm in propagation-based methods. It solved the cost function under the constraint of local smoothness. KNN Matting[Chen, Li, and Tang2013] collected matching nonlocal neighborhoods globally by K nearest neighbors. Moreover, the Information-flow Matting [Aksoy, Ozan Aydin, and Pollefeys2017] proposed a color-mixture flow which combined the local and nonlocal affinities of colors and spatial smoothness.
Due to the tremendous success of deep convolutional neural networks, learning-based methods achieve a dominate position in recent natural image matting[Cho, Tai, and Kweon2016, Xu et al.2017, Lutz, Amplianitis, and Smolic2018, Lu et al.2019, Tang et al.2019]. DCNN Matting [Cho, Tai, and Kweon2016] is the first method that introduced a deep neural network into image matting task. It made use of the network to learn a combination of results from different previous methods. Deep Matting [Xu et al.2017] proposed a fully neural network model with a large-scale dataset for learning-based matting methods, which was one of the most significant work in deep image matting. Following Deep Matting, AlphaGan [Lutz, Amplianitis, and Smolic2018] explored the deep image matting within a generative adversarial framework. More subsequent work like SampleNet Matting [Tang et al.2019] and IndexNet [Lu et al.2019] with different architectures also yielded appealing alpha matte estimations.
Our proposed model uses the guided contextual attention module and a customized U-Net [Ronneberger, Fischer, and Brox2015] architecture to perform deep natural image matting. We first construct our customized U-Net baseline for matting, then introduce the proposed guided contextual attention (GCA) module.
The U-Net [Ronneberger, Fischer, and Brox2015] like architecture are prevailing in recent matting tasks [Lutz, Amplianitis, and Smolic2018, Tang et al.2019, Lu et al.2019] as well as image segmentation [Long, Shelhamer, and Darrell2015]Isola et al.2017] and image inpainting [Liu et al.2018]. Our baseline model shares almost the same network architecture with guided contextual attention framework in Figure 2. The only difference is that the baseline model replaces GCA blocks with identity layers and has no image feature block. The input to this baseline network is a cropped image patch and a 3-channel one-hot trimap which are concatenated as a 6-channel input. The output is corresponding estimated alpha matte. The baseline structure is built as an encoder-decoder network with stacked residual blocks [He et al.2016].
Since the low-level features play a crucial role in retaining the detailed texture information in alpha mattes, in our customized baseline model, the decoder combines encoder features just before upsampling blocks instead of after each upsampling block. Such a design can avoid more convolutions on the encoder features, which are supposed to provide lower-level feature. We also use a two layer short cut block to align channels of encoder features for feature fusion. Moreover, in contrast to the typical U-Net structure which only combines different middle-level features, we directly forward the original input to the last convolutional layer through a short cut block instead. These features do not share any computation with the stem. Hence, this short cut branch only focuses on detailed textures and gradients.
In addition to the widely used batch normalization [Ioffe and Szegedy2015], we introduce the spectral normalization [Miyato et al.2018] to each convolutional layer to add a constraint on Lipschitz constant of the network and stable the training, which is prevalent in image generation tasks [Brock, Donahue, and Simonyan2019, Zhang et al.2019].
Our network only leverages one alpha prediction loss. The alpha prediction loss is defined as an absolute difference between predicted and ground-truth alpha matte averaged over the unknown area:
where indicates the region labeled as unknown in the trimap, and denote the predicted and ground-truth value of alpha matte as position .
There are some losses proposed in prior work for the deep image matting tasks, like compositional loss [Xu et al.2017], gradient loss [Tang et al.2019] and Gabor loss [Li et al.2019]. Compositional loss used in Deep Matting [Xu et al.2017] is the absolute difference between the original input image and predicted image composited by the ground-truth foreground, background and the predicted alpha mattes. The gradient loss calculates the averaged absolute difference between the gradient magnitude of predicted and ground-truth alpha mattes in the unknown region. Gabor loss proposed in [Li et al.2019] substitutes the gradient operator with a bundle of Gabor filters and aims to have a more comprehensive supervision on textures and gradients than gradient loss.
We delve into these losses to reveal whether involving different losses can benefit the alpha matte estimation in our baseline model. We provide an ablation study on Composition-1k testing set [Xu et al.2017] in Table 1. As Table 1 shows, the use of compositional loss does not bring any notable difference under MSE and Gradient error, and both errors increase when we incorporate the gradient loss and alpha prediction loss. Although the adoption of Gabor loss can reduce the Gradient error to some degree, it also slightly increases the MSE. Consequently, we only opt for the alpha prediction loss in our model.
Ablation study on data augmentation and different loss functions with baseline structure. The quantitative results are tested on Composition-1k testing set. Aug: data augmentation; Rec: alpha prediction loss; Comp: compositional loss; GradL: gradient loss; Gabor: Gabor loss.
Since the most dominant image matting dataset proposed by xu2017deep only contains 431 foreground objects for training. We treat the data augmentation as a necessity of our baseline model. We introduce a sequence of data augmentation.
Firstly, following the data augmentation in [Tang et al.2019]
, we randomly select two foreground object images with a probability of 0.5 and combine them to obtain a new foreground object as well as a new alpha image. Subsequently, the foreground object and alpha image will be resized toimages with a probability of 0.25. In this way, the network can nearly see the whole foreground image instead of a cropped snippet. Then, a random affine transformation are applied to the foreground image and the corresponding alpha image. We define a random rotation, scaling, shearing as well as the vertical and horizontal flipping in this affine transformation. Afterwards, trimaps are generated by a dilation and an erosion on alpha images with random number of pixels ranging from 5 to 29. With the trimap obtained, we randomly crop one patch from each foreground image, corresponding alpha and trimap respectively. All of the cropped patches are centered on an unknown region. The foreground images are then converted to HSV space, and different jitters are imposed to the hue, saturation and value. Finally, we randomly select one background image from MS COCO dataset [Lin et al.2014] for each foreground patch and composite them to get the input image.
To demonstrate the effectiveness of data augmentation, we conduct an experiment with minimal data augmentation. In this case, only two necessary operations, image cropping and trimap dilation are retained. More augmentations like random image resize and flipping, which are widely used in most of previous deep image matting methods [Xu et al.2017, Lutz, Amplianitis, and Smolic2018, Tang et al.2019, Lu et al.2019], are not included in this experiment. We treat this experiment setting as no data augmentation. The experimental results are also listed in Table 1. We can see that without additional augmentation, our baseline model already achieves comparable performance with Deep Matting.
The guided contextual attention module contains two kinds of components, an image feature extractor block for low-level image feature and one or more guided contextual attention blocks for information propagation.
Most of the affinity-based approaches have a basic inductive bias that local regions with almost identical appearance should have similar opacity. This inductive bias allows the alpha value propagates from the known region of a trimap to the unknown region based on affinity graph, which can often yields impressive alpha matte prediction.
Motivated by this, we define two different feature flows in our framework (Figure 2
): alpha feature flow (blue arrows) and image feature flow (yellow arrows). Alpha features are generated from the 6-channel input which is a concatenation of original image and trimap. The final alpha matte can be predicted directly from alpha features. Low-level image features contrast with the high-level alpha features. These features are generated only from the input image by a sequence of three convolutional layer with stride 2, which are analogous to the local color statistics in conventional affinity-based methods.
In other words, the alpha feature contains opacity information and low-level image feature contains appearance information. Given both opacity and appearance information, we can build an affinity graph and carry out opacity propagation as affinity-based methods. Specifically, we utilize the low-level image feature to guide the information flow on alpha features.
Inspired by the contextual attention for image inpainting proposed in [Yu et al.2018], we introduce our guided contextual attention block.
As shown in Figure 3, the guided contextual attention leverages both the image feature and alpha feature. Firstly, the image feature are divided into known part and unknown part and patches are extracted from the whole image feature. Each feature patch represents the appearance information at a specific position. We reshape the patches as convolutional kernels. In order to measure the correlation between an unknown region patch centered on and an image feature patch centered on , the similarity is defined as the normalized inner product:
where is also an element of the image feature patch set , i.e. . The constant
is a punishment hyperparameter that we usein our model, which can avoid a large correlation between each unknown patch and itself. In implementation, this similarity is computed by a convolution between unknown region features and kernels reshaped from image feature patches. Given the correlation, we carry out a scaled softmax along dimension to attain the guided attention score for each patch as following,
in which is a weight function and is the set of image feature patches from known region. As distinct from image inpainting task, the area of unknown region in a trimap is not under control. In many input trimaps, there are overwhelming unknown region and scarcely any known pixel. Thus, typically it is not feasible that only propagate the opacity information from the known region to unknown part. In our guided contextual attention, we let the unknown part borrow features from both known patches and unknown ones. Different weights are assigned to known and unknown patches based on the area of each region as the weight function defined in Eq. (5). If the area of known region is larger, the known patches can convey more accurate appearance information which exposes the difference between foreground and background, hence we weigh known patches with a larger weight. Whereas, if the unknown region has an overwhelming area, the known patches only provide some local appearance information, which may harm the opacity propagation. Then a small weight is assigned to known patches.
When we get guided attention scores from image features, we do the propagation on alpha features based on the affinity graph defined by guided attention. Analogous to image features, patches are extracted and reshaped as filter kernels from alpha features. The information propagation is implemented as a deconvolution between guided attention scores and reshaped alpha feature patches. This deconvolution yields a reconstruction of alpha features in the unknown area and the values of overlapped pixels in the deconvolution are averaged. Finally, we combine the input alpha features and the propagation result by an element-wise summation. This element-wise summation works as a residual connection which can stable the training.
|Learning Based Matting||0.048||113.9||91.6||122.2|
Most of the affinity-based matting methods result in a closed-form solution based on the graph Laplacian [Levin, Lischinski, and Weiss2008, Lee and Wu2011, Chen, Li, and Tang2013]. The closed-form solution can be seen as a fixed point of the propagation or a limitation of infinite propagation iterations [Zhou et al.2004]. Motivated by this, we stick two guided contextual attention blocks to the encoder and decoder symmetrically in our stem. It aims to propagate more times in our model and take full advantage of the opacity information flow.
When we compute the guided contextual attention on higher-resolution features, more detailed appearance information will be attended. However, on the other hand, the computational complexity of the attention block is , where are the channels, height and width of the feature map respectively. Therefore, we append two guided contextual attention blocks to the stage with feature maps.
The network of our GCA Matting is trained for iterations with a batch size of 40 in total on the Adobe Image Matting dataset [Xu et al.2017]. We perform optimization using Adam optimizer [Kingma and Ba2014] with and . The learning rate is initialized to . Warmup and cosine decay [Loshchilov and Hutter2016, Goyal et al.2017, He et al.2019] are applied to the learning rate.
In this section we report the evaluation results of our proposed model on two datasets, the Composition-1k testing set and alphamatting.com dataset. Both quantitative and qualitative results are shown in this section. We evaluate the quantitative results under the Sum of Absolute Differences (SAD), Mean Squared Error (MSE), Gradient error (Grad) and Connectivity error (Conn) proposed by [Rhemann et al.2009].
|Gradient Error||Average Rank||Troll||Doll||Donkey||Elephant||Plant||Pineapple||Plastic bag||Net|
The Composition-1k testing dataset proposed in [Xu et al.2017] contains 1000 testing images which are composed from 50 foreground objects and 1000 different background images from Pascal VOC dataset [Everingham et al.2015].
We compare our approach and the baseline model with three state-of-the-art deep image matting methods: Deep Matting [Xu et al.2017], IndexNet Matting [Lu et al.2019] and SampleNet Matting [Tang et al.2019], as well as three conventional hand-crafted algorithms: Learning Based Matting [Zheng and Kambhamettu2009], Closed-Form Matting [Levin, Lischinski, and Weiss2008] and KNN Matting [Chen, Li, and Tang2013]. The quantitative results are shown in Table 2. Our method outperforms all of the state-of-the-art approaches. In addition, our baseline model also get better results than some of the top performing methods. The effectiveness of the proposed guided contextual attention can be validated by the results displayed in Table 2.
Some qualitative results are given in Figure 4. The results of Deep Matting and IndexNet Matting are generated by source codes and pretrained models provided in [Lu et al.2019]. As displayed in Figure 4, our approach achieves better performance on different foreground objects, especially in the semitransparent regions. Advantages are more obvious with a larger unknown region. This good performance profits from the information flow between feature patches with similar appearance features.
Additionally, our proposed method can evaluate each image in Composition-1k testing dataset as a whole on a single Nvidia GTX 1080 with 8GB memory. Since we take each image as a whole in our network without scaling, the guided contextual attention blocks are applied to feature maps with a much higher resolution than in training phase. This results in a better performance in the detailed texture.
The alphamatting.com benchmark dataset [Rhemann et al.2009] has eight different images. For each testing image, there are three corresponding trimaps, namely, ”small”, ”large” and ”user”. The methods on the benchmark are ranked by the averaged rank over 24 alpha matte estimations in terms of four different metrics. We evaluate our method on the the alphamatting.com benchmark, and show the scores in Table 3. Some top approaches in the benchmark are also displayed for comparison.
As displayed in Table 3, GCA Matting ranks the first place under the Gradient Error metric in the benchmark. The evaluation results of our method under the ”large” and ”user” trimaps are much better than the other top approaches. The image matting becomes more difficult as the trimap has a larger unknown region. Therefore, we can say that our approach is more robust to changes in the area of unknown region. Additionally, our approach has almost the same overall ranks with the SampleNet under the MSE metric. Generally, the proposed GCA Matting is one of the top performing method on this benchmark dataset.
We provide some of the visual examples in Figure 5. The results of our method and some top algorithms on ”Elephant” and ”Plastic bag” are displayed to demonstrate the good performance of our approach. For example, in the test image ”Plastic bag”, most of the previous methods make a mistake at the iron wire. However, our method learns from the contextual information in the surrounding background patches and predicts these pixels correctly.
We visualize the attention map learned in the guided contextual attention block by demonstrating the pixel position with the largest attention score. Unlike the offset map widely used in optical flow estimation [Dosovitskiy et al.2015, Hui, Tang, and Loy2018, Sun et al.2018] and image inpainting [Yu et al.2018] which indicates the relative displacement of each pixel, our attention map demonstrates the absolute position of the corresponding pixel with highest attention activation. From this attention map, we can easily identify where the opacity information is propagated from for each feature pixel. As we can see in Figure 1, there is no information flow in the known region and feature patches in the unknown region tend to borrow information from the patches with similar appearance. Figure 1 reveals where our GCA blocks attend to physically in the input image. Since there is an adaption convolutional layer in the guided contextual attention block before patch extraction on image features, attention maps from two attention blocks are not identical. The weights of known and unknown part are shown in the top-left corner of the attention map.
From the attention offset map in Figure 1, we can easily recognize the car in the sieve. The light pink patches at the center of the sieve indicate that these features are propagated from the left part of the car. While blue patches show the features which are borrowed from the right-hand side road. These propagated features will assist in the identification of foreground and background in ensuing convolutional layers.
In this paper, we propose to solve the image matting problem by opacity information propagation in an end-to-end neural network. Consequently, a guided contextual attention module is introduced to imitate the affinity-based propagation method by a fully convolutional manner. In the proposed attention module, the opacity information is transmitted between alpha features under the guidance of appearance information. The evaluation results on both Composition-1k testing dataset and alphamatting.com dataset show the superiority of our proposed method.
This paper is supported by NSFC (No.61772330, 61533012, 61876109), the advanced research project (No.61403120201), Shanghai authentication key Lab. (2017XCWZK01), Technology Committee the interdisciplinary Program of Shanghai Jiao Tong University (YG2015MS43). We also would like to thank the help and support from Versa.
Image-to-image translation with conditional adversarial networks.In CVPR.
Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440.