Learning Affinity-Aware Upsampling for Deep Image Matting

11/29/2020 ∙ by Yutong Dai, et al. ∙ 0

We show that learning affinity in upsampling provides an effective and efficient approach to exploit pairwise interactions in deep networks. Second-order features are commonly used in dense prediction to build adjacent relations with a learnable module after upsampling such as non-local blocks. Since upsampling is essential, learning affinity in upsampling can avoid additional propagation layers, offering the potential for building compact models. By looking at existing upsampling operators from a unified mathematical perspective, we generalize them into a second-order form and introduce Affinity-Aware Upsampling (A2U) where upsampling kernels are generated using a light-weight lowrank bilinear model and are conditioned on second-order features. Our upsampling operator can also be extended to downsampling. We discuss alternative implementations of A2U and verify their effectiveness on two detail-sensitive tasks: image reconstruction on a toy dataset; and a largescale image matting task where affinity-based ideas constitute mainstream matting approaches. In particular, results on the Composition-1k matting dataset show that A2U achieves a 14 against a strong baseline with negligible increase of parameters (<0.5 Compared with the state-of-the-art matting network, we achieve 8 performance with only 40



There are no comments yet.


page 1

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The similarity among positions, a.k.a. affinity, is commonly investigated in dense prediction tasks [19, 4, 8, 32, 17]. Compared with directly fitting ground truths using first-order features, modeling similarity among different positions can provide second-order information. There currently exist two solutions to learn affinity in deep networks: i) learning an affinity map before a non-deep backend and ii) defining a learnable affinity-based module to propagate information. We are interested in end-to-end affinity learning, because classic methods often build upon some assumptions, rendering weak generalization in general cases. Existing approaches typically propagate or model affinity after upsampling layers or before the last prediction layer. While affinity properties are modeled, they sometimes may not be effective for the downstream tasks. For instance,the work in  [17] requires a feature encoding block besides the encoder-decoder architecture to learn affinity. The work in [4] needs more iterations to refine the feature maps according to their affinity at the last stage. As shown in Fig. 1, one plausible reason is that pairwise similarity is damaged during upsampling. In addition, it is inefficient to construct interactions between high-dimensional feature maps. We therefore pose the question: Can we model affinity earlier in upsampling in an effective and efficient manner?

Figure 1:

Visualization of upsampled feature maps with various upsampling operators. From left to right, the input RGB image, feature maps after the last upsampling using nearest neighbor interpolation, bilinear upsampling, and our proposed affinity-aware upsampling, respectively. Our method produces better details with clear connectivity.

Many widely used upsampling operators interpolate values following a fixed rule at different positions. For instance, despite reference positions may change in bilinear upsampling, it always interpolates values based on relative spatial distances. Recently, the idea of learning to upsample emerges [21, 22, 31]. A learnable module is often built to generate upsampling kernels conditioned on feature maps to enable dynamic, feature-dependent upsampling behaviors. Two such representative operators include CARAFE [31] and IndexNet [22]. In our experiments, we find that CARAFE may not work well in low-level vision tasks where details need to be restored. IndexNet instead can recover details much better. We believe that one important reason is that IndexNet encodes, stores, and delivers spatial information prior to downsampling. But computation can be costly when the network goes deep. This motivates us to pursue not only flexible but also light-weight designs of the upsampling operator.

In this paper, we propose to model affinity into upsampling and introduce a novel learnable upsampling operator, i.e., affinity-aware upsampling (AU). As we show later in Section 3, AU is a generalization of first-order upsampling operators: in some conditions, the first-order formulation in [31] and [21] can be viewed as special cases of our second-order one. In addition, by implementing AU in a low-rank bilinear formulation, we can achieve efficient upsampling with few extra parameters.

We demonstrate the effectiveness of AU on two detail-sensitive tasks: an image reconstruction task on a toy dataset with controllable background and a large-scale image matting task with subtle foregrounds. Image matting is a desirable task to justify the usefulness of affinity, because affinity-based matting approaches constitute one of prominent matting paradigms in literatures. Top matting performance thus can suggest appropriate affinity modeling. In particular, we further discuss alternative design choices of AU and compare their similarities and differences. Compared with a strong image matting baseline on the Composition-1k matting dataset, AU exhibits a significant improvement () with negligible increase of parameters (), proffering a light-weight image matting architecture with state-of-the-art performance.

2 Related work

Upsampling Operators in Deep Networks. Upsampling is often necessary in dense prediction to recover spatial resolution. The mostly used upsampling operators are bilinear interpolation and nearest neighbor interpolation. Since they are executed only based on spatial distances, they may be sub-optimal in detail-oriented tasks such as image matting where distance-based similarity can be violated. Compared with distance-based upsampling, max-unpooling is feature-dependent and has been shown to benefit detail-oriented tasks [21, 22]

, but it must match with max-pooling. In recent literatures, learning-based upsampling operators 

[29, 20, 31, 22] emerge. The Pixel Shuffle (P.S.) [29] upsamples feature maps by reshaping. The deconvolution (Deconv) [20], an inverse version of convolution, learns the upsampling kernel via back-propagation. Both P.S. and Deconv are data-independent during inference, because the kernel is fixed once learned. By contrast, CARAFE [31] and IndexNet [21] learn the upsampling kernel dynamically conditioned on the data. They both introduce additional modules to learn upsampling kernels. Since the upsampling kernel is directly related to the feature maps, these upsampling operators are considered first-order.

Following the learning-based upsampling paradigm, we also intend to learn dynamic upsampling operators but to condition on second-order features to enable affinity-informed upsampling. We show that, compared with first-order upsampling, affinity-informed upsampling not only achieves better performance but also introduces a light-weight learning paradigm.

Deep Image Matting. Affinity dominates the majority of classic image matting approaches [16, 3, 6, 9]

. The main assumption in propagation-based matting is that, similar alpha values can be propagated from known positions to unknown positions, conditioned on affinity. This assumption, however, highly depends on the color distribution. Such methods can perform well on cases with clear color contrast but more often fail in cases where the color distribution assumption is violated. Recently, deep learning is found effective to address ill-posed image matting. Many deep matting methods arise 

[5, 32, 34, 30, 11, 21, 17, 2]. This field has experienced from a semi-deep stage [5, 32] to a fully-deep stage [34, 11, 21, 17, 2]. Here ‘semi-deep’ means that the matting part still relies on classic methods [16, 3] to function, while ‘fully-deep’ means that the entire network does not resort to any classic algorithms. Among fully-deep matting, DeepMatting [34] first applied the encoder-decoder architecture and reported improved results. Targeting this strong baseline, several deep matting methods were proposed. AlphaGAN matting [23] and IndexNet matting [21] explored adversarial learning and index generating module to improve matting performance, respectively. In particular, works in [11, 17, 2, 30] imitated classic sampling-based and propagation-based ideas into deep networks to ease the difficulty of learning. Therein, GCA matting [17] first designed an affinity-based module and demonstrated the effectiveness of affinity in fully-deep matting. It treats alpha propagation as an independent module and adds it to different layers to refine the feature map, layer by layer.

Different from the idea of ‘generating then refining’, we propose to directly incorporate the propagation-based idea into upsampling for deep image matting. It not only benefits alpha propagation but also shows the potential for light-weight module design.

3 A Mathematical View of Upsampling

The work in [22] unifies upsampling from an indexing perspective. Here we provide an alternative mathematical view. To simplify exposition, we discuss the upsampling of the one-channel feature map. Without loss of generality, the one-channel case can be easily extended to multi-channel upsampling, because most upsampling operators execute per-channel upsampling. Given a one-channel local feature map

used to generate an upsampled feature point, it can be vectorized to

. Similarly, the vectorization of an upsampling kernel can be denoted by . If defines the output of upsampling, most existing upsampling operations follow


Note that indicates an upsampled point. In practice, multiple such points can be generated to form an upsampled feature map. may be either shared or unshared among channels depending on the upsampling operator. Different operators define different ’s. Further, even the same can be applied to different ’s. According to how the upsampling kernel is generated, we categorize the kernel into two types: the universal kernel and the customized kernel. The universal kernel is input-independent. It follows the same upsampling rule given any input. One example is deconvolution [20]. The customized kernel, however, is input-dependent. Based on what input is used to generate the kernel, the customized kernel can be further divided into distance-based and feature-based. We elaborate as follows.

Distance-based Upsampling. Distance-based upsampling is implemented according to spatial distances, such as nearest neighbor and bilinear interpolation. The difference between them is the number of positions taken into account. Under the definition of Eq. (1), the upsampling kernel is a function of the relative distance between points. By taking bilinear interpolation with reference points as an example, , where given the coordinates of two reference points and ; and is the coordinates of the interpolated point; , , and can be derived similarly. In multi-channel cases, the same is shared by all channels of input.

Feature-based Upsampling. Feature-based upsampling is feature-dependent. They are developed in deep networks, including max-unpooling [1], CARAFE [31], and IndexNet [22]:

  1. [label=), leftmargin=1.1em]

  2. Max-unpooling interpolates values following the indices returned from max-pooling. In a region of the feature layer after upsampling, only one position recorded in the indices has value, and other three are filled with . Since each position on the upsampled feature map is interpolated from a point at the low-resolution layer, we can define by a vector , where , and is also the point at the low-resolution layer. Note that, , and only one can equal to in a region of the output feature map. In multi-channel cases, and are different in different channels conditioned on the operator.

  3. CARAFE learns an upsampling kernel ( in [31]) via a kernel generation module given a decoder feature map ready to upsample. It also conforms to Eq. (1), where is obtained from the low-resolution decoder feature map. The kernel size of depends on the size of . In multi-channel cases, the same is shared among channels.

  4. IndexNet also learns an upsampling kernel dynamically from features. The difference is that IndexNet learns from high-resolution encoder feature maps. Under the formulation of Eq. (1), the upsampling kernel follows a similar spirit like max-unpooling: , where , because each position on the upsampled feature layer is interpolated from a corresponding point on the low-resolution map by multiplying by an interpolation weight . But here instead of .

Hence, distanced-based and feature-based upsampling operators have a unified form , while different operators correspond to different ’s and ’s, where

can be heuristically defined or dynamically generated. In particular, existing operators define/generate

according to distances or first-order features, while second-order information remains unexplored in upsampling.

4 Learning Affinity-Aware Upsampling

Here we explain how we exploit second-order information to formulate the affinity idea in upsampling using a bilinear model and how we apply a low-rank approximation to reduce computational complexity.

General Formulation of Upsampling. Given a feature map to be upsampled, the goal is to generate an upsampled feature map , where is the upsampling ratio. For a position in , the corresponding source position in is derived by solving , . We aim to learn an upsampling kernel for each position in . By applying the kernel to a channel of the local feature map centered at position on , denoted by , the corresponding upsampled feature point of the same channel at target position can be obtained by according to Eq. (1), where is the vectorization of .

General Meaning of Affinity. Affinity is often used to indicate pairwise similarity and is considered second-order features. An affinity map can be constructed in different ways such as using a Gaussian kernel. In self-attention, the affinity between the position and the enumeration of all possible positions at a feature map is denoted by , where and represent two vectors at position and , respectively, and measures the similarity between and with the inner product .

Affinity-Aware Upsampling via Bilinear Modeling. Given a local feature map , has an equivalent matrix form , where . We aim to learn an upsampling kernel conditioned on . Previous learning-based upsampling operators [31, 21, 22] generate the value of the upsampling kernel following a linear model by , where and are the weight and the feature at the channel and position of , respectively. Note that . To encode second-order information, a natural generalization of the linear model above is bilinear modeling where another feature matrix transformed from the feature map (), is introduced to pair with to model affinity. Given each in , in , the bilinear weight of the vector pair, and the embedding weights and for each channel of and , we propose to generate each value of the upsampling kernel from embedded pairwise similarity, i.e.,


where and are the -th channel of and , respectively,

is the affinity matrix for

-th channel, , and and represent the embedding function.

Figure 2: Kernel generation of AU. Given a feature map of size , a upsampling kernel is generated at each spatial position conditioned on the feature map. The rank is here.

Factorized Affinity-Aware Upsampling. Learning can be expensive when and are large. Inspired by [12, 36], a low-rank bilinear method can be derived to reduce computational complexity of Eq. (2). Specifically, can be rewritten by , where and . represents the rank of under the constraint of . Eq. (2) therefore can be rewritten by


where is a column vector of ones, and denotes the Hadamard product. Since we need to generate a upsampling kernel, in Eq. (3) can be replaced with . Note that, Eq. (3) is applied to each position of a feature map, so the inner product here can be implemented by convolution. The full upsampling kernel therefore can be generated by


where , . The convolution kernels , , and

are reshaped tensor versions of

, and , respectively. represents a convolution operation on the feature map with the kernel ; defines a group convolution operation ( groups) with the same input. is the concatenate operator. This process is visualized in Fig. 2.

Alternative Implementations. Eq. (4) is a generic formulation. In practice, many design choices can be discussed in implementation:

  1. [leftmargin=1.1em,label=)]

  2. The selection of and can be either same or different. In this paper, we only discuss self-similarity, i.e., ;

  3. The rank can be chosen in the range . For example, if and are extracted in regions, the range will be . In our experiments, we set to explore the most simplified and light-weight case.

  4. and can be considered two encoding functions. They can be shared, partly-shared, or unshared among channels. We discuss two extreme cases in the experiments: ‘channel-shared’ (‘cs’) and ‘channel-wise’ (‘cw’).

  5. Eq. (4) adjusts the kernel size of only using . Since the low-rank approximation has less parameters, fixed , , and may not be sufficient to model all local variations. Inspired by CondConv [35], we attempt to generate and , dynamically conditioned on the input. We investigate three implementations: 1) static: none of them is input-dependent; 2) hybrid: only is conditioned on input; and 3) dynamic: , , and are all conditioned on input. The dynamic generation of , , or is implemented using a global average pooling and a convolution layer.

  6. We implement stride-2

    and in our experiments. They output features of size . To generate an upsampling kernel of size , one can either use sets of different weights for and or sets of weights for (), followed by a shuffling operation (). We denote the former case as ‘pointwise’ (‘pw’). Further, as pointed out in [12], nonlinearity, e.g., tanh or relu, can be added after the encoding of and . We verify a similar idea by adding normalization and nonlinearity in the experiments.

Method MNIST Fashion-MNIST
PSNR () SSIM () MSE () MAE () PSNR () SSIM () MSE () MAE ()
Conv/2-Nearest 28.54 0.9874 0.0374 0.0148 25.58 0.9797 0.0527 0.0269
Conv/2-Bilinear 26.12 0.9783 0.0495 0.0205 23.68 0.9675 0.0656 0.0343
Conv/2-Deconv [20] 31.85 0.9942 0.0256 0.0089 27.42 0.9870 0.0426 0.0207
P.S. [29] 31.63 0.9939 0.0262 0.0099 27.33 0.9868 0.0431 0.0212
MaxPool-MaxUnpool 29.91 0.9916 0.0320 0.0133 28.31 0.9901 0.0385 0.0218
MaxPool-CARAFE [31] 28.72 0.9885 0.0367 0.0131 25.17 0.9773 0.0552 0.0266
MaxPool-IndexNet  [21] 45.51 0.9997 0.0053 0.0024 45.83 0.9998 0.0051 0.0033
MaxPool-AU (Ours) 47.63 0.9998 0.0042 0.0020 46.41 0.9999 0.0048 0.0031
MaxPool-IndexNet  [21] 47.13 0.9997 0.0044 0.0020 44.35 0.9998 0.0061 0.0036
Table 1: Reconstruction results on the MNIST dataset and the Fashion-MNIST dataset. denotes holistic index network, represents depthwise index network. Both index networks here apply the setting of ‘context+linear’ for a fair comparison.

Extension to Downsampling. Following [22], our method can also be extended to downsampling. Downsampling is in pair with upsampling, so their kernels are generated from the same encoder feature. We use ‘d’ to indicate the use of paired downsampling in experiments. We share the same and in Eq. (4) in both downsampling and upsampling, but use different ’s considering that they may have different kernel sizes. We denote the overall upsampling kernel by and the downsampling kernel by , where is the ratio of upsampling/downsampling. We set in our experiments.

5 Image Reconstruction and Analysis

Here we conduct a pilot image reconstruction experiment on a toy dataset to show the effectiveness of AU. Inspired by [22], we build sets of reconstruction experiments on the MNIST dataset [15] and Fashion-MNIST dataset [33]. The motivation is to verify whether exploiting second-order information into upsampling benefits recovering spatial information.

We denote to be a convolution layer with -channel output and filters (stride is unless stated), followed by BatchNorm and ReLU, and denote a downsampling operator with a ratio of , and denote an upsampling operator with a ratio of . We build the network architecture as: -------------

. The same training strategies and evaluation metrics are used following 

[22]. Since training patches are relatively small (), upsampling kernel sizes for CARAFE and AU are both set to , and the encoding convolution kernels in IndexNet and AU are both set to . Other settings keep the default ones. We apply ‘static-pw-cw’ AU here because it is the same as Holistic IndexNet if letting convolution results of to be all ones. We hence add a sigmoid function after to generalize IndexNet. To avoid extra layers, we apply max-pooling to to obtain high-resolution layers when validating IndexNet and AU. Reconstruction results are presented in Table 1.

As shown in Table 1, upsampling operators informed by features (max-unpooling, CARAFE, IndexNet, and AU) outperform the operators guided by spatial distances (nearest, bilinear, and bicubic). Moreover, learning from high-resolution features matter for upsampling, among which, learning-based operators (IndexNet, AU) achieve the best results. Further, it is worth noting that, AU performs better than IndexNet with even fewer parameters. From these observations, we believe in upsampling: 1) high-resolution features are beneficial to extract spatial information, and 2) second-order features can help to recover more spatial details than first-order ones.

6 Experiments and Discussions

Here we evaluate AU on deep image matting. This task is suitable for assessing the quality of modeling pairwise relations.

Figure 3: Overview of our matting framework. The focus of this work is on the upsampling stages.

6.1 Network Architecture

Similar to [17], our baseline network adopts the first layers of the ResNet34 [10] as the encoder. The decoder consists of residual blocks and upsampling stages. The In-Place Activated BatchNorm [28] is applied to each layer except the last one to reduce GPU memory consumption. As shown in Fig. 3, the overall network follows the UNet architecture [27] with ‘skip’ connection. To apply AU to upsampling, we replace the upsampling operations in the decoder with AU modules. Specifically, we learn upsampling kernels from the skipped features. If AU is used in both upsampling and downsampling stages, we change all 2-stride convolution layers in the encoder to be 1-stride and implement paired downsampling and upsampling operations, respectively, by learning upsampling/downsampling kernels from the modified 1-stride feature layer.

6.2 Datasets

We mainly conduct our experiments on the Adobe Image Matting dataset [34]. Its training set has unique foreground objects and ground-truth alpha mattes. Instead of compositing each foreground with fixed

background images chosen from MS COCO 

[18], we randomly choose the background images in each iteration and generate the composition images on-the-fly. The test set, termed the Composition-1k, contains unique foreground objects; each foreground is composited with

background images from the Pascal VOC dataset 


We also evaluate our method on the benchmark [26]. This online benchmark has unique testing images and different trimaps for each image, providing test cases.

Further, we report results on the recently proposed Distinctions-646 dataset [25]. It has foreground objects in the training set and foreground objects in the test set. We generate the training data and the test set following the same protocol as on the Adode Image Matting dataset.

6.3 Implementation Details

Our implementation is based on PyTorch 

[24]. Here we describe training details on the Adobe Image Matting dataset. The -channel input concatenates the RGB image and its trimap. We mainly follow the data argumentation of [17]

. Two foreground objects are first chosen with a probability of

and are composited to generate a new foreground image and a new alpha matte. Next, they are resized to with a probability of . Random affine transformations are then applied. Trimaps are randomly dilated from the ground truth alpha mattes with distances in the range between and , followed by random cropping. The background image is randomly chosen from the MS COCO dataset [18]. After imposing random jitters to the foreground object, the RGB image is finally generated by composition.

The backbone is pretrained on ImageNet 

[14]. Adam optimizer [13]

is used. We use the same loss function as 

[34, 21], including alpha prediction loss and composition loss computed from the unknown regions indicated by trimaps. We update parameters for epochs. Each epoch has a fixed number of iterations. A batch size of is used and BN layers in the backbone are fixed. The learning rate is initialized to and reduced by at the -th epoch and the -th epoch, respectively. The training strategies on the Distinction646 dataset are the same except that we update the parameters for only epochs. We evaluate our results using Sum of Absolute Differences (SAD), Mean Squared Error (MSE), Gradient (Grad), and Connectivity (Conn) [26]. We follow the evaluation code provided by [34].

6.4 The Adobe Image Matting Dataset

Upsample SAD MSE Grad Conn # Params
Nearest 37.51 0.0096 19.07 35.72 8.05M
Bilinear 37.31 0.0103 21.38 35.39 8.05M
CARAFE 41.01 0.0118 21.39 39.01 +0.26M
IndexNet 34.28 0.0081 15.94 31.91 +12.26M
AU (static-pw-cw) 36.36 0.0099 21.03 34.40 +0.10M
AU (static-cw) 35.92 0.0098 20.06 33.68 +26K
AU (hybrid-cw) 34.76 0.0088 16.39 32.29 +44K
AU (hybrid-cs) 36.43 0.0098 21.24 34.11 +19K
AU (dynamic-cw) 36.66 0.0094 18.60 34.62 +0.20M
AU (dynamic-cs) 35.86 0.0095 17.13 33.71 +20K
AU (dynamic-cs-d) 33.13 0.0078 17.90 30.22 +38K
AU (dynamic-cs-d) 32.15 0.0082 16.39 29.25 +38K
Table 2: Results of different upsampling operators on the Composition-1k test set with the same baseline model. denotes additional normalization and nonlinearity after the encoding layers of and . The best performance is in boldface.

Ablation Study on Alternative Implementations. Here we verify different implementations of AU on the Composition-1k test set and compare them with existing upsampling operators. Quantitative results are shown in Table 2. All the models are implemented by the same architecture but with different upsampling operators. The ‘nearest’ and ‘bilinear’ are our direct baselines. They achieve close performance with the same model capacity. For CARAFE, we use the default setting as in [31], i.e., and . We observe CARAFE has a negative effect on the performance. The idea behind CARAFE is to reassemble contextual information, which is not the focus of matting where subtle details matter. However, it is interesting that CARAFE can still be useful for matting when it follows a light-weight MobileNetV2 backbone [22]. One possible explanation is that a better backbone (ResNet34) suppresses the advantages of context reassembling. We report results of IndexNet with the best-performance setting (‘depthwise+context+nonlinear’) in [21, 22]. The upsampling indices are learned from the skipped feature layers. IndexNet achieves a notable improvement, especially on the Grad metric. However, IndexNet significantly increases the number of parameters.

We further investigate different implementations of AU and another version with paired downsampling and upsampling. According to the results, the ‘static’ setting can only improve the SAD and Conn metrics. The position-wise and position-shared settings report comparable results, so we fix the position-shared setting in the following ‘hybrid’ and ‘dynamic’ experiments. We verify both channel-wise and channel-shared settings for ‘hybrid’ and ‘dynamic’ models. The ‘hybrid’ achieves higher performance with channel-wise design, while the ‘dynamic’ performs better with channel-shared design. All ‘hybrid’ and ‘dynamic’ models show improvements against baselines on all metrics, except the MSE and Grad metrics for the channel-shared ‘hybrid’ model. The last implementation, where channel-shared ‘dynamic’ downsampling is paired with upsampling, achieves the best performance (at least relative improvements against the baseline) with negligible increase of parameters ().

Hence, while the dedicated design of upsampling operators matters, paired downsampling and upsampling seems more important, at least for image matting.

Method SAD MSE Grad Conn
AU (hybrid-cw) 1 37.74 0.0104 22.07 35.91
AU (hybrid-cw) 3 34.76 0.0088 16.39 32.29
AU (hybrid-cw) 5 35.99 0.0093 17.96 33.90
AU (dynamic-cs) 1 36.06 0.0098 17.25 33.95
AU (dynamic-cs) 3 35.86 0.0095 17.13 33.71
AU (dynamic-cs) 5 37.40 0.0096 18.28 35.50
Table 3: Ablation study of upsampling kernel size on the Composition-1k test set.

Ablation Study on Upsampling Kernel. Here we investigate the performance of our models with different upsampling kernel sizes. The encoding kernel size (the kernel size of or ) is set to in all matting experiments unless stated. Under this setting, results in Table 3 show that performs the best. It is interesting to observe that larger upsampling kernel does not imply better performance. We believe this is related to the encoding kernel size and the way how we generate , and . We use as our default setting.

Ablation Study on Normalization. In both [31] and [22], different normalization strategies are verified, and experiments show that normalization significantly affects the results. We thus justify the normalization choices in our AU module here. We conduct the experiments on the channel-wise ‘hybrid’ model and the channel-shared ‘dynamic’ model. Two normalization choices are considered: ‘softmax’ and ‘sigmoid+softmax’. It is clear that the latter normalization works better (Table 4

). It may boil down to the nonlinearity introduced by the sigmoid function.

Method Norm SAD MSE Grad Conn
AU (hybrid-cw) softmax 35.93 0.0092 17.13 33.87
AU (hybrid-cw) sigmoid+softmax 34.76 0.0088 16.39 32.29
AU (dynamic-cs) softmax 36.40 0.0100 17.67 34.33
AU (dynamic-cs) sigmoid+softmax 35.86 0.0095 17.13 33.71
Table 4: Ablation study of normalization on the Composition-1k test set.

Comparison with State of the Art. Here we compare our models against other state-of-the-art methods on the Composition-1k test set. Results are shown in Table 5. We observe that our models outperform other methods on all the evaluation metrics with the minimum model capacity. Compared with the state-of-the-art method [17], our best model achieves higher performance with only model complexity. Our model is also memory-efficient, being able to infer high-resolution images on a single 1080Ti GPU without downsampling on the Composition-1k test set. Some qualitative results are shown in Fig. 4. Our results show improved detail delineation such as the net structure and the filament.

Method SAD MSE Grad Conn # Params
Closed-Form [16] 168.1 0.091 126.9 167.9 -
KNN Matting [3] 175.4 0.103 124.1 176.4 -
Deep Matting [34] 50.4 0.014 31.0 50.8 M
IndexNet Matting [21] 45.8 0.013 25.9 43.7 8.15M
AdaMatting [2] 41.7 0.010 16.8 - -
Context-Aware [11] 35.8 0.0082 17.3 33.2 107.5M
GCA Matting [17] 35.28 0.0091 16.9 32.5 25.27M
AU (hybrid-cw) 34.76 0.0088 16.39 32.29 8.09M
AU (dynamic-cs) 35.86 0.0095 17.13 33.71 8.07M
AU (dynamic-cs-d) 32.15 0.0082 16.39 29.25 8.09M
Table 5: Benchmark results on Composition-1k test set. The best performance is in boldface.
Figure 4: Qualitative results on the Composition-1k test set. The methods in comparison include Closed-Form Matting [16], KNN Matting [3], Deep Image Matting (DIM) [34], IndexNet Matting [21], GCA Matting [17], our baseline, and our method.

6.5 The Benchmark

Here we report results on the online benchmark [26]. We follow [17] to train our model with all the data in the Adobe matting dataset and then test it on the benchmark. As shown in Table 6, our method ranks the first w.r.t. the gradient error among all published methods. We also achieve comparable overall ranking compared with AdaMatting [2] under the SAD and MSE metrics, suggesting our method is one of the top performing methods on this benchmark.

Gradient Error Average Rank Troll Doll Donkey Elephant Plant Pineapple Plastic bag Net
Overall S L U S L U S L U S L U S L U S L U S L U S L U S L U
Ours 6.3 5.6 3.3 10.1 0.2 0.2 0.2 0.1 0.1 0.2 0.1 0.2 0.2 0.2 0.2 0.4 1.1 1.3 1.9 0.6 0.7 1.7 0.6 0.6 0.6 0.3 0.3 0.4
AdaMatting [2] 7.8 4.5 5.6 13.3 0.2 0.2 0.2 0.1 0.1 0.4 0.2 0.2 0.2 0.1 0.1 0.3 1.1 1.4 2.3 0.4 0.6 0.9 0.9 1 0.9 0.3 0.4 0.4
GCA Matting [17] 8 8.4 6.6 9.1 0.1 0.1 0.2 0.1 0.1 0.3 0.2 0.2 0.2 0.2 0.2 0.3 1.3 1.6 1.9 0.7 0.8 1.4 0.6 0.7 0.6 0.4 0.4 0.4
Context-aware Matting [11] 9.1 10.8 9.8 6.8 0.2 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 1.4 1.5 1.8 0.8 1.3 1 1.1 1.1 0.9 0.4 0.4 0.4
Table 6: Gradient errors on the test set. The top-4 methods are shown. The lowest errors are in boldface.

6.6 The Distinction-646 Dataset

We also evaluate our method on the recent Distinction-646 test set. In Table 7, we report results of the three models performing the best on the Composition-1k dataset and also compare with other benchmarking results provided by [25]. We have two observations: 1) our models show improved performance against the baseline, which further confirms the effectiveness of our AU; 2) Our models outperform other reported benchmarking results by large margins, setting a new state of the art on this dataset.

Method SAD MSE Grad Conn
Closed-Form [16] 105.73 0.023 91.76 114.55
KNN Matting [3] 116.68 0.025 103.15 121.45
Deep Matting [34] 47.56 0.009 43.29 55.90
Baseline-Nearest 25.03 0.0106 13.85 24.41
AU (hybrid-cw) 24.08 0.0104 13.53 23.59
AU (dynamic-cs) 24.55 0.0107 14.51 23.89
AU (dynamic-cs-d) 23.20 0.0102 12.39 22.20
Table 7: Benchmark results on the Distinctions-646 test set. The best performance is in boldface.

6.7 Visualization of Upsampling Kernels

Here we visualize the learned upsampling kernel in a ‘hybrid’ model to showcase what is learned by the kernel. Two examples are illustrated in Fig. 5. We observe that, after learning, boundary details are highlighted, while flat regions are weakened.

Figure 5: Visualization of the upsampling kernel. The left is the randomly initialized kernel, and the right is the learned kernel.

7 Conclusion

Considering that affinity is widely exploited in dense prediction, we explore the feasibility to model such second-order information into upsampling for building compact models. We implement this idea with a low-rank bilinear formulation, based on a generalized mathematical view of upsampling. We show that, with negligible parameters increase, our method AU can achieve better performance on both image reconstruction and image matting tasks. We also investigate different design choices of AU. Results on three image matting benchmarks all show that AU achieve a significant relative improvement and also state-of-the-art results. In particular, compared with the best performing image matting network, our model achieves higher performance on the Composition-1k test set, with only model capacity. For future work, we plan to extend AU to other dense prediction tasks.


Appendix A Training Details of Image Reconstruction

The image reconstruction experiments are implemented on the MNIST dataset [15] and Fashion-MNIST dataset [33]. They both include training images and test images. During training, the input images are resized to , and loss is used. We use the SGD optimizer with an initial learning rate of . The learning rate is decreased by at the -th, -th, and -th epoch, respectively. We update the parameters for epochs in total with a batch size of

. The evaluation metrics are Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), Mean Absolute Error (MAE) and root Mean Square Error (MSE).

Appendix B Analysis of Complexity

Here we summarize the model complexity of different implementations of AU in Table 8. We assume that the encoding kernel size is , the upsampling kernel size is , and the channel number of feature map is . Since is much larger than and , AU generally has the complexity: .

Model Type # Params
static cw
static cs
hybrid cw
hybrid cs
dynamic cw
dynamic cs
Table 8: Analysis on the complexity of AU. ‘cw’: channel-wise, ‘cs’: channel-shared
Figure 6: Qualitative results on the alphamatting.com test set. The methods in comparison include AdaMatting [2], GCA Matting [17], Context-Aware Matting [11], and our method.

Appendix C Qualitative Results

We show additional qualitative results on the alphamatting.com benchmark [26] in Fig. 6. top-performing methods are visualized here. Since all these methods achieve good performance, and their quantitative results on the benchmark are very close, it is difficult to tell the obvious difference in Fig. 6. It worth noting that, however, our method produces better visual results on detailed structures, such as gridding of the net, and leaves of the pineapple.

We also show qualitative results on the Distinction-646 test set [25] in Fig. 7. Since no implementation of other deep methods on this benchmark is publicly available, we only present the results of our baseline and our method here to show the relative improvements. According to Fig. 7, our method produces clearly better predictions on highly transparent objects such as the bubbles.

Figure 7: Qualitative results on the Distinction-646 test set. The methods in comparison include the baseline and our method.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §3.
  • [2] S. Cai, X. Zhang, H. Fan, H. Huang, J. Liu, J. Liu, J. Liu, J. Wang, and J. Sun (2019) Disentangled image matting. In

    Proc. IEEE International Conference on Computer Vision (ICCV)

    pp. 8819–8828. Cited by: Figure 6, §2, §6.5, Table 5, Table 6.
  • [3] Q. Chen, D. Li, and C. Tang (2013) KNN matting. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (9), pp. 2175–2188. Cited by: §2, Figure 4, Table 5, Table 7.
  • [4] X. Cheng, P. Wang, and R. Yang (2018)

    Depth estimation via affinity learned with convolutional spatial propagation network

    In Proc. European Conference on Computer Vision (ECCV), pp. 103–119. Cited by: §1.
  • [5] D. Cho, Y. Tai, and I. Kweon (2016)

    Natural image matting using deep convolutional neural networks

    In Proc. European Conference on Computer Vision (ECCV), pp. 626–643. Cited by: §2.
  • [6] Y. Chuang, B. Curless, D. H. Salesin, and R. Szeliski (2001) A bayesian approach to digital matting. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. 2, pp. II–II. Cited by: §2.
  • [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §6.2.
  • [8] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and K. Huang (2019) Ssap: single-shot instance segmentation with affinity pyramid. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 642–651. Cited by: §1.
  • [9] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun (2011) A global sampling method for alpha matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2049–2056. Cited by: §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §6.1.
  • [11] Q. Hou and F. Liu (2019) Context-aware image matting for simultaneous foreground and alpha estimation. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 4130–4139. Cited by: Figure 6, §2, Table 5, Table 6.
  • [12] J. Kim, K. On, W. Lim, J. Kim, J. Ha, and B. Zhang (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325. Cited by: item 5, §4.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §6.3.
  • [15] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Appendix A, §5.
  • [16] A. Levin, D. Lischinski, and Y. Weiss (2007) A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 228–242. Cited by: §2, Figure 4, Table 5, Table 7.
  • [17] Y. Li and H. Lu (2020) Natural image matting via guided contextual attention. In

    Proc. AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 11450–11457. Cited by: Figure 6, §1, §2, Figure 4, §6.1, §6.3, §6.4, §6.5, Table 5, Table 6.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proc. European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §6.2, §6.3.
  • [19] S. Liu, S. De Mello, J. Gu, G. Zhong, M. Yang, and J. Kautz (2017) Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1520–1530. Cited by: §1.
  • [20] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §2, §3, Table 1.
  • [21] H. Lu, Y. Dai, C. Shen, and S. Xu (2019) Indices matter: learning to index for deep image matting. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3266–3275. Cited by: §1, §1, §2, §2, Table 1, §4, Figure 4, §6.3, §6.4, Table 5.
  • [22] H. Lu, Y. Dai, C. Shen, and S. Xu (2020) Index networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2, §3, §3, §4, §4, §5, §5, §6.4, §6.4.
  • [23] S. Lutz, K. Amplianitis, and A. Smolic (2018)

    Alphagan: generative adversarial networks for natural image matting

    British Machince Vision Conference (BMVC). Cited by: §2.
  • [24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NIPS), pp. 8026–8037. Cited by: §6.3.
  • [25] Y. Qiao, Y. Liu, X. Yang, D. Zhou, M. Xu, Q. Zhang, and X. Wei (2020) Attention-guided hierarchical structure aggregation for image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13676–13685. Cited by: Appendix C, §6.2, §6.6.
  • [26] C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, and P. Rott (2009) A perceptually motivated online benchmark for image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1826–1833. Cited by: Appendix C, §6.2, §6.3, §6.5.
  • [27] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Cited by: §6.1.
  • [28] S. Rota Bulò, L. Porzi, and P. Kontschieder (2018) In-place activated batchnorm for memory-optimized training of dnns. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.
  • [29] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883. Cited by: §2, Table 1.
  • [30] J. Tang, Y. Aksoy, C. Oztireli, M. Gross, and T. O. Aydin (2019) Learning-based sampling for natural image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3055–3063. Cited by: §2.
  • [31] J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, and D. Lin (2019) Carafe: content-aware reassembly of features. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3007–3016. Cited by: §1, §1, §2, item 2, §3, Table 1, §4, §6.4, §6.4.
  • [32] Y. Wang, Y. Niu, P. Duan, J. Lin, and Y. Zheng (2018) Deep propagation based image matting.. In International Joint Conference on Artificial Intelligence, Vol. 3, pp. 999–1006. Cited by: §1, §2.
  • [33] H. Xiao, K. Rasul, and R. Vollgraf (2017)

    Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms

    arXiv preprint arXiv:1708.07747. Cited by: Appendix A, §5.
  • [34] N. Xu, B. Price, S. Cohen, and T. Huang (2017) Deep image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2970–2979. Cited by: §2, Figure 4, §6.2, §6.3, Table 5, Table 7.
  • [35] B. Yang, G. Bender, Q. V. Le, and J. Ngiam (2019) Condconv: conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems (NIPS), pp. 1307–1318. Cited by: item 4.
  • [36] C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In Proc. European Conference on Computer Vision (ECCV), pp. 574–589. Cited by: §4.