The similarity among positions, a.k.a. affinity, is commonly investigated in dense prediction tasks [19, 4, 8, 32, 17]. Compared with directly fitting ground truths using first-order features, modeling similarity among different positions can provide second-order information. There currently exist two solutions to learn affinity in deep networks: i) learning an affinity map before a non-deep backend and ii) defining a learnable affinity-based module to propagate information. We are interested in end-to-end affinity learning, because classic methods often build upon some assumptions, rendering weak generalization in general cases. Existing approaches typically propagate or model affinity after upsampling layers or before the last prediction layer. While affinity properties are modeled, they sometimes may not be effective for the downstream tasks. For instance,the work in  requires a feature encoding block besides the encoder-decoder architecture to learn affinity. The work in  needs more iterations to refine the feature maps according to their affinity at the last stage. As shown in Fig. 1, one plausible reason is that pairwise similarity is damaged during upsampling. In addition, it is inefficient to construct interactions between high-dimensional feature maps. We therefore pose the question: Can we model affinity earlier in upsampling in an effective and efficient manner?
Many widely used upsampling operators interpolate values following a fixed rule at different positions. For instance, despite reference positions may change in bilinear upsampling, it always interpolates values based on relative spatial distances. Recently, the idea of learning to upsample emerges [21, 22, 31]. A learnable module is often built to generate upsampling kernels conditioned on feature maps to enable dynamic, feature-dependent upsampling behaviors. Two such representative operators include CARAFE  and IndexNet . In our experiments, we find that CARAFE may not work well in low-level vision tasks where details need to be restored. IndexNet instead can recover details much better. We believe that one important reason is that IndexNet encodes, stores, and delivers spatial information prior to downsampling. But computation can be costly when the network goes deep. This motivates us to pursue not only flexible but also light-weight designs of the upsampling operator.
In this paper, we propose to model affinity into upsampling and introduce a novel learnable upsampling operator, i.e., affinity-aware upsampling (AU). As we show later in Section 3, AU is a generalization of first-order upsampling operators: in some conditions, the first-order formulation in  and  can be viewed as special cases of our second-order one. In addition, by implementing AU in a low-rank bilinear formulation, we can achieve efficient upsampling with few extra parameters.
We demonstrate the effectiveness of AU on two detail-sensitive tasks: an image reconstruction task on a toy dataset with controllable background and a large-scale image matting task with subtle foregrounds. Image matting is a desirable task to justify the usefulness of affinity, because affinity-based matting approaches constitute one of prominent matting paradigms in literatures. Top matting performance thus can suggest appropriate affinity modeling. In particular, we further discuss alternative design choices of AU and compare their similarities and differences. Compared with a strong image matting baseline on the Composition-1k matting dataset, AU exhibits a significant improvement () with negligible increase of parameters (), proffering a light-weight image matting architecture with state-of-the-art performance.
2 Related work
Upsampling Operators in Deep Networks. Upsampling is often necessary in dense prediction to recover spatial resolution. The mostly used upsampling operators are bilinear interpolation and nearest neighbor interpolation. Since they are executed only based on spatial distances, they may be sub-optimal in detail-oriented tasks such as image matting where distance-based similarity can be violated. Compared with distance-based upsampling, max-unpooling is feature-dependent and has been shown to benefit detail-oriented tasks [21, 22]
, but it must match with max-pooling. In recent literatures, learning-based upsampling operators[29, 20, 31, 22] emerge. The Pixel Shuffle (P.S.)  upsamples feature maps by reshaping. The deconvolution (Deconv) , an inverse version of convolution, learns the upsampling kernel via back-propagation. Both P.S. and Deconv are data-independent during inference, because the kernel is fixed once learned. By contrast, CARAFE  and IndexNet  learn the upsampling kernel dynamically conditioned on the data. They both introduce additional modules to learn upsampling kernels. Since the upsampling kernel is directly related to the feature maps, these upsampling operators are considered first-order.
Following the learning-based upsampling paradigm, we also intend to learn dynamic upsampling operators but to condition on second-order features to enable affinity-informed upsampling. We show that, compared with first-order upsampling, affinity-informed upsampling not only achieves better performance but also introduces a light-weight learning paradigm.
. The main assumption in propagation-based matting is that, similar alpha values can be propagated from known positions to unknown positions, conditioned on affinity. This assumption, however, highly depends on the color distribution. Such methods can perform well on cases with clear color contrast but more often fail in cases where the color distribution assumption is violated. Recently, deep learning is found effective to address ill-posed image matting. Many deep matting methods arise[5, 32, 34, 30, 11, 21, 17, 2]. This field has experienced from a semi-deep stage [5, 32] to a fully-deep stage [34, 11, 21, 17, 2]. Here ‘semi-deep’ means that the matting part still relies on classic methods [16, 3] to function, while ‘fully-deep’ means that the entire network does not resort to any classic algorithms. Among fully-deep matting, DeepMatting  first applied the encoder-decoder architecture and reported improved results. Targeting this strong baseline, several deep matting methods were proposed. AlphaGAN matting  and IndexNet matting  explored adversarial learning and index generating module to improve matting performance, respectively. In particular, works in [11, 17, 2, 30] imitated classic sampling-based and propagation-based ideas into deep networks to ease the difficulty of learning. Therein, GCA matting  first designed an affinity-based module and demonstrated the effectiveness of affinity in fully-deep matting. It treats alpha propagation as an independent module and adds it to different layers to refine the feature map, layer by layer.
Different from the idea of ‘generating then refining’, we propose to directly incorporate the propagation-based idea into upsampling for deep image matting. It not only benefits alpha propagation but also shows the potential for light-weight module design.
3 A Mathematical View of Upsampling
The work in  unifies upsampling from an indexing perspective. Here we provide an alternative mathematical view. To simplify exposition, we discuss the upsampling of the one-channel feature map. Without loss of generality, the one-channel case can be easily extended to multi-channel upsampling, because most upsampling operators execute per-channel upsampling. Given a one-channel local feature map
used to generate an upsampled feature point, it can be vectorized to. Similarly, the vectorization of an upsampling kernel can be denoted by . If defines the output of upsampling, most existing upsampling operations follow
Note that indicates an upsampled point. In practice, multiple such points can be generated to form an upsampled feature map. may be either shared or unshared among channels depending on the upsampling operator. Different operators define different ’s. Further, even the same can be applied to different ’s. According to how the upsampling kernel is generated, we categorize the kernel into two types: the universal kernel and the customized kernel. The universal kernel is input-independent. It follows the same upsampling rule given any input. One example is deconvolution . The customized kernel, however, is input-dependent. Based on what input is used to generate the kernel, the customized kernel can be further divided into distance-based and feature-based. We elaborate as follows.
Distance-based Upsampling. Distance-based upsampling is implemented according to spatial distances, such as nearest neighbor and bilinear interpolation. The difference between them is the number of positions taken into account. Under the definition of Eq. (1), the upsampling kernel is a function of the relative distance between points. By taking bilinear interpolation with reference points as an example, , where given the coordinates of two reference points and ; and is the coordinates of the interpolated point; , , and can be derived similarly. In multi-channel cases, the same is shared by all channels of input.
Max-unpooling interpolates values following the indices returned from max-pooling. In a region of the feature layer after upsampling, only one position recorded in the indices has value, and other three are filled with . Since each position on the upsampled feature map is interpolated from a point at the low-resolution layer, we can define by a vector , where , and is also the point at the low-resolution layer. Note that, , and only one can equal to in a region of the output feature map. In multi-channel cases, and are different in different channels conditioned on the operator.
CARAFE learns an upsampling kernel ( in ) via a kernel generation module given a decoder feature map ready to upsample. It also conforms to Eq. (1), where is obtained from the low-resolution decoder feature map. The kernel size of depends on the size of . In multi-channel cases, the same is shared among channels.
IndexNet also learns an upsampling kernel dynamically from features. The difference is that IndexNet learns from high-resolution encoder feature maps. Under the formulation of Eq. (1), the upsampling kernel follows a similar spirit like max-unpooling: , where , because each position on the upsampled feature layer is interpolated from a corresponding point on the low-resolution map by multiplying by an interpolation weight . But here instead of .
Hence, distanced-based and feature-based upsampling operators have a unified form , while different operators correspond to different ’s and ’s, where
can be heuristically defined or dynamically generated. In particular, existing operators define/generateaccording to distances or first-order features, while second-order information remains unexplored in upsampling.
4 Learning Affinity-Aware Upsampling
Here we explain how we exploit second-order information to formulate the affinity idea in upsampling using a bilinear model and how we apply a low-rank approximation to reduce computational complexity.
General Formulation of Upsampling. Given a feature map to be upsampled, the goal is to generate an upsampled feature map , where is the upsampling ratio. For a position in , the corresponding source position in is derived by solving , . We aim to learn an upsampling kernel for each position in . By applying the kernel to a channel of the local feature map centered at position on , denoted by , the corresponding upsampled feature point of the same channel at target position can be obtained by according to Eq. (1), where is the vectorization of .
General Meaning of Affinity. Affinity is often used to indicate pairwise similarity and is considered second-order features. An affinity map can be constructed in different ways such as using a Gaussian kernel. In self-attention, the affinity between the position and the enumeration of all possible positions at a feature map is denoted by , where and represent two vectors at position and , respectively, and measures the similarity between and with the inner product .
Affinity-Aware Upsampling via Bilinear Modeling. Given a local feature map , has an equivalent matrix form , where . We aim to learn an upsampling kernel conditioned on . Previous learning-based upsampling operators [31, 21, 22] generate the value of the upsampling kernel following a linear model by , where and are the weight and the feature at the channel and position of , respectively. Note that . To encode second-order information, a natural generalization of the linear model above is bilinear modeling where another feature matrix transformed from the feature map (), is introduced to pair with to model affinity. Given each in , in , the bilinear weight of the vector pair, and the embedding weights and for each channel of and , we propose to generate each value of the upsampling kernel from embedded pairwise similarity, i.e.,
where and are the -th channel of and , respectively,
is the affinity matrix for-th channel, , and and represent the embedding function.
Factorized Affinity-Aware Upsampling. Learning can be expensive when and are large. Inspired by [12, 36], a low-rank bilinear method can be derived to reduce computational complexity of Eq. (2). Specifically, can be rewritten by , where and . represents the rank of under the constraint of . Eq. (2) therefore can be rewritten by
where is a column vector of ones, and denotes the Hadamard product. Since we need to generate a upsampling kernel, in Eq. (3) can be replaced with . Note that, Eq. (3) is applied to each position of a feature map, so the inner product here can be implemented by convolution. The full upsampling kernel therefore can be generated by
where , . The convolution kernels , , and
are reshaped tensor versions of, and , respectively. represents a convolution operation on the feature map with the kernel ; defines a group convolution operation ( groups) with the same input. is the concatenate operator. This process is visualized in Fig. 2.
Alternative Implementations. Eq. (4) is a generic formulation. In practice, many design choices can be discussed in implementation:
The selection of and can be either same or different. In this paper, we only discuss self-similarity, i.e., ;
The rank can be chosen in the range . For example, if and are extracted in regions, the range will be . In our experiments, we set to explore the most simplified and light-weight case.
and can be considered two encoding functions. They can be shared, partly-shared, or unshared among channels. We discuss two extreme cases in the experiments: ‘channel-shared’ (‘cs’) and ‘channel-wise’ (‘cw’).
Eq. (4) adjusts the kernel size of only using . Since the low-rank approximation has less parameters, fixed , , and may not be sufficient to model all local variations. Inspired by CondConv , we attempt to generate and , dynamically conditioned on the input. We investigate three implementations: 1) static: none of them is input-dependent; 2) hybrid: only is conditioned on input; and 3) dynamic: , , and are all conditioned on input. The dynamic generation of , , or is implemented using a global average pooling and a convolution layer.
We implement stride-2and in our experiments. They output features of size . To generate an upsampling kernel of size , one can either use sets of different weights for and or sets of weights for (), followed by a shuffling operation (). We denote the former case as ‘pointwise’ (‘pw’). Further, as pointed out in , nonlinearity, e.g., tanh or relu, can be added after the encoding of and . We verify a similar idea by adding normalization and nonlinearity in the experiments.
|PSNR ()||SSIM ()||MSE ()||MAE ()||PSNR ()||SSIM ()||MSE ()||MAE ()|
Extension to Downsampling. Following , our method can also be extended to downsampling. Downsampling is in pair with upsampling, so their kernels are generated from the same encoder feature. We use ‘d’ to indicate the use of paired downsampling in experiments. We share the same and in Eq. (4) in both downsampling and upsampling, but use different ’s considering that they may have different kernel sizes. We denote the overall upsampling kernel by and the downsampling kernel by , where is the ratio of upsampling/downsampling. We set in our experiments.
5 Image Reconstruction and Analysis
Here we conduct a pilot image reconstruction experiment on a toy dataset to show the effectiveness of AU. Inspired by , we build sets of reconstruction experiments on the MNIST dataset  and Fashion-MNIST dataset . The motivation is to verify whether exploiting second-order information into upsampling benefits recovering spatial information.
We denote to be a convolution layer with -channel output and filters (stride is unless stated), followed by BatchNorm and ReLU, and denote a downsampling operator with a ratio of , and denote an upsampling operator with a ratio of . We build the network architecture as: -------------
. The same training strategies and evaluation metrics are used following. Since training patches are relatively small (), upsampling kernel sizes for CARAFE and AU are both set to , and the encoding convolution kernels in IndexNet and AU are both set to . Other settings keep the default ones. We apply ‘static-pw-cw’ AU here because it is the same as Holistic IndexNet if letting convolution results of to be all ones. We hence add a sigmoid function after to generalize IndexNet. To avoid extra layers, we apply max-pooling to to obtain high-resolution layers when validating IndexNet and AU. Reconstruction results are presented in Table 1.
As shown in Table 1, upsampling operators informed by features (max-unpooling, CARAFE, IndexNet, and AU) outperform the operators guided by spatial distances (nearest, bilinear, and bicubic). Moreover, learning from high-resolution features matter for upsampling, among which, learning-based operators (IndexNet, AU) achieve the best results. Further, it is worth noting that, AU performs better than IndexNet with even fewer parameters. From these observations, we believe in upsampling: 1) high-resolution features are beneficial to extract spatial information, and 2) second-order features can help to recover more spatial details than first-order ones.
6 Experiments and Discussions
Here we evaluate AU on deep image matting. This task is suitable for assessing the quality of modeling pairwise relations.
6.1 Network Architecture
Similar to , our baseline network adopts the first layers of the ResNet34  as the encoder. The decoder consists of residual blocks and upsampling stages. The In-Place Activated BatchNorm  is applied to each layer except the last one to reduce GPU memory consumption. As shown in Fig. 3, the overall network follows the UNet architecture  with ‘skip’ connection. To apply AU to upsampling, we replace the upsampling operations in the decoder with AU modules. Specifically, we learn upsampling kernels from the skipped features. If AU is used in both upsampling and downsampling stages, we change all 2-stride convolution layers in the encoder to be 1-stride and implement paired downsampling and upsampling operations, respectively, by learning upsampling/downsampling kernels from the modified 1-stride feature layer.
We mainly conduct our experiments on the Adobe Image Matting dataset . Its training set has unique foreground objects and ground-truth alpha mattes. Instead of compositing each foreground with fixed
background images chosen from MS COCO, we randomly choose the background images in each iteration and generate the composition images on-the-fly. The test set, termed the Composition-1k, contains unique foreground objects; each foreground is composited with
background images from the Pascal VOC dataset.
We also evaluate our method on the benchmark . This online benchmark has unique testing images and different trimaps for each image, providing test cases.
Further, we report results on the recently proposed Distinctions-646 dataset . It has foreground objects in the training set and foreground objects in the test set. We generate the training data and the test set following the same protocol as on the Adode Image Matting dataset.
6.3 Implementation Details
Our implementation is based on PyTorch. Here we describe training details on the Adobe Image Matting dataset. The -channel input concatenates the RGB image and its trimap. We mainly follow the data argumentation of 
. Two foreground objects are first chosen with a probability ofand are composited to generate a new foreground image and a new alpha matte. Next, they are resized to with a probability of . Random affine transformations are then applied. Trimaps are randomly dilated from the ground truth alpha mattes with distances in the range between and , followed by random cropping. The background image is randomly chosen from the MS COCO dataset . After imposing random jitters to the foreground object, the RGB image is finally generated by composition.
The backbone is pretrained on ImageNet. Adam optimizer 
is used. We use the same loss function as[34, 21], including alpha prediction loss and composition loss computed from the unknown regions indicated by trimaps. We update parameters for epochs. Each epoch has a fixed number of iterations. A batch size of is used and BN layers in the backbone are fixed. The learning rate is initialized to and reduced by at the -th epoch and the -th epoch, respectively. The training strategies on the Distinction646 dataset are the same except that we update the parameters for only epochs. We evaluate our results using Sum of Absolute Differences (SAD), Mean Squared Error (MSE), Gradient (Grad), and Connectivity (Conn) . We follow the evaluation code provided by .
6.4 The Adobe Image Matting Dataset
Ablation Study on Alternative Implementations. Here we verify different implementations of AU on the Composition-1k test set and compare them with existing upsampling operators. Quantitative results are shown in Table 2. All the models are implemented by the same architecture but with different upsampling operators. The ‘nearest’ and ‘bilinear’ are our direct baselines. They achieve close performance with the same model capacity. For CARAFE, we use the default setting as in , i.e., and . We observe CARAFE has a negative effect on the performance. The idea behind CARAFE is to reassemble contextual information, which is not the focus of matting where subtle details matter. However, it is interesting that CARAFE can still be useful for matting when it follows a light-weight MobileNetV2 backbone . One possible explanation is that a better backbone (ResNet34) suppresses the advantages of context reassembling. We report results of IndexNet with the best-performance setting (‘depthwise+context+nonlinear’) in [21, 22]. The upsampling indices are learned from the skipped feature layers. IndexNet achieves a notable improvement, especially on the Grad metric. However, IndexNet significantly increases the number of parameters.
We further investigate different implementations of AU and another version with paired downsampling and upsampling. According to the results, the ‘static’ setting can only improve the SAD and Conn metrics. The position-wise and position-shared settings report comparable results, so we fix the position-shared setting in the following ‘hybrid’ and ‘dynamic’ experiments. We verify both channel-wise and channel-shared settings for ‘hybrid’ and ‘dynamic’ models. The ‘hybrid’ achieves higher performance with channel-wise design, while the ‘dynamic’ performs better with channel-shared design. All ‘hybrid’ and ‘dynamic’ models show improvements against baselines on all metrics, except the MSE and Grad metrics for the channel-shared ‘hybrid’ model. The last implementation, where channel-shared ‘dynamic’ downsampling is paired with upsampling, achieves the best performance (at least relative improvements against the baseline) with negligible increase of parameters ().
Hence, while the dedicated design of upsampling operators matters, paired downsampling and upsampling seems more important, at least for image matting.
Ablation Study on Upsampling Kernel. Here we investigate the performance of our models with different upsampling kernel sizes. The encoding kernel size (the kernel size of or ) is set to in all matting experiments unless stated. Under this setting, results in Table 3 show that performs the best. It is interesting to observe that larger upsampling kernel does not imply better performance. We believe this is related to the encoding kernel size and the way how we generate , and . We use as our default setting.
Ablation Study on Normalization. In both  and , different normalization strategies are verified, and experiments show that normalization significantly affects the results. We thus justify the normalization choices in our AU module here. We conduct the experiments on the channel-wise ‘hybrid’ model and the channel-shared ‘dynamic’ model. Two normalization choices are considered: ‘softmax’ and ‘sigmoid+softmax’. It is clear that the latter normalization works better (Table 4
). It may boil down to the nonlinearity introduced by the sigmoid function.
Comparison with State of the Art. Here we compare our models against other state-of-the-art methods on the Composition-1k test set. Results are shown in Table 5. We observe that our models outperform other methods on all the evaluation metrics with the minimum model capacity. Compared with the state-of-the-art method , our best model achieves higher performance with only model complexity. Our model is also memory-efficient, being able to infer high-resolution images on a single 1080Ti GPU without downsampling on the Composition-1k test set. Some qualitative results are shown in Fig. 4. Our results show improved detail delineation such as the net structure and the filament.
|KNN Matting ||175.4||0.103||124.1||176.4||-|
|Deep Matting ||50.4||0.014||31.0||50.8||M|
|IndexNet Matting ||45.8||0.013||25.9||43.7||8.15M|
|GCA Matting ||35.28||0.0091||16.9||32.5||25.27M|
6.5 The Benchmark
Here we report results on the online benchmark . We follow  to train our model with all the data in the Adobe matting dataset and then test it on the benchmark. As shown in Table 6, our method ranks the first w.r.t. the gradient error among all published methods. We also achieve comparable overall ranking compared with AdaMatting  under the SAD and MSE metrics, suggesting our method is one of the top performing methods on this benchmark.
|Gradient Error||Average Rank||Troll||Doll||Donkey||Elephant||Plant||Pineapple||Plastic bag||Net|
|GCA Matting ||8||8.4||6.6||9.1||0.1||0.1||0.2||0.1||0.1||0.3||0.2||0.2||0.2||0.2||0.2||0.3||1.3||1.6||1.9||0.7||0.8||1.4||0.6||0.7||0.6||0.4||0.4||0.4|
|Context-aware Matting ||9.1||10.8||9.8||6.8||0.2||0.2||0.2||0.1||0.2||0.2||0.2||0.2||0.2||0.2||0.4||0.4||1.4||1.5||1.8||0.8||1.3||1||1.1||1.1||0.9||0.4||0.4||0.4|
6.6 The Distinction-646 Dataset
We also evaluate our method on the recent Distinction-646 test set. In Table 7, we report results of the three models performing the best on the Composition-1k dataset and also compare with other benchmarking results provided by . We have two observations: 1) our models show improved performance against the baseline, which further confirms the effectiveness of our AU; 2) Our models outperform other reported benchmarking results by large margins, setting a new state of the art on this dataset.
|KNN Matting ||116.68||0.025||103.15||121.45|
|Deep Matting ||47.56||0.009||43.29||55.90|
6.7 Visualization of Upsampling Kernels
Here we visualize the learned upsampling kernel in a ‘hybrid’ model to showcase what is learned by the kernel. Two examples are illustrated in Fig. 5. We observe that, after learning, boundary details are highlighted, while flat regions are weakened.
Considering that affinity is widely exploited in dense prediction, we explore the feasibility to model such second-order information into upsampling for building compact models. We implement this idea with a low-rank bilinear formulation, based on a generalized mathematical view of upsampling. We show that, with negligible parameters increase, our method AU can achieve better performance on both image reconstruction and image matting tasks. We also investigate different design choices of AU. Results on three image matting benchmarks all show that AU achieve a significant relative improvement and also state-of-the-art results. In particular, compared with the best performing image matting network, our model achieves higher performance on the Composition-1k test set, with only model capacity. For future work, we plan to extend AU to other dense prediction tasks.
Appendix A Training Details of Image Reconstruction
The image reconstruction experiments are implemented on the MNIST dataset  and Fashion-MNIST dataset . They both include training images and test images. During training, the input images are resized to , and loss is used. We use the SGD optimizer with an initial learning rate of . The learning rate is decreased by at the -th, -th, and -th epoch, respectively. We update the parameters for epochs in total with a batch size of
. The evaluation metrics are Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), Mean Absolute Error (MAE) and root Mean Square Error (MSE).
Appendix B Analysis of Complexity
Here we summarize the model complexity of different implementations of AU in Table 8. We assume that the encoding kernel size is , the upsampling kernel size is , and the channel number of feature map is . Since is much larger than and , AU generally has the complexity: .
Appendix C Qualitative Results
We show additional qualitative results on the alphamatting.com benchmark  in Fig. 6. top-performing methods are visualized here. Since all these methods achieve good performance, and their quantitative results on the benchmark are very close, it is difficult to tell the obvious difference in Fig. 6. It worth noting that, however, our method produces better visual results on detailed structures, such as gridding of the net, and leaves of the pineapple.
We also show qualitative results on the Distinction-646 test set  in Fig. 7. Since no implementation of other deep methods on this benchmark is publicly available, we only present the results of our baseline and our method here to show the relative improvements. According to Fig. 7, our method produces clearly better predictions on highly transparent objects such as the bubbles.
-  (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §3.
Disentangled image matting.
Proc. IEEE International Conference on Computer Vision (ICCV), pp. 8819–8828. Cited by: Figure 6, §2, §6.5, Table 5, Table 6.
-  (2013) KNN matting. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (9), pp. 2175–2188. Cited by: §2, Figure 4, Table 5, Table 7.
Depth estimation via affinity learned with convolutional spatial propagation network. In Proc. European Conference on Computer Vision (ECCV), pp. 103–119. Cited by: §1.
Natural image matting using deep convolutional neural networks. In Proc. European Conference on Computer Vision (ECCV), pp. 626–643. Cited by: §2.
A bayesian approach to digital matting.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. II–II. Cited by: §2.
-  (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §6.2.
-  (2019) Ssap: single-shot instance segmentation with affinity pyramid. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 642–651. Cited by: §1.
-  (2011) A global sampling method for alpha matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2049–2056. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §6.1.
-  (2019) Context-aware image matting for simultaneous foreground and alpha estimation. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 4130–4139. Cited by: Figure 6, §2, Table 5, Table 6.
-  (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325. Cited by: item 5, §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
-  (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §6.3.
-  (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Appendix A, §5.
-  (2007) A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 228–242. Cited by: §2, Figure 4, Table 5, Table 7.
Natural image matting via guided contextual attention.
Proc. AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11450–11457. Cited by: Figure 6, §1, §2, Figure 4, §6.1, §6.3, §6.4, §6.5, Table 5, Table 6.
-  (2014) Microsoft coco: common objects in context. In Proc. European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §6.2, §6.3.
-  (2017) Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1520–1530. Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §2, §3, Table 1.
-  (2019) Indices matter: learning to index for deep image matting. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3266–3275. Cited by: §1, §1, §2, §2, Table 1, §4, Figure 4, §6.3, §6.4, Table 5.
-  (2020) Index networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2, §3, §3, §4, §4, §5, §5, §6.4, §6.4.
Alphagan: generative adversarial networks for natural image matting. British Machince Vision Conference (BMVC). Cited by: §2.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NIPS), pp. 8026–8037. Cited by: §6.3.
-  (2020) Attention-guided hierarchical structure aggregation for image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13676–13685. Cited by: Appendix C, §6.2, §6.6.
-  (2009) A perceptually motivated online benchmark for image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1826–1833. Cited by: Appendix C, §6.2, §6.3, §6.5.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Cited by: §6.1.
-  (2018) In-place activated batchnorm for memory-optimized training of dnns. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883. Cited by: §2, Table 1.
-  (2019) Learning-based sampling for natural image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3055–3063. Cited by: §2.
-  (2019) Carafe: content-aware reassembly of features. In Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3007–3016. Cited by: §1, §1, §2, item 2, §3, Table 1, §4, §6.4, §6.4.
-  (2018) Deep propagation based image matting.. In International Joint Conference on Artificial Intelligence, Vol. 3, pp. 999–1006. Cited by: §1, §2.
Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Appendix A, §5.
-  (2017) Deep image matting. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2970–2979. Cited by: §2, Figure 4, §6.2, §6.3, Table 5, Table 7.
-  (2019) Condconv: conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems (NIPS), pp. 1307–1318. Cited by: item 4.
-  (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In Proc. European Conference on Computer Vision (ECCV), pp. 574–589. Cited by: §4.