Classical approaches to depth map upsampling mainly focus on designing the filter kernels and the set of neighbors (, sampling locations ). They use hand-crafted kernels and predefined neighbors without learning [16, 27, 48]. For example, the guided filter  uses spatially-variant matting Laplacian kernels  to encode local structures from the HR color image. These methods use regularly sampled neighbors for aggregating pixels, and do not handle inconsistent structures in the HR color and LR depth images, causing texture-copying artifacts . To address the problem, both HR color and LR depth images have been used to extract common structures [14, 15, 31, 32]. Recently, learning-based approaches using CNNs [20, 31, 32]
have also become increasingly popular. The networks are trained using large quantities of data, capturing natural image priors and often outperforming traditional methods by large margins. These methods do not use a weighted averaging process. They combine instead nonlinear activations of spatially-invariant kernels learned from the networks. That is, they approximate spatially-variant kernels by mixing the activations of spatially-invariant ones nonlinearly (, via the ReLU function).
In this paper, we propose to exploit spatially-variant kernels explicitly to encode the structural details from both HR color and LR depth images as in classical approaches, but learn the kernel weights in a supervised manner. We also learn the set of neighbors, building an adaptive and sparse neighborhood system for each pixel. This also allows sub-pixel information aggregation, which may be difficult to achieve by hand. To implement this idea, we propose a CNN architecture and its efficient implementation, called a deformable kernel network (DKN), for learning sampling locations of the neighboring pixels and their corresponding kernel weights at every pixel. We also propose a fast version of DKN (FDKN), achieving a times speed-up compared to the plain DKN for a HR image of size , while retaining its superior performance. We show that the weighted averaging process, even trained with the LR depth map only without any guidance (, in (1)), with points sparsely sampled, is sufficient to obtain a new state of the art (Fig. 1). Our code and models are available online: https://cvlab-yonsei.github.io/projects/DKN
Contributions. The main contributions of this paper can be summarized as follows:
We introduce a novel variant of the classical guided weighted averaging process for depth map upsapling and its implementation, the DKN, that computes the set of neighbors and their corresponding weights adaptively for individual pixels.
We propose a fast version of DKN (FDKN) that runs about times faster than the DKN while retaining its superior performance.
We achieve a new state of the art, outperforming all existing methods we are aware of by a large margin, and clearly demonstrating the advantage of our approach to learning both kernel weights and sampling locations. We also provide an extensive experimental analysis to investigate the influence of all the components and parameters of our model.
2 Related work
Here we briefly describe the context of our approach, and review representative works related to ours.
Depth map upsampling. We categorize depth map upsampling into explicit/implicit weighted-average methods and learning-based ones. First, explicit weighted-average methods compute the output at each pixel by a weighted average of neighboring pixels in the LR depth image, where the weights are estimated from the HR color image [16, 27] to transfer fine-grained structures. The bilateral [27, 48] and guided  filters are representative methods that have been successfully adapted to depth map upsampling. They use hand-crafted kernels to estimate the weights, which may transfer erroneous structures to the target image . Second, implicit weighted-average methods formulate depth map upsampling as an optimization problem, and minimize an objective function that usually involves fidelity and regularization terms [9, 15, 14, 33, 39]. The fidelity term encourages the output to be close to the LR depth image, and the regularization term gives the output having a structure similar to that of the HR color image. Although, unlike explicit ones, implicit weighted-average methods exploit global structures in the HR color image, hand-crafted regularizers may not capture structural priors. Finally, learning-based methods can further be categorized into dictionary- and CNN-based approaches. Dictionary-based methods exploit the relationship between paired LR and HR depth patches, additionally coupled with the HR color image [10, 29, 52]. In CNN-based methods [20, 31, 32], an encoder-decoder architecture is typically used to learn features from the HR color and/or LR depth images, and the output is then regressed directly from the network. Other methods [40, 41] integrate a variational optimization into CNNs by unrolling the optimization steps of the primal-dual algorithm, which requires two stages in training and a number of iterations in testing. Similar to implicit weighted-average methods, they use hand-crafted regularizers, which may not capture structural priors.
Our method borrows from both explicit weighted-average methods and CNN-based ones. Unlike existing explicit weighted-average methods [16, 27], that use hand-crafted kernels and neighbors defined on a fixed regular grid, we leverage CNNs to learn the set of sparsely chosen neighbors and their corresponding weights adaptively. Our method differs from previous CNN-based ones [20, 31, 32]
in that we learn sparse and spatially-variant kernels for each pixel to obtain upsampling results as a weighted average. The bucketing stretch in single image super-resolution[12, 42] can be seen as a non-learning-based approach to filter selection. It assigns a single filter by solving a least-squares problem for a set of similar patches (buckets). In contrast, our model learns different filters using CNNs even for similar RGB patches, since we learn them from a set of multi-modal images (., pairs of RGB/D images).
Variants of the spatial transformer . Recent works introduce more flexible and effective CNN architectures. Jaderberg et al. propose a novel learnable module, the spatial transformer , that outputs the parameters of the desired spatial transformation (, affine or thin plate spline) given a feature map or an input image. The spatial transformer makes a standard CNN for classification invariant to a set of geometric transformation, but it has a limited capability of handling local transformations. Most similar to ours are the dynamic filter network  and its variants (the adaptive convolution network  and the kernel prediction networks [3, 37, 49]), where a set of local transformation parameters is generated adaptively conditioned on the input image. The main differences between our model and these works are two-fold. First, our network is more general in that it is not limited to learning spatially-variant kernels, but it also learns the sampling locations of neighbors. This allows to aggregate sparse but highly related samples only, enabling an efficient implementation in terms of speed and memory and achieving state-of-the-art results even with aggregating 9 samples sparsely chosen. For comparison, the adaptive convolution and kernel prediction networks require lots of samples (, in [3, 49], in , and in ). As will be seen in our experiments, learning sampling locations of neighbors clearly boosts the performance significantly compared to learning kernel weights only. Second, as other guided image filtering approaches [15, 16, 27, 31, 32]
, our model is easily adapted to other tasks such as saliency map upsampling, cross-modality image restoration, texture removal, and semantic segmentation. We focus here on depth upsampling but see the supplement for some examples. In contrast, the adaptive convolution network is specialized to video frame interpolation, and kernel prediction networks are applicable to denoising Monte Carlo renderings[3, 49] or burst denoising  only. Our work is also related to the deformable convolutional network . The basic idea of deformable convolutions is to add offsets to the sampling locations defined on a regular grid in standard CNNs. The deformable convolutional network samples features directly from learned offsets, but shares the same weights for different sets of offsets as in standard CNNs. In contrast, we use spatially-variant weights for each sampling location.
|Feature extraction||Weight regression|
|Input (Receptive field)||Conv()|
|DownConv()-ReLu||Mean subtraction or|
|Conv()-BN-ReLU||L1 norm. (w/o Res.)|
Network architecture details. “BN” and “Res.” denote the batch normalization
and residual connection, respectively. We denote by “DownConv” convolution with stride 2. The inputs of our network are 3-channel HR color and 1-channel LR depth images (denoted by). For the model without the residual connection, we use an L1 normalization layer (denoted by “L1 norm.”) instead of subtracting mean values for weight regression.
In this section, we briefly describe our approach, and present a concrete network architecture. We then describe a fast version of DKN.
Our network mainly consists of two parts (Fig. 2): We first learn spatially-variant kernel weights and spatial sampling offsets w.r.t the regular grid. To this end, a two-stream CNN , where each sub-network has the same structure (but different parameters), uses the guidance (HR color) and target (LR depth) images to extract features that are used to estimate the kernel weights and the offsets. We then compute a weighted average using the learned kernel weights and sampling locations computed from the offsets to obtain a residual image. Finally, the upsampling result is obtained by combining the residual image with the LR depth map. Note that we can train DKN without the residual connection, by directly computing the upsampling result as a weighted average. Note also that we can train our model without the guidance of the HR color image. In this case, we use a single-stream CNN to extract features from the LR depth map only in both training and testing. Our network is fully convolutional, does not require fixed-size input images, and it is trained end-to-end.
Weight and offset learning. Dual supervisory information for the weights and offsets is typically not available. We learn instead these parameters by minimizing directly the discrepancy between the output of the network and a reference HR depth map. In particular, constraints on weight and offset regression (sigmoid and mean subtraction layers in Fig. 2) specify how the kernel weights and offsets behave and guide the learning process. For weight regression, we apply a sigmoid layer that makes all elements larger than 0 and smaller than 1. We then subtract the mean value from the output of the sigmoid layer so that the regressed weights should be similar to high-pass filters with kernel weights adding to 0. For offset regression, we do not apply the sigmoid layer, since relative offsets (for , positions) from locations on a regular grid can have negative values.
Residual connection. The main reason behind using a residual connection is that the upsampling result is largely correlated with the LR depth map, and both share low-frequency content [17, 25, 32, 56]. Focussing on learning the residuals also accelerates training speed while achieving better performance . Note that contrary to [17, 25, 32, 56], we obtain the residuals by a weighted averaging process with the learned kernels, instead of regressing them directly from the network output. Empirically, the kernels learned with the residual connection have the same characteristics as the high-pass filters widely used to extract important structures (, object boundaries) from images (See the supplemental material).
3.2 DKN architecture
We design a fully convolutional network to learn the kernel weights and the sampling offsets for individual pixels. We show in Table 1 the detailed description of the network structure.
Feature extraction. We adapt an architecture similar to  for feature extraction. It consists of 7 convolutional layers, two of which use convolutions with multiple strides (“DownConv” in Table 1), that enlarge a receptive field size with a small number of network parameters to estimate. We input the HR color and LR depth images to each of the sub-networks, resulting in a feature map of size for a receptive field of size . The LR depth map is initially upsampled using bicubic interpolation. We use the ReLU 
as an activation function. Batch normalization is used for speeding up training and regularization.
Weight regression. For each sub-network, we add a convolutional layer on top of the feature extraction layer. It gives a feature map of size , where is the size of the filter kernel, which is used to regress the kernel weights. To estimate the weights, we apply a sigmoid layer to each feature map of size , and then combine the outputs by element-wise multiplication (see Fig. 2
). We could use a softmax layer as in[3, 38, 49], but empirically find that it does not perform as well as the sigmoid layer. The softmax function encourages the estimated kernel to have only a few non-zero elements, which is not appropriate for estimating the weights for sparsely sampled pixels. The estimated kernels should be similar to high-pass filters, with kernel weights adding to 0. To this end, we subtract the mean value from the combined output of size . For our model without a residual connection, we apply instead L1 normalization to the output of size . Since the sigmoid layer makes all elements in the combined output larger than 0, applying L1 normalization forces the kernel weights to add to 1 as in (2).
Offset regression. Similar to the weight regression case, we add a convolutional layer on top of the feature extraction layer. The resulting two feature maps of size are combined by element-wise multiplication. The final output contains relative offsets (for , positions) from locations on a regular grid. In our implementation, we use kernels, but the output is computed by aggregating 9 samples sparsely chosen from a much larger neighborhood. The two main reasons behind the use of small-size kernels are as follows: (1) This enables an efficient implementation in terms of speed and memory. (2) The reliability of samples are more important than the total number of samples aggregated. As will be seen in Sec. 4, our model outperforms the guided filter  using kernels of size by a large margin. A similar finding is noted in , which shows that only high-confidence samples should be chosen when estimating foreground and background images in image matting. Note that offset regression is closely related to nonlocal means  in that both select which pixels to aggregate instead of immediate neighbors. Likewise, learning offsets is related to “self-supervised” correspondence models in stereo matching  and optical flow estimation . For example, in the case of stereo matching, a model is trained to produce a flow field such that a right image is reconstructed by a left one according to that flow field. Our model with filter kernels of size computes correspondences for each pixel within input images, and also learns the corresponding matching confidence (, the kernel weights).
Weighted average. Given the learned kernel and sampling offsets , we compute the residuals as a weighted average:
The sampling position predicted by the network is irregular and typically fractional (Fig. 6(c)). We use a sampler to compute corresponding (sub) pixel values as
where enumerates all integer locations in a local 4-neighborhood system to the fractional position , and is a sampling kernel. Following [7, 22], we use a two-dimensional bilinear kernel, and split it into two one-dimensional ones as
where . Note that the residual term in (3) is exactly the same as the explicit weighted average, but we aggregate pixels from the sparsely chosen locations with the learned kernels , which is not feasible in current methods.
When we do not use a residual connection, we compute the upsampling result directly as a weighted average using the learned kernels and offsets:
Loss. We train our model by minimizing the norm of the difference between the network output and ground-truth HR reference depth map as follows.
Testing. Two principles have guided the design of our learning architecture: (1) Points from a large receptive field in the original guidance and target images should be used to compute the weighted averages associated with the value of the upsampled depth map at each one of its pixels; and (2) inference should be fast. The second principle is rather self-evident. We believe that the first one is also rather intuitive, and it is justified empirically by the ablation study presented later. In fine, it is also the basis for our approach, since our network learns where, and how to sample a small number of points in a large receptive field.
A reasonable compromise between receptive field size and speed is to use one or several convolutional layers with a multi-pixel stride, which enlarges the image area pixels are drawn from without increasing the number of weights in the network. This is the approach we have followed in our base architecture, DKN, with two stride-2 “DownConv” layers. The price to pay is a loss in spatial resolution for the final feature map, with only of the total number of pixels in the input images. One could of course give as input to our network the receptive fields associated with all of the original guidance and target image pixels, at the cost of forward passes during inference. DKN implements a much more efficient method where shifted copies of the two images are used in turn as input to the network, and the corresponding network outputs are then stitched together in a single HR image, at the cost of only forward passes. The details of this shift-and-stitch approach [34, 38] can be found in the supplemental material.
|Datasets||Middlebury ||Lu ||NYU v2 ||Sintel |
|FDKN w/o Res.||1.12||2.23||4.52||0.85||2.19||5.15||1.88||3.67||7.13||3.38||5.02||7.74|
|DKN w/o Res.||1.26||2.16||4.32||0.99||2.21||5.12||1.66||3.36||6.78||3.36||4.82||7.48|
3.3 FDKN architecture
A more efficient alternative to DKN is to split the input images into the same 16 subsampled and shifted parts as before, but this time stack them into new target and guidance images (Fig. 7), with channels for the former, and channels for the latter, , when the RGB image is used. The effective receptive field for FDKN is comparable to that of DKN, but FDKN involves much fewer parameters because of the reduced input image resolution and the shared weights across channels. The individual channels are then recomposed into the final upsampled image , at the cost of only one forward pass. Specifically, we use a series of 6 convolutional layers of size for feature extraction. For weight and offset regression, we apply a convolution on top of the feature extraction layers similar to DKN, but using more network parameters. For example, FDKN and DKN compute feature maps of size and , respectively, for weight regression, from each feature of size . This allows FDKN to estimate kernel weights and offsets for all pixels simultaneously. The details of this shift-and-stack approach can be found in the supplemental material. In practice, FDKN gives a 17 times speed-up over DKN. Because it involves fewer parameters (M vs. M for DKN), one might expect somewhat degraded results. Our experiments demonstrate that FDKN remains in the ballpark of that of DKN, still significantly better than competing approaches, and in one case even overperforming DKN.
In this section we present a detailed analysis and evaluation of our approach. More results and other applications of our model including saliency image upsampling, cross-modality image restoration, texture removal and semantic segmentation can be found in the supplement.
4.1 Implementation details
Following the experimental protocol of [31, 32], we train different models to upsample depth maps for scale factors , , with random initialization. We sample 1,000 RGB/D image pairs of size from the NYU v2 dataset . We use the same image pairs as in [31, 32]
to train the networks. The models are trained with a batch size of 1 for 40k iterations, giving roughly 20 epochs over the training data. We synthesize LR depth images (, , ) from ground truth by bicubic downsampling. We use the Adam optimizer  with and . As learning rate we use and divide it by 5 every 10k iterations. Data augmentation and regularization techniques such as weight decay and dropout  are not used, since 1,000 RGB/D image pairs from the NYU dataset have proven to be sufficient to train our models (See the supplement). All networks are trained end-to-end using PyTorch .
We test our models with the following four benchmark datasets. These feature aligned color and depth images. Note that we train our models with the NYU v2 dataset, and do not fine-tune them to other ones to evaluate its generalization ability.
Sintel dataset : This dataset provides 1,064 RGB/D image pairs created from an animated 3D movie. It contains realistic scenes including fog and motion blur. We use 864 pairs from a final-pass dataset for testing.
|MRF ||GF ||JBU ||TGV ||Park ||SDF ||FBS ||FGI ||DMSG ||DJFR ||DKN||FDKN||DKN||FDKN|
|Weight learning||Offset learning||Res.|
We compare our method with the state of the art in Table 2. It shows the average RMSE between upsampling results and ground truth. All numbers except those for the Sintel dataset are taken from [31, 32]. The results of DJF  and its residual version (DJFR ) are obtained by the provided models trained with the NYU v2 dataset. DMSG  uses the Middlebury and Sintel datasets for training the network. For fair comparison of DMSG with other CNN-based methods including ours, we retrain the DMSG model using the same image pairs from the NYU v2 dataset as in [31, 32]. From this table, we observe four things: (1) Our models outperform the state of the art including CNN-based methods [20, 31, 32] by significant margins in terms of RMSE, even without the residual connection (DKN w/o Res. and FDKN w/o Res.). For example, DKN decreases the average RMSE by (), () and () compared to DJFR . (2) Our models trained without the guidance of HR color images (DKN and FDKN), using the depth map only, also outperform the state of the art. In particular, they give better results than DKN and FDKN for the Lu dataset . A plausible expiation is that depth and color boundaries are less correlated, since the color images in the dataset are captured in a low-light condition. (3) We can clearly see that our models perform well on both synthetic and real datasets (, the Sintel and NYU v2 datasets), and generalize well to other images (, on the Middlebury dataset) outside the training dataset. (4) FDKN retains the superior performance of DKN, and even outperforms DKN for the Lu dataset.
Qualitative results. Figure 17 shows a visual comparison of the upsampled depth maps (8). The better ability to extract common structures from the color and depth images by our models here is clearly visible. Specifically, our results show a sharp depth transition without the texture-copying artifacts. In contrast, artifacts are clearly visible even in the results of DJFR , which tends to over-smooth the results and does not recover fine details. This confirms once more the advantage of using the weighted average with spatially-variant kernels and an adaptive neighborhood system in depth map upsampling.
Runtime. Table 3 shows runtime comparisons on the same machine. We report the runtime for DMSG , DJFR  and our models with a Nvidia Titan XP and for other methods with an Intel i5 3.3 GHz CPU. Our current implementation for DKN takes on average seconds for HR images of size . It is slower than DMSG  and DJFR , but yields a significantly better RMSE (Fig. 1 and Table 2). FDKN runs about faster than the DKN, as fast as DJFR, but with significantly higher accuracy.
We conduct an ablation analysis on different components in our models, and show the effects of different parameters for depth map upsampling () on the NYU v2 dataset . More discussion can be found in the supplement.
Network architecture. We show the average RMSE for six variants of our models in Table 4. The baseline models learn kernel weights from HR color images only. The first row shows that this baseline already outperforms the state of the art (see Table 2). From the second row, we can see that our models trained using LR depth maps only give better results than the baseline, indicating that using the HR color images only is not enough to fully exploit common structures. The third row demonstrates that constructing kernels from both images boosts performance. For example, the average RMSE of DKN decreases from to for the kernel. The fourth and fifth rows show that learning the offsets significantly boosts the performance of our models. The average RMSE of DKN trained using the HR color or LR depth images only decreases from to and from to , respectively, for the kernel. The last two rows demonstrate that the effect of learning kernel weights and offsets from both inputs is significant, and combining all components including the residual connection gives the best results. Note that learning to predict the spatial offsets is important because (1) learning spatially-variant kernels for individual pixels would be very hard otherwise, unless using much larger kernels to achieve the same neighborhood size, which would lead to an inefficient implementation, and (2) contrary to current architectures including DJF  and DMSG , this allows sub-pixel information aggregation.
Kernel size. Table 4 also compares the performances of networks with different size of kernels. We enlarge the kernel size gradually from to and compute the average RMSE. From the third row, we observe that the performance improves until size of . Increasing size further does not give additional performance gain. This indicates that aggregating pixels from a window is enough for the task. For offset learning, we restrict the maximum range of the sampling position to for all experiments. That is, the results from the third to last rows are computed by aggregating 9, 25 or 49 samples sparsely chosen from a window. The last row of Table 4 suggests that our final models also benefit from using more samples. The RMSE for DKN decreases from to at the cost of additional runtime. For comparison, DKN with kernels of size , and take 0.17, 0.18 and 0.19 seconds, respectively, with a Nvidia Titan XP. A size offers a good compromise in terms of RMSE and runtime and this is what we have used in all experiments.
DownConv for DKN. We empirically find that extracting features from large receptive fields is important to incorporate context for weight and offset learning. For example, reducing the size from to causes an increase of the average RMSE from to for the kernel. The DKN without DownConv layers can be implemented in a single forward pass, but requires more parameters (M vs. M for DKN) to maintain the same receptive field size, with a total number of convolutions increasing from M to M at each pixel. We may use dilated convolutions  that support large receptive fields without loss of resolution. When using the same receptive field size as , the average RMSE for dilated convolutions increases from to for the kernel. The resampling technique (Fig. 7) thus appears to be the preferable alternative.
We have presented a CNN architecture for depth map upsampling. Instead of regressing the upsampling results directly from the network, we use spatially-variant weighted averages where the set of neighbors and the corresponding kernel weights are learned end-to-end. A fast version achieves a speed-up compared to the plain DKN without much (if any) loss in performance. Finally, we have shown that the weighted averaging process, even using the LR depth image only without any guidance, with samples sparsely chosen, is sufficient to set a new state of the art.
-  https://www.asus.com/ae-en/3D-Sensor/Xtion_PRO_LIVE/.
-  Automatic differentiation in PyTorch.
-  S. Bako, T. Vogels, B. McWilliams, M. Meyer, J. Novák, A. Harvill, P. Sen, T. Derose, and F. Rousselle. Kernel-predicting convolutional networks for denoising Monte Carlo renderings. ACM Trans. Graph., 36(4):97, 2017.
-  J. T. Barron and B. Poole. The fast bilateral solver. In Proc. Eur. Conf. Comput. Vis., 2016.
A. Buades, B. Coll, and J.-M. Morel.
A non-local algorithm for image denoising.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. Eur. Conf. Comput. Vis., 2012.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proc. Int. Conf. Comput. Vis., 2017.
-  J. Diebel and S. Thrun. An application of Markov random fields to range sensing. In Adv. Neural Inf. Process. Syst., 2006.
-  D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In Proc. Int. Conf. Comput. Vis., 2013.
-  D. Ferstl, M. Ruther, and H. Bischof. Variational depth superresolution using example-based edge representations. In Proc. Int. Conf. Comput. Vis., 2015.
-  Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell., 32(8):1362–1376, 2010.
-  P. Getreuer, I. Garcia-Dorado, J. Isidoro, S. Choi, F. Ong, and P. Milanfar. Blade: Filter learning for general purpose computational photography. In 2018 IEEE Conf. Computational Photography, 2018.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
-  S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
-  B. Ham, M. Cho, and J. Ponce. Robust guided image filtering using nonconvex potentials. IEEE Trans. Pattern Anal. Mach. Intell., 40(1):192–207, 2018.
-  K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell., 35(6):1397–1409, 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
-  H. Hirschmuller and D. Scharstein. Evaluation of cost functions for stereo matching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
-  J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
-  T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In Proc. Eur. Conf. Comput. Vis., 2016.
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
Proc. Int. Conf. Machine Learning, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Adv. Neural Inf. Process. Syst., 2015.
J. Y. Jason, A. W. Harley, and K. G. Derpanis.
Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness.In Proc. Eur. Conf. Comput. Vis., 2016.
-  X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Adv. Neural Inf. Process. Syst., 2016.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learning Representations, 2015.
-  J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst., 2012.
-  H. Kwon, Y.-W. Tai, and S. Lin. Data-driven depth map refinement via multi-scale sparse representation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
-  A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):228–242, 2008.
-  Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. Eur. Conf. Comput. Vis., 2016.
-  Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
-  Y. Li, D. Min, M. N. Do, and J. Lu. Fast guided global interpolation for depth and motion. In Proc. Eur. Conf. Comput. Vis., 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
-  S. Lu, X. Ren, and F. Liu. Depth enhancement via low-rank matrix completion. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014.
W. Luo, A. G. Schwing, and R. Urtasun.
Efficient deep learning for stereo matching.In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
-  B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll. Burst denoising with kernel prediction networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
-  S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
-  J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon. High quality depth map upsampling for 3D-ToF cameras. In Proc. Int. Conf. Comput. Vis., 2011.
-  G. Riegler, D. Ferstl, M. Rüther, and B. Horst. A deep primal-dual network for guided depth super-resolution. In Proc. British Machine Vision Conference, 2016.
-  G. Riegler, M. Rüther, and B. Horst. ATGV-Net: Accurate depth super-resolution. In Proc. Eur. Conf. Comput. Vis., 2016.
-  Y. Romano, J. Isidoro, and P. Milanfar. RAISR: Rapid and accurate image super resolution. IEEE Transactions on Computational Imaging, 3(1):110–125, 2017.
-  D. Scharstein and C. Pal. Learning conditional random fields for stereo. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
-  J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comput. Vis., 2012.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Adv. Neural Inf. Process. Syst., 2014.
-  C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. Int. Conf. Comput. Vis., 1998.
T. Vogels, F. Rousselle, B. McWilliams, G. Röthlin, A. Harvill, D. Adler,
M. Meyer, and J. Novák.
Denoising with kernel prediction and asymmetric loss functions.ACM Trans. Graph., 37(4):124, 2018.
-  J. Wang and M. F. Cohen. Optimized color sampling for robust matting. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
-  H. Wu, S. Zheng, J. Zhang, and K. Huang. Fast end-to-end trainable guided filter. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Trans. Image Process., 19(11):2861–2873, 2010.
-  Q. Yang, R. Yang, J. Davis, and D. Nistér. Spatial-depth super resolution for range images. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In Proc. Int. Conf. Learning Representations, 2016.
-  J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process., 26(7):3142–3155, 2017.
-  Z. Zhang. Microsoft Kinect sensor and its effect. IEEE Trans. Multimedia, 19(2):4–10, 2012.