1 Introduction
Acquiring depth information is one of the fundamental tasks in computer vision, for scene recognition
[19], pose estimation [45] and 3D reconstruction [11], for example. Recent stereo matching methods based on convolutional neural networks (CNNs) [36, 55] give highquality depth maps, but still require a huge computational cost, especially in the case of a large search range. Consumer depth cameras (, the ASUS Xtion Pro [1] and the Microsoft Kinect [57]), typically coupled with RGB sensors, are practical alternatives to obtain depth maps at low cost. Although they provide dense depth maps, these typically offer limited spatial resolution and depth accuracy. To address this problem, registered highresolution (HR) color images can be used as guidance to enhance the spatial resolution of lowresolution (LR) depth maps [9, 14, 15, 20, 27, 31, 33, 39, 53]. The basic idea behind this approach, called guided or joint image filtering, is to exploit their statistical correlation to transfer structural details from the guidance HR color image to the target LR depth maps, typically by estimating spatiallyvariant kernels from the guidance. Concretely, given the target image and the guidance image , the filtering output at position is expressed as a weighted average [16, 27, 48, 51]:(1) 
where we denote by a set of neighbors (defined on a discrete regular grid) near the position . The filter kernel is typically a function of the guidance image [9, 16, 27, 39], normalized so that
(2) 
Classical approaches to depth map upsampling mainly focus on designing the filter kernels and the set of neighbors (, sampling locations ). They use handcrafted kernels and predefined neighbors without learning [16, 27, 48]. For example, the guided filter [16] uses spatiallyvariant matting Laplacian kernels [30] to encode local structures from the HR color image. These methods use regularly sampled neighbors for aggregating pixels, and do not handle inconsistent structures in the HR color and LR depth images, causing texturecopying artifacts [9]. To address the problem, both HR color and LR depth images have been used to extract common structures [14, 15, 31, 32]. Recently, learningbased approaches using CNNs [20, 31, 32]
have also become increasingly popular. The networks are trained using large quantities of data, capturing natural image priors and often outperforming traditional methods by large margins. These methods do not use a weighted averaging process. They combine instead nonlinear activations of spatiallyinvariant kernels learned from the networks. That is, they approximate spatiallyvariant kernels by mixing the activations of spatiallyinvariant ones nonlinearly (, via the ReLU function
[28]).In this paper, we propose to exploit spatiallyvariant kernels explicitly to encode the structural details from both HR color and LR depth images as in classical approaches, but learn the kernel weights in a supervised manner. We also learn the set of neighbors, building an adaptive and sparse neighborhood system for each pixel. This also allows subpixel information aggregation, which may be difficult to achieve by hand. To implement this idea, we propose a CNN architecture and its efficient implementation, called a deformable kernel network (DKN), for learning sampling locations of the neighboring pixels and their corresponding kernel weights at every pixel. We also propose a fast version of DKN (FDKN), achieving a times speedup compared to the plain DKN for a HR image of size , while retaining its superior performance. We show that the weighted averaging process, even trained with the LR depth map only without any guidance (, in (1)), with points sparsely sampled, is sufficient to obtain a new state of the art (Fig. 1). Our code and models are available online: https://cvlabyonsei.github.io/projects/DKN
Contributions. The main contributions of this paper can be summarized as follows:

[leftmargin=*]

We introduce a novel variant of the classical guided weighted averaging process for depth map upsapling and its implementation, the DKN, that computes the set of neighbors and their corresponding weights adaptively for individual pixels.

We propose a fast version of DKN (FDKN) that runs about times faster than the DKN while retaining its superior performance.

We achieve a new state of the art, outperforming all existing methods we are aware of by a large margin, and clearly demonstrating the advantage of our approach to learning both kernel weights and sampling locations. We also provide an extensive experimental analysis to investigate the influence of all the components and parameters of our model.
2 Related work
Here we briefly describe the context of our approach, and review representative works related to ours.
Depth map upsampling. We categorize depth map upsampling into explicit/implicit weightedaverage methods and learningbased ones. First, explicit weightedaverage methods compute the output at each pixel by a weighted average of neighboring pixels in the LR depth image, where the weights are estimated from the HR color image [16, 27] to transfer finegrained structures. The bilateral [27, 48] and guided [16] filters are representative methods that have been successfully adapted to depth map upsampling. They use handcrafted kernels to estimate the weights, which may transfer erroneous structures to the target image [31]. Second, implicit weightedaverage methods formulate depth map upsampling as an optimization problem, and minimize an objective function that usually involves fidelity and regularization terms [9, 15, 14, 33, 39]. The fidelity term encourages the output to be close to the LR depth image, and the regularization term gives the output having a structure similar to that of the HR color image. Although, unlike explicit ones, implicit weightedaverage methods exploit global structures in the HR color image, handcrafted regularizers may not capture structural priors. Finally, learningbased methods can further be categorized into dictionary and CNNbased approaches. Dictionarybased methods exploit the relationship between paired LR and HR depth patches, additionally coupled with the HR color image [10, 29, 52]. In CNNbased methods [20, 31, 32], an encoderdecoder architecture is typically used to learn features from the HR color and/or LR depth images, and the output is then regressed directly from the network. Other methods [40, 41] integrate a variational optimization into CNNs by unrolling the optimization steps of the primaldual algorithm, which requires two stages in training and a number of iterations in testing. Similar to implicit weightedaverage methods, they use handcrafted regularizers, which may not capture structural priors.
Our method borrows from both explicit weightedaverage methods and CNNbased ones. Unlike existing explicit weightedaverage methods [16, 27], that use handcrafted kernels and neighbors defined on a fixed regular grid, we leverage CNNs to learn the set of sparsely chosen neighbors and their corresponding weights adaptively. Our method differs from previous CNNbased ones [20, 31, 32]
in that we learn sparse and spatiallyvariant kernels for each pixel to obtain upsampling results as a weighted average. The bucketing stretch in single image superresolution
[12, 42] can be seen as a nonlearningbased approach to filter selection. It assigns a single filter by solving a leastsquares problem for a set of similar patches (buckets). In contrast, our model learns different filters using CNNs even for similar RGB patches, since we learn them from a set of multimodal images (., pairs of RGB/D images).Variants of the spatial transformer [22]. Recent works introduce more flexible and effective CNN architectures. Jaderberg et al. propose a novel learnable module, the spatial transformer [22], that outputs the parameters of the desired spatial transformation (, affine or thin plate spline) given a feature map or an input image. The spatial transformer makes a standard CNN for classification invariant to a set of geometric transformation, but it has a limited capability of handling local transformations. Most similar to ours are the dynamic filter network [24] and its variants (the adaptive convolution network [38] and the kernel prediction networks [3, 37, 49]), where a set of local transformation parameters is generated adaptively conditioned on the input image. The main differences between our model and these works are twofold. First, our network is more general in that it is not limited to learning spatiallyvariant kernels, but it also learns the sampling locations of neighbors. This allows to aggregate sparse but highly related samples only, enabling an efficient implementation in terms of speed and memory and achieving stateoftheart results even with aggregating 9 samples sparsely chosen. For comparison, the adaptive convolution and kernel prediction networks require lots of samples (, in [3, 49], in [38], and in [37]). As will be seen in our experiments, learning sampling locations of neighbors clearly boosts the performance significantly compared to learning kernel weights only. Second, as other guided image filtering approaches [15, 16, 27, 31, 32]
, our model is easily adapted to other tasks such as saliency map upsampling, crossmodality image restoration, texture removal, and semantic segmentation. We focus here on depth upsampling but see the supplement for some examples. In contrast, the adaptive convolution network is specialized to video frame interpolation, and kernel prediction networks are applicable to denoising Monte Carlo renderings
[3, 49] or burst denoising [37] only. Our work is also related to the deformable convolutional network [7]. The basic idea of deformable convolutions is to add offsets to the sampling locations defined on a regular grid in standard CNNs. The deformable convolutional network samples features directly from learned offsets, but shares the same weights for different sets of offsets as in standard CNNs. In contrast, we use spatiallyvariant weights for each sampling location.Feature extraction  Weight regression  

Type  Output  Type  Output 
Input (Receptive field)  Conv()  
Conv()BNReLU  Sigmoid  
DownConv()ReLu  Mean subtraction or  
Conv()BNReLU  L1 norm. (w/o Res.)  
DownConv()ReLU  Offset regression  
Conv()BNReLU  Type  Output  
Conv()ReLU  Conv()  
Conv()ReLU 
Network architecture details. “BN” and “Res.” denote the batch normalization
[21]and residual connection, respectively. We denote by “DownConv” convolution with stride 2. The inputs of our network are 3channel HR color and 1channel LR depth images (denoted by
). For the model without the residual connection, we use an L1 normalization layer (denoted by “L1 norm.”) instead of subtracting mean values for weight regression.3 Approach
In this section, we briefly describe our approach, and present a concrete network architecture. We then describe a fast version of DKN.
3.1 Overview
Our network mainly consists of two parts (Fig. 2): We first learn spatiallyvariant kernel weights and spatial sampling offsets w.r.t the regular grid. To this end, a twostream CNN [47], where each subnetwork has the same structure (but different parameters), uses the guidance (HR color) and target (LR depth) images to extract features that are used to estimate the kernel weights and the offsets. We then compute a weighted average using the learned kernel weights and sampling locations computed from the offsets to obtain a residual image. Finally, the upsampling result is obtained by combining the residual image with the LR depth map. Note that we can train DKN without the residual connection, by directly computing the upsampling result as a weighted average. Note also that we can train our model without the guidance of the HR color image. In this case, we use a singlestream CNN to extract features from the LR depth map only in both training and testing. Our network is fully convolutional, does not require fixedsize input images, and it is trained endtoend.
Weight and offset learning. Dual supervisory information for the weights and offsets is typically not available. We learn instead these parameters by minimizing directly the discrepancy between the output of the network and a reference HR depth map. In particular, constraints on weight and offset regression (sigmoid and mean subtraction layers in Fig. 2) specify how the kernel weights and offsets behave and guide the learning process. For weight regression, we apply a sigmoid layer that makes all elements larger than 0 and smaller than 1. We then subtract the mean value from the output of the sigmoid layer so that the regressed weights should be similar to highpass filters with kernel weights adding to 0. For offset regression, we do not apply the sigmoid layer, since relative offsets (for , positions) from locations on a regular grid can have negative values.
Residual connection. The main reason behind using a residual connection is that the upsampling result is largely correlated with the LR depth map, and both share lowfrequency content [17, 25, 32, 56]. Focussing on learning the residuals also accelerates training speed while achieving better performance [25]. Note that contrary to [17, 25, 32, 56], we obtain the residuals by a weighted averaging process with the learned kernels, instead of regressing them directly from the network output. Empirically, the kernels learned with the residual connection have the same characteristics as the highpass filters widely used to extract important structures (, object boundaries) from images (See the supplemental material).
3.2 DKN architecture
We design a fully convolutional network to learn the kernel weights and the sampling offsets for individual pixels. We show in Table 1 the detailed description of the network structure.
Feature extraction. We adapt an architecture similar to [38] for feature extraction. It consists of 7 convolutional layers, two of which use convolutions with multiple strides (“DownConv” in Table 1), that enlarge a receptive field size with a small number of network parameters to estimate. We input the HR color and LR depth images to each of the subnetworks, resulting in a feature map of size for a receptive field of size . The LR depth map is initially upsampled using bicubic interpolation. We use the ReLU [28]
as an activation function. Batch normalization
[21] is used for speeding up training and regularization.Weight regression. For each subnetwork, we add a convolutional layer on top of the feature extraction layer. It gives a feature map of size , where is the size of the filter kernel, which is used to regress the kernel weights. To estimate the weights, we apply a sigmoid layer to each feature map of size , and then combine the outputs by elementwise multiplication (see Fig. 2
). We could use a softmax layer as in
[3, 38, 49], but empirically find that it does not perform as well as the sigmoid layer. The softmax function encourages the estimated kernel to have only a few nonzero elements, which is not appropriate for estimating the weights for sparsely sampled pixels. The estimated kernels should be similar to highpass filters, with kernel weights adding to 0. To this end, we subtract the mean value from the combined output of size . For our model without a residual connection, we apply instead L1 normalization to the output of size . Since the sigmoid layer makes all elements in the combined output larger than 0, applying L1 normalization forces the kernel weights to add to 1 as in (2).Offset regression. Similar to the weight regression case, we add a convolutional layer on top of the feature extraction layer. The resulting two feature maps of size are combined by elementwise multiplication. The final output contains relative offsets (for , positions) from locations on a regular grid. In our implementation, we use kernels, but the output is computed by aggregating 9 samples sparsely chosen from a much larger neighborhood. The two main reasons behind the use of smallsize kernels are as follows: (1) This enables an efficient implementation in terms of speed and memory. (2) The reliability of samples are more important than the total number of samples aggregated. As will be seen in Sec. 4, our model outperforms the guided filter [16] using kernels of size by a large margin. A similar finding is noted in [50], which shows that only highconfidence samples should be chosen when estimating foreground and background images in image matting. Note that offset regression is closely related to nonlocal means [5] in that both select which pixels to aggregate instead of immediate neighbors. Likewise, learning offsets is related to “selfsupervised” correspondence models in stereo matching [13] and optical flow estimation [23]. For example, in the case of stereo matching, a model is trained to produce a flow field such that a right image is reconstructed by a left one according to that flow field. Our model with filter kernels of size computes correspondences for each pixel within input images, and also learns the corresponding matching confidence (, the kernel weights).
Weighted average. Given the learned kernel and sampling offsets , we compute the residuals as a weighted average:
(3) 
where is a local window centered at the location on a regular grid (Fig. 6(a)). We denote by the sampling position computed from the offset (Fig. 6(b)) of the location as follows.
(4) 
The sampling position predicted by the network is irregular and typically fractional (Fig. 6(c)). We use a sampler to compute corresponding (sub) pixel values as
(5) 
where enumerates all integer locations in a local 4neighborhood system to the fractional position , and is a sampling kernel. Following [7, 22], we use a twodimensional bilinear kernel, and split it into two onedimensional ones as
(6) 
where . Note that the residual term in (3) is exactly the same as the explicit weighted average, but we aggregate pixels from the sparsely chosen locations with the learned kernels , which is not feasible in current methods.
When we do not use a residual connection, we compute the upsampling result directly as a weighted average using the learned kernels and offsets:
(7) 
Loss. We train our model by minimizing the norm of the difference between the network output and groundtruth HR reference depth map as follows.
(8) 
Testing. Two principles have guided the design of our learning architecture: (1) Points from a large receptive field in the original guidance and target images should be used to compute the weighted averages associated with the value of the upsampled depth map at each one of its pixels; and (2) inference should be fast. The second principle is rather selfevident. We believe that the first one is also rather intuitive, and it is justified empirically by the ablation study presented later. In fine, it is also the basis for our approach, since our network learns where, and how to sample a small number of points in a large receptive field.
A reasonable compromise between receptive field size and speed is to use one or several convolutional layers with a multipixel stride, which enlarges the image area pixels are drawn from without increasing the number of weights in the network. This is the approach we have followed in our base architecture, DKN, with two stride2 “DownConv” layers. The price to pay is a loss in spatial resolution for the final feature map, with only of the total number of pixels in the input images. One could of course give as input to our network the receptive fields associated with all of the original guidance and target image pixels, at the cost of forward passes during inference. DKN implements a much more efficient method where shifted copies of the two images are used in turn as input to the network, and the corresponding network outputs are then stitched together in a single HR image, at the cost of only forward passes. The details of this shiftandstitch approach [34, 38] can be found in the supplemental material.
Datasets  Middlebury [18]  Lu [35]  NYU v2 [46]  Sintel [6]  

Methods  
Bicubic Int.  4.44  7.58  11.87  5.07  9.22  14.27  8.16  14.22  22.32  6.54  8.80  12.17 
MRF [8]  4.26  7.43  11.80  4.90  9.03  14.19  7.84  13.98  22.20  8.81  11.77  15.75 
GF [16]  4.01  7.22  11.70  4.87  8.85  14.09  7.32  13.62  22.03  6.10  8.22  11.22 
JBU [27]  2.44  3.81  6.13  2.99  5.06  7.51  4.07  8.29  13.35  5.88  7.63  10.97 
TGV [9]  3.39  5.41  12.03  4.48  7.58  17.46  6.98  11.23  28.13  32.01  36.78  43.89 
Park [39]  2.82  4.08  7.26  4.09  6.19  10.14  5.21  9.56  18.10  9.28  12.22  16.51 
SDF [15]  3.14  5.03  8.83  4.65  7.53  11.52  5.27  12.31  19.24  6.52  7.98  11.36 
FBS [4]  2.58  4.19  7.30  3.03  5.77  8.48  4.29  8.94  14.59  11.96  12.29  13.08 
FGI [33]  3.24  4.60  6.74  4.68  6.32  9.25  6.43  9.52  14.13  6.29  8.24  11.01 
DMSG [20]  1.88  3.45  6.28  2.30  4.17  7.22  3.02  5.38  9.17  5.32  7.24  10.11 
DJF [31]  2.14  3.77  6.12  2.54  4.71  7.66  3.54  6.20  10.21  5.51  7.52  10.63 
DJFR [32]  1.98  3.61  6.07  2.22  4.54  7.48  3.38  5.86  10.11  5.50  7.43  10.48 
FDKN  1.07  2.23  5.09  0.85  1.90  5.33  2.05  4.10  8.10  3.31  5.08  8.51 
DKN  1.12  2.13  5.00  0.90  1.83  4.99  2.11  4.00  8.24  3.40  4.90  8.18 
FDKN w/o Res.  1.12  2.23  4.52  0.85  2.19  5.15  1.88  3.67  7.13  3.38  5.02  7.74 
DKN w/o Res.  1.26  2.16  4.32  0.99  2.21  5.12  1.66  3.36  6.78  3.36  4.82  7.48 
FDKN  1.08  2.17  4.50  0.82  2.10  5.05  1.86  3.58  6.96  3.36  4.96  7.74 
DKN  1.23  2.12  4.24  0.96  2.16  5.11  1.62  3.26  6.51  3.30  4.77  7.59 
3.3 FDKN architecture
A more efficient alternative to DKN is to split the input images into the same 16 subsampled and shifted parts as before, but this time stack them into new target and guidance images (Fig. 7), with channels for the former, and channels for the latter, , when the RGB image is used. The effective receptive field for FDKN is comparable to that of DKN, but FDKN involves much fewer parameters because of the reduced input image resolution and the shared weights across channels. The individual channels are then recomposed into the final upsampled image [44], at the cost of only one forward pass. Specifically, we use a series of 6 convolutional layers of size for feature extraction. For weight and offset regression, we apply a convolution on top of the feature extraction layers similar to DKN, but using more network parameters. For example, FDKN and DKN compute feature maps of size and , respectively, for weight regression, from each feature of size . This allows FDKN to estimate kernel weights and offsets for all pixels simultaneously. The details of this shiftandstack approach can be found in the supplemental material. In practice, FDKN gives a 17 times speedup over DKN. Because it involves fewer parameters (M vs. M for DKN), one might expect somewhat degraded results. Our experiments demonstrate that FDKN remains in the ballpark of that of DKN, still significantly better than competing approaches, and in one case even overperforming DKN.
4 Experiments
In this section we present a detailed analysis and evaluation of our approach. More results and other applications of our model including saliency image upsampling, crossmodality image restoration, texture removal and semantic segmentation can be found in the supplement.
4.1 Implementation details
Following the experimental protocol of [31, 32], we train different models to upsample depth maps for scale factors , , with random initialization. We sample 1,000 RGB/D image pairs of size from the NYU v2 dataset [46]. We use the same image pairs as in [31, 32]
to train the networks. The models are trained with a batch size of 1 for 40k iterations, giving roughly 20 epochs over the training data. We synthesize LR depth images (
, , ) from ground truth by bicubic downsampling. We use the Adam optimizer [26] with and . As learning rate we use and divide it by 5 every 10k iterations. Data augmentation and regularization techniques such as weight decay and dropout [28] are not used, since 1,000 RGB/D image pairs from the NYU dataset have proven to be sufficient to train our models (See the supplement). All networks are trained endtoend using PyTorch [2].4.2 Results
We test our models with the following four benchmark datasets. These feature aligned color and depth images. Note that we train our models with the NYU v2 dataset, and do not finetune them to other ones to evaluate its generalization ability.

[leftmargin=*]

Sintel dataset [6]: This dataset provides 1,064 RGB/D image pairs created from an animated 3D movie. It contains realistic scenes including fog and motion blur. We use 864 pairs from a finalpass dataset for testing.
MRF [8]  GF [16]  JBU [27]  TGV [9]  Park [39]  SDF [15]  FBS [4]  FGI [33]  DMSG [20]  DJFR [32]  DKN  FDKN  DKN  FDKN  

Times (s)  0.69  0.14  0.31  33  18  25  0.37  0.24  0.04  0.01  0.09  0.01  0.17  0.01 
Weight learning  Offset learning  Res.  
RGB  Depth  RGB  Depth  
5.92/  6.05  5.52/  5.73  5.43/  5.67  5.59/  5.74  5.82/  5.81  6.21/  5.99  
5.24/  5.30  4.36/  4.47  4.09/  4.24  4.09/  4.17  4.11/  4.18  4.15/  4.21  
5.03/  5.14  3.90/  4.16  3.48/  3.80  3.32/  3.66  3.33/  3.66  3.39/  3.72  
5.37/  5.18  5.38/  5.09  5.40/  5.07  –  –  –  
4.06/  4.13  4.09/  4.13  4.13/  4.14  –  –  –  
3.36/  3.67  3.32/  3.65  3.33/  3.66  –  –  –  
3.26/  3.58  3.21/  3.53  3.19/  3.52  –  –  – 
We compare our method with the state of the art in Table 2. It shows the average RMSE between upsampling results and ground truth. All numbers except those for the Sintel dataset are taken from [31, 32]. The results of DJF [31] and its residual version (DJFR [32]) are obtained by the provided models trained with the NYU v2 dataset. DMSG [20] uses the Middlebury and Sintel datasets for training the network. For fair comparison of DMSG with other CNNbased methods including ours, we retrain the DMSG model using the same image pairs from the NYU v2 dataset as in [31, 32]. From this table, we observe four things: (1) Our models outperform the state of the art including CNNbased methods [20, 31, 32] by significant margins in terms of RMSE, even without the residual connection (DKN w/o Res. and FDKN w/o Res.). For example, DKN decreases the average RMSE by (), () and () compared to DJFR [32]. (2) Our models trained without the guidance of HR color images (DKN and FDKN), using the depth map only, also outperform the state of the art. In particular, they give better results than DKN and FDKN for the Lu dataset [35]. A plausible expiation is that depth and color boundaries are less correlated, since the color images in the dataset are captured in a lowlight condition. (3) We can clearly see that our models perform well on both synthetic and real datasets (, the Sintel and NYU v2 datasets), and generalize well to other images (, on the Middlebury dataset) outside the training dataset. (4) FDKN retains the superior performance of DKN, and even outperforms DKN for the Lu dataset.
Qualitative results. Figure 17 shows a visual comparison of the upsampled depth maps (8). The better ability to extract common structures from the color and depth images by our models here is clearly visible. Specifically, our results show a sharp depth transition without the texturecopying artifacts. In contrast, artifacts are clearly visible even in the results of DJFR [32], which tends to oversmooth the results and does not recover fine details. This confirms once more the advantage of using the weighted average with spatiallyvariant kernels and an adaptive neighborhood system in depth map upsampling.
Runtime. Table 3 shows runtime comparisons on the same machine. We report the runtime for DMSG [20], DJFR [32] and our models with a Nvidia Titan XP and for other methods with an Intel i5 3.3 GHz CPU. Our current implementation for DKN takes on average seconds for HR images of size . It is slower than DMSG [20] and DJFR [32], but yields a significantly better RMSE (Fig. 1 and Table 2). FDKN runs about faster than the DKN, as fast as DJFR, but with significantly higher accuracy.
4.3 Discussion
We conduct an ablation analysis on different components in our models, and show the effects of different parameters for depth map upsampling () on the NYU v2 dataset [46]. More discussion can be found in the supplement.
Network architecture. We show the average RMSE for six variants of our models in Table 4. The baseline models learn kernel weights from HR color images only. The first row shows that this baseline already outperforms the state of the art (see Table 2). From the second row, we can see that our models trained using LR depth maps only give better results than the baseline, indicating that using the HR color images only is not enough to fully exploit common structures. The third row demonstrates that constructing kernels from both images boosts performance. For example, the average RMSE of DKN decreases from to for the kernel. The fourth and fifth rows show that learning the offsets significantly boosts the performance of our models. The average RMSE of DKN trained using the HR color or LR depth images only decreases from to and from to , respectively, for the kernel. The last two rows demonstrate that the effect of learning kernel weights and offsets from both inputs is significant, and combining all components including the residual connection gives the best results. Note that learning to predict the spatial offsets is important because (1) learning spatiallyvariant kernels for individual pixels would be very hard otherwise, unless using much larger kernels to achieve the same neighborhood size, which would lead to an inefficient implementation, and (2) contrary to current architectures including DJF [31] and DMSG [20], this allows subpixel information aggregation.
Kernel size. Table 4 also compares the performances of networks with different size of kernels. We enlarge the kernel size gradually from to and compute the average RMSE. From the third row, we observe that the performance improves until size of . Increasing size further does not give additional performance gain. This indicates that aggregating pixels from a window is enough for the task. For offset learning, we restrict the maximum range of the sampling position to for all experiments. That is, the results from the third to last rows are computed by aggregating 9, 25 or 49 samples sparsely chosen from a window. The last row of Table 4 suggests that our final models also benefit from using more samples. The RMSE for DKN decreases from to at the cost of additional runtime. For comparison, DKN with kernels of size , and take 0.17, 0.18 and 0.19 seconds, respectively, with a Nvidia Titan XP. A size offers a good compromise in terms of RMSE and runtime and this is what we have used in all experiments.
DownConv for DKN. We empirically find that extracting features from large receptive fields is important to incorporate context for weight and offset learning. For example, reducing the size from to causes an increase of the average RMSE from to for the kernel. The DKN without DownConv layers can be implemented in a single forward pass, but requires more parameters (M vs. M for DKN) to maintain the same receptive field size, with a total number of convolutions increasing from M to M at each pixel. We may use dilated convolutions [54] that support large receptive fields without loss of resolution. When using the same receptive field size as , the average RMSE for dilated convolutions increases from to for the kernel. The resampling technique (Fig. 7) thus appears to be the preferable alternative.
5 Conclusion
We have presented a CNN architecture for depth map upsampling. Instead of regressing the upsampling results directly from the network, we use spatiallyvariant weighted averages where the set of neighbors and the corresponding kernel weights are learned endtoend. A fast version achieves a speedup compared to the plain DKN without much (if any) loss in performance. Finally, we have shown that the weighted averaging process, even using the LR depth image only without any guidance, with samples sparsely chosen, is sufficient to set a new state of the art.
References
 [1] https://www.asus.com/aeen/3DSensor/Xtion_PRO_LIVE/.
 [2] Automatic differentiation in PyTorch.
 [3] S. Bako, T. Vogels, B. McWilliams, M. Meyer, J. Novák, A. Harvill, P. Sen, T. Derose, and F. Rousselle. Kernelpredicting convolutional networks for denoising Monte Carlo renderings. ACM Trans. Graph., 36(4):97, 2017.
 [4] J. T. Barron and B. Poole. The fast bilateral solver. In Proc. Eur. Conf. Comput. Vis., 2016.

[5]
A. Buades, B. Coll, and J.M. Morel.
A nonlocal algorithm for image denoising.
In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
, 2005.  [6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. Eur. Conf. Comput. Vis., 2012.
 [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proc. Int. Conf. Comput. Vis., 2017.
 [8] J. Diebel and S. Thrun. An application of Markov random fields to range sensing. In Adv. Neural Inf. Process. Syst., 2006.
 [9] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In Proc. Int. Conf. Comput. Vis., 2013.
 [10] D. Ferstl, M. Ruther, and H. Bischof. Variational depth superresolution using examplebased edge representations. In Proc. Int. Conf. Comput. Vis., 2015.
 [11] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell., 32(8):1362–1376, 2010.
 [12] P. Getreuer, I. GarciaDorado, J. Isidoro, S. Choi, F. Ong, and P. Milanfar. Blade: Filter learning for general purpose computational photography. In 2018 IEEE Conf. Computational Photography, 2018.
 [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
 [14] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
 [15] B. Ham, M. Cho, and J. Ponce. Robust guided image filtering using nonconvex potentials. IEEE Trans. Pattern Anal. Mach. Intell., 40(1):192–207, 2018.
 [16] K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell., 35(6):1397–1409, 2013.
 [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
 [18] H. Hirschmuller and D. Scharstein. Evaluation of cost functions for stereo matching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
 [19] J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
 [20] T.W. Hui, C. C. Loy, and X. Tang. Depth map superresolution by deep multiscale guidance. In Proc. Eur. Conf. Comput. Vis., 2016.

[21]
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
Proc. Int. Conf. Machine Learning
, 2015.  [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Adv. Neural Inf. Process. Syst., 2015.

[23]
J. Y. Jason, A. W. Harley, and K. G. Derpanis.
Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness.
In Proc. Eur. Conf. Comput. Vis., 2016.  [24] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Adv. Neural Inf. Process. Syst., 2016.
 [25] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image superresolution using very deep convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
 [26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learning Representations, 2015.
 [27] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
 [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst., 2012.
 [29] H. Kwon, Y.W. Tai, and S. Lin. Datadriven depth map refinement via multiscale sparse representation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
 [30] A. Levin, D. Lischinski, and Y. Weiss. A closedform solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):228–242, 2008.
 [31] Y. Li, J.B. Huang, N. Ahuja, and M.H. Yang. Deep joint image filtering. In Proc. Eur. Conf. Comput. Vis., 2016.
 [32] Y. Li, J.B. Huang, N. Ahuja, and M.H. Yang. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
 [33] Y. Li, D. Min, M. N. Do, and J. Lu. Fast guided global interpolation for depth and motion. In Proc. Eur. Conf. Comput. Vis., 2016.
 [34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
 [35] S. Lu, X. Ren, and F. Liu. Depth enhancement via lowrank matrix completion. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014.

[36]
W. Luo, A. G. Schwing, and R. Urtasun.
Efficient deep learning for stereo matching.
In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.  [37] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll. Burst denoising with kernel prediction networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
 [38] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
 [39] J. Park, H. Kim, Y.W. Tai, M. S. Brown, and I. Kweon. High quality depth map upsampling for 3DToF cameras. In Proc. Int. Conf. Comput. Vis., 2011.
 [40] G. Riegler, D. Ferstl, M. Rüther, and B. Horst. A deep primaldual network for guided depth superresolution. In Proc. British Machine Vision Conference, 2016.
 [41] G. Riegler, M. Rüther, and B. Horst. ATGVNet: Accurate depth superresolution. In Proc. Eur. Conf. Comput. Vis., 2016.
 [42] Y. Romano, J. Isidoro, and P. Milanfar. RAISR: Rapid and accurate image super resolution. IEEE Transactions on Computational Imaging, 3(1):110–125, 2017.
 [43] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
 [44] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
 [45] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Realtime human pose recognition in parts from single depth images.
 [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comput. Vis., 2012.
 [47] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In Adv. Neural Inf. Process. Syst., 2014.
 [48] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. Int. Conf. Comput. Vis., 1998.

[49]
T. Vogels, F. Rousselle, B. McWilliams, G. Röthlin, A. Harvill, D. Adler,
M. Meyer, and J. Novák.
Denoising with kernel prediction and asymmetric loss functions.
ACM Trans. Graph., 37(4):124, 2018.  [50] J. Wang and M. F. Cohen. Optimized color sampling for robust matting. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
 [51] H. Wu, S. Zheng, J. Zhang, and K. Huang. Fast endtoend trainable guided filter. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
 [52] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image superresolution via sparse representation. IEEE Trans. Image Process., 19(11):2861–2873, 2010.
 [53] Q. Yang, R. Yang, J. Davis, and D. Nistér. Spatialdepth super resolution for range images. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
 [54] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. In Proc. Int. Conf. Learning Representations, 2016.
 [55] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
 [56] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process., 26(7):3142–3155, 2017.
 [57] Z. Zhang. Microsoft Kinect sensor and its effect. IEEE Trans. Multimedia, 19(2):4–10, 2012.
Comments
There are no comments yet.