Deformable kernel networks for guided depth map upsampling

03/27/2019 ∙ by Beomjun Kim, et al. ∙ 0

We address the problem of upsampling a low-resolution (LR) depth map using a registered high-resolution (HR) color image of the same scene. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details from LR depth and HR color images, and regress upsampling results directly from the networks. In this paper, we revisit the weighted averaging process that has been widely used to transfer structural details from hand-crafted visual features to LR depth maps. We instead learn explicitly sparse and spatially-variant kernels for this task. To this end, we propose a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sparse sets of neighbors and the corresponding weights adaptively for each pixel. We also propose a fast version of DKN (FDKN) that runs about 17 times faster (0.01 seconds for a HR image of size 640 x 480). Experimental results on standard benchmarks demonstrate the effectiveness of our approach. In particular, we show that the weighted averaging process with 3 x 3 kernels (i.e., aggregating 9 samples sparsely chosen) outperforms the state of the art by a significant margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 12

page 13

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Acquiring depth information is one of the fundamental tasks in computer vision, for scene recognition 

[19], pose estimation [45] and 3D reconstruction [11], for example. Recent stereo matching methods based on convolutional neural networks (CNNs) [36, 55] give high-quality depth maps, but still require a huge computational cost, especially in the case of a large search range. Consumer depth cameras (, the ASUS Xtion Pro [1] and the Microsoft Kinect [57]), typically coupled with RGB sensors, are practical alternatives to obtain depth maps at low cost. Although they provide dense depth maps, these typically offer limited spatial resolution and depth accuracy. To address this problem, registered high-resolution (HR) color images can be used as guidance to enhance the spatial resolution of low-resolution (LR) depth maps [9, 14, 15, 20, 27, 31, 33, 39, 53]. The basic idea behind this approach, called guided or joint image filtering, is to exploit their statistical correlation to transfer structural details from the guidance HR color image to the target LR depth maps, typically by estimating spatially-variant kernels from the guidance. Concretely, given the target image and the guidance image , the filtering output at position is expressed as a weighted average [16, 27, 48, 51]:

(1)

where we denote by a set of neighbors (defined on a discrete regular grid) near the position . The filter kernel is typically a function of the guidance image  [9, 16, 27, 39], normalized so that

(2)

Classical approaches to depth map upsampling mainly focus on designing the filter kernels and the set of neighbors  (, sampling locations ). They use hand-crafted kernels and predefined neighbors without learning [16, 27, 48]. For example, the guided filter [16] uses spatially-variant matting Laplacian kernels [30] to encode local structures from the HR color image. These methods use regularly sampled neighbors for aggregating pixels, and do not handle inconsistent structures in the HR color and LR depth images, causing texture-copying artifacts [9]. To address the problem, both HR color and LR depth images have been used to extract common structures [14, 15, 31, 32]. Recently, learning-based approaches using CNNs [20, 31, 32]

have also become increasingly popular. The networks are trained using large quantities of data, capturing natural image priors and often outperforming traditional methods by large margins. These methods do not use a weighted averaging process. They combine instead nonlinear activations of spatially-invariant kernels learned from the networks. That is, they approximate spatially-variant kernels by mixing the activations of spatially-invariant ones nonlinearly (, via the ReLU function 

[28]).

In this paper, we propose to exploit spatially-variant kernels explicitly to encode the structural details from both HR color and LR depth images as in classical approaches, but learn the kernel weights in a supervised manner. We also learn the set of neighbors, building an adaptive and sparse neighborhood system for each pixel. This also allows sub-pixel information aggregation, which may be difficult to achieve by hand. To implement this idea, we propose a CNN architecture and its efficient implementation, called a deformable kernel network (DKN), for learning sampling locations of the neighboring pixels and their corresponding kernel weights at every pixel. We also propose a fast version of DKN (FDKN), achieving a times speed-up compared to the plain DKN for a HR image of size , while retaining its superior performance. We show that the weighted averaging process, even trained with the LR depth map only without any guidance (,  in (1)), with points sparsely sampled, is sufficient to obtain a new state of the art (Fig. 1). Our code and models are available online: https://cvlab-yonsei.github.io/projects/DKN

Contributions. The main contributions of this paper can be summarized as follows:

  • [leftmargin=*]

  • We introduce a novel variant of the classical guided weighted averaging process for depth map upsapling and its implementation, the DKN, that computes the set of neighbors and their corresponding weights adaptively for individual pixels.

  • We propose a fast version of DKN (FDKN) that runs about times faster than the DKN while retaining its superior performance.

  • We achieve a new state of the art, outperforming all existing methods we are aware of by a large margin, and clearly demonstrating the advantage of our approach to learning both kernel weights and sampling locations. We also provide an extensive experimental analysis to investigate the influence of all the components and parameters of our model.

2 Related work

Here we briefly describe the context of our approach, and review representative works related to ours.

Figure 2: The DKN architecture. We learn the kernel weights  and the spatial sampling offsets  from the feature maps of HR color and LR depth images. To obtain the residual image , we then compute the weighted average with the kernel weights  and image values  sampled at offset locations  from the neighbors . Finally, the result is combined with the LR depth image  to obtain the upsampling result . Our model is fully convolutional and is learned end-to-end. We denote by and

 element-wise multiplication and dot product, respectively. The reshaping operator and residual connection are drawn in dotted and dashed lines, respectively. See Table 

1 for the detailed description of the network structure. (Best viewed in color.)

Depth map upsampling. We categorize depth map upsampling into explicit/implicit weighted-average methods and learning-based ones. First, explicit weighted-average methods compute the output at each pixel by a weighted average of neighboring pixels in the LR depth image, where the weights are estimated from the HR color image [16, 27] to transfer fine-grained structures. The bilateral [27, 48] and guided [16] filters are representative methods that have been successfully adapted to depth map upsampling. They use hand-crafted kernels to estimate the weights, which may transfer erroneous structures to the target image [31]. Second, implicit weighted-average methods formulate depth map upsampling as an optimization problem, and minimize an objective function that usually involves fidelity and regularization terms [9, 15, 14, 33, 39]. The fidelity term encourages the output to be close to the LR depth image, and the regularization term gives the output having a structure similar to that of the HR color image. Although, unlike explicit ones, implicit weighted-average methods exploit global structures in the HR color image, hand-crafted regularizers may not capture structural priors. Finally, learning-based methods can further be categorized into dictionary- and CNN-based approaches. Dictionary-based methods exploit the relationship between paired LR and HR depth patches, additionally coupled with the HR color image [10, 29, 52]. In CNN-based methods [20, 31, 32], an encoder-decoder architecture is typically used to learn features from the HR color and/or LR depth images, and the output is then regressed directly from the network. Other methods [40, 41] integrate a variational optimization into CNNs by unrolling the optimization steps of the primal-dual algorithm, which requires two stages in training and a number of iterations in testing. Similar to implicit weighted-average methods, they use hand-crafted regularizers, which may not capture structural priors.

Our method borrows from both explicit weighted-average methods and CNN-based ones. Unlike existing explicit weighted-average methods [16, 27], that use hand-crafted kernels and neighbors defined on a fixed regular grid, we leverage CNNs to learn the set of sparsely chosen neighbors and their corresponding weights adaptively. Our method differs from previous CNN-based ones [20, 31, 32]

in that we learn sparse and spatially-variant kernels for each pixel to obtain upsampling results as a weighted average. The bucketing stretch in single image super-resolution 

[12, 42] can be seen as a non-learning-based approach to filter selection. It assigns a single filter by solving a least-squares problem for a set of similar patches (buckets). In contrast, our model learns different filters using CNNs even for similar RGB patches, since we learn them from a set of multi-modal images (., pairs of RGB/D images).

Variants of the spatial transformer [22]. Recent works introduce more flexible and effective CNN architectures. Jaderberg et al. propose a novel learnable module, the spatial transformer [22], that outputs the parameters of the desired spatial transformation (, affine or thin plate spline) given a feature map or an input image. The spatial transformer makes a standard CNN for classification invariant to a set of geometric transformation, but it has a limited capability of handling local transformations. Most similar to ours are the dynamic filter network [24] and its variants (the adaptive convolution network [38] and the kernel prediction networks [3, 37, 49]), where a set of local transformation parameters is generated adaptively conditioned on the input image. The main differences between our model and these works are two-fold. First, our network is more general in that it is not limited to learning spatially-variant kernels, but it also learns the sampling locations of neighbors. This allows to aggregate sparse but highly related samples only, enabling an efficient implementation in terms of speed and memory and achieving state-of-the-art results even with aggregating 9 samples sparsely chosen. For comparison, the adaptive convolution and kernel prediction networks require lots of samples (,  in [3, 49], in [38], and in [37]). As will be seen in our experiments, learning sampling locations of neighbors clearly boosts the performance significantly compared to learning kernel weights only. Second, as other guided image filtering approaches [15, 16, 27, 31, 32]

, our model is easily adapted to other tasks such as saliency map upsampling, cross-modality image restoration, texture removal, and semantic segmentation. We focus here on depth upsampling but see the supplement for some examples. In contrast, the adaptive convolution network is specialized to video frame interpolation, and kernel prediction networks are applicable to denoising Monte Carlo renderings 

[3, 49] or burst denoising [37] only. Our work is also related to the deformable convolutional network [7]. The basic idea of deformable convolutions is to add offsets to the sampling locations defined on a regular grid in standard CNNs. The deformable convolutional network samples features directly from learned offsets, but shares the same weights for different sets of offsets as in standard CNNs. In contrast, we use spatially-variant weights for each sampling location.

Feature extraction Weight regression
Type Output Type Output
Input (Receptive field) Conv()
Conv()-BN-ReLU Sigmoid
DownConv()-ReLu Mean subtraction or
Conv()-BN-ReLU L1 norm. (w/o Res.)
DownConv()-ReLU Offset regression
Conv()-BN-ReLU Type Output
Conv()-ReLU Conv()
Conv()-ReLU
Table 1:

Network architecture details. “BN” and “Res.” denote the batch normalization 

[21]

and residual connection, respectively. We denote by “DownConv” convolution with stride 2. The inputs of our network are 3-channel HR color and 1-channel LR depth images (denoted by

). For the model without the residual connection, we use an L1 normalization layer (denoted by “L1 norm.”) instead of subtracting mean values for weight regression.

3 Approach

In this section, we briefly describe our approach, and present a concrete network architecture. We then describe a fast version of DKN.

3.1 Overview

Our network mainly consists of two parts (Fig. 2): We first learn spatially-variant kernel weights and spatial sampling offsets w.r.t the regular grid. To this end, a two-stream CNN [47], where each sub-network has the same structure (but different parameters), uses the guidance (HR color) and target (LR depth) images to extract features that are used to estimate the kernel weights and the offsets. We then compute a weighted average using the learned kernel weights and sampling locations computed from the offsets to obtain a residual image. Finally, the upsampling result is obtained by combining the residual image with the LR depth map. Note that we can train DKN without the residual connection, by directly computing the upsampling result as a weighted average. Note also that we can train our model without the guidance of the HR color image. In this case, we use a single-stream CNN to extract features from the LR depth map only in both training and testing. Our network is fully convolutional, does not require fixed-size input images, and it is trained end-to-end.

Weight and offset learning. Dual supervisory information for the weights and offsets is typically not available. We learn instead these parameters by minimizing directly the discrepancy between the output of the network and a reference HR depth map. In particular, constraints on weight and offset regression (sigmoid and mean subtraction layers in Fig. 2) specify how the kernel weights and offsets behave and guide the learning process. For weight regression, we apply a sigmoid layer that makes all elements larger than 0 and smaller than 1. We then subtract the mean value from the output of the sigmoid layer so that the regressed weights should be similar to high-pass filters with kernel weights adding to 0. For offset regression, we do not apply the sigmoid layer, since relative offsets (for  positions) from locations on a regular grid can have negative values.

Residual connection. The main reason behind using a residual connection is that the upsampling result is largely correlated with the LR depth map, and both share low-frequency content [17, 25, 32, 56]. Focussing on learning the residuals also accelerates training speed while achieving better performance [25]. Note that contrary to [17, 25, 32, 56], we obtain the residuals by a weighted averaging process with the learned kernels, instead of regressing them directly from the network output. Empirically, the kernels learned with the residual connection have the same characteristics as the high-pass filters widely used to extract important structures (, object boundaries) from images (See the supplemental material).

3.2 DKN architecture

We design a fully convolutional network to learn the kernel weights and the sampling offsets for individual pixels. We show in Table 1 the detailed description of the network structure.

Feature extraction. We adapt an architecture similar to [38] for feature extraction. It consists of 7 convolutional layers, two of which use convolutions with multiple strides (“DownConv” in Table 1), that enlarge a receptive field size with a small number of network parameters to estimate. We input the HR color and LR depth images to each of the sub-networks, resulting in a feature map of size  for a receptive field of size . The LR depth map is initially upsampled using bicubic interpolation. We use the ReLU [28]

as an activation function. Batch normalization 

[21] is used for speeding up training and regularization.

Weight regression. For each sub-network, we add a convolutional layer on top of the feature extraction layer. It gives a feature map of size , where is the size of the filter kernel, which is used to regress the kernel weights. To estimate the weights, we apply a sigmoid layer to each feature map of size , and then combine the outputs by element-wise multiplication (see Fig. 2

). We could use a softmax layer as in 

[3, 38, 49], but empirically find that it does not perform as well as the sigmoid layer. The softmax function encourages the estimated kernel to have only a few non-zero elements, which is not appropriate for estimating the weights for sparsely sampled pixels. The estimated kernels should be similar to high-pass filters, with kernel weights adding to 0. To this end, we subtract the mean value from the combined output of size . For our model without a residual connection, we apply instead L1 normalization to the output of size . Since the sigmoid layer makes all elements in the combined output larger than 0, applying L1 normalization forces the kernel weights to add to 1 as in (2).

Offset regression. Similar to the weight regression case, we add a convolutional layer on top of the feature extraction layer. The resulting two feature maps of size  are combined by element-wise multiplication. The final output contains relative offsets (for , positions) from locations on a regular grid. In our implementation, we use kernels, but the output is computed by aggregating 9 samples sparsely chosen from a much larger neighborhood. The two main reasons behind the use of small-size kernels are as follows: (1) This enables an efficient implementation in terms of speed and memory. (2) The reliability of samples are more important than the total number of samples aggregated. As will be seen in Sec. 4, our model outperforms the guided filter [16] using kernels of size by a large margin. A similar finding is noted in [50], which shows that only high-confidence samples should be chosen when estimating foreground and background images in image matting. Note that offset regression is closely related to nonlocal means [5] in that both select which pixels to aggregate instead of immediate neighbors. Likewise, learning offsets is related to “self-supervised” correspondence models in stereo matching [13] and optical flow estimation [23]. For example, in the case of stereo matching, a model is trained to produce a flow field such that a right image is reconstructed by a left one according to that flow field. Our model with filter kernels of size computes correspondences for each pixel within input images, and also learns the corresponding matching confidence (, the kernel weights).

(a)
(b)
(c)
Figure 6: Illustration of irregular sampling of neighboring pixels using offsets: (a) regular sampling  on discrete grid; (b) learned offsets ; (c) deformable sampling locations  with the offsets . The learned offsets are fractional and the corresponding pixel values are obtained by bilinear interpolation.

Weighted average. Given the learned kernel  and sampling offsets , we compute the residuals  as a weighted average:

(3)

where is a local window centered at the location on a regular grid (Fig. 6(a)). We denote by the sampling position computed from the offset  (Fig. 6(b)) of the location as follows.

(4)

The sampling position predicted by the network is irregular and typically fractional (Fig. 6(c)). We use a sampler to compute corresponding (sub) pixel values  as

(5)

where enumerates all integer locations in a local 4-neighborhood system to the fractional position , and is a sampling kernel. Following [7, 22], we use a two-dimensional bilinear kernel, and split it into two one-dimensional ones as

(6)

where . Note that the residual term in (3) is exactly the same as the explicit weighted average, but we aggregate pixels from the sparsely chosen locations  with the learned kernels , which is not feasible in current methods.

When we do not use a residual connection, we compute the upsampling result  directly as a weighted average using the learned kernels and offsets:

(7)

Loss. We train our model by minimizing the norm of the difference between the network output  and ground-truth HR reference depth map  as follows.

(8)

Testing. Two principles have guided the design of our learning architecture: (1) Points from a large receptive field in the original guidance and target images should be used to compute the weighted averages associated with the value of the upsampled depth map at each one of its pixels; and (2) inference should be fast. The second principle is rather self-evident. We believe that the first one is also rather intuitive, and it is justified empirically by the ablation study presented later. In fine, it is also the basis for our approach, since our network learns where, and how to sample a small number of points in a large receptive field.

A reasonable compromise between receptive field size and speed is to use one or several convolutional layers with a multi-pixel stride, which enlarges the image area pixels are drawn from without increasing the number of weights in the network. This is the approach we have followed in our base architecture, DKN, with two stride-2 “DownConv” layers. The price to pay is a loss in spatial resolution for the final feature map, with only of the total number of pixels in the input images. One could of course give as input to our network the receptive fields associated with all of the original guidance and target image pixels, at the cost of forward passes during inference. DKN implements a much more efficient method where shifted copies of the two images are used in turn as input to the network, and the corresponding network outputs are then stitched together in a single HR image, at the cost of only forward passes. The details of this shift-and-stitch approach [34, 38] can be found in the supplemental material.

Figure 7: Illustration of resampling. An image of size  is reshaped with stride  in each dimension, resulting a resampled one of size . (Best viewed in color.)
Datasets Middlebury [18] Lu [35] NYU v2 [46] Sintel [6]
Methods
Bicubic Int. 4.44 7.58 11.87 5.07 9.22 14.27 8.16 14.22 22.32 6.54 8.80 12.17
MRF [8] 4.26 7.43 11.80 4.90 9.03 14.19 7.84 13.98 22.20 8.81 11.77 15.75
GF [16] 4.01 7.22 11.70 4.87 8.85 14.09 7.32 13.62 22.03 6.10 8.22 11.22
JBU [27] 2.44 3.81 6.13 2.99 5.06 7.51 4.07 8.29 13.35 5.88 7.63 10.97
TGV [9] 3.39 5.41 12.03 4.48 7.58 17.46 6.98 11.23 28.13 32.01 36.78 43.89
Park [39] 2.82 4.08 7.26 4.09 6.19 10.14 5.21 9.56 18.10 9.28 12.22 16.51
SDF [15] 3.14 5.03 8.83 4.65 7.53 11.52 5.27 12.31 19.24 6.52 7.98 11.36
FBS [4] 2.58 4.19 7.30 3.03 5.77 8.48 4.29 8.94 14.59 11.96 12.29 13.08
FGI [33] 3.24 4.60 6.74 4.68 6.32 9.25 6.43 9.52 14.13 6.29 8.24 11.01
DMSG [20] 1.88 3.45 6.28 2.30 4.17 7.22 3.02 5.38 9.17 5.32 7.24 10.11
DJF [31] 2.14 3.77 6.12 2.54 4.71 7.66 3.54 6.20 10.21 5.51 7.52 10.63
DJFR [32] 1.98 3.61 6.07 2.22 4.54 7.48 3.38 5.86 10.11 5.50 7.43 10.48
FDKN 1.07 2.23 5.09 0.85 1.90 5.33 2.05 4.10 8.10 3.31 5.08 8.51
DKN 1.12 2.13 5.00 0.90 1.83 4.99 2.11 4.00 8.24 3.40 4.90 8.18
FDKN w/o Res. 1.12 2.23 4.52 0.85 2.19 5.15 1.88 3.67 7.13 3.38 5.02 7.74
DKN w/o Res. 1.26 2.16 4.32 0.99 2.21 5.12 1.66 3.36 6.78 3.36 4.82 7.48
FDKN 1.08 2.17 4.50 0.82 2.10 5.05 1.86 3.58 6.96 3.36 4.96 7.74
DKN 1.23 2.12 4.24 0.96 2.16 5.11 1.62 3.26 6.51 3.30 4.77 7.59
Table 2: Quantitive comparison with the state of the art on depth map upsampling in terms of average RMSE. Numbers in bold indicate the best performance and underscored ones are the second best. Following [31, 32], the average RMSE are measured in centimeter for the NYU v2 dataset [46]. For other datasets, we compute RMSE with upsampled depth maps scaled to the range .

3.3 FDKN architecture

A more efficient alternative to DKN is to split the input images into the same 16 subsampled and shifted parts as before, but this time stack them into new target and guidance images (Fig. 7), with channels for the former, and channels for the latter, ,  when the RGB image is used. The effective receptive field for FDKN is comparable to that of DKN, but FDKN involves much fewer parameters because of the reduced input image resolution and the shared weights across channels. The individual channels are then recomposed into the final upsampled image [44], at the cost of only one forward pass. Specifically, we use a series of 6 convolutional layers of size  for feature extraction. For weight and offset regression, we apply a convolution on top of the feature extraction layers similar to DKN, but using more network parameters. For example, FDKN and DKN compute feature maps of size and , respectively, for weight regression, from each feature of size . This allows FDKN to estimate kernel weights and offsets for all pixels simultaneously. The details of this shift-and-stack approach can be found in the supplemental material. In practice, FDKN gives a 17 times speed-up over DKN. Because it involves fewer parameters (M vs. M for DKN), one might expect somewhat degraded results. Our experiments demonstrate that FDKN remains in the ballpark of that of DKN, still significantly better than competing approaches, and in one case even overperforming DKN.

4 Experiments

In this section we present a detailed analysis and evaluation of our approach. More results and other applications of our model including saliency image upsampling, cross-modality image restoration, texture removal and semantic segmentation can be found in the supplement.

4.1 Implementation details

Following the experimental protocol of [31, 32], we train different models to upsample depth maps for scale factors , , with random initialization. We sample 1,000 RGB/D image pairs of size from the NYU v2 dataset [46]. We use the same image pairs as in [31, 32]

to train the networks. The models are trained with a batch size of 1 for 40k iterations, giving roughly 20 epochs over the training data. We synthesize LR depth images (

, , ) from ground truth by bicubic downsampling. We use the Adam optimizer [26] with and . As learning rate we use and divide it by 5 every 10k iterations. Data augmentation and regularization techniques such as weight decay and dropout [28] are not used, since 1,000 RGB/D image pairs from the NYU dataset have proven to be sufficient to train our models (See the supplement). All networks are trained end-to-end using PyTorch [2].

(a) RGB image.
(b) GF [16].
(c) TGV [9].
(d) Park [39].
(e) SDF [15].
(f) DJFR [32].
(g) DKN.
(h) FDKN.
(i) Ground truth.
Figure 17: Visual comparison of upsampled depth maps (). Top to bottom: Each row shows upsampled depth maps on the NYU v2 [46], Lu [35], and Sintel [6] datasets, respectively. Note that we train our models with the NYU v2 dataset, and do not fine-tune them to other datasets. (Best viewed in color.)

4.2 Results

We test our models with the following four benchmark datasets. These feature aligned color and depth images. Note that we train our models with the NYU v2 dataset, and do not fine-tune them to other ones to evaluate its generalization ability.

  • [leftmargin=*]

  • Middlebury dataset [18, 43]: We use the 30 RGB/D image pairs from the 2001-2006 datasets provided by Lu [35].

  • Lu dataset [35]: This provides 6 RGB/D image pairs acquired by the ASUS Xtion Pro camera [1].

  • NYU v2 dataset [46]: It consists of 1,449 RGB/D image pairs captured with the Microsoft Kinect [57]. We exclude the 1,000 pairs used for training, and use the rest (449 pairs) for evaluation

  • Sintel dataset [6]: This dataset provides 1,064 RGB/D image pairs created from an animated 3D movie. It contains realistic scenes including fog and motion blur. We use 864 pairs from a final-pass dataset for testing.

MRF [8] GF [16] JBU [27] TGV [9] Park [39] SDF [15] FBS [4] FGI [33] DMSG [20] DJFR [32] DKN FDKN DKN FDKN
Times (s) 0.69 0.14 0.31 33 18 25 0.37 0.24 0.04 0.01 0.09 0.01 0.17 0.01
Table 3: Runtime comparison for HR images of size  (NYU v2 dataset [46]).
Weight learning Offset learning Res.
RGB Depth RGB Depth
5.92/ 6.05 5.52/ 5.73 5.43/ 5.67 5.59/ 5.74 5.82/ 5.81 6.21/ 5.99
5.24/ 5.30 4.36/ 4.47 4.09/ 4.24 4.09/ 4.17 4.11/ 4.18 4.15/ 4.21
5.03/ 5.14 3.90/ 4.16 3.48/ 3.80 3.32/ 3.66 3.33/ 3.66 3.39/ 3.72
5.37/ 5.18 5.38/ 5.09 5.40/ 5.07
4.06/ 4.13 4.09/ 4.13 4.13/ 4.14
3.36/ 3.67 3.32/ 3.65 3.33/ 3.66
3.26/ 3.58 3.21/ 3.53 3.19/ 3.52
Table 4: Average RMSE comparison (DKN/FDKN) of different components and size of kernels (from to ). From the third row, we can see that aggregating pixels from a window is enough. We thus restrict the maximum range of offset locations to . For example, results for in the forth row are computed using pixels sparsely sampled from a window. We omit the results for , and kernels, since they are equal to or beyond the maximum range of offset locations.

We compare our method with the state of the art in Table 2. It shows the average RMSE between upsampling results and ground truth. All numbers except those for the Sintel dataset are taken from [31, 32]. The results of DJF [31] and its residual version (DJFR [32]) are obtained by the provided models trained with the NYU v2 dataset. DMSG [20] uses the Middlebury and Sintel datasets for training the network. For fair comparison of DMSG with other CNN-based methods including ours, we retrain the DMSG model using the same image pairs from the NYU v2 dataset as in [31, 32]. From this table, we observe four things: (1) Our models outperform the state of the art including CNN-based methods [20, 31, 32] by significant margins in terms of RMSE, even without the residual connection (DKN w/o Res. and FDKN w/o Res.). For example, DKN decreases the average RMSE by  (),  () and  () compared to DJFR [32]. (2) Our models trained without the guidance of HR color images (DKN and FDKN), using the depth map only, also outperform the state of the art. In particular, they give better results than DKN and FDKN for the Lu dataset [35]. A plausible expiation is that depth and color boundaries are less correlated, since the color images in the dataset are captured in a low-light condition. (3) We can clearly see that our models perform well on both synthetic and real datasets (, the Sintel and NYU v2 datasets), and generalize well to other images (, on the Middlebury dataset) outside the training dataset. (4) FDKN retains the superior performance of DKN, and even outperforms DKN for the Lu dataset.

Qualitative results. Figure 17 shows a visual comparison of the upsampled depth maps (8). The better ability to extract common structures from the color and depth images by our models here is clearly visible. Specifically, our results show a sharp depth transition without the texture-copying artifacts. In contrast, artifacts are clearly visible even in the results of DJFR [32], which tends to over-smooth the results and does not recover fine details. This confirms once more the advantage of using the weighted average with spatially-variant kernels and an adaptive neighborhood system in depth map upsampling.

Runtime. Table 3 shows runtime comparisons on the same machine. We report the runtime for DMSG [20], DJFR [32] and our models with a Nvidia Titan XP and for other methods with an Intel i5 3.3 GHz CPU. Our current implementation for DKN takes on average seconds for HR images of size . It is slower than DMSG [20] and DJFR [32], but yields a significantly better RMSE (Fig. 1 and Table 2). FDKN runs about faster than the DKN, as fast as DJFR, but with significantly higher accuracy.

4.3 Discussion

We conduct an ablation analysis on different components in our models, and show the effects of different parameters for depth map upsampling () on the NYU v2 dataset [46]. More discussion can be found in the supplement.

Network architecture. We show the average RMSE for six variants of our models in Table 4. The baseline models learn kernel weights from HR color images only. The first row shows that this baseline already outperforms the state of the art (see Table 2). From the second row, we can see that our models trained using LR depth maps only give better results than the baseline, indicating that using the HR color images only is not enough to fully exploit common structures. The third row demonstrates that constructing kernels from both images boosts performance. For example, the average RMSE of DKN decreases from to for the kernel. The fourth and fifth rows show that learning the offsets significantly boosts the performance of our models. The average RMSE of DKN trained using the HR color or LR depth images only decreases from to and from to , respectively, for the kernel. The last two rows demonstrate that the effect of learning kernel weights and offsets from both inputs is significant, and combining all components including the residual connection gives the best results. Note that learning to predict the spatial offsets is important because (1) learning spatially-variant kernels for individual pixels would be very hard otherwise, unless using much larger kernels to achieve the same neighborhood size, which would lead to an inefficient implementation, and (2) contrary to current architectures including DJF [31] and DMSG [20], this allows sub-pixel information aggregation.

Kernel size. Table 4 also compares the performances of networks with different size of kernels. We enlarge the kernel size gradually from to and compute the average RMSE. From the third row, we observe that the performance improves until size of . Increasing size further does not give additional performance gain. This indicates that aggregating pixels from a window is enough for the task. For offset learning, we restrict the maximum range of the sampling position to for all experiments. That is, the results from the third to last rows are computed by aggregating 9, 25 or 49 samples sparsely chosen from a window. The last row of Table 4 suggests that our final models also benefit from using more samples. The RMSE for DKN decreases from to at the cost of additional runtime. For comparison, DKN with kernels of size , and take 0.17, 0.18 and 0.19 seconds, respectively, with a Nvidia Titan XP. A size offers a good compromise in terms of RMSE and runtime and this is what we have used in all experiments.

DownConv for DKN. We empirically find that extracting features from large receptive fields is important to incorporate context for weight and offset learning. For example, reducing the size from to causes an increase of the average RMSE from to for the kernel. The DKN without DownConv layers can be implemented in a single forward pass, but requires more parameters (M vs. M for DKN) to maintain the same receptive field size, with a total number of convolutions increasing from M to M at each pixel. We may use dilated convolutions [54] that support large receptive fields without loss of resolution. When using the same receptive field size as , the average RMSE for dilated convolutions increases from to for the kernel. The resampling technique (Fig. 7) thus appears to be the preferable alternative.

5 Conclusion

We have presented a CNN architecture for depth map upsampling. Instead of regressing the upsampling results directly from the network, we use spatially-variant weighted averages where the set of neighbors and the corresponding kernel weights are learned end-to-end. A fast version achieves a speed-up compared to the plain DKN without much (if any) loss in performance. Finally, we have shown that the weighted averaging process, even using the LR depth image only without any guidance, with samples sparsely chosen, is sufficient to set a new state of the art.

References

  • [1] https://www.asus.com/ae-en/3D-Sensor/Xtion_PRO_LIVE/.
  • [2] Automatic differentiation in PyTorch.
  • [3] S. Bako, T. Vogels, B. McWilliams, M. Meyer, J. Novák, A. Harvill, P. Sen, T. Derose, and F. Rousselle. Kernel-predicting convolutional networks for denoising Monte Carlo renderings. ACM Trans. Graph., 36(4):97, 2017.
  • [4] J. T. Barron and B. Poole. The fast bilateral solver. In Proc. Eur. Conf. Comput. Vis., 2016.
  • [5] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

    , 2005.
  • [6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. Eur. Conf. Comput. Vis., 2012.
  • [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proc. Int. Conf. Comput. Vis., 2017.
  • [8] J. Diebel and S. Thrun. An application of Markov random fields to range sensing. In Adv. Neural Inf. Process. Syst., 2006.
  • [9] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In Proc. Int. Conf. Comput. Vis., 2013.
  • [10] D. Ferstl, M. Ruther, and H. Bischof. Variational depth superresolution using example-based edge representations. In Proc. Int. Conf. Comput. Vis., 2015.
  • [11] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell., 32(8):1362–1376, 2010.
  • [12] P. Getreuer, I. Garcia-Dorado, J. Isidoro, S. Choi, F. Ong, and P. Milanfar. Blade: Filter learning for general purpose computational photography. In 2018 IEEE Conf. Computational Photography, 2018.
  • [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [14] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [15] B. Ham, M. Cho, and J. Ponce. Robust guided image filtering using nonconvex potentials. IEEE Trans. Pattern Anal. Mach. Intell., 40(1):192–207, 2018.
  • [16] K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell., 35(6):1397–1409, 2013.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [18] H. Hirschmuller and D. Scharstein. Evaluation of cost functions for stereo matching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
  • [19] J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [20] T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In Proc. Eur. Conf. Comput. Vis., 2016.
  • [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    Proc. Int. Conf. Machine Learning

    , 2015.
  • [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Adv. Neural Inf. Process. Syst., 2015.
  • [23] J. Y. Jason, A. W. Harley, and K. G. Derpanis.

    Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness.

    In Proc. Eur. Conf. Comput. Vis., 2016.
  • [24] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Adv. Neural Inf. Process. Syst., 2016.
  • [25] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learning Representations, 2015.
  • [27] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
  • [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst., 2012.
  • [29] H. Kwon, Y.-W. Tai, and S. Lin. Data-driven depth map refinement via multi-scale sparse representation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
  • [30] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):228–242, 2008.
  • [31] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. Eur. Conf. Comput. Vis., 2016.
  • [32] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
  • [33] Y. Li, D. Min, M. N. Do, and J. Lu. Fast guided global interpolation for depth and motion. In Proc. Eur. Conf. Comput. Vis., 2016.
  • [34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
  • [35] S. Lu, X. Ren, and F. Liu. Depth enhancement via low-rank matrix completion. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014.
  • [36] W. Luo, A. G. Schwing, and R. Urtasun.

    Efficient deep learning for stereo matching.

    In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [37] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll. Burst denoising with kernel prediction networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
  • [38] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [39] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon. High quality depth map upsampling for 3D-ToF cameras. In Proc. Int. Conf. Comput. Vis., 2011.
  • [40] G. Riegler, D. Ferstl, M. Rüther, and B. Horst. A deep primal-dual network for guided depth super-resolution. In Proc. British Machine Vision Conference, 2016.
  • [41] G. Riegler, M. Rüther, and B. Horst. ATGV-Net: Accurate depth super-resolution. In Proc. Eur. Conf. Comput. Vis., 2016.
  • [42] Y. Romano, J. Isidoro, and P. Milanfar. RAISR: Rapid and accurate image super resolution. IEEE Transactions on Computational Imaging, 3(1):110–125, 2017.
  • [43] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
  • [44] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [45] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images.
  • [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comput. Vis., 2012.
  • [47] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Adv. Neural Inf. Process. Syst., 2014.
  • [48] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. Int. Conf. Comput. Vis., 1998.
  • [49] T. Vogels, F. Rousselle, B. McWilliams, G. Röthlin, A. Harvill, D. Adler, M. Meyer, and J. Novák.

    Denoising with kernel prediction and asymmetric loss functions.

    ACM Trans. Graph., 37(4):124, 2018.
  • [50] J. Wang and M. F. Cohen. Optimized color sampling for robust matting. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
  • [51] H. Wu, S. Zheng, J. Zhang, and K. Huang. Fast end-to-end trainable guided filter. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
  • [52] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Trans. Image Process., 19(11):2861–2873, 2010.
  • [53] Q. Yang, R. Yang, J. Davis, and D. Nistér. Spatial-depth super resolution for range images. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007.
  • [54] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In Proc. Int. Conf. Learning Representations, 2016.
  • [55] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
  • [56] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process., 26(7):3142–3155, 2017.
  • [57] Z. Zhang. Microsoft Kinect sensor and its effect. IEEE Trans. Multimedia, 19(2):4–10, 2012.