Deeply Supervised Depth Map Super-Resolution as Novel View Synthesis

08/27/2018 ∙ by Xibin Song, et al. ∙ 0

Deep convolutional neural network (DCNN) has been successfully applied to depth map super-resolution and outperforms existing methods by a wide margin. However, there still exist two major issues with these DCNN based depth map super-resolution methods that hinder the performance: i) The low-resolution depth maps either need to be up-sampled before feeding into the network or substantial deconvolution has to be used; and ii) The supervision (high-resolution depth maps) is only applied at the end of the network, thus it is difficult to handle large up-sampling factors, such as × 8, × 16. In this paper, we propose a new framework to tackle the above problems. First, we propose to represent the task of depth map super-resolution as a series of novel view synthesis sub-tasks. The novel view synthesis sub-task aims at generating (synthesizing) a depth map from different camera pose, which could be learned in parallel. Second, to handle large up-sampling factors, we present a deeply supervised network structure to enforce strong supervision in each stage of the network. Third, a multi-scale fusion strategy is proposed to effectively exploit the feature maps at different scales and handle the blocking effect. In this way, our proposed framework could deal with challenging depth map super-resolution efficiently under large up-sampling factors (e.g. × 8, × 16). Our method only uses the low-resolution depth map as input, and the support of color image is not needed, which greatly reduces the restriction of our method. Extensive experiments on various benchmarking datasets demonstrate the superiority of our method over current state-of-the-art depth map super-resolution methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 10

page 11

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depth map super-resolution (DSR) (c.f., Fig. 1) aims at super-resolving a high-resolution depth map from a low-resolution depth map input [1, 2, 3, 4], which is a challenging task especially under large up-sampling factors (, , and beyond). This is mainly due to the great information loss in down-sampling. For example, under the up-sampling factor of , 256 () depth values have to be inferred from a single depth value on average. To tackle this highly under-constrained problem, various methods have been proposed by exploiting the availability of large-scale training datasets. Even though deep convolutional neural network (DCNN) based methods have achieved great success in various vision tasks such as image deblurring [5][6], image denoising [7][8]

, monocular depth estimation

[46, 56], saliency prediction [28, 27], and even color image super-resolution (CSR) [9][10][11][12], it is only very recently that the success of DCNN in color image super-resolution [9][11][12] has been extended to the task of depth map super-resolution [2][13][14][15]. This is mainly due to the intrinsic differences between color images and depth maps, where the depth maps generally contain less textures and more sharp boundaries, and are usually degraded by noise due to the imprecise consumer depth cameras. The difficulty in capturing high-resolution depth map further increases the challenge.


Fig. 1: Qualitative comparison between our method and state-of-the-art methods for DSR with noisy input under an up-sampling factor . (a) Ground truth, (b) Bicubic, (c) Xie et al. [16] and (d) Our result.

Under the pipelines of current DCNN based DSR methods [2][14][15], a depth map usually needs to be up-sampled before feeding into the network. However, the up-sampled depth maps do not necessarily provide a proper good initialization for the network learning. And the problem becomes even worse when the resolution of the input low-resolution depth map is too low and the up-sampling factor is too large. Hence, to improve the representative ability of DCNN, Riegler et al. [14][15] resorted to increasing the depth of the network. Unfortunately, this deep structure may suffer from the vanishing gradient issue as the supervision is only enforced at the very end of the network. Besides, deconvolution strategy [13] has also been used in DSR to improve the quality of the resultant feature maps, which can be viewed as an inverse operation of convolution. The deconvolution operator generally needs a large number of parameters. In this paper, we would like to argue that neither the hand-designed up-sampling nor the deconvolution is necessary for depth map super-resolution.


Fig. 2: Depth map super-resolution as novel view synthesis. We illustrate the novel view synthesis process for an up-sampling factor of . (a) shows the input single pixel which can be regarded as the input low-resolution depth map captured at ; (b) shows the output four pixels which can be regarded as pixels captured from four slightly different viewpoints, respectively. The red, black, blue and green pixels are corresponding to the positions of and . The yellow imaginary line in (b) corresponds to the yellow pixel in (a).

In this paper, we adopt a different way by representing the task of depth map super-resolution as a series of novel view synthesis sub-tasks at different positions, where each sub-task generates a depth map at a different camera pose. Take the depth map super-resolution task with an up-sampling factor as an example, where we would like to generate new depth values from one depth measurement in the low-resolution depth map. We partition the desired high-resolution depth map with size () into four parts: , , and , where . These four depth maps own the same spatial resolution and could be viewed as depth maps captured by virtual depth cameras at four different positions, which has the same resolution as the low-resolution depth map with resolution (). Therefore, instead of learning a direct nonlinear mapping from to , we propose to learn four separate nonlinear mappings (e.g.novel view synthesis) from to , , and separately, where each of the nonlinear mapping could be learned through DCNN in parallel. Then the super-resolution task can be formulated as predicting the depth maps corresponding to the four different virtual cameras from the input low-resolution depth map. In this way, the input and output of each novel view synthesis task have the same resolution, which makes the network structure easy to design and implement. In Fig. 2, we illustrate our idea of representing depth map super-resolution as novel view synthesis.

Furthermore, to handle large up-sampling factors, a deeply supervised learning strategy is proposed. In training each sub-task, the supervision from the output could be used independently. Thus, we achieve a deeply supervised learning framework to DSR. As the supervision has been deeply enforced at different layers of the network, strong supervision is well expected. As each novel view synthesis sub-task is learned independently, there will be blocking effect between different parts, which only happens in the final output depth map. To reduce the blocking effect and effectively exploit the feature maps at different stages, we propose a multi-scale fusion strategy (MSF) to further improve the learned depth map, where inter-media depth maps at different scales are fused. Finally, to better utilize the individual information from each depth map, we impose a global depth field statistic prior (DFS) to further optimize the obtained depth maps. Our method does not need the support of color information, which reduces the restriction of corresponding color images.

As illustrated in Fig. 1, our method outperforms state-of-the-art DSR methods especially under large up-sampling factors (). Our main contributions can be summarized as:

  1. We represent the task of depth map super-resolution as a series of novel view synthesis sub-tasks, where each sub-task can be efficiently solved in an end-to-end learning manner;

  2. A deeply supervised learning framework is proposed to handle large up-sampling factors in depth map super-resolution, where strong supervisions are applied at different stages;

  3. A multi-scale fusion strategy (MFS) and depth field statistic (DFS) are proposed to effectively exploit the feature maps at different scales and handle the blocking effect;

  4. Experiments on various benchmarking datasets demonstrate the superiority of our method over state-of-the-art depth map super-resolution methods, including the DSR methods using the color images.


Fig. 3: A nutshell of our depth map super-resolution method for an up-sampling factor of . Given an LR depth map as input, the DCNN unit is first used to train novel view synthesis sub-tasks in parallel, then a re-organization operation is utilized to obtain the up-sampled depth map. Then, the resultant depth map is regarded as input to the next stage. Deep supervision is enforced at each stage, where the supervision signals are down-sampled from the ground truth high-resolution depth maps. The results of each stage are fused by using our multi-scale fusion (MSF) strategy. Finally, a depth field statistic (DFS) prior is applied to further improve the quality of the fused depth maps.

2 Related Work

2.1 Depth image super-resolution

Depth map super-resolution (DSR) methods can be roughly classified into three categories: conventional learning based DSR, high-resolution (HR) intensity image guided DSR and deep convolutional neural network (DCNN) based DSR.

Conventional learning based DSR: It is proven that HR depth maps can be generated by low-resolution (LR) depth maps based on prior information. Inspired by Freeman et al. [17], Aodha et al. [18] proposed a patch based MRF method to DSR by using prior information learned from depth map datasets. Hornácek et al. [19] generated HR depth maps by searching low- and high-resolution patch-pairs of arbitrary size in the depth map itself. What’s more, sparse representation and dictionary learning have also been utilized in DSR. Ferstl et al. [20] proposed to generate HR depth edges by learning from an external database of high and low-resolution examples, where the Total Generalized Variation (TGV) was employed as regularization. Xie et al. [16] used the MRF optimization approach to generate sharp HR depth edges from LR depth edges, where the HR depth edges were used as guidance in generating the HR depth maps.

HR intensity image guided DSR: Pre-aligned HR intensity images can always provide effective guidance for DSR since they contain plenty of useful high-frequency components which assist the process of DSR. Park et al. [21] generated HR depth maps by non-local means filter with an HR intensity image as auxiliary information. Yang et al. [22] used an adaptive color-guided auto-regressive (AR) model to generate HR depth maps from LR depth maps. Ferstl et al. [23]

utilized an anisotropic diffusion tensor, which used HR color image as guidance. Kiechle

et al. [24] made use of a bimodal co-sparse analysis model to generate an HR depth map from an LR depth map and an HR color image. Additionally, Matsuo et al. [25] generated HR depth maps by using auxiliary information extracted from HR color images to compute local tangent planes in depth maps. What’s more, Lu et al. [26] utilized the consistency between color images and depth maps to generate HR depth maps. However, notwithstanding the appealing results that such approaches could generate, the lack of high-resolution color images fully registered with the depth maps in many cases makes the color image guided approaches less general.

DCNN based DSR:

The success of DCNN in high-level computer vision tasks has only been extended to DSR very recently. Song

et al. [2] proposed a progressive DCNN based end-to-end learning method to generate an HR depth map from an LR depth map, where SRCNN [9] was used as the mapping unit in the progressive process. Meanwhile, Riegler et al. [15] proposed to combine DCNN with total variations in a novel ATGV-Net to generate HR depth maps. The total variations were expressed by layers with fixed parameters. Besides, Riegler et al. [14] also proposed a novel DCNN based method by combining a DCNN with a non-local variational method. The corresponding HR color images were also utilized in the method. LR depth maps were all needed to be up-sampled before feeding them to the network [2][14][15]. Very recently, Hui et al. [13] proposed to use a multi-scale fusion strategy in a Multi-Scale Guided convolutional network for DSR with and without the guidance of the color images.

2.2 DCNN based CSR

The success of DCNN has been extended to color image super-resolution (CSR). Using DCNN, effective frameworks have been proposed to super-resolve the low-resolution color images.

Dong et al. [9] proposed an end-to-end deep CNN framework to learn the nonlinear mapping between low- and high-resolution images. Based on [9], Kim et al. [11] proposed to use a deeper network to represent the non-linear mapping and improved performance has been achieved. Meanwhile, in [12], a deeply-recursive convolutional network for CSR is proposed, which uses a deep recursive layer to obtain better results. Shi et al. [29] proposed a sub-pixel DCNN for CSR instead of using the deconvolution strategy, which utilized a sub-pixel layer to learn an array of up-sampling filters to upscale the LR feature maps to HR output. Ledig et al. [30] proposed a residual framework to infer photo-realistic natural images, where an adversarial loss and a content loss were used. Recently, Tai et al. [31] proposed a Deep Recursive Residual Network to mitigate the difficulty of training deep networks and used Recursive learning to control the model parameters. Lai et al. [32] proposed a Laplacian Pyramid Super-Resolution Network to progressively solve the problem. To handle high magnification ratios and create realistic textures, Sajjadi et al. [33] proposed to use feed-forward fully CNN and perceptual loss to achieve automated texture synthesis. Tong et al. [34] introduced dense skip connections in a very deep network for CSR. Huang et al. [35] proposed to utilize weakly-supervised joint convolutional sparse coding to solve the problem of multi-modal image super-resolution.

3 Our Approach

In this paper, we propose a new framework to tackle the problems in handling large up-sampling factors (e.g., , ) and the need of deconvolution or pre-processing. First, we propose to represent the task of depth map super-resolution as a series of novel view synthesis sub-tasks. Each novel view synthesis sub-task aims at generating (synthesizing) a depth map from a slightly different camera pose, which could be learned in parallel. Second, to handle large up-sampling factors, we present a deeply supervised network structure to enforce strong supervision in each stage of the network. Third, to exploit the feature maps learned in different stages with different scales, we propose to use a multi-scale fusion (MFS) strategy, which fuses the inter-media feature maps at different scales. In this way, our framework could tackle the challenging task depth map super-resolution with large up-sampling factors efficiently. In Fig. 3, we illustrate the whole pipeline of our method under an up-sampling factor of . Our method takes a low-resolution depth map as input and does not require the corresponding color image.

3.1 Deconvolution in DCNN based DSR

Deconvolution layer [10] [13]

, as an inverse operation of convolution, is a novel operation to recover an HR depth map from an LR depth map. The deconvolution layer employs a set of deconvolution filters to up-sample the feature maps, and, the filter is convolved with the image by a stride of

, and the output is times of the input. Most of the current DCNN based depth super-resolution methods [2][14][15] need the up-sampled LR depth images as input, the process of up-sampling LR depth maps can also be viewed as a deconvolution layer whose parameters are fixed. However, up-sampling the low-resolution depth map does not necessarily provide a proper initialization of the network.

Recently, Shi et al. [29] proposed a sub-pixel DCNN for color image super-resolution without deconvolution. Feature maps are extracted in the LR space, and a subpixel layer which learns an array of up-sampling filters to upscale LR feature maps to HR output is utilized. Inspired by [29], we present a novel view to deal with DSR as deeply supervised novel view synthesis task.

3.2 DSR as Novel View Synthesis

To avoid up-sampling the LR depth maps before feeding them into the network, we represent DSR as a series of novel view synthesis sub-tasks. Rather than treating it as learning a nonlinear mapping between the low-resolution depth map and the high-resolution depth map, we decompose the desired output as a collection of low-resolution depth maps acquired at slightly different viewpoints from different “virtual cameras”. Under this setup, each virtual camera owns the same spatial resolution which is the same as the input low-resolution depth map. For example, given a DSR task with an up-sampling factor , we could decompose the DSR task to novel view synthesis tasks. This is because each virtual camera can be explained as having a translation between them, or forming a light field camera as illustrated in Fig. 2.

Specifically, as shown in Fig. 4, given the current network input, under , we denote the task of DSR as predicting the depth maps viewed at the position of and . Under a pure down-sampling version, the low-resolution depth map is exactly the depth map viewed at the position of

. Then the first novel view synthesis is directly an identical mapping. The other three tasks could be directly viewed as predicting the depth map with a motion vector, say

for the horizontal direction, for the vertical direction and for the diagonal direction. With different down-sampling operators, the detailed meaning may be different, but the principle is pretty the same. The above formulation could be generalized to other up-sampling factors with ease.


Fig. 4: The process of novel view synthesis. The number in the grid of input denotes the depth value of each pixel. The DCNN unit is first employed to estimate the depth value of each position of novel views, then a re-organization operation is applied to generate the desired high-resolution output.

Through this representation, these low-resolution depth maps of the virtual cameras could be learned in parallel. Meanwhile, as the whole DSR task has been decoupled to several different novel view synthesis sub-tasks, where for each sub-task, the output and input have the same spatial resolution. Thus, neither deconvolution nor the up-sampled input is needed in our approach.

As illustrated in Fig. 4, given an LR depth map as input, layers DCNN unit is first employed to estimate the depth values at each novel view synthesis in parallel (, , and ), where is the result of each sub-net. Then a re-organization operation is utilized to generate the output . Note that we use a residual DCNN unit here.

Specifically, for each novel view synthesis sub-task, the process can be described as follows:

(1)
(2)

where denotes the nonlinear mapping between the LR depth map and the HR depth map. , are the learnable network weights and biases respectively, denotes convolution. is the input LR depth map, is a nonlinear function and is the estimated novel view synthesis output for position .

Then, the final output can be obtained by using the re-organization operation which is described as follows:

(3)

where is a periodic operation that re-arranges low-resolution depth maps of dimension to a high-resolution depth map of dimension .

Training each sub-task: Take an up-sampling factor of as an example, the ground truth high-resolution depth map is decomposed into four sub-ground truth depth maps, (the reverse process of the re-organization operation as illustrated in Fig. 4), and in each sub-task, the LR depth map has the same resolution with the corresponding high-resolution supervision depth sub-map, which makes the DCNN easy to design and implement. Thus, neither deconvolution or up-sampling of the LR input are needed.

3.3 Deeply Supervised Learning

During training, besides the high-resolution target/ground truth depth map used to supervise the final output, its down-sampled versions have also been used at different stages of the depth map learning framework, denoted as , , ,, which are expressed as follows:

(4)

where is the number of supervision stages and

is the down-sampling factor between two consecutive stages. Here the bicubic interpolation is used as the down-sampling strategy, which is commonly used in depth map super-resolution and color image super-resolution. In this way, the aim of each sub-task can be viewed as learning the residual between two consecutive stages (two consecutive down-sampled versions of high-resolution target/ground truth) using the novel view synthesis strategy. The residual of each stage is much smaller than the residual between the input low-resolution depth map and the ground truth depth map, which does not need very deep layers to present. Hence, deeply supervised learning can effectively handle the gradient vanishing issue and obtain better results. In this way, the network could receive direct supervision in learning each factor, which could better regularize the learned depth map.


Fig. 5: Experimental results on the Middlebury dataset (up-sampling factor ). (a) Ground Truth. (b) Bicubic. (c) Nearest neighbor. (d) Aodha et al. [18]. (e) Zeyde et al. [36]. (f) GR [37]. (g) Park et al. [21]. (h) Ferstl et al. [23]. (i) Ferstl et al. [20]. (j) Xie et al. [16]. (k) Song et al. [2]. (l) Our approach. Best viewed on Screen.

3.4 Multi-scale Fusion

In the above deeply supervised learning structure, the ground truth depth map has been down-sampled to different resolutions to provide supervisions at different scales. While providing deep supervision and guiding the generation of high-resolution depth maps, the quality of the down-sampled depth maps has been gradually decreased due to the downsampling effect, i.e., the more downsampling has applied, the more smooth the obtained depth maps are. Therefore, the supervision depth maps actually encode the depth map details at different scales. To handle this side-effect and better utilize the inter-media feature maps, we propose a multi-scale fusion strategy (MSF) (c.f. Fig. 3), which fuses the feature maps (depth maps) at different stages. Specifically, the predicted depth maps are upsampled to the same resolution as the output and then are concatenated together as the input to another DCNN unit, which not only achieves multi-scale fusion of the feature maps at different scales but also effectively handles the blocking effect introduced by the individual novel view synthesis sub-tasks. By integrating both deeply supervised learning and multi-scale fusion strategies, our network is able to effectively exploit the supervision provided by the ground truth high-resolution depth map and the multi-scale feature maps.

3.5 Network Architecture

For each novel view synthesis sub-task, we utilize the network VDSR-Net [11] as the DCNN unit due to its high performance in color image super-resolution. Furthermore, we propose to learn the residual between the input and the ground truth rather than learning the depth map itself. Note that the DCNN unit can be replaced by any other DCNN networks. Fig. 4 shows the process of one sub-task. Taking low-resolution depth maps as input, depth maps corresponding to the positions , , and are trained by their corresponding DCNN units in parallel. Then, the output is obtained by re-organizing the four depth maps.

Taking the DSR task with an up-sampling factor as an example, we demonstrate the network structure in Fig. 3. The low-resolution depth maps are fed into the first stage, and the resultant depth maps are regarded as input for the next stage. Down-sampled ground truth is used as supervision to refine the depth maps at each stage. Then, the multi-scale fusion is employed to improve the quality of the resultant high-resolution depth maps, and finally, the depth field statistic (DFS) is used to further refine the resultant high-resolution depth maps.

Our method can also be extended to handle flexible up-sampling factors such as , , which can be recognized as generating depth maps of virtual cameras from 9 or 25 positions from a low-resolution depth map.

3.6 Depth Field Statistics

In the above end-to-end learning framework for DSR, the learned depth maps could be biased by different depth map statistics and we cannot expect that the network could learn the high-frequency information or edge information.

The distribution of a natural depth map can often be modeled as a generalized Laplace distribution [2], where the distribution of gradient magnitude of depth images can be well approximated with Laplacian distribution. Therefore, we propose to minimize the total variation of the depth map, i.e., where the total variation could be expressed in matrix form:

(5)

where consists of as its elements that works as the gradient operator, denotes the vectorization operator that transforming a matrix to a vector.

3.7 An Energy Minimization Formulation

To further refine the depth super-resolution results, we integrate the depth super-resolution cue from the deeply supervised DCNN and the depth field statistics (DFS), and reach the following energy minimization formulation:

(6)

where is the depth super-resolution result from the deep neural network (novel view synthesis strategy + MFS), and is the final depth map we want to generate. is set as empirically in our experiments. This is a convex optimization problem where a global optimal solution exists. We propose to use the iterative reweighted least squares (IRLS) [38] [39] to efficiently solve the problem. Given the depth map estimation result in the -th iteration, the optimization for the -th iteration can be expressed as:

(7)

which could be equivalently expressed as:

(8)

Denote , i.e., the row-wise reweighted version of and , we have:

(9)

The above least squares problem owns a closed-form solution.

Method
Cones Teddy Tsukuba Venus Cones Teddy Tsukuba Venus Cones Teddy Tsukuba Venus
NN 4.4622 3.2363 9.2305 2.1298 6.0054 4.5466 12.9083 2.9333 7.5937 6.2416 18.4786 4.4645
Bicubic 2.5245 1.9495 5.7828 1.3119 3.8635 2.8930 8.7103 1.9403 5.3000 4.2423 13.3220 2.8948

Park et al. [21]
2.8497 2.1850 6.8869 1.2584 6.5447 4.3366 12.1231 2.2595 8.0078 6.3264 17.6225 3.4086
Yang et al. [22] 2.4214 1.8941 5.6312 1.2368 5.1390 4.0660 13.1748 2.7559 5.1390 4.0660 13.1748 2.7559

Ferstl et al. [23]
3.1651 2.4208 6.9988 1.4194 3.9968 2.8080 10.0352 1.6643 N/A N/A N/A N/A
JID [24] 1.7451 1.2681 3.7415 0.6879 3.0369 1.8043 5.9028 0.9625 4.5929 2.9342 10.0800 1.2684

Yang et al. [40]
2.8384 2.0079 6.1157 1.3777 3.9546 3.0908 8.2713 1.9850 5.3176 4.0447 13.0340 2.8140
Zeyde et al. [36] 1.9539 1.5013 4.5276 0.9305 3.2232 2.3527 7.3003 1.4751 4.8945 3.5670 11.9758 2.2879
GR [37] 2.3742 1.8010 5.4059 1.2153 3.5728 2.7044 8.0645 1.8175 5.0603 3.8137 12.3357 2.6384
ANR [37] 2.1237 1.6054 4.8169 1.0566 3.3156 2.4861 7.4895 1.6449 4.9904 3.6666 12.1035 2.4653
NE+LS 2.0437 1.5256 4.6372 0.9697 3.2868 2.4210 7.3404 1.5225 5.0948 3.6195 12.1448 2.3967
NE+NNLS 2.1158 1.5771 4.7287 1.0046 3.4362 2.4887 7.5344 1.6291 4.9906 3.6957 12.2283 2.4647
NE+LLE 2.1437 1.6173 4.8719 1.0827 3.3414 2.4905 7.5528 1.6449 4.9572 3.6916 12.1652 2.5202
Aodha et al. [18] 4.3185 3.2828 9.1089 2.2098 12.6938 4.1113 12.6938 2.6497 N/A N/A N/A N/A
Hornácek et al. [19] 3.7512 3.1395 8.8070 2.0383 5.4898 5.0212 11.1101 3.5833 N/A N/A N/A N/A
Huang et al. [41] 4.6273 3.4293 10.0766 2.1653 6.2723 4.8346 13.7645 3.0606 6.1629 6.6235 10.6618 4.1399
Schultery et al. [42] 1.9199 1.5545 4.2400 1.0185 2.9859 2.3793 6.8026 1.5477 N/A N/A N/A N/A
Ferstl et al. [20] 2.2139 1.7205 5.3252 1.1230 3.5680 2.6474 7.5356 1.7771 N/A N/A N/A N/A
Xie et al. [16] 2.7338 2.4911 6.3534 1.6390 4.4087 3.2768 9.7765 2.3714 N/A N/A N/A N/A
Wang et al. [43] 1.8895 1.4074 3.8789 0.8935 2.9263 2.0638 6.0356 1.2640 4.8933 3.0437 9.8942 1.8618

MSLaplas [32]
1.5119 1.1992 3.3239 0.7370 3.0733 1.8076 4.9121 0.9380 5.2976 2.9100 9.4433 1.4769

Laplas [32]
1.8810 1.4653 4.4722 0.8572 3.2320 2.0716 6.4812 1.1994 5.0544 2.8592 9.8536 1.2186

Song et al. [2]
1.4356 1.1974 2.9841 0.5592 2.9789 1.8006 6.1422 0.8796 4.5887 2.8850 11.6231 1.7082

MS-Net [13]
1.1000 0.8220 2.4720 0.2590 2.7700 1.5330 4.9960 0.4220 5.2170 2.8740 9.9860 0.8810

ATGV-Net [15]
1.0021 0.8155 2.3846 0.1991 2.9293 1.5029 6.6327 0.3764 N/A N/A N/A N/A

VDSR-Net [11]
0.9339 0.8548 1.6934 0.3934 2.3831 1.5469 4.5902 0.5616 4.9893 2.9458 10.8818 1.3126

0.8991 0.7745 1.6786 0.2779 2.2233 1.5427 4.4539 0.5504 4.9932 2.8483 9.8526 1.1017

+
0.9213 0.7891 1.7181 0.3206 2.2080 1.5174 4.4089 0.4978 4.9369 2.8055 9.9407 1.0854
++ 0.8757 0.7613 1.6505 0.2784 2.1956 1.5173 4.3839 0.4987 4.9318 2.8046 9.9206 1.0839


TABLE I: Quantitative evaluation under clean depth map input. The is calculated for different SOTA methods for clean Middlebury dataset for up-sampling factors of , and . denotes our deeply supervised novel view synthesis strategy, means the multi-scale fusion strategy, means the depth field statistic prior. The best result is highlighted and the second best is underlined.
Method
Cones Teddy Tsukuba Venus Cones Teddy Tsukuba Venus Cones Teddy Tsukuba Venus
NN 0.9645 0.9696 0.9423 0.9888 0.9360 0.9450 0.9003 0.9800 0.8996 0.9199 0.8387 0.9634
Bicubic 0.9720 0.9771 0.9536 0.9909 0.9538 0.9619 0.9205 0.9845 0.9314 0.9442 0.8564 0.9771
Park et al. [21] 0.9452 0.9610 0.9052 0.9811 0.9321 0.9510 0.8756 0.9799 0.9231 0.9426 0.8409 0.9792
Yang et al. [22] 0.9833 0.9850 0.9721 0.9946 0.9629 0.9697 0.9322 0.9882 0.9370 0.9488 0.8633 0.9773
Ferstl et al. [23] 0.9755 0.9795 0.9576 0.9938 0.9625 0.9707 0.9245 0.9901 N/A N/A N/A N/A

JID. [24]
0.9913 0.9922 0.9904 0.9983 0.9811 0.9833 0.9751 0.9971 0.9612 0.9691 0.9441 0.9941

Yang et al. [40]
0.9473 0.9564 0.9072 0.9805 0.9482 0.9566 0.9015 0.9816 0.9339 0.9465 0.8662 0.9771
Zeyde et al. [36] 0.9655 0.9717 0.9438 0.9886 0.9604 0.9628 0.9147 0.9884 0.9385 0.9503 0.8718 0.9816
GR [37] 0.9587 0.9656 0.9314 0.9862 0.9500 0.9592 0.9012 0.9817 0.9320 0.9454 0.8581 0.9761
ANR [37] 0.9630 0.9693 0.9400 0.9879 0.9391 0.9452 0.8731 0.9806 0.9350 0.9478 0.8659 0.9784
NE+LS 0.9623 0.9692 0.9391 0.9887 1.6977 0.9514 0.9574 0.9042 0.9367 0.9493 0.8681 0.9799
NE+NNLS 0.9640 0.9707 0.9426 0.9883 0.9424 0.9499 0.8872 0.9820 0.9345 0.9472 0.8635 0.9785
NE+LLE 0.9588 0.9658 0.9405 0.9837 0.9270 0.9331 0.8794 0.9641 0.9344 0.9462 0.8650 0.9759
Aodha et al. [18] 0.9606 0.9690 0.9364 0.9874 0.9392 0.9520 0.9080 0.9822 N/A N/A N/A N/A
Hornácek et al. [19] 0.9696 0.9719 0.9461 0.9895 0.9501 0.9503 0.9137 0.9789 N/A N/A N/A N/A
Huang et al. [41] 0.9582 0.9673 0.9301 0.9875 0.9360 0.9425 0.8821 0.9784 0.9280 0.9254 0.9027 0.9712
Ferstl et al. [20] 0.9866 0.9884 0.9766 0.9963 0.9645 0.9716 0.9413 0.9893 N/A N/A N/A N/A
Xie et al. [16] 0.8932 0.9012 0.9053 0.9300 0.8885 0.8927 0.8405 0.9175 N/A N/A N/A N/A
Wang et al. [43] 0.9891 0.9907 0.9866 0.9966 0.9758 0.9798 0.9640 0.9937 0.9475 0.9618 0.9132 0.9878

MSLaplas [32]
0.9940 0.9909 0.9926 0.9984 0.9822 0.9867 0.9797 0.9976 0.9521 0.9664 0.9333 0.9927

Laplas [32]
0.9894 0.9946 0.9849 0.9975 0.9774 0.9824 0.9657 0.9963 0.9516 0.9658 0.9291 0.9950

Song.et al [2]
0.9915 0.9918 0.9905 0.9989 0.9783 0.9831 0.9666 0.9976 0.9510 0.9679 0.9051 0.9903

MS-Net [13]
0.9952 0.9953 0.9930 0.9993 0.9817 0.9860 0.9746 0.9987 0.9511 0.9652 0.9312 0.9967

VDSR-Net [11]
0.9947 0.9941 0.9970 0.9982 0.9823 0.9840 0.9784 0.9971 0.9481 0.9611 0.9174 0.9916
0.9958 0.9953 0.9972 0.9991 0.9855 0.9886 0.9820 0.9979 0.9539 0.9663 0.9316 0.9953
0.9954 0.9950 0.9969 0.9988 0.9860 0.9856 0.9829 0.9980 0.9543 0.9669 0.9307 0.9953
0.9957 0.9951 0.9972 0.9991 0.9861 0.9855 0.9832 0.9980 0.9546 0.9670 0.9316 0.9954

TABLE II: Quantitative evaluation under up-sampling factors . Note that the input is non-noisy. The is calculated for different SOTA methods on the Middlebury dataset under up-sampling factors of , and . The best result is highlighted and the second best is underlined.

3.8 Implementation Details

DCNN unit: In this paper, we use the VDSR-Net [11] with 10 convolutional layers as the DCNN unit. Each convolutional filter has the size of for the novel view synthesis sub-task and for the multi-scale fusion strategy, and each hidden layer of the network has feature maps. For each DCNN unit, the learning rate varies from to , and the momentum is chosen as . The stepwise decrease (4 steps with learning rate multiplier

) as the learning policy, and adjustable gradient clipping strategy 

[11] is used.

Method
Jadeplant Motorcycle Playtable Flower Art Books Laundry Reindeer Scan21 Scan30 Scan42

Bicubic
3.0186 3.4961 2.5275 3.3368 3.9117 2.2887 2.9908 3.140 2.1792 2.1401 3.7099
NN 3.6726 4.2625 3.0908 4.0940 4.7907 2.7146 3.5863 3.7993 3.1733 3.0431 5.8794
Park et al. [21] 3.7906 4.1823 2.2619 3.7799 4.1322 2.2075 3.0175 3.1835 N/A N/A N/A
Zeyde et al. [36] 3.0823 2.9615 1.8113 2.8241 3.8967 1.7817 2.9839 2.9304 2.3399 2.3371 3.3357
Yang et al. [40] 3.1787 3.0837 1.8748 2.9407 3.8967 1.7817 2.9839 2.9304 2.2071 2.1705 3.7650
ANR [37] 3.0660 2.9520 1.8175 2.8172 9.4621 8.8228 8.0385 8.7093 2.5350 2.5171 3.6840
GR [37] 3.0660 2.9520 1.8175 2.8172 9.5812 8.8838 8.1310 8.8010 2.7205 2.6990 3.9682
Aodha et al. [18] N/A N/A N/A N/A 3.8967 1.7817 2.9839 2.9304 2.3156 2.1225 4.0006
Ferstl et al. [23] 3.6394 3.8577 2.2640 3.4801 4.9346 4.5651 6.9055 4.6487 N/A N/A N/A
JID [24] 2.5476 2.0540 1.2785 1.8310 2.8314 2.2602 2.5027 2.6417 N/A N/A N/A
Huang et al. [41] 3.6394 3.8577 2.2640 3.4801 3.7537 3.1474 3.4566 3.5258 2.6827 2.7017 3.2517
Xie et al. [16] N/A N/A N/A N/A 3.7995 2.0742 2.6162 2.9992 2.3233 2.2411 4.0130
Wang et al. [43] 3.3520 3.3191 2.8827 3.0703 3.5072 2.7229 3.0895 3.1753 2.1460 2.1314 3.0442
AGTV-Net [15] N/A N/A N/A N/A 2.9800 1.7200 N/A N/A N/A N/A N/A

VDSR-Net [11]
1.9091 2.0237 1.3521 1.8153 2.0570 1.0768 1.5213 1.6711 1.4566 1.3233 1.8078

1.8282 1.9518 1.2695 1.7667 1.9441 0.9675 1.3915 1.5935 1.3201 1.2636 1.7610
+ 1.7902 1.9240 1.2328 1.7371 1.9019 0.9346 1.3390 1.5658 1.3149 1.2498 1.6984

++
1.7771 1.9138 1.2236 1.7240 1.8896 0.9230 1.3301 1.5554 1.2866 1.2264 1.6778

TABLE III: Quantitative results under noisy input (up-sampling factor ). The is calculated for different SOTA methods for the Middlebury dataset (2005 and 2014) and the Laserscan dataset using noisy input. The explanation is same as Table I.
Method
Jadeplant Motorcycle Playtable Flower Art Books Laundry Reindeer Scan21 Scan30 Scan42

NN
0.8949 0.9332 0.9389 0.9424 0.9171 0.9418 0.9325 0.9342 0.9702 0.9707 0.9567
Bicubic 0.9164 0.9595 0.9646 0.9653 0.9448 0.9634 0.9563 0.9581 0.9716 0.9722 0.9590
Park et al. [21] 0.9258 0.9762 0.9859 0.9824 0.9551 0.9717 0.9661 0.9679 N/A N/A N/A
Zeyde et al. [36] 0.9321 0.9799 0.9865 0.9839 0.9337 0.9408 0.9361 0.9378 0.9592 0.9592 0.9506
Yang et al. [40] 0.9311 0.9801 0.9866 0.9845 0.9486 0.9656 0.9597 0.9607 0.9725 0.9732 0.9599
ANR [37] 0.9321 0.9791 0.9859 0.9831 0.9213 0.9300 0.9248 0.9264 0.9510 0.9513 0.9396
GR [37] 0.9321 0.9791 0.9859 0.9831 0.9068 0.9197 0.9129 0.9146 0.9423 0.9429 0.9274
Aodha et al. [18] N/A N/A N/A N/A 0.9714 0.9869 0.9764 0.9828 0.9857 0.9873 0.9828
Ferstl et al. [23] 0.9237 0.9704 0.9796 0.9760 0.9631 0.9663 0.9365 0.9733 N/A N/A N/A
JID [24] 0.9421 0.9914 0.9932 0.9937 0.9287 0.9286 0.9244 0.9261 N/A N/A N/A
Huang et al. [41] 0.8834 0.9078 0.9063 0.9170 0.9074 0.9077 0.9014 0.9047 0.9359 0.9350 0.9301
Xie et al. [16] N/A N/A N/A N/A 0.9567 0.9706 0.9710 0.9685 0.9718 0.9716 0.9643
Wang et al. [43] 0.9077 0.9349 0.9333 0.9455 0.9320 0.9353 0.9288 0.9318 0.9655 0.9657 0.9582
VDSR-Net [11] 0.9885 0.9905 0.9921 0.9927 0.9868 0.9914 0.9876 0.9901 0.9920 0.9926 0.9924
0.9411 0.9917 0.9931 0.9936 0.9885 0.9927 0.9894 0.9913 0.9935 0.9935 0.9936
0.9898 0.9917 0.9931 0.9936 0.9886 0.9927 0.9895 0.9913 0.9933 0.9934 0.9935
0.9902 0.9920 0.9934 0.9940 0.9889 0.9930 0.9898 0.9915 0.9937 0.9938 0.9940

TABLE IV: Quantitative evaluation under noisy depth input. The is calculated for different SOTA methods on the noisy Middlebury and the Laserscanner datasets under an up-sampling factor of . The best result is highlighted and the second best is underlined.
Method
Jadeplant Motorcycle Playtable Flower Art Books Laundry Reindeer Scan21 Scan30 Scan42

Bicubic
6.4744 7.5870 4.9166 5.9961 8.7821 4.0171 6.2251 6.2469 5.6547 6.0234 7.7719
NN 5.4158 6.3775 4.0157 7.3735 10.023 4.6584 7.0470 7.2392 9.1224 6.6863 6.5055
Park et al. [21] 6.5770 8.2173 4.5044 6.3069 9.1964 3.7853 6.4360 6.0934 N/A N/A N/A
Zeyde et al. [36] 5.3660 5.9962 3.5302 5.5972 6.6357 3.4477 4.6541 5.0603 4.4270 4.0878 7.2210
Yang et al. [40] 5.5559 6.2265 3.6571 5.8026 6.6079 2.9578 4.3285 4.8519 4.3007 3.8619 7.2310
ANR [37] 5.3135 5.9480 3.5046 5.5557 6.7198 3.6173 4.8064 5.2210 4.5430 4.1952 7.2291
GR [37] 5.3135 5.9480 3.5046 5.5557 6.5905 3.5932 4.7439 5.1316 4.4347 4.1142 7.0916
JID [24] 5.5170 5.5709 3.2141 5.0051 6.3400 3.0937 5.9453 5.3907 N/A N/A N/A
Wang et al. [43] 5.2034 5.6034 3.8110 4.6530 6.2276 3.3479 4.4003 4.7777 3.4242 3.1981 5.4845
VDSR-Net [11] 5.1063 6.3280 3.4919 4.7088 6.7515 2.9180 4.6333 4.8409 3.7227 3.3443 6.1008

4.6949 5.2480 2.8377 3.6106 5.7697 2.0427 3.5901 3.9111 2.8423 2.6082 4.4192
+ 4.6067 5.1246 2.7952 3.5814 5.7122 2.0422 3.5603 3.8838 2.7718 2.5853 4.3061

++
4.5960 5.1157 2.7914 3.5750 5.7067 2.0348 3.5624 3.8775 2.7637 2.6088 4.2957

TABLE V: Quantitative results for large upscaling factor . The is calculated for different SOTA methods for the Middlebury dataset (2005 and 2014) and Laserscan dataset using noisy input. The explanation is same as Table I.

Training Data: depth maps from the Middlebury stereo dataset [44][45][47] (25 images), the Sintel dataset [48] (60 images) and the synthetic New Tsukuba dataset [49] (30 images) are collected to construct our dataset, depth maps are used for training and depth maps are used for validation. Using these depth maps as ground-truth high-resolution depth maps , the input low-resolution depth maps are generated by , where is the down-sampling factor and bicubic interpolation is used as the down-sampling strategy. Note that to accelerate the training time of larger up-sampling factors, such as , and , parameters of (model of deeply supervised novel view synthesis part) are used to initialize the parameters of the first stage of these larger up-sampling factors. For larger up-sampling factors, in each DCNN unit of the first stage, the learning rate varies from 0.01 to 0.0001, and the stepwise decrease (3 steps with learning rate multiplier ) as the learning policy. Using a Titan X GPU (Pascal), training for the task of depth map super-resolution under up-sampling factors , , and roughly takes 3 hours, 4 hours, 5 hours and 6 hours, respectively.


Fig. 6: Experimental results on the clean Laserscann dataset [18] () (up-sampling factor ). (a) Ground truth; (b) Bicubic; (c) Nearest Neighbor; (d) Aodha et al. [18]; (e) Yang et al. [40]; (f) ANR [37]; (g) Zeyde et al. [36]; (h) Ferstl et al. [20]; (i) Xie et al. [16]; (j) Song et al. [2]; (k) VDSR-Net [11]; (l) Our results. Best viewed on screen.

4 Experimental Results

In this section, we present an extensive experimental evaluation of our proposed method. Both quantitative and qualitative results on noise-free and noisy benchmark datasets are provided. , , and , as noisy-free depth maps, are extracted from the Middlebury 2001 and 2003 datasets [47][50]. , , and as noisy depth maps, are collected from the Middlebury 2005 dataset [44][45]. Furthermore, to further evaluate our proposed method, , , and , as noisy depth maps, are extracted from the Middlebury 2014 dataset [52]. Additionally, we also demonstrate test results of the Laserscan dataset (, and ) provided by Aodha et al. [18].

Baseline Methods: We compare our methods with the following five categories of methods:

  1. State-of-the-art single DSR methods: Aodha et al. [18], Hornacek et al. [19], Ferstl et al. [20] and Xie et al. [16];

  2. State-of-the-art color guided DSR methods: Park et al. [21], Yang et al. [22], Ferstl et al. [23], Kiechle et al. [24] ();

  3. Single color image super resolution approaches: Zeyde et al.[36], Yang et al.[40] and Timofte et al. [37], including two kinds of methods: Global Regression () and Anchored Neighborhood Regression (), the neighborhood embedding methods proposed by Bevilacqua et al. [51], including , and , Huang et al. [41], Schulter et al. [42] and Lai et al. [32];

  4. Standard interpolate approaches: Bicubic and Nearest Neighbour ();

  5. State-of-the-art deep convolutional neural networks: Wang et al. [43], VDSR-Net [11]111For a fair comparison, results of VDSR-Net are obtained using the same number of layers with ours, which means VDSR-Net contains 10 layers for up-sampling factor , contains 20 layers for up-sampling factor , contains 30 layers for up-sampling factor , and contains 40 layers for up-sampling factor ., Song et al. [2], Hui et al. [13] (MS-Net) and Riegler et al. [15] (ATGV-Net).

Note that for these deep learning based methods, we retrained the model using depth maps dataset either with the source code provided by the authors or implement those methods by ourself.


Fig. 7: Experimental results on the noisy Middlebury 2014 dataset () (up-sampling factor ). (a) Ground truth; (b) Bicubic; (c) Yang et al. [40]; (d) Zeyde et al. [36]; (e) ANR [37]; (f) GR [37]; (g) Park et al. [21]; (h) Ferstl et al. [23]; (i) Wang et al. [43]; (j) Xie et al. [16]; (k) VDSR-Net [11]; (l) Our results. Best viewed on screen.

Error metrics: In this paper, for quantitative comparison, we use two kinds of error metrics to evaluate the results obtained by our method and other state-of-the-art methods, including: 1) Root Mean Squared Error (); 2). Structure Similarity of Index (). As a frequently used measure,

represents the sample standard deviation of the differences between the obtained HR depth maps and the ground truth. Specifically,

, where and are the obtained HR depth map and ground truth respectively, is the number of pixels in the HR depth map. Meanwhile, is an error metric which evaluates the perceived quality between the obtained HR depth map and the ground truth depth map. We have also computed the percent of errors (percent of pixels that have errors larger than 1 pixel in the disparity map) and the results of percent errors are pretty consistent with the error metrics of and reported in our paper.

4.1 Clean Input

We evaluate the performance of our method on clean depth maps (there is no noise added in the process of down-sampling). The commonly used , , and from the Middlebury stereo dataset are employed. The quantitative results in terms of and of up-sampling factors of , and are reported in Table I and Table II. We can clearly see that all the DCNN based methods achieve significant performance leap over other methods under all the up-sampling factors, even though methods [2][21][23] utilize auxiliary intensity information. Additionally, our proposed method outperforms other DCNN based methods for almost all the up-sampling factors. Note that the VDSR-Net [11] utilizes up-sampled (bicubic) depth map as input, and its structure is the same as our DCNN unit. Additionally, for a fair comparison, results of VDSR-Net [11] are obtained using the same number of layers with ours, which means VDSR-Net contains 10 layers for up-sampling factor , contains 20 layers for up-sampling factor , and contains 30 layers for up-sampling factor . Obviously, our proposed method outperforms VDSR-Net, which demonstrates the excellent performance of our deeply supervised novel view synthesis and multi-scale fusion strategy. Meanwhile, for  [32], as shown in Table I and Table II, we trained two kinds of models, namely (up-sampling factors of , and are trained together) and (up-sampling factors of , and are trained separately), we can see that results obtained by our method are better than [32] under up-sampling factors of , and , which further proves the effectiveness of our method.

As indicated in Table I and Table II, our deeply supervised novel view synthesis strategy () outperforms others for most of the up-sampling factors, and the multi-scale fusion strategy () can handle blocking effect and further improve the results for most up-scale factors, though results of are slightly lower than for an up-sampling factor of . Lastly, the depth field statistic prior () can further improve the results.

Qualitative results are illustrated in Fig. 5 (part of and ) for an up-sampling factor . Obviously, our method produces more visually appealing results. Boundaries generated by our method are sharper and accurate, which show that our method can effectively recover the structure of high-resolution depth maps.

Method
Jadeplant Motorcycle Playtable Flower Art Books Laundry Reindeer Scan21 Scan30 Scan42

NN
0.8812 0.9326 0.9502 0.9437 0.8855 0.9565 0.9362 0.9352 0.9410 0.9485 0.9159
Bicubic 0.9067 0.9577 0.9724 0.9647 0.9258 0.9755 0.9589 0.9614 0.9666 0.9699 0.9446
Park et al. [21] 0.9125 0.9613 0.9782 0.9727 0.9401 0.9792 0.9632 0.9721 N/A N/A N/A
Zeyde et al. [36] 0.9067 0.9575 0.9741 0.9651 0.9034 0.9792 0.9632 0.9721 0.9482 0.9644 0.9277
Yang et al. [40] 0.9080 0.9596 0.9750 0.9671 0.9356 0.9804 0.9666 0.9664 0.9675 0.9722 0.9426
ANR [37] 0.9056 0.9560 0.9734 0.9636 0.8995 0.9627 0.9421 0.9414 0.9443 0.9509 0.8970
GR [37] 0.9056 0.9560 0.9734 0.9636 0.9224 0.9723 0.9564 0.9566 0.9598 0.9644 0.9277
JID [24] 0.9080 0.9641 0.9769 0.9716 N/A N/A N/A N/A N/A N/A N/A
Wang et al. [43] 0.9083 0.9578 0.9708 0.9699 0.9362 0.9742 0.9609 0.9643 0.9746 0.9771 0.9577
VDSR-Net [11] 0.9582 0.9639 0.9766 0.9735 0.9372 0.9800 0.9658 0.9683 0.9737 0.9776 0.9547
0.9618 0.9687 0.9810 0.9816 0.9501 0.9847 0.9724 0.9757 0.9817 0.9829 0.9757
0.9620 0.9685 0.9809 0.9814 0.9504 0.9848 0.9725 0.9757 0.9815 0.9830 0.9746
0.9623 0.9689 0.9811 0.9817 0.9509 0.9850 0.9728 0.9760 0.9819 0.9833 0.9752

TABLE VI: Quantitative evaluation. The is calculated for different SOTA methods on the noisy Middlebury and the Laserscanner datasets under up-sampling factors of . The best result is highlighted and the second best is underlined.

4.2 Noisy Input

We also evaluate our proposed method on the noisy Middlebury dataset [45][44][52] and the Laserscan dataset [18]. To simulate the acquisition process of a Time-of-Flight sensor, depth dependent Gaussian noise is added to the Middlebury dataset (, , of the Middlebury 2005 dataset and , , , of the Middlebury 2014 dataset) and the Laserscan dataset (, and ). Following [15], we add depth dependent Gaussian noise to our low-resolution training data in the form , where and denotes the depth value of each pixel in .

Table III and Table IV report the quantitative results in terms of and for the up-sampling factor of with noisy input, from which, we can clearly see that our proposed method outperforms others. We can observe that our method can well eliminate the influence of noise, thus depth maps with smaller and larger can be obtained. Note that results of VDSR-Net [11] are obtained using the same number of layers with ours for noisy input.

Besides, Fig. 7 illustrates the qualitative results of our method () for an up-sampling factor . As shown in Fig. 7, contains complex textures and luxuriant details, which is hard to recover an HR depth map from an LR depth map. Obviously, our method produces more visually appealing results. Boundaries generated by our method are sharper and accurate, which show that our method can effectively recover the structure of high-resolution depth maps.


Fig. 8: Experimental results on the noisy Middlebury 2014 dataset () (up-sampling factor ). (a) Ground truth; (b) Bicubic; (c) Yang et al. [40]; (d) Zeyde et al. [36]; (e) ANR [37]; (f) GR [37]; (g) Park et al. [21]; (h) Ferstl et al. [23]; (i) Wang et al. [43]; (j) Xie et al. [16]; (k) VDSR-Net [11]; (l) Our results. Best viewed on screen.

4.3 Large Up-sampling Factor

Most of the existing work such as [16][20] could only handle depth super-resolution problem under up-sampling factors , and . In our paper, due to the proposed deeply supervised learning network structure, our method can handle large up-sampling factors effectively. Meanwhile, the multi-scale fusion and depth field statistics can further refine the obtained results. In this section, we demonstrate the qualitative results under large up-sampling factors (, and ). It is a very challenging super-resolution task, i.e., on average inferring depth values from depth value for .


Fig. 9: Experimental results on the noisy Middlebury dataset () under an up-sampling factor of . (a) Ground truth; (b) Bicubic; (c) Nearest Neighbor; (d) Yang et al. [40]; (e) Zeyde et al. [36]; (f) ANR [37]; (g) GR [37]; (h) Wang et al.[43]; (i) VDSR-Net [11]; (j) Our results. Best viewed on screen.

Fig. 8 shows the qualitative comparison results of of from the Middlebury 2014 dataset [52], which demonstrates the ability of our method in handling general up-sampling factors besides (). Fig. 6 ( of the Laserscan dataset) shows the qualitative comparison results for an up-sampling factor . Clean low-resolution depth maps are used as input of Fig. 8 and Fig. 6. As these figures illustrate, depth maps generated by other competing methods all suffer from heavy edge blur, while our method can obtain results with sharp boundaries and accurate depth edges.

Meanwhile, we also illustrate the qualitative results under large up-sampling factor in Fig. 9 ( of the Middlebury dataset). Note that Gaussian noisy described in Section 4.2 is added in the depth input of Fig. 9. Obviously, our method can well eliminate the influence of noise and accurately recover the high-resolution depth maps with sharper boundaries, even under a very large up-sampling factor of .

In addition, Table V and Table VI demonstrate the quantitative results of our method on and , respectively. As illustrated in these tables, our method outperforms all competing methods with a margin by achieving smaller and larger on all the testing datasets. Note that same depth dependent Gaussian noisy is added to the depth maps as described in Section 4.2.

4.4 Ablation Analysis

As shown in Table I, Table II, Table III, Table IV, Table V and Table VI, compared with other state-of-the-art methods, our deeply supervised novel view synthesis strategy () can obtain better results. To handle the blocking effect introduced in the resultant depth maps, the multi-scale fusion strategy ( + ) is used, and as shown in these tables, the multi-scale fusion strategy () can improve the and under most of the testing depth maps. Besides, since all DCNN based methods can only learn the generality of depth super-resolution, we also use depth filed statistic () information (features of each depth map) to further refine the results. As shown in these tables, better and results can be obtained for most of the testing depth maps by using depth field statistic (). Therefore, we can conclude that the three components, namely, deeply supervised novel view synthesis strategy, multi-scale fusion strategy and depth field statistics all contribute positively toward the final success of our approach.

4.5 Influence of the number of layers

In this section, we analyze the influence of the number of layers of DCNN unit in novel view synthesis sub-task. Fig. 10 shows the of our method on the Middlebury dataset, where the number of layers of DCNN unit varies from 3 to 10. We can observe that decreases with the increase of the number of layers of DCNN unit, however, the difference between 3 layers and 5 layers is larger than the difference between 5 layers and 10 layers. Thus, we can conclude that deeper DCNN generally results in better performance. However, as the net goes deeper, the improvement becomes marginal. Hence, we fix the number of layers of DCNN unit as 10 in our experiments.


Fig. 10: results of our method on the Middlebury dataset (DCNN unit with 3 layers, 5 layers and 10 layers respectively).

4.6 Running time

We summarize the computation time for up-sampling different LR depth maps to their full resolution () in Table VII. Up-sampling is performed in MATLAB with a TITAN X GPU (Pascal) using MatConvNet [53]. Note that the running time of MS-Net [13] is obtained from their paper which is calculated using a TITAN X GPU.

MS-Net [13] 0.211 0.221 0.247 0.277
0.169 0.185 0.205 0.211
0.315 0.331 0.351 0.357
TABLE VII: Computation time for different methods under different up-sampling factors.

Besides, the running time of the depth field statistics (DFS) varies a lot depending on different size of input depth maps. For examples, the depth field statistic strategy costs 2.845 seconds on the depth map (370 450 resolution ), while it costs 10.321 seconds on the depth map (1376 1104 resolution).

4.7 Generalization ability

To evaluate the generalization ability of our network, we use a new training dataset (60 depth maps from the Sintel dataset and 30 depth maps from the synthetic New Tsukuba dataset) to retrain our network, and then test the network on the Middlebury dataset. results () are shown in TableVIII (_). of TableVIII shows the results obtained by training data of section 3.8. We can see that and results are pretty similar to the results reported in Table I, which demonstrates the excellent generalization ability of our network.

To further demonstrate our generalization ability on raw depth maps, we show the average results of the NYU dataset [54][55] in Table IX. The NYU dataset comprises of video sequences of indoor scenes captured by a Kinect. For each set, depth maps are selected randomly. Note that we use the model trained by training data (115 depth maps from , and ) described in our paper to test the NYU dataset. As shown in Table IX, our network model can handle wild and raw depth maps. Note that results shown here are obtained by our deeply supervised novel view synthesis strategy.

Method Cones Teddy Tsukuba Venus
4.9932 2.8483 9.8526 1.1017
_ 4.9010 2.8692 10.2326 1.0481
TABLE VIII: Experimental evaluation of the generalization ability. ( up-sampling, ).
Method Set1 Set2 Set3 Set4
Bicubic 2.7900 2.3917 2.7752 1.8444
VDSR-Net [11] 2.0976 1.6226 2.0163 1.3767
2.0399 1.5915 1.9832 1.3513
TABLE IX: Performance evaluation on raw depth maps in the NYU dataset [54] [55] captured with Kinect sensor (, ).

5 Conclusions

In this paper, we propose to represent depth map super-resolution as a series of novel view synthesis sub-tasks, where each sub-task can be efficiently solved in an end-to-end learning manner. The training stage can be performed in parallel, while neither deconvolution nor pre-processing of input depth map is needed. Furthermore, a deeply supervised learning framework is proposed to handle large up-sampling factors (), where strong supervisions are directly applied in different stages. To further exploit the feature maps at different stages, a multi-scale fusion strategy has also been introduced. Our framework can handle DSR efficiently without the need for high-resolution intensity images, and both qualitative and quantitative results demonstrate the outstanding performance of our method compared with the state-of-the-art DSR methods. In the future, we plan to investigate how to exploit the color images under the same framework and how to learn a natural and realistic depth map statistics through GAN (Generative adversarial network).

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos.61672326, 61671387, 61420106007), National 1000 Young Talents Plan of China, ARC Grants (DE140100180, LP100100588), and China Scholarship Council.

References

  • [1] J. Zhang, Y. Cao, Z.-J. Zha, Z. Zheng, C. W. Chen, Z. Wang: A Unified Scheme for Super-Resolution and Depth Estimation From Asymmetric Stereoscopic Video. IEEE Trans. Circuits Syst. Video Techn. 26(3): 479-493, 2016.
  • [2] X. Song, Y. Dai, X. Qin: Deep Depth Super-Resolution: Learning Depth Super-Resolution Using Deep Convolutional Neural Network. In Proc. Asian Conf. Comp. Vis. pp. 360-376, 2016.
  • [3] F. Zhou, S.-T. Xia, Qi. Liao: Nonlocal Pixel Selection for Multisurface Fitting-Based Super-Resolution. IEEE Trans. Circuits Syst. Video Techn. 24(12): 2013-2017, 2014.
  • [4] L.-J. Deng, W. Guo, T.-Z. Huang: Single-Image Super-Resolution via an Iterative Reproducing Kernel Hilbert Space Method. IEEE Trans. Circuits Syst. Video Techn. 26(11): 2001-2014, 2016.
  • [5] J. Sun, W. Cao, Z. Xu, J. Ponce: Learning a convolutional neural network for non-uniform motion blur removal. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 769-777, 2015.
  • [6] L. Xu, J. S. J. Ren, C. Liu, J. Jia: Deep Convolutional Neural Network for Image Deconvolution. In Proc. Adv. Neural Inf. Process. Syst. pp. 1790-1798, 2014.
  • [7] V. Jain, H. S. Seung: Natural Image Denoising with Convolutional Networks. In Proc. Adv. Neural Inf. Process. Syst. pp. 769-776, 2008.
  • [8] J. Xie, L. Xu, E. Chen: Image Denoising and Inpainting with Deep Neural Networks. In Proc. Adv. Neural Inf. Process. Syst. pp. 350-358, 2012.
  • [9] C. Dong, C. C. Loy, K. He, X. Tang: Learning a Deep Convolutional Network for Image Super-Resolution. In Proc. Eur. Conf. Comp. Vis. pp. 184-199, 2014.
  • [10] C. Dong, C. C. Loy, X. Tang: Accelerating the Super-Resolution Convolutional Neural Network. In Proc. Eur. Conf. Comp. Vis. pp. 391-407, 2016.
  • [11] J. Kim, J. K. Lee, K. M. Lee: Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp., 1646-1654, 2016.
  • [12] J. Kim, J. K. Lee, K. M. Lee: Deeply-Recursive Convolutional Network for Image Super-Resolution. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1637-1645, 2016.
  • [13] T.-W. Hui, C. C. Loy, X. Tang: Depth Map Super-Resolution by Deep Multi-Scale Guidance. In Proc. Eur. Conf. Comp. Vis. pp. 353-369, 2016.
  • [14] G. Riegler, D. Ferstl, M. Rüther, H. Bischof: A Deep Primal-Dual Network for Guided Depth Super-Resolution. In Proc. Brit. Mach. Vis. Conf. 2016.
  • [15] G. Riegler, M. Rüther, H. Bischof: ATGV-Net: Accurate Depth Super-Resolution. In Proc. Eur. Conf. Comp. Vis. pp. 268-284, 2016.
  • [16] J. Xie, R. S. Feris, M.-T. Sun: Edge-Guided Single Depth Image Super Resolution. IEEE Trans. Image Proc. 25(1): 428-438, 2016.
  • [17] W. T. Freeman, T. R. Jones, E. C. Pasztor: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22(2): 56-65, 2002.
  • [18] O. M. Aodha, N. D. F. Campbell, A. Nair, G. J. Brostow: Patch Based Synthesis for Single Depth Image Super-Resolution. In Proc. Eur. Conf. Comp. Vis. pp. 71-84, 2012.
  • [19] M. Hornacek, C. Rhemann, M. Gelautz, C. Rother: Depth Super Resolution by Rigid Body Self-Similarity in 3D. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1123-1130, 2013.
  • [20] D. Ferstl, M. R ther, H. Bischof: Variational Depth Superresolution Using Example-Based Edge Representations. In Proc. IEEE Int. Conf. Comp. Vis. pp. 513-521, 2015,
  • [21] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, I.-S. Kweon: High quality depth map upsampling for 3D-TOF cameras. In Proc. IEEE Int. Conf. Comp. Vis. pp. 1623-1630, 2011.
  • [22] J. Yang, X. Ye, K. Li, C. Hou: Depth Recovery Using an Adaptive Color-Guided Auto-Regressive Model. In Proc. Eur. Conf. Comp. Vis. pp. 158-171, 2012.
  • [23] D. Ferstl, C. Reinbacher, R. Ranftl, M. R ther, H. Bischof: Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation. In Proc. IEEE Int. Conf. Comp. Vis. pp. 993-1000, 2013.
  • [24] M. Kiechle, S. Hawe, M. Kleinsteuber: A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution. In Proc. IEEE Int. Conf. Comp. Vis. pp. 1545-1552, 2013.
  • [25] K. Matsuo, Y. Aoki: Depth image enhancement using local tangent plane approximations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 3574-3583, 2015.
  • [26] J. Lu, D. A. Forsyth: Sparse depth super resolution. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 2245-2253, 2015.
  • [27] J. Zhang, T. Zhang, Y. Dai, M. Harandi, R. Hartley: Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 9029-9038, 2018.
  • [28] A. Borji, M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Trans. Image Proc., 24(12):5706 C 5722, 2015.
  • [29] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang: Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1874-1883, 2016.
  • [30] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, Al. Tejani, J. Totz, Z. Wang, W. Shi: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 105-114, 2017.
  • [31] Y. Tai, J. Yang, X. Liu: Image Super-Resolution via Deep Recursive Residual Network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 2790-2798, 2017.
  • [32] W.-S. Lai, J.-B. Huang, N. Ahuja, M.-H. Yang: Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 5835-5843, 2017.
  • [33] M. S. M. Sajjadi, B. Schölkopf, M. Hirsch: EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis. In Proc. IEEE Int. Conf. Comp. Vis. pp. 4501-4510, 2017.
  • [34] T. Tong, G. Li, X. Liu, Q. Gao: Image Super-Resolution Using Dense Skip Connections. In Proc. IEEE Int. Conf. Comp. Vis. pp. 4809-4817, 2017.
  • [35] Y. Huang, L. Shao, A. F. Frangi: Simultaneous Super-Resolution and Cross-Modality Synthesis of 3D Medical Images Using Weakly-Supervised Joint Convolutional Sparse Coding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 5787-5796, 2017.
  • [36] R. Zeyde, M. Elad, M. Protter: On Single Image Scale-Up Using Sparse-Representations. In International Conference on Curves and Surfaces pp. 711-730, 2010.
  • [37] R. Timofte, V. D. Smet, L. J. Van Gool: Anchored Neighborhood Regression for Fast Example-Based Super-Resolution. In Proc. IEEE Int. Conf. Comp. Vis. pp. 1920-1927, 2013.
  • [38] R. Chartrand, W. Yin: Iteratively reweighted algorithms for compressive sensing. In Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process. pp. 3869-3872, 2008.
  • [39] T. Ajanthan, R. I. Hartley, M. Salzmann, H. Li: Iteratively reweighted graph cut for multi-label MRFs with non-convex priors. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 5144-5152, 2015.
  • [40] J. Yang, J. Wright, T. S. Huang, Y. Ma: Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 19(11): 2861-2873, 2010.
  • [41] J.-B. Huang, A. Singh, N. Ahuja: Single image super-resolution from transformed self-exemplars. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 5197-5206, 2015.
  • [42] S. Schulter, C. Leistner, H. Bischof: Fast and accurate image upscaling with super-resolution forests. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 3791-3799, 2015.
  • [43] Z. Wang, D. Liu, J. Yang, W. Han, T. S. Huang: Deep Networks for Image Super-Resolution with Sparse Prior. In Proc. IEEE Int. Conf. Comp. Vis. pp. 370-378, 2015.
  • [44] H. Hirschmüller, D. Scharstein: Evaluation of Cost Functions for Stereo Matching. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1-8. 2007.
  • [45] D. Scharstein, C. Pal: Learning Conditional Random Fields for Stereo. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1-8. 2007.
  • [46]

    B. Li, C. Shen, Y. Dai, A. van den Hengel, M. He: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In

    Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1119-1127, 2015.
  • [47] D. Scharstein, R. Szeliski: A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal of Computer Vision 47(1-3): 7-42, 2002.
  • [48] D. J. Butler, J. Wulff, G. B. Stanley, M. J. Black: A Naturalistic Open Source Movie for Optical Flow Evaluation. In Proc. Eur. Conf. Comp. Vis. pp. 611-625, 2012.
  • [49] S. Martull, M. Peris, K. Fukui. Realistic CG Stereo Image Dataset with Ground Truth Disparity Maps. IEICE Technical Report, 2012, 111(430):117-118.
  • [50] D. Scharstein, R. Szeliski: High-Accuracy Stereo Depth Maps Using Structured Light. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 195-202, 2003.
  • [51] M. Bevilacqua, A. Roumy, C. Guillemot, M.-L. Alberi-Morel: Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding. In Proc. Brit. Mach. Vis. Conf. pp. 1-10, 2012.
  • [52] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, P. Westling: High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In

    Proc. German Conference on Pattern Recognition

    pp. 31-42, 2014.
  • [53] A. Vedaldi, K. Lenc: MatConvNet: Convolutional Neural Networks for MATLAB. In ACM international conference on Multimedia pp. 689-692, 2015.
  • [54] N. Silberman, D. Hoiem, P. Kohli, R. Fergus: Indoor Segmentation and Support Inference from RGBD Images. In Proc. Eur. Conf. Comp. Vis. pp.746-760, 2012.
  • [55] N. Silberman, R. Fergus: Indoor scene segmentation using a structured light sensor. In ICCV Workshops pp. 601-608, 2011.
  • [56] B. Li, Y. Dai, M. He: Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference. Pattern Recognition 83(11): 328-339, 2018.