Normal Assisted Stereo Depth Estimation

Accurate stereo depth estimation plays a critical role in various 3D tasks in both indoor and outdoor environments. Recently, learning-based multi-view stereo methods have demonstrated competitive performance with limited number of views. However, in challenging scenarios, especially when building cross-view correspondences is hard, these methods still cannot produce satisfying results. In this paper, we study how to enforce the consistency between surface normal and depth at training time to improve the performance. We couple the learning of a multi-view normal estimation module and a multi-view depth estimation module. In addition, we propose a novel consistency loss to train an independent consistency module that refines the depths from depth/normal pairs. We find that the joint learning can improve both the prediction of normal and depth, and the accuracy and smoothness can be further improved by enforcing the consistency. Experiments on MVS, SUN3D, RGBD and Scenes11 demonstrate the effectiveness of our method and state-of-the-art performance.

READ FULL TEXT VIEW PDF

page 3

page 7

page 8

page 9

page 14

12/26/2019

Planar Prior Assisted PatchMatch Multi-View Stereo

The completeness of 3D models is still a challenging problem in multi-vi...
05/05/2022

Exploiting Correspondences with All-pairs Correlations for Multi-view Depth Estimation

Multi-view depth estimation plays a critical role in reconstructing and ...
01/05/2022

Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation and Focal Loss

Depth estimation is solved as a regression or classification problem in ...
08/19/2021

PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

Recent learning-based multi-view stereo (MVS) methods show excellent per...
03/04/2022

PatchMVSNet: Patch-wise Unsupervised Multi-View Stereo for Weakly-Textured Surface Reconstruction

Learning-based multi-view stereo (MVS) has gained fine reconstructions o...
03/04/2022

3D endoscopic depth estimation using 3D surface-aware constraints

Robotic-assisted surgery allows surgeons to conduct precise surgical ope...
12/01/2021

FaSS-MVS – Fast Multi-View Stereo with Surface-Aware Semi-Global Matching from UAV-borne Monocular Imagery

With FaSS-MVS, we present an approach for fast multi-view stereo with su...

1 Introduction

Work done while visiting University of California San Diego Multi-view stereo (MVS) is one of the most fundamental problems in computer vision and has been studied over decades. Recently, learning-based MVS methods have witnessed significant improvement against their traditional counterparts 

[46, 23, 47, 7]. In general, these methods formulate the task as an optimization problem, where the target is to minimize the overall summation of pixel-wise depth discrepancy. However, the lack of geometric constraints leads to bumpy depth prediction especially in areas with low texture or that are textureless as shown in Fig. 1. Compared with depth that is a property of the global geometry, surface normal represents a more local geometric property and can be inferred more easily from visual appearance. For instance, it is much easier for humans to estimate whether a wall is flat or not than the absolute depth. Fig. 1 shows an example where learning-based MVS methods perform poorly on depth estimation but significantly better on normal prediction.

Figure 1: Illustration of results of separate learning and joint learning of depth and normal. While the normal prediction is smooth and accurate, existing state-of-the-art stereo depth prediction result is noisy. Our method improves the prediction quality significantly by joint learning of depth and normal and enforcing consistency. Color format. Best viewed on screen.

Attempts have been made to incorporate the normal based geometric constraints into the optimization to improve the monocular depth prediction [49, 60]

. One simple form of enforcing a consistency constraint between depth and normal is enforce orthogonality between the predicted normal and the tangent directions computed from the predicted depths at every point. However, for usage as regularizing loss function during training, we find that naive consistency method in the world coordinate space is a very soft constraint as there are many sub-optimal depth solutions that are consistent with a given normal. Optimizing depths to be consistent with the normal as a post processing

[60] ensures local consistency; however, not only this is an expensive step during inference time, but also the post-processed result may lose grounding from the input images. Therefore, we strive to propose a new formulation of depth-normal consistency that can improve the training process. Our consistency is defined in the pixel coordinate space and we show that our formulation is better than the simple consistency along with better performance than previous methods to make the geometry consistent in Section 4.4. This constraint is independent of the multi-view formulation and can be used to enforce consistency on any pair of depth and normal even in the single view setting. To this end, our contributions are mainly in the following aspects:

First, we propose a novel cost-volume-based multi-view surface normal prediction network (NNet). By constructing a 3D cost volume by plane sweeping and accumulating the multi-view image information to different planes through projection, our NNet can learn to infer the normal accurately using the image information at the correct depth. The construction of a cost volume with image features from multiple views contains the information of available features in addition to enforcing additional constraints on the correspondences and thus the depths of each point. We show that the cost volume is a better structural representation that facilitates better learning on the image features for estimating the underlying surface normal. While in single image setting, the network tends to overfit the texture and color and demonstrates worse generalizability, we show that our method of normal estimation generalizes better due to learning on a better abstraction than single view images.

Further, we demonstrate that learning a normal estimation model on the cost volume jointly with the depth estimation pipeline facilitates both tasks. Both traditional and learning-based stereo methods suffer from the noisy nature of the cost volume. The problem is significant in textureless surfaces when the image feature based matching doesn’t offer enough cues. We show that enforce the network to predict accurate normal maps from the cost volume results in regularizing the cost volume representation, and thereby assists in producing better depth estimation. Experiments on MVS, SUN3D, RGBD, Scenes11, and Scene Flow datasets demonstrate that our method achieves state-of-the-art performance.

2 Related Work

In this section we review the literature relevant to our work concerned with stereo depth estimation, normal estimation and multi-task learning for multi-view geometry.

Classical Stereo Matching. A stereo algorithm typically consists of the following steps: matching cost calculation, matching cost aggregation and disparity calculation. As the pixel representation plays a critical role in the process, previous literature have exploited a variety of representations, from the simplest RGB colors to hand-craft feature descriptors [50, 44, 39, 31, 2]. Together with post-processing techniques like Markov random fields [38] and semi-global matching [20], these methods can work well on relative simple scenarios.

Learning-based Stereo. To deal with more complex real world scenes, recently researchers leverage CNNs to extract pixel-wise features and match correspondences [25, 51, 28, 5, 30, 29, 19]. The learned representation shows more robustness to low-texture regions and various lightings [22, 46, 47, 7, 32]

. Rather than directly estimating depth from image pairs as done in many previous deep learning methods, some approaches also tried to incorporate semantic cues and context information in the cost aggregation process 

[43, 8, 24] and achieved positive results. While other geometry information such as normal and boundary [57, 16, 26] are widely utilized in traditional methods for further improving the reconstruction accuracy, it is non-trivial to explicit enforce the geometry constraints in learning-based approaches [15]. To the best of our knowledge, this is the first work that tries to solve depth and normal estimation in multi-view scenario in a joint learning fashion.

Surface Normal Estimation.

Surface normal is an important geometry information for 3D scene understanding. Recently, several data-driven methods have achieved promising results 

[12, 1, 14, 41, 3, 53]. While these methods learn the image level features and textures to address normal prediction in a single image setting, we propose a multi-view method that generalizes better and reduces the learning complexity of the task.

Joint Learning of Normal and Depth. With deep learning, numerous methods have been proposed for joint learning of normal and depth [34, 18, 54, 11, 48, 59, 12]. Even though these methods achieved some progress, all these methods focus on single image scenario, while there are still few works exploring joint estimation of normal and depth in multi-view setting. Gallup et al[17] estimated candidate plane directions for warping during plane sweep stereo and further enforce an integrability penalty between the predicted depths and the candidate plane for better performance on slanted surfaces, however, the geometry constraints are only applied in a post processing or optimization step (e.g., energy model or graph cut). The lack of end-to-end learning mechanism make their methods easier to stuck in sub-optimal solutions. In this work, our experiments demonstrate that with careful design, the joint learning of normal and depth is favorable for both sides, as the geometry information is easier to be captured. Benefited from the learned powerful representations, our approach achieves competitive results with upon previous methods.

Figure 2: Illustration of the pipeline of our method. We first extract deep image features from viewed images and build a feature cost volume by using feature wrapping. The depth and normal are jointly learned in a supervision fashion. Further we use our proposed consistency module to refine the depth and apply a consistency loss.

3 Approach

We propose an end-to-end pipeline for multi-view depth and normal estimation as shown in Fig. 2. The entire pipeline can be viewed as two modules. The first module consists of joint estimation of depth and normal maps from the cost volume built from multi-view image features. The subsequent module refines the predicted depth by enforcing consistency between the predicted depth and normal maps using the proposed consistency loss. In the first module, joint prediction of normal from the cost volume implicitly improves the learned model for depth estimation. The second module is explicitly trained to refine the estimates so that the refined depth is consistent with the predicted normal map.

3.1 Learning based Plane Sweep Stereo

First, we describe our depth prediction module. In terms of the type of target, current learning-based stereo methods can be divided into two categories: single object reconstruction [45, 7] and scene reconstruction [23, 22]. Compared with single object reconstruction, scene reconstruction, where multiple objects are included, requires larger receptive field for the network to better infer the context information. Because this work also aims at scene reconstruction, we take DPSNet [23], state-of-the-art scene reconstruction method, as our depth prediction module.

The inputs to the network are a reference image and a neighboring view image of the same scene along with the intrinsic camera parameters and the extrinsic transformation between the two views. We first extract deep image features using a spatial pyramid pooling module. Then a cost volume is built by plane sweeping and 3D CNNs are applied on it. Multiple cost volumes can be built and averaged when multiple views are present. Further context-aware cost aggregation [23] is used to regularize the noisy cost volume. The final depth is regressed using soft argmin [27] from the final cost volume.

3.2 Cost Volume based Surface Normal Estimation

Figure 3: Network architecture of cost volume based surface normal estimation, NNet

In this section, we describe the network architecture of cost volume based surface normal estimation (Fig. 3

). The cost volume contains all the spatial information in the scene as well as image features in it. The probability volume models a depth distribution across candidate planes for each pixel. In the limiting case of infinite candidate planes, an accurately estimated probability volume turns out to be the implicit function representation of the underlying surface 

i.e. takes value 1 where a point on the surface exists and 0 everywhere else. This motivates us to use the cost volume which also contains the image-level features to estimate the surface normal map of the underlying scene.

Given the cost volume

we concatenate the world coordinates of every voxel to its feature. Then, we use three layers of 2-strided convolution along the depth dimension to reduce the size of this input to

and call this . Consider a fronto-parallel slice of size in . We pass each slice through a normal-estimation network (NNet). NNet contains 7 layers of 2D convolutions of with increasing receptive field as the layers go deep using dilated convolutions (1, 2, 4, 6, 8, 1, 1). We add the output of all slices and normalize the sum to obtain the estimate of the normal map.

(1)

We explain the intuition behind this choice as follows. Each slice contains information corresponding to the patch match similarity of each pixel in all the views conditioned on the hallucinated depths in the receptive field of the current slice. In addition, due to the strided 3D convolutions, the slice features accumulate information about features of a group of neighboring planes. The positional information of each pixel in each plane is explicitly encoded into the feature when we concatenated the world coordinates. So NNet() is an estimate of the normal at each pixel conditional to the depths in the receptive field of the current slice. For a particular pixel, slices close to the ground truth depth predict good normal estimates, where as slices far from the ground truth predict zero estimates. One way to see this is, if the normal estimate from each slice for a pixel is , the magnitude of can be seen as the correspondence probability at that slice for that pixel. The direction of

can be seen as the vector aligning with the strong correspondences in the local patch around the pixel in that slice.

111Refer to the appendix for Visualisation of the NNet slices

We train the the first module with ground truth depth () supervision on & along with the ground truth normal () supervision on (). The loss function () is defined as follows.

(2)

where denotes the Huber norm222Also referred to as Smooth L1Loss..

3.3 Depth Normal Consistency

In addition to estimating depth and normal jointly from the cost volume, we use a novel consistency loss to enforce consistency between the estimated depth and normal maps. There have been several ways of doing this and one straight forward approach is to estimate the surface tangent from the predicted depth map and enforce it to be orthogonal to the estimated normal using a loss or through optimization during post processing [58]. This loss function can be seen as,

(3)

where , , are the world coordinates and (, , ) represents the normal map. This is the same as saying that the tangent estimated from two neighbors on the surface must be orthogonal to the normal estimate at them. This loss formulation is minimized when the depth gradients in the world coordinate space align with the normals locally. It enforces a constraint on but not a constraint on the depth itself. When implemented as a loss function over pixel level predictions, () when estimated as () for neighboring pixels produces gradients back to the network that are symmetric with respect to the depths of both the pixels. This doesn’t differentiate between an already confident estimate of depth and a poor estimate of depth but regardlessly, makes the gradient consistent with the normal. We show experimentally that the performance improvement using this loss as a regularizer is insignificant when compared to a naive UNet based refinement in 4.4

We propose a method that estimates the spatial gradient of the depth map in the pixel coordinate space (, ) using the depth (Z) and normal maps (). The projection of the gradient back into the pixel coordinate space enables us to incorporate the depth of the pixel too in the loss formulation in addition to the depth gradient consistency. Thus, the loss not only depends upon being consistent with the normal map, but also ensures, the depths themselves too. A simple way to view this is: Many surfaces can have their depth gradient in world coordinate space consistent with a normal map, but the constraint on the projection of the depth gradient in the pixel coordinates is stronger as it implicitly has another constraint on the depth itself. We compute two estimates for (, ) and enforce them to be consistent.

Estimate 1:

(4)

where refers to the depth map and , refer to the pixel coordinate space. Note that the right hand side is the output of a sobel filter on the depth map.

We assume the underlying scene to be of a smooth surface which can be expressed as an implicit function of the world coordinates. The normal map is an estimate of the gradient of this surface.

Estimate 2:

(5)
(6)
(7)

From (5), (6) & (7),

(8)

where , are the focal lengths in pixel coordinates. The consistency loss is given by

(9)

The first estimatei.e. the sobel gradient propagates symmetric gradients to neighbors but the second estimate contains the depth of the current pixel that makes the overall gradients asymmetric. We exploit this fact to propagate the depths using the normal map. This can be seen as, instead of both the pixels agreeing on a solution to minimize the loss (as in ), each of them tries to move the other one to fit its local geometry. With sufficient ground truth supervision, we expect that the confident depths are propagated throughout. We show experimentally that our loss formulation outperforms the previous as well as naive learning based refinement methods in Section 4.4.

One significant advantage of our loss is that it can be used for any depth refinement/completion method including single-view estimation given an estimate of the normal map. In our pipeline, we implement this loss in an independent module. We use a UNet [35] with the raw depth and normal estimates as inputs to predict the refined depth and normal estimates. We train the entire pipeline in a modular fashion. We initially train the first module with loss and then train add the second module with the consistency loss . We formulate training the second module as an ADMM optimization problem with as the objective and the depth and normal proximities to the ground truths as constraints333refer to the appendix.

4 Experiments

4.1 Datasets

We use SUN3D [42], RGBD [37] and Scenes11 [4] datasets for training our end-to-end pipeline from scratch. The train set contain 166,285 image pairs from 50420 scenes (SUN3D: 29294, RGBD: 3373, Scenes11: 17753). Scenes11 is a synthetic dataset whereas SUN3D and RGBD consist of real word indoor environments. We test on the same split as previous methods and report the common quantitative measures of depth quality: absolute relative error (Abs Rel), absolute relative inverse error (Abs R-Inv), absolute difference error (Abs diff), square relative error (Sq Rel), root mean square error and its log scale (RMSE and RMSE log) and inlier ratios ( where ).

We also evaluate our method on a different class of datasets used by other state-of-the-art methods. We train and test on the Scene Flow datasets [33] which consist of 35454 training and 4370 test stereo pairs in 960540 resolution with both synthetic and natural scenes. The metrics we use on this dataset are the popularly used average End Point Error (EPE) and the 1-pixel threshold error rate.

Further, we evaluate our task on ScanNet [10]. The dataset consists of 94212 image pairs from 1201 scenes. We use the same test split as in [52]. We follow [46] for neighboring view selection, and generate ground-truth normal map following [13]. We use ScanNet to evaluate the performance of the surface normal estimation task too. We use the mean angle error (mean) and median angle error (median) per pixel. In addition, we also report the fraction of pixels with absolute angle difference with ground truth less than t where t {11.25°, 22.5°, 30°}. For all the tables, we represent if a lower value of a metric is better with () and if an upper value of a metric is better with ().

For all the experiments, to be consistent with other works, we use only two views to train and evaluate. Please refer to the appendix for View Selection and ground truth normal generation

Dataset Method Abs Rel() Abs diff() Sq Rel() RMSE
()
RMSE log() 1.1cm1.25
()
1.1cm1.252
()
1.2cm1.253
()
MVS COLMAP [36] 0.3841 0.8430 1.257 1.4795 0.5001 0.4819 0.6633 0.8401
(Outdoor) DeMoN [40] 0.3105 1.3291 19.970 2.6065 0.2469 0.6411 0.9017 0.9667
DeepMVS [21] 0.2305 0.6628 0.6151 1.1488 0.3019 0.6737 0.8867 0.9414
DPSNet-U [23] 0.0813 0.2006 0.0971 0.4419 0.1595 0.8853 0.9454 0.9735
Ours 0.0679 0.1677 0.0555 0.3752 0.1419 0.9054 0.9644 0.9879
SUN3D COLMAP [36] 0.6232 1.3267 3.2359 2.3162 0.6612 0.3266 0.5541 0.7180
(Indoor) DeMoN [40] 0.2137 2.1477 1.1202 2.4212 0.2060 0.7332 0.9219 0.9626
DeepMVS [21] 0.2816 0.6040 0.4350 0.9436 0.3633 0.5622 0.7388 0.8951
DPSNet-U [23] 0.1469 0.3355 0.1165 0.4489 0.1956 0.7812 0.9260 0.9728
Ours 0.1332 0.3038 0.0910 0.3994 0.1820 0.8168 0.9421 0.9789
RGBD COLMAP [36] 0.5389 0.9398 1.7608 1.5051 0.7151 0.2749 0.5001 0.7241
(Indoor) DeMoN [40] 0.1569 1.3525 0.5238 1.7798 0.2018 0.8011 0.9056 0.9621
DeepMVS [21] 0.2938 0.6207 0.4297 0.8684 0.3506 0.5493 0.8052 0.9217
DPSNet-U [23] 0.1508 0.5312 0.2514 0.6952 0.2421 0.8041 0.8948 0.9268
Ours 0.1314 0.4737 0.2126 0.6190 0.2091 0.8565 0.9289 0.9450
Scenes11 COLMAP [36] 0.6249 2.2409 3.7148 3.6575 0.8680 0.3897 0.5674 0.6716
(Synthetic) DeMoN [40] 0.5560 1.9877 3.4020 2.6034 0.3909 0.4963 0.7258 0.8263
DeepMVS [21] 0.2100 0.5967 0.3727 0.8909 0.2699 0.6881 0.8940 0.9687
DPSNet-U [23] 0.0500 0.1515 0.1108 0.4661 0.1164 0.9614 0.9824 0.9880
Ours 0.0380 0.1130 0.0666 0.3710 0.0946 0.9754 0.9900 0.9947
Table 1: Comparative evaluation of our model on SUN3D, RGBD, Scenes11 and MVS datasets. For all the metrics except the inlier ratios, lower the better. We use the perfomance of COLMAP, DeMoN, and DeepMVS reported in [23].

4.2 Comparison with state-of-the-art

For comparisons on the DeMoN datasets (SUN3D, RGBD, Scenes11 and MVS), we choose state-of-the-art approaches of a diverse kind. We also evaluate on another dataset MVS [36] containing outdoor scenes of buildings which is not used for training to evaluate generalizability. The complete comparison on all the metrics is presented in Table 1, and some qualitative results are shown in Fig. 4. Our method outperforms existing methods in terms of all the metrics. Also, our method generates more accurate and smooth point cloud with fine details even in textureless regions (eg. bed, wall).

Figure 4: Visualizing the depths in 3D for SUN3D. Two views for the point cloud from depth prediction.
Method EPE() 1-pixel error rate()
GCNet 1.80 15.6
PSMNet 1.09 12.1
DPSNet 0.80 8.4
GANet-15 0.84 9.9
GANet-deep 0.78 8.7
GANet-NNet 0.77 8.0
Ours 0.69 7.0
Table 2: Comparative evaluation of our model on Scene Flow datasets. For all the metrics, lower the better.

We compare our performance against similar cost-volume based approaches GCNet [27], PSMNet [6]) and GANet [55] which have different choices of cost aggregation. Since we use the same testing protocol, we use the performance of GCNet, PSMNet and GANet-15 as reported in [55]. We obtain the performance of GANet-deep which uses a deeper network with more 3D convolutions from the authors’ website. Further, we append our NNet branch to the existing GANet architecture by passing the cost volume of GANet through our NNet and train this branch simultaneously with the full GANet architecture. We call this GANet-NNet. Finally, we also train DPSNet on scene flow datasets to confirm that the better performance is due to normal supervision rather than better cost aggregation or a better architecture.

Dataset Method Abs Rel() Abs diff() Sq Rel() RMSE
()

ScanNet
DPSNet 0.1258 0.2145 0.0663 0.3145
Ours 0.1150 0.2068 0.0577 0.3009
Ours- 0.1070 0.1946 0.0508 0.2807
SUN3D DPSNet 0.1470 0.3234 0.1071 0.4269
Ours 0.1332 0.3038 0.0910 0.3994
Ours- 0.1186 0.2744 0.0753 0.3554
Table 3: Comparative evaluation of our consistency loss.

We also evaluate the performance of our consistency loss on SUN3D and ScanNet datasets. We train DPSNet on SUN3D as well as ScanNet independently along with our method and present the results and present them in Table 3. We observe that our pipeline achieves significantly better performance on all the metrics on the MVS, SUN3D, RGBD, Scenes11 & SceneFlow datasets. We find that joint multi-view normal and depth estimation helps improve performance on indoor, outdoor, real and synthetic datasets. We further show that our consistency module significantly improves the performance on top of our existing pipeline. We further evaluate the performance on planar and textureless surfaces and visualise the changes in the cost volume due to the addition of NNet.

4.3 Surface Normal Estimation

Table 4 compares our cost volume based surface normal estimation with existing RGB-based, depth-based and RGB-D methods. We perform significantly better than the depth completion based method and perform similar to the RGB based method. The RGB-D based method performs the best, because of using the ground truth depth data.

Method Mean
()
Median
()
11.25
()
22.5
()
30
()
RGB-D [52] 14.6 7.5 65.6 81.2 86.2
DC [60] 30.6 20.7 39.2 55.3 60.0
RGB [60] 31.1 17.2 37.7 58.3 67.1
Ours 23.1 18.0 31.1 61.8 73.6
Table 4: Comparison of normal estimation on ScanNet with single view normal estimation. Note that the RGB-D and depth completion (DC) based methods use ground truth depth. The performances of DC & RGB-D are from [52] and RGB from [60].

We evaluate the surface normal estimation performance in the wild by comparing our method against RGB based methods [60]. We use models trained on ScanNet and test them on images from the SUN3D dataset. We present the results in Table 5 and visualize a few cases in Fig. 5

. We notice that our method generalizes much better in the wild when compared to the single-view RGB based methods. NNet estimates normals accurately not only in regions of low texture but also in regions with high variance in texture (the bed’s surface). We attribute this performance to using two views instead of one which reduces the learning complexity of the task and thus generalizes better.

We also observe that irrespective of the dataset, the normal estimation loss as well as the validation accuracies saturate within 3 epochs, showing that the task of normal estimation from cost volume is much easier than depth estimation.

Figure 5: Surface Normal Estimation. Test on SUN3D after training on ScanNet. The RGB-based method is from [60]
0.9 cmMethod Mean () Median () 11.25 () 22.5 () 30
()
RGB - SUN3D 31.6 25.7 17.9 45.6 57.6
Ours - SUN3D 25.1 19.8 21.8 58.3 71.3
RGB - MVS 33.3 27.8 11.8 42.4 55.1
Ours - MVS 25.9 20.0 32.0 56.2 66.2
Table 5: Generalization performance. Both the models were trained on ScanNet (indoor) and tested on SUN3D (indoor) and MVS (outdoor) datasets

4.4 Consistency Loss

We perform a few experiments to analyse the performance gains due to our novel consistency loss. We freeze the stereo pipeline and train the UNet that takes the raw estimates of depth and normal maps and refines the depth estimate. We train three configurations of it: (1) Pure network based refinement with just ground truth supervision, (2) Simple consistency loss, as regularizer, (3) Our consistency loss as regularizer. We analyse the performance of the configurations on the SUN3D dataset which consists of indoor environments with a lot of scope for planar, textureless surfaces. Using our consistency loss as regularizer will improve the depth estimation accuracy, suggesting that depth prediction can benefit from the accurate normal prediction.

Method Abs Rel() Abs diff() Sq Rel() RMSE
()

Raw
0.1332 0.3038 0.0910 0.3994
UNet 0.1307 0.2863 0.0878 0.3720
UNet- 0.1288 0.2980 0.0842 0.3820
UNet- 0.1186 0.2744 0.0753 0.3554
Table 6: Ablation Study of Consistency Loss on SUN3D

4.5 Visualizing the Cost Volume

Regularization Existing stereo methods, both traditional and learning-based ones perform explicit cost aggregation on the cost volume. This is because each pixel has good correspondences only with a few pixels in its neighborhood in the other view. But the cost volume contains many more candidates, to be specific, the number of slices per pixel. Further, in textureless regions, there are no distinctive features and thus all candidates have similar features. This induces a lot of noise in the cost volume also is responsible for to false-positives.

Figure 6: Cost slice visualization: The first column contains the reference image and the ground truth depth map. The first row contains the cost volume slices from DPSNet. The second row contains the same from our network. The third row contains the estimates of ground truth cost slices. This can be seen as a distribution around the ground truth depth corresponding to each slice.

We show that normal supervision during training regularizes the cost volume. Fig. 6 visualises the probability volume slices and compares it against those of DPSNet. We consider the un-aggregated probability volume in both the cases. We visualise the slices at disparities 14, 15, 16 & 17 (corresponding to depths 2.28, 2.13, 2.0, 1.88) which encompass the wall of the building. The slices of dpsnet are very noisy and do not actually produce good outputs in textureless regions like the walls & sky and reflective regions like the windows.

Planar and Textureless Regions

Figure 7:

Post-softmax probability distributions on disparity

Green lines illustrate the ground truth disparities while the red lines illustrate the predicted disparities.

We also visualise the softmax distribution at a few regions in Fig. 7. Challenging regions that are planar or textureless or both are chosen. (a) The chair image consists of very less distinctive textures and the local patches on the chair look the same as those on the floor. But given two views, estimation of normal in the regions with curvature is much easier than estimating depth. This fact allows our model to perform better in the these regions(red & yellow boxes). Cost volume based methods that estimate a probability for each of the candidate depths struggle in textureless regions and usually have long tails in the output distribution. Normal supervision provides additional local constraints on the cost volume and suppresses the tails. This further justifies our understanding (from Section 3.2) that the correspondence probability is related to the slice’s contribution to the normal map.

We quantify these observations by evaluating the performance of depths obtained from against the performance of DPSNet without cost aggregation in Table 7. It shows that normal supervision helps to regularize the cost volume by constraining the cost volume better both qualitatively in challenging cases and quantitatively across the entire test data.

Method Abs Rel() Abs diff() Sq Rel() RMSE
()

DPSNet
0.1274 0.3388 0.1957 0.6230

Ours
0.1114 0.3276 0.1466 0.5631
Table 7: Test performance without cost aggregation on the DeMoN datasets.

Further, we quantify the performance of our methods on planar and textureless surfaces by evaluating on semantic classes on ScanNet test images. Specifically we use the eigen13 classes [9] and report the depth estimation metrics of our methods against DPSNet on the top-2 occuring classes in Table 8. The performance on the remaining classes can be found in the appendix. We show that our methods perform well on all semantic categories and quantitatively show the improvement on planar and textureless surfaces which are usually found on walls and floors.

Label Method Abs Rel() Abs diff() Sq Rel() RMSE
()

Wall
DPSNet 0.1340 0.2968 0.0871 0.3599
Ours 0.1255 0.2835 0.0799 0.3436
Ours- 0.1173 0.2690 0.0721 0.3215
Floor DPSNet 0.1116 0.2472 0.0777 0.2973
Ours 0.1092 0.2242 0.0509 0.2642
Ours- 0.1037 0.2061 0.0474 0.2561
Table 8: Semantic class specific evaluation on ScanNet

5 Conclusion

In this paper, we proposed to apply geometry constraints between surface normal and depth at training time to improve stereo depth estimation. We jointly learn to predict the depth and the normal based the multi-view cost volume. Moreover, we proposed to refine the depths from depth, normal pairs with an independent consistency module which is trained independently using a novel consistency loss. Experimental results showed that joint learning can improve both the prediction of normal and depth, and the accuracy & smoothness can be further improved by enforcing the consistency. We achieved state-of-the-art performance on MVS, SUN3D, RGBD and Scenes11.

Appendix

Implementation details

We use 64 levels of depth/disparity while building the cost volumes. The hyperparameters in the loss function

and are set to 0.7 and 3 respectively. We train the network without the consistency module first for 20 epochs with ADAM optimizer with a learning rate of . Further, we finetune the consistency module with the end-to-end pipeline for 10 epochs with a learning rate of . The training process takes 5 days and uses 4 NVIDIA GTX 1080Ti GPUs with a batch size of 12. We use a random crop size of (320 240) during training which can be optionally increased in the later epochs by decreasing the batch size.

View Selection and Normal Generation

ScanNet [10] provides depth map and camera pose for each image frame. To make it appropriate for stereo evaluation, view selection is a crucial step. Following Yao et al[46], we calculate a score for each image pair according to the sparse points, where is a common track in both view and , is ’s baseline angle and is the camera center. is a piece-wise Gaussian function [56] that favors a certain baseline angle :

In the experiments, , and are set to 5, 1 and 10 respectively. We generate ground-truth surface normal maps following the procedure of [13].

Visualization of NNet slices

We justify the intuition in Section 3.2 in the main paper by visualising the normal estimate contribution from each slice i.e. NNet() in Figure 8. The slices in the figure clearly show that only slices with good correspondence probabilities contribute to the output of NNet.

Figure 8: Normal Estimation contribution from different slices. The top two rows shows the mask of receptive field and contribution of normal prediction of two slices close to the ground truth depth. The third row shows the sum of the outputs of NNet on all other slices.

ADMM based training on

We use Alternating Direction Method of Multipliers (ADMM) for training the consistency module. Given the ground truth supervision over depth and normal predictions in addition to the enforced consistency loss, we model this as an ADMM problem to attain stable convergence given the highly non-convex nature of the problem. For the sake of simplicity, we represent just the depth map variable and the depth map constraints. The normal map variable along with its constraints can be seen as concatenated to the depth variable without loss of generality. The primal objective can be written as just with the constraint that . This can be reformulated with the introduction of an auxiliary variable with the new objective as , where is an indicator function which fires up when subject to the constraint that . This can be converted into the ADMM scaled dual form with the introduction of the Lagrangian multiplier , and thereby the ADMM general iteration can be written as,

(10)

The solution to the second step of minimization over can be obtained by clamping to . The first step of the minimization requires minimization over

which in turn requires minimization over the learnable parameters of the consistency module. So, this step is approximated using a few iterations of stochastic gradient descent. In our implementation we choose this to be 1 epoch.

Semantic class specific evaluation on ScanNet

We quantify the performance of our methods on planar and textureless surfaces by evaluating on semantic classes on ScanNet test images. Specifically we use the eigen13 classes [9] and report the depth estimation metrics of our methods against DPSNet. We present the other frequently occuring classes not presented in the paper here in Table 9. We show that our methods perform well on all semantic categories and quantitatively show the improvement on planar and textureless surfaces as well which are usually found on walls, floors and ceiling.

Label Method Abs Rel() Abs diff() Sq Rel() RMSE
()

Bed
DPSNet 0.1291 0.1572 0.050 0.1986
Ours 0.1142 0.1449 0.0405 0.1830
Ours- 0.1049 0.1347 0.0345 0.1665
Books DPSNet 0.1087 0.2281 0.0733 0.2527
Ours 0.0970 0.2176 0.0650 0.2404
Ours- 0.0942 0.2139 0.0628 0.2334
Ceiling DPSNet 0.1693 0.3429 0.1029 0.3895
Ours 0.1496 0.3189 0.0840 0.3528
Ours- 0.1360 0.2244 0.0643 0.2900
Chair DPSNet 0.1602 0.2469 0.0836 0.3187
Ours 0.1417 0.2351 0.0697 0.3050
Ours- 0.1360 0.2244 0.0643 0.2900
Floor DPSNet 0.1116 0.2472 0.0777 0.2973
Ours 0.1092 0.2242 0.0509 0.2642
Ours- 0.1037 0.2061 0.0474 0.2561
Objects DPSNet 0.1305 0.2375 0.0785 0.2934
Ours 0.1165 0.2237 0.0661 0.2771
Ours- 0.1095 0.2113 0.0589 0.2587
Picture DPSNet 0.1160 0.2991 0.0949 0.3249
Ours 0.1110 0.2913 0.0912 0.3167
Ours- 0.1017 0.2724 0.0808 0.2923
Table DPSNet 0.1374 0.2211 0.0745 0.2808
Ours 0.1238 0.2116 0.0646 0.2694
Ours- 0.1164 0.2014 0.0590 0.2545
Wall DPSNet 0.1340 0.2968 0.0871 0.3599
Ours 0.1255 0.2835 0.0799 0.3436
Ours- 0.1173 0.2690 0.0721 0.3215
Window DPSNet 0.1559 0.3836 0.1353 0.4384
Ours 0.1468 0.3605 0.1111 0.4163
Ours- 0.1373 0.3385 0.1079 0.3848
Table 9: Semantic class specific evaluation on ScanNet. “DPSNet” corresponds to the predictions from DPSNet. “Ours” corresponds to our predictions before refinement by the consistency module. “Ours-” refers to our final predictions

KITTI 2015 Benchmark

We try to evaluate our method on the KITTI 2015 stereo benchmark. We pre-train our network on the Scene Flow datasets and finetune it on KITTI 2015 train data. We also pre-train GANet-NNet (defined in 4.2 in main paper) on Scene Flow datasets. For GANet, we use the pretrained models the authors provide. We test the performance of these pretrained models first on the KITTI train data without training on it and the report the EPE and 3 pixel error rate in Table 10. We then proceed to train on the KITTI 2015 train data and provide the results of the benchmark in Table 11

We observe that the pretrained models generalize better than other methods on KITTI 2015. We obtain significant improvement over DPSNet on the KITTI 2015 test set by adding normal supervision. The KITTI 2015 dataset contains only 200 training images with sparse ground truths with the sparsity increasing as we move to the background. Our ground truth normals are generated using a least squares optimization on the ground truth depths. Sparsity in the ground truth depths makes the generation of very accurate ground truth normals difficult. We see this as a significant problem and affects our performance on KITTI 2015. Despite this problem, GANet-NNet performs better than GANet on the foreground regions.

Method EPE() 3-pixel error rate()
GANet-deep 1.66 10.5
GANet-NNet 1.64 9.7
Ours 1.64 8.2
Table 10: Evaluation of Scene Flow pretrained models on KITTI2015. For all the metrics, lower the better.
Method fg-noc() both-noc() fg-all() both-all()
DPSNet 6.08 4.00 7.58 4.77
GANet-deep 3.11 1.63 3.46 1.81
GANet-NNet 3.04 1.70 3.34 1.91
Ours 4.06 2.08 4.41 2.27
Table 11: Comparative evaluation of our model on KITTI 2015 dataset. For all the metrics, lower the better. fg: Foreground, both: Foreground and Background, noc: Non occluded Pixels, all: All Pixels

More Qualitative Results

We present more qualitative results on depth map estimation in Figure 9. The examples depict various situations like planar surfaces, reflective surfaces, planar-textureless surfaces and in general overall quality of the prediction. The red boxes on the images illustrate these regions. Our method produces more accurate depth maps when compared to the previous state-of-the-art.

Figure 9: Qualitative comparison of the predicted depth maps. GT represents Ground Truth Depth.

References

  • [1] A. Bansal, B. Russell, and A. Gupta (2016) Marr revisited: 2d-3d alignment via surface normal prediction. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 5965–5974. Cited by: §2.
  • [2] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §2.
  • [3] A. Boulch and R. Marlet (2016) Deep learning for robust normal estimation in unstructured point clouds. In Computer Graphics Forum, Vol. 35, pp. 281–290. Cited by: §2.
  • [4] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. External Links: Link, 1512.03012 Cited by: §4.1.
  • [5] J. Chang and Y. Chen (2018-06) Pyramid stereo matching network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] J. Chang and Y. Chen (2018) Pyramid stereo matching network. CoRR abs/1803.08669. External Links: Link, 1803.08669 Cited by: §4.2.
  • [7] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.1.
  • [8] I. Cherabier, J. L. Schonberger, M. R. Oswald, M. Pollefeys, and A. Geiger (2018) Learning priors for semantic 3d reconstruction. In Proceedings of the European conference on computer vision (ECCV), pp. 314–330. Cited by: §2.
  • [9] C. Couprie, C. Farabet, L. Najman, and Y. LeCun (2013) Indoor semantic segmentation using depth information. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Conference Track Proceedings, External Links: Link Cited by: §4.5, Semantic class specific evaluation on ScanNet.
  • [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2432–2443. External Links: Link, Document Cited by: §4.1, View Selection and Normal Generation.
  • [11] T. Dharmasiri, A. Spek, and T. Drummond (2017) Joint prediction of depths, normals and surface curvature from rgb images using cnns. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1505–1512. Cited by: §2.
  • [12] D. Eigen and R. Fergus (2015-12) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §2.
  • [13] D. F. Fouhey, A. Gupta, and M. Hebert (2013) Data-driven 3d primitives for single image understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3392–3399. Cited by: §4.1, View Selection and Normal Generation.
  • [14] D. F. Fouhey, A. Gupta, and M. Hebert (2014) Unfolding an indoor origami world. In European Conference on Computer Vision, pp. 687–702. Cited by: §2.
  • [15] Y. Furukawa, C. Hernández, et al. (2015) Multi-view stereo: a tutorial. Foundations and Trends® in Computer Graphics and Vision 9 (1-2), pp. 1–148. Cited by: §2.
  • [16] S. Galliani, K. Lasinger, and K. Schindler (2016) Gipuma: massively parallel multi-view stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25, pp. 361–369. Cited by: §2.
  • [17] D. Gallup, J. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys (2007) Real-time plane-sweeping stereo with multiple sweeping directions. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA, External Links: Link, Document Cited by: §2.
  • [18] C. Hane, L. Ladicky, and M. Pollefeys (2015)

    Direction matters: depth estimation with a surface normal classifier

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 381–389. Cited by: §2.
  • [19] W. Hartmann, S. Galliani, M. Havlena, L. V. Gool, and K. Schindler (2017) Learned multi-patch similarity. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1595–1603. Cited by: §2.
  • [20] H. Hirschmuller (2007) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §2.
  • [21] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) DeepMVS: learning multi-view stereopsis. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 2821–2830. External Links: Link, Document Cited by: Table 1.
  • [22] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) DeepMVS: learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1.
  • [23] S. Im, H. Jeon, S. Lin, and I. S. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1, §3.1, §3.1, Table 1.
  • [24] S. Im, H. Jeon, S. Lin, and I. S. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538. Cited by: §2.
  • [25] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017)

    SurfaceNet: an end-to-end 3d neural network for multiview stereopsis

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2315. Cited by: §2.
  • [26] M. Kazhdan, M. Bolitho, and H. Hoppe (2006) Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, Vol. 7. Cited by: §2.
  • [27] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. CoRR abs/1703.04309. External Links: Link, 1703.04309 Cited by: §3.1, §4.2.
  • [28] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §2.
  • [29] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi (2018) Stereonet: guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 573–590. Cited by: §2.
  • [30] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang (2018-06) Learning for disparity estimation through feature constancy. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [31] D. G. Lowe et al. Object recognition from local scale-invariant features.. Cited by: §2.
  • [32] K. Luo, T. Guan, L. Ju, H. Huang, and Y. Luo (2019-10) P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [33] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Note: arXiv:1512.02134 External Links: Link Cited by: §4.1.
  • [34] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia (2018) Geonet: geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291. Cited by: §2.
  • [35] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Link, 1505.04597 Cited by: §3.3.
  • [36] J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4104–4113. External Links: Link, Document Cited by: §4.2, Table 1.
  • [37] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 7-12, 2012, pp. 573–580. External Links: Link, Document Cited by: §4.1.
  • [38] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother (2006) A comparative study of energy minimization methods for markov random fields. In European conference on computer vision, pp. 16–29. Cited by: §2.
  • [39] E. Tola, V. Lepetit, and P. Fua (2009) Daisy: an efficient dense descriptor applied to wide-baseline stereo. IEEE transactions on pattern analysis and machine intelligence 32 (5), pp. 815–830. Cited by: §2.
  • [40] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) DeMoN: depth and motion network for learning monocular stereo. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5622–5631. External Links: Link, Document Cited by: Table 1.
  • [41] X. Wang, D. Fouhey, and A. Gupta (2015) Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547. Cited by: §2.
  • [42] J. Xiao, A. Owens, and A. Torralba (2013) SUN3D: A database of big spaces reconstructed using sfm and object labels. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 1625–1632. External Links: Link, Document Cited by: §4.1.
  • [43] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia (2018) Segstereo: exploiting semantic information for disparity estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 636–651. Cited by: §2.
  • [44] Q. Yang, L. Wang, R. Yang, H. Stewénius, and D. Nistér (2008) Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (3), pp. 492–504. Cited by: §2.
  • [45] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) MVSNet: depth inference for unstructured multi-view stereo. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, pp. 785–801. External Links: Link, Document Cited by: §3.1.
  • [46] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) MVSNet: depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV). Cited by: §1, §2, §4.1, View Selection and Normal Generation.
  • [47] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.
  • [48] W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5684–5693. Cited by: §2.
  • [49] Z. Yin and J. Shi (2018)

    GeoNet: unsupervised learning of dense depth, optical flow and camera pose

    .
    In CVPR, Cited by: §1.
  • [50] J. Yoo and T. H. Han (2009) Fast normalized cross-correlation. Circuits, systems and signal processing 28 (6), pp. 819. Cited by: §2.
  • [51] J. Zbontar, Y. LeCun, et al. (2016)

    Stereo matching by training a convolutional neural network to compare image patches.

    .

    Journal of Machine Learning Research

    17 (1-32), pp. 2.
    Cited by: §2.
  • [52] J. Zeng, Y. Tong, Y. Huang, Q. Yan, W. Sun, J. Chen, and Y. Wang (2019) Deep surface normal estimation with hierarchical RGB-D fusion. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6153–6162. External Links: Link Cited by: §4.1, Table 4.
  • [53] J. Zeng, Y. Tong, Y. Huang, Q. Yan, W. Sun, J. Chen, and Y. Wang (2019) Deep surface normal estimation with hierarchical rgb-d fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6153–6162. Cited by: §2.
  • [54] H. Zhan, C. S. Weerasekera, R. Garg, and I. Reid (2019)

    Self-supervised learning for single view depth and surface normal estimation

    .
    arXiv preprint arXiv:1903.00112. Cited by: §2.
  • [55] F. Zhang, V. A. Prisacariu, R. Yang, and P. H. S. Torr (2019) GA-net: guided aggregation net for end-to-end stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 185–194. External Links: Link Cited by: §4.2.
  • [56] R. Zhang, S. Li, T. Fang, S. Zhu, and L. Quan (2015) Joint camera clustering and surface segmentation for large-scale multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2084–2092. Cited by: View Selection and Normal Generation.
  • [57] S. Zhang, W. Xie, G. Zhang, H. Bao, and M. Kaess (2017) Robust stereo matching with surface normal prediction. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2540–2547. Cited by: §2.
  • [58] Y. Zhang and T. A. Funkhouser (2018) Deep depth completion of a single RGB-D image. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 175–185. External Links: Link, Document Cited by: §3.3.
  • [59] Y. Zhang and T. Funkhouser (2018) Deep depth completion of a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 175–185. Cited by: §2.
  • [60] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. A. Funkhouser (2017) Physically-based rendering for indoor scene understanding using convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5057–5065. External Links: Link, Document Cited by: §1, Figure 5, §4.3, Table 4.