DeepAI
Log In Sign Up

Joint Graph-based Depth Refinement and Normal Estimation

12/03/2019
by   Mattia Rossi, et al.
EPFL
Sony
0

Depth estimation is an essential component in understanding the 3D geometry of a scene, with numerous applications in urban and indoor settings. These scenes are characterized by a prevalence of human made structures, which in most of the cases, are either inherently piece-wise planar, or can be approximated as such. In these settings, we devise a novel depth refinement framework that aims at recovering the underlying piece-wise planarity of the inverse depth map. We formulate this task as an optimization problem involving a data fidelity term that minimizes the distance to the input inverse depth map, as well as a regularization that enforces a piece-wise planar solution. As for the regularization term, we model the inverse depth map as a weighted graph between pixels. The proposed regularization is designed to estimate a plane automatically at each pixel, without any need for an a priori estimation of the scene planes, and at the same time it encourages similar pixels to be assigned to the same plane. The resulting optimization problem is efficiently solved with ADAM algorithm. Experiments show that our method leads to a significant improvement in depth refinement, both visually and numerically, with respect to state-of-the-art algorithms on Middlebury, KITTI and ETH3D multi-view stereo datasets.

READ FULL TEXT VIEW PDF

page 1

page 6

page 7

01/19/2022

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

In this paper, we introduce a deep multi-view stereo (MVS) system that j...
12/06/2020

Depth Completion using Piecewise Planar Model

A depth map can be represented by a set of learned bases and can be effi...
04/28/2019

Weighted Dark Channel Dehazing

In dark channel based methods, local constant assumption is widely used ...
03/15/2018

LEGO: Learning Edge with Geometry all at Once by Watching Videos

Learning to estimate 3D geometry in a single image by watching unlabeled...
08/04/2019

Image-Guided Depth Sampling and Reconstruction

Depth acquisition, based on active illumination, is essential for autono...
06/19/2017

Guided Depth Upsampling for Precise Mapping of Urban Environments

We present an improved model for MRF-based depth upsampling, guided by i...

I Introduction

The accurate recovery of depth information in a scene represents a fundamental step for many applications, ranging from 3D imaging to the enhancement of machine vision systems and autonomous navigation. Typically, dense depth estimation is implemented either using active devices like Time-Of-Flight cameras, or via dense stereo matching methods [hirschmuller_stereo_2008, zabih_non_param_1994, bleyer_patchmatch_2011, Galliani_massively_2015, schoenberger_pixelwise_2016] that rely on two or more images of the same scene to compute its geometry. Active methods suffer from noisy measurements, possibly caused by light interference or multiple reflections, therefore they can benefit from a post-processing step to refine the depth map. Similarly, dense stereo matching methods have a limited performance in untextured areas, where the matching becomes ambiguous, or in the presence of occlusions. Therefore, a stereo matching pipeline typically includes a refinement step to fill the missing depth map areas and remove the noise.

In general, the refinement step is guided by the image associated to the measured or estimated depth map. The depth refinement literature mostly focuses on enforcing some kind of first order smoothness among the depth map pixels, possibly avoiding smoothing across the edges of the guide image that may correspond to object boundaries [barron_fast_2016, tosi_leveraging_2019, yao_mvsnet_2018]. Although depth maps are typically piece-wise smooth, first order smoothness is a very general assumptions, which does not exploit the geometrical simplicity of most 3D scenes. Based on the observation that most human made environments are characterized by planar surfaces, some authors propose to enforce second order smoothness by computing a set of possible planar surfaces a priori and assigning each depth map pixel to one of them [park_asplanar_2019]. Unfortunately, this refinement strategy imposes to select a finite number of plane candidates a priori, which may not be optimal in practice and lead to reduced performance.

In this article we propose a depth map refinement framework, which promotes a piece-wise planar arrangement of scenes without any knowledge of the planar surfaces in the scene. We cast the depth refinement problem into the optimization of a cost function involving a data fidelity term and a regularization. The former penalizes those solutions deviating from the input depth map in areas where the depth is considered to be reliable, whereas the latter promotes depth maps corresponding to piece-wise planar surfaces. In particular, our regularization models the depth map as a weighted graph whose nodes are the pixels of a guide image. The edge weight between two pixels captures the probability that their corresponding points in the 3D scene belong to the same planar surface.

Our contribution is twofold. On the one hand, we propose a graph-based regularization for depth refinement that promotes the reconstruction of piece-wise planar scenes explicitly. Moreover, thanks to its underneath graph, our regularization is flexible enough to handle non fully piece-wise planar scenes as well. On the other hand, our regularization is defined in order estimate the normal map of the scene jointly with the refined depth map, which is useful in the context of 3D reconstruction, specifically in the depth map fusion step.

The proposed depth refinement and normal estimation framework is potentially very useful in the context of large scale 3D reconstruction, where the large number of images to be processed requires fast dense stereo matching methods, which require a later depth refinement step [Kuhn_2020, schoenberger_pixelwise_2016].

It is also relevant in the fusion of multiple depth maps into a single point cloud, where the estimated normals can be used to filter possible outliers

[Kuhn_2020, schoenberger_pixelwise_2016]. We test our framework extensively and show that it is effective in both refining the input depth map and estimating the corresponding normal map.

The article is organized as follows. Section II provides an overview on the depth map refinement literature. Section III motivates the novel regularization term and derives the related geometry. Section IV presents our problem formulation and Section V presents our full algorithm. In Section VI we carry out extensive experiments to test the effectiveness of the proposed depth refinement and normal estimation approach. Section VII concludes the paper.

Ii Related work

Depth refinement methods fall mainly into three classes: local methods, global methods and learning-based methods.

Local methods are characterized by a greedy approach. Tosi et al.[tosi_leveraging_2019]

adopt a two step strategy. First, the input disparity map is used to compute a binary confidence mask that classifies each pixel as reliable or not. Then, the disparity at the pixels classified as reliable is kept unchanged and used to infer the disparity at the non reliable ones, using a wise interpolation heuristic. In particular, for each non reliable pixel, a set of anchor pixels with a reliable disparity is selected and the pixel disparity is estimated as a weighted average of the anchor disparities. Besides its low computational requirements, the method in

[tosi_leveraging_2019] suffers two major drawbacks. On the one hand, pixels classified as reliable are left unchanged: this does not permit to correct possible pixels misclassified as reliable, which may bias the refinement of those pixels. On the other hand, the method in [tosi_leveraging_2019], and local methods in general, cannot take advantage of the reliable parts of the disparity map fully, due to their local perspective.

Global methods rely on an optimization procedure to refine each pixel of the input disparity map jointly. Barron and Poole [barron_fast_2016] propose the Fast Bilateral Solver, a framework which permits to cast arbitrary image related ill posed problems into a global optimization formulation, whose prior resembles the popular bilateral filter [tomasi_bilateral_1998]. In [tosi_leveraging_2019] the Fast Bilateral Solver has been shown to be effective in the disparity refinement task, but its general purposefulness prevents it from competing with specialized methods, even local ones like [tosi_leveraging_2019]. Global is also the disparity refinement framework proposed by Park et al.[park_asplanar_2019], which can be broken down into four steps. First, the input reference image is partitioned into super-pixels and a local plane is estimated for each one of them using RANSAC. Second, super-pixels are progressively merged into macro super-pixels to cover larger areas of the scene and a new global plane is estimated for each of them. Then, a Markov Random Field (MRF) is defined over the set of super-pixels and each one is assigned to one of four different classes: the class associated the local plane of the super-pixel, the class associated to the global plane of the macro super-pixels to which the super-pixel belongs, the class of pixels not belonging to any planar surface, or the class of outliers. The MRF employs a prior that enforces connected super-pixel to belong to the same class, thus promoting a global consistency of the disparity map. Finally, the parameters of the plane associated to each super-pixel are slightly perturbed, again within a MRF model, to allow for a finer disparity refinement. This method is the closest to ours in flavour. However, the a priori detection of a finite number of candidate planes for the whole scene biases the refinements toward a set of plane hypotheses that may either not be correct, as estimated on the input noisy and possibly incomplete disparity map, or not be rich enough to cover the full geometry of the scene.

Finally, recent learning based methods typically rely on a deep neural network which, fed with the noisy or incomplete disparity map, outputs a refined version of it

[gidaris_detect_2016, knobelreiter_learned_2019]. In [gidaris_detect_2016] the task is split into three sub-tasks, each one addressed by a different network and finally trained end to end as a single one: detection of the non reliable pixels, gross refinement of the disparity map, and fine refinement. Instead, Knobelreiter and Pock [knobelreiter_learned_2019] revisit the work of Cherabier at al. [cherabier_learning_2018] in the context of disparity refinement. First, the disparity refinement task is cast into the minimization of a cost function, hence a global optimization, whose minimizer is the desired refined disparity map. However, the cost function is partially parametrized, rather than fully handcrafted. Then, the cost function solver can be unrolled for a fixed number of iterations, thus obtaining a network structure, and the parametrized cost function can be learned. Once the network parameters are learned, the disparity refinement requires just a network evaluation. Both the methods in [gidaris_detect_2016] and [knobelreiter_learned_2019] permit a fast refinement of the input disparity. However, due to their learning-based nature, they can falls short easily in those scenarios which differ from the ones employed at training time, as shown for the method in [knobelreiter_learned_2019], which performs remarkably well in the Middlebury benchmark [scharstein_high_2014] training set, while quite poorly in the test set of the same dataset.

Our graph-based depth refinement framework, instead, does not rely on any training procedure. It adopts a global approach which permits to jointly compute a pair of consistent depth and normal maps. Moreover, it does not need any a priori knowledge of the possible planar surfaces in the scene, as it automatically assigns a plane to each pixel based on its neighbors in the underneath graph. Finally, the proposed framework does not call for a separate handling of pixels belonging to planar surfaces and not, again thanks to the graph underneath.

Iii Depth map model

In this section we investigate the relationship between a plane in the 3D space and its 2D depth map. In particular, we show that, when a plane in the 3D space is imaged by a camera, the corresponding 2D inverse depth map is described by a plane as well, thus motivating a piece-wise planar model for the inverse depth map of scenes where planar structures are prevalent. Let us consider a plane in the 3D scene in front of a pinhole camera. The plane can be described uniquely by a pair , where is a point of the plane and , with

, is a vector defining the plane orientation, referred to as the

plane normal. Thus, for every point ,

(1)

which can be rewritten as follows:

(2)

where . Eq. (2) is expressed in the left handed coordinate system of the pinhole camera, whose axis points outside the camera and is aligned with the camera optical axis.111We assume that the camera plane is parallel to the X,Y plane of the 3D coordinate system. In the pinhole model, where the pixel coordinate origin is at the top left corner of the image, the 3D point is projected into the camera image plane at the image coordinates :

(3)

where are the coordinates of the camera center of projection and , are the horizontal and vertical focal lengths, respectively. Solving for and in Eq. (3), the plane equation in (2) can be expressed as a function of the image coordinates and the corresponding depth :

(4)

Similarly, in image coordinates, reads as follows:

(5)

where is the projection of the point into the camera image plane.

The inverse depth is associated to the image coordinated . Now, let us introduce the vector defined as with

(6)

Eq. (4) allows us to deduce that

(7)

The proof is provided in the supplementary material.

Eq. (7) can be interpreted as a first order Taylor expansion of the inverse depth map at the image coordinate , such that should be close to . In addition, Eq. (7) shows that the inverse depth of every point is described by a plane: in particular, this plane passes through the point and has a normal .

We showed that the plane is represented either by the pair in the scene domain or by the pair in the camera domain. Eq. (6) permits to move from the scene domain to the camera one by computing when is given. To recover when is given instead, it is sufficient to solve the following non linear system in the variables , , in closed form:

(8a)
(8b)
(8c)

whose solution is provided in the supplementary material. In what follows, we present our optimization problem to jointly estimate the normal map and the refined inverse depth .

Iv Depth map refinement problem

Given an image , we are interested in recovering the corresponding depth map when only a noisy and possibly incomplete estimate is available. We assume that is provided together with a confidence mask with entries in . In particular, the confidence map is such that, , when the entry is considered completely inaccurate, while when is considered highly accurate222A wide variety of algorithms addressing pixel-wise confidence prediction exist in the literature, either based on hand-crafted features or learning-based [poggi_quantitative_2017]. In practice, also the simple stereo reprojection error could be adopted [knobelreiter_learned_2019]. . In the following, we focus on estimating the refined inverse depth map and the normal map, given the initial estimate and the mask . We consider the following optimization problem:

(9)

where is a data term, is a regularization term for piece-wise planar functions, and is a scalar weight factor. The refined depth map is eventually computed as , while the 3D normal map is obtained from via the close form solution of the system in Eq. (8c).

In more details, the data fidelity term enforces consistency between the estimate of the inverse depth map and the input inverse depth map . We adopt a data term of the following form:

(10)

which enforces that the estimated inverse depth map is close to at those pixels where the latter is considered accurate, i.e., where tends to one.333 The quality of the confidence map can affect the quality of the refined depth map. However, in the case of missing confidence, i.e., constant, our formulation in Eq. (9) still promotes piece-wise planar scenes.

Then, the regularization term enforces the inverse depth map to be piece-wise planar, according to the model developed in the previous Section. In particular, we choose to model the inverse depth map as a weighted graph, where each pixel represents a node, and where the weight of the edge between two pixels can be interpreted as the likelihood that the corresponding two points in the 3D scene belong to the same plane. Namely, if the image looks locally similar at two different pixels, the probability is large for these pixels to belong to the same physical object hence the same plane. The regularization term parametrizes the inverse depth at each pixel with a different plane, but it enforces strongly connected pixels in the graph, i.e., those pixels connected by an edge with high weight, to share the same plane parametrization. Specifically, our regularization term encompasses two terms balanced by a scalar weight and reads as follows:

(11)
(12)

where describes the direct neighbours of in the graph and is the weight associated to the edge between the pixel and its neighbour pixel .

The first term of the regularization term in Eq. (11) enforces the following constraint between the pixel and its neighboring pixel , for every ,

(13)

which requires the inverse depth map in the neighbourhood of the pixel to be approximated by the plane whose orientation is given by the vector . This constraint recalls Eq. (7) and it is weighted by the likelihood that and are the projections of two points and belonging to the same plane in the 3D scene. However, using only Eq. (11) does not guaranty that planes and are the same (e.g., the normal maps and are parallel). Hence, the second term of the regularization in Eq. (12) enforces that the two planes fitted at and , with orientations and , respectively, are consistent with each other when and are considered likely to belong to the same plane .

We conclude by observing that Eq. (12) can be interpreted as a generalization of the well-known anisotropic Total Variation (TV) regularization [condat_discrete_2017], typically referred to as Non Local Total Variation (NLTV) [gilboa_nonlocal_2009] in general graph settings. In fact, the quantity is driven by the magnitude of the partial derivative of at node in the direction of [shuman_emerging_2013], so that Eq. (12) enforces a piece-wise constant signal on the graph, which enforces the signal to be piece-wise planar. This corresponds to depth map model of Section 3.

V Depth refinement algorithm

In this section we present the structure of the graph underneath the regularizer in Eqs. (11) and (12). Then, we detail the optimization algorithm adopted to find the solution of the joint depth refinement and normal estimation problem presented in Eq. (9).

V-a Graph construction

We assume that areas of the image sharing the same texture correspond to the same object and likewise to the same planar surface in the 3D scene. Based on this assumption, we associate a weight to the graph edge , which quantifies our confidence about the two pixels and to belong to the same object. Formally, first we define a –pixels search window centered at the pixel . Then, for each pixel in the window we compute the following weight:

(14)

where is a image patch centered at the pixel , denotes the Frobenius norm, , are tunable parameters. The first exponential in Eq. (14) has a high weight, hence high likelihood, when the values of the image pixels in two patches centered at and are similar; it is low otherwise [buades_a_review_2005, foi_foveated_2012]. The second exponential then makes the weight decay as the Euclidean distance between and increases.

After the weights associated to all the pixels in the considered search window have been computed, we design a K nearest neighbours (k-NN) graph, by keeping only the edges with the largest weights.

Limiting the number of connections at each pixel to reduces the computation during the minimization of the problem in Eq. (9), on the one hand, and it avoids weak edges that may connect pixels belonging to different objects, on the other one.

V-B Solver

The problem in Eq. (9) is convex, but non smooth. Multiple solvers specifically tailored for this class of problems exist, such as the Forward Backward Primal Dual (FBPD) solver [condat_primaldual_2013]. However, the convergence of these methods calls for the estimation of multiple parameters before the actual minimization takes place, e.g., the operator norm associated to the implicit linear operator inside the regularizer in Eqs. (11) and (12).

Instead, we decide to solve the problem in Eq. (9) using Gradient Descent with momentum, in particular ADAM [kingma_adam_2015], as we empirically found it to be considerably faster (time-wise) than FBPD in our scenario.

To further speed up the convergence, we adopt a multi-scale approach.

The noisy and possibly incomplete inverse depth map is progressively down-sampled by a factor to get , with , where is the number of scales. An instance of the problem in Eq. (9) is solved for each and the solution at the scale is up-sampled to initialize the solver at the scale 444All up-sampling and down-sampling operations are performed using nearest neighbor interpolation. . It is crucial to observe that the field must be both up-sampled and scaled by a factor , as in the continuous domain the relation holds true: in fact, the up/down-sampling operations emulate a change of the pixel area, while the camera sensor area remains constant. We refer to the supplementary material for a formal proof.

Vi Experimental results

In this section we test the effectiveness of our joint depth refinement and normal estimation framework on the training splits of the Middlebury v3 stereo dataset [scharstein_high_2014] at quarter resolution, of the KITTI stereo dataset [menze_object_2015] and of the ETH3D Multi-View Stereo (MVS) dataset [schops_multiview_2017] at half resolution. Since these datasets come with ground truth depth maps but lack ground truth normals, we provide numerical results for the depth refinement part of the framework, while we provide only visual results for the normal estimation part.

Due to space constraints, only the average results over the single datasets are presented: the single scene results are provided in the supplementary material. Regarding the ground truth normal map instead, we approximate it by applying Eqs. (8c) to , where the gradient is computed using a

–pixels Gaussian derivative kernel with standard deviation

pixels. The small standard deviation permits to recover fine details, as the ground truth inverse depth map is not affected by noise. Although this does not permit a numerical evaluation, it permits to appreciate the normals estimated by our framework.

Vi-a Middlebury and KITTI datasets

Similarly to the recent disparity refinement method in [tosi_leveraging_2019], we refine the disparity maps computed via Semi-Global Matching (SGM) [hirschmuller_stereo_2008] and census-based Block Matching (BM) [zabih_non_param_1994]. We compare our framework to the disparity refinement method recently proposed in [tosi_leveraging_2019], as it also relies on a confidence map and, most importantly, it showed to outperform many other widely used disparity refinement methods, e.g., [barron_fast_2016, perreault_median_2007, zhang_100_2014, ma_constant_2013, matoccia_locally_2009], on both the Middlebury and the KITTI datasets. Moreover, since our new regularizer in Eqs. (11)–(12) resembles NLTGV [ranftl_non_local_2014], we compare to NLTGV as well. In particular, we replace with NLTGV in our problem formulation in Eq. (9).

It is crucial to observe that, originally, NLTGV was introduced in the context of optical flow [ranftl_non_local_2014] as a general purpose regularizer, without any ambition to connect the geometry of the optical flow and the geometry of the underneath scene. In this article instead, we aim at modeling explicitly the joint piece-wise planarity of the inverse depth map and of the underneath scene. In fact, the mixed –norms that we employ in both the terms of our regularizer, as opposed to the simple –norm of NLTGV, are carefully chosen to make our regularizer more robust in its global plane fitting.555A through analysis of the differences between the proposed regularizer and NLTGV is provided in the supplementary material.

The SGM and BM disparity maps to refine are provided by the authors in [tosi_leveraging_2019], who provided also their refined disparity maps and binary confidence maps. In order to carry out a fair comparison, these confidence maps are used by all the methods considered in the experiments. As described in [tosi_leveraging_2019], the considered binary confidence maps are the result of a learning-based framework trained on a split of the KITTI 2012 stereo dataset [geiger_object_2012], therefore there is no bias toward the Middlebury and KITTI datasets.

Since our framework assumes a depth map at its input, we convert the disparity map to be refined into a depth map and we then convert the refined depth map back to the disparity domain, in order to carry out the numerical evaluation. The evaluation involves the bad pixel metric, which is the percentage of pixels with an error larger than a predefined disparity threshold, together with the average absolute error (avgerr) and the root mean square error (rms). We carry out the evaluation on all the pixels with an available ground truth, regardless of the occlusions.

Finally, the parameters adopted in NLTGV and in our framework are the result of a grid search. In particular, concerning the graph construction, NLTGV and our framework adopt the following common parameters on both the datasets: weight parameters and pixels, search window size , patch size and maximum number of connections per pixel . Finally, we set the number of scales and .

Middlebury dataset

The Middlebury training dataset [scharstein_high_2014] consists in a set of indoor scenes carefully crafted to challenge modern stereo algorithms. Some scenes contain multiple untextured planar surfaces, which represent a hard challenge for stereo methods but are compliant with the model underneath our framework; other scenes are inherently non piece-wise planar instead. Due to its variety, the Middlebury dataset permits to evaluate the flexibility of our framework.

For NLTGV we set and , regardless of the scale. For our framework and SGM disparity maps at the input, we set and at the low and high scales, respectively. For BM disparity maps at the input instead, we set and at the low and high scales, respectively. We set regardless of the input disparity map.

The results of our experiments on the Middlebury dataset are presented in Table I. When BM is considered, our framework outperforms the method in [tosi_leveraging_2019] and NLTGV in all the considered metrics. Similarly, when SGM is considered, our framework outperforms the method in [tosi_leveraging_2019] and NLTGV in four out of the five metrics. However, in the bad 1px metric, where the best error is achieved by the methods in [tosi_leveraging_2019], our result is comparable. Moreover, in the most common bad 2px metric, our framework always provides the best error regardless of the input disparity map. Clearly, some scenes in the dataset are far from fulfilling our piece-wise planar assumption, e.g., ArtL and Jadeplant. These scenes affect the average results in Table I and mask the large improvement exhibited by our framework in those scenes which fulfill the assumption even partially. This can be appreciated in the supplementary material.

Err. metric Input [tosi_leveraging_2019] [ranftl_non_local_2014] Ours
SGM [hirschmuller_stereo_2008] bad 0.5px 41.33 39.14 36.57 35.70
bad 1px 28.90 25.58 26.02 25.71
bad 2px 23.48 19.55 19.88 19.25
avgerr 4.06 3.32 3.31 2.87
rms 9.75 8.27 7.99 6.86
BM [zabih_non_param_1994] bad 0.5px 47.48 39.01 38.49 35.01
bad 1px 37.56 25.83 28.28 25.40
bad 2px 33.98 20.61 22.03 19.41
avgerr 8.41 3.48 3.35 2.79
rms 17.32 8.58 7.91 6.97
TABLE I: Disparity refinement on the Middlebury dataset [scharstein_high_2014]. The first column specifies the stereo method whose disparity map is refined. The second column provides the error metric used in the evaluation: bad px thresholds, the average absolute error (avgerr) and the root mean square error (rms). All the pixels with a ground truth disparity are considered. The columns from four to six report the error of the disparity maps refined by the method in [tosi_leveraging_2019], NLTGV [ranftl_non_local_2014], our method. The best result for each row is in bold.

In Figure 1 we provide the results of our experiments on the scene Piano, when the stereo methods BM is considered. The normal map associated to the input BM disparity map and to the one refined by the method in [tosi_leveraging_2019] are computed with the same approach adopted for the ground truth normal map, while employing in order to handle the noise. In fact, the input BM disparity is significantly noisy, especially in the walls surrounding the piano. The method in [tosi_leveraging_2019] manages to decrease the error in some areas of the surrounding walls: however, since no global consistency is considered, the result is a speckled error. Instead, our method manages to approximate the surrounding walls better, using multiple planes. Finally, NLTGV fails to capture the geometry of the surrounding wall, as its relying on a simple -norm makes it more sensible to outliers than our mixed -norm.

Reference

BM [zabih_non_param_1994]

[width=0.15]figures/middlebury/bm/Piano/Piano_disparity_mvs.png

31.88%

[tosi_leveraging_2019]

[width=0.15]figures/middlebury/bm/Piano/Piano_disparity_tosi.png

18.29%

NLTGV [ranftl_non_local_2014]

[width=0.15]figures/middlebury/bm/Piano/Piano_disparity_nltgv.png

27.86%

Ours

[width=0.15]figures/middlebury/bm/Piano/Piano_disparity_ours.png

15.25%

Fig. 1: Middlebury [scharstein_high_2014] scene Piano. The first row hosts, from left to right, the reference image and the ground truth disparity and normal maps. Each other row hosts, from left to right, the bad 2px disparity error mask and the disparity and normal map. The second row refers to the stereo method BM [zabih_non_param_1994], whose disparity is refined by the methods [tosi_leveraging_2019], NLTGV [ranftl_non_local_2014], and ours, in the rows three to five, respectively. The pixels in the error maps are color coded: error within in dark blue, error larger than in yellow, missing ground truth in white. The bad 2px error percentage is reported on the bottom right corner of each disparity map.
Reference BM [zabih_non_param_1994] [tosi_leveraging_2019] NLTGV [ranftl_non_local_2014] Ours
[ width=0.195]figures/kitti/bm/126/126_disparity_mvs.png

26.97%

[ width=0.195]figures/kitti/bm/126/126_disparity_tosi.png

07.99%

[ width=0.195]figures/kitti/bm/126/126_disparity_nltgv.png

07.39%

[ width=0.195]figures/kitti/bm/126/126_disparity_ours.png

05.46%

Fig. 2: KITTI [menze_object_2015] scene 126. The first column hosts, from top to bottom, the reference image and the ground truth disparity and normal maps. Each other column hosts, from top to bottom, the bad 3px disparity error mask and the disparity and normal map. The second column refers to the stereo method BM [zabih_non_param_1994], whose disparity is refined by the methods [tosi_leveraging_2019], NLTGV [ranftl_non_local_2014], and ours in the columns three, four and five, respectively. The pixels in the error maps are color coded: error within in blue, error larger than in yellow, missing ground truth in white. The bad 3px error percentage is reported on the bottom right corner of each disparity map.

KITTI dataset

The KITTI 2015 training dataset [menze_object_2015] consists in a set of scenes captured from the top of a moving car. As a consequence, the prevalent content of each scene are the road, possible vehicles and possible buildings at the two sides of the road. At a first glance, this content may seem to match our piece-wise planar assumption. However, in practice the buildings at the sides of the road are mainly occluded by vegetation, which is far from piece-wise planar. We select scenes randomly and test our framework on them, in order to analyze its flexibility.

For NLTGV we set and regardless of the scale. For our framework we set and at the lowest and highest scales, respectively, while we keep regardless of the scale.

The results of our experiments on the KITTI dataset are presented in Table II. Regardless of the considered metric and stereo method, NLTGV outperforms the method in [tosi_leveraging_2019], while our framework outperforms all the others. Moreover, when the most common bad 3px error is considered, our framework improves the input SGM and BM disparity maps by more than and , respectively.

In Figure 2 we provide the results of our experiments on the scene 126, when the stereo method BM is considered. The method in [tosi_leveraging_2019], NLTGV and our framework manage all to reduce sensibly the high amount of noise that affects the input disparity map, represented by the yellow speckles. However, only NLTGV and our framework manage to preserve fine details like the pole on the left side of the image, which appears broken in the disparity map associated to [tosi_leveraging_2019]. Finally, our framework provides the sharpest disparity map, as NLTGV exhibits some disparity bleeding at object boundaries. This is visible on the car at the bottom right corner of the image, both by observing the disparity maps and the error masks. This is also confirmed by the numerical results, as our bad 3px error is significantly lower.

Err. metric Input [tosi_leveraging_2019] [ranftl_non_local_2014] Ours
BM [zabih_non_param_1994] bad 2px 40.96 16.75 11.09 10.54
bad 3px 38.15 12.80 6.76 6.40
avgerr 1.94 1.63 1.30 1.22
rms 5.46 4.52 3.43 3.15
SGM [hirschmuller_stereo_2008] bad 2px 14.25 11.58 10.49 9.82
bad 3px 10.11 7.65 6.07 5.54
avgerr 21.12 2.97 1.62 1.51
rms 46.50 8.49 7.91 7.88
TABLE II: Disparity refinement on the KITTI dataset [menze_object_2015]. The first column specifies the stereo method whose disparity map is refined. The second column specifies the considered error metric: bad px thresholds, the average absolute error (avgerr) and the root mean square error (rms). All the pixels with a ground truth disparity are considered. The columns from four to six report the error of the disparity maps refined by the method in [tosi_leveraging_2019], NLTGV [ranftl_non_local_2014] and our method, respectively. The best result for each row is in bold.

Vi-B ETH3D dataset

Reference

Input

[ width=0.31]figures/eth3d/pipes3/input_depth.jpg

33.22%

NLTGV [ranftl_non_local_2014]

[ width=0.31]figures/eth3d/pipes3/nltgv2_depth.jpg

24.34%

Ours

[ width=0.31]figures/eth3d/pipes3/ours_depth.jpg

22.20%

Fig. 3: ETH3D [schops_multiview_2017] Pipes. The first row hosts, from left to right, the reference image and the ground truth depth and normal maps. Each other row hosts, from left to right, the 2cm error map, the depth and normal maps. The second row refers to the MVS method [Kuhn_2020], whose depth is refined by NLTGV [ranftl_non_local_2014] and our method in the rows three and four, respectively. The pixels in the error maps are color coded: error within in blue, error larger than in yellow, missing ground truth in white. The error percentage associated to the threshold is reported on the bottom right corner of each depth map.

Large scale 3D reconstruction methods [furukama_accurate_2010, jancosek_multi_2011, schoenberger_pixelwise_2016, kuhn_tv_2017, xu_multi_2019, kuhn_plane_2019] estimate the depth map from a large number of input images of the same scene, by leveraging photometric constraints among the images, and subsequently fuse them to produce a model of the scene itself. Large scale 3D reconstruction methods can largely benefit from a refinement of the estimated depth maps and can exploit the corresponding normal maps during the fusion step. In order to demonstrate the suitability of our joint depth refinement and normal estimation framework on half-resolution image from challenging MVS configurations, we test it on the training split of the ETH3D dataset [schops_multiview_2017], a popular benchmark for large 3D reconstruction algorithms, involving both indoor and outdoor scenes.

The ground truth ETH3D depth maps are very sparse, but characterized by a resolution which is significantly higher than the resolution adopted in our tests. Therefore, similarly to [huang_deepmvs_2018], we back project the sparse ground truth depth maps to half resolution in order to get denser ones, to be used in our evaluation. In [Kuhn_2020], the authors propose a novel deep-network-based confidence prediction framework for depth maps computed by MVS algorithms, hence in the context of large baselines and severe occlusions. For our experiments, in order to estimate the confidence map , we re-train the network proposed in [Kuhn_2020] jointly on the synthetic dataset proposed in the same work and on the dense ground truth depth maps of the ETH3D training split. For an unbiased evaluation, we extract three sequences of the ETH3D training split (Pipes, Office, Delivery Area) from the training procedure and used them exclusively for our evaluation. We compare our refined depth and normal maps against the depth and normal maps derived from the MVS method based on Patch Match Stereo [bleyer_patchmatch_2011] presented in [Kuhn_2020].

For both NLTGV and our framework we set and regardless of the scale, adopt the graph parameters used on Middlebury and KITTI, set the number of scales and .

The continuous confidence map provided by the trained network is binarized with a

threshold.

Table III compares the input MVS depth map with those refined by NLTGV and our method.

The top part of the table reports the percentage of pixel, computed over all the pixels of all the images in the scene, with an error within a given threshold: and . On average, our method outperforms NLTGV and manages to improve the input depth maps by more than when the threshold, the most common in the ETH3D benchmark, is considered.

In the bottom part of the same table, we provide also the average absolute error (avgerr) and the root mean square error (rms). The rms metric is very sensitive to outliers and, especially in the Delivery Area sequence, it highlights our improvement over the input depth map.

Finally, a visual example is provided in Figure 3 for the scene Pipes, which is characterized by multiple untextured planar surfaces, representing a hard challenge for MVS methods. Our method targets exactly these scenarios instead: in fact, it manages to refine the input depth map by capturing the main planes, as exemplified by the estimated normal map. Moreover, it manages to fit better planes than NLTGV, which fails to capture the correct floor orientation.

2cm 5cm
Input [ranftl_non_local_2014] Ours Input [ranftl_non_local_2014] Ours
Pipes (14) 18.16 11.17 10.71 14.18 7.64 7.10
Delivery (44) 24.15 19.20 18.33 12.05 6.41 5.80
Office (26) 47.54 39.34 38.59 39.32 30.23 29.13
avg 29.95 23.24 22.54 21.85 14.76 14.08
avgerr rms
Input [ranftl_non_local_2014] Ours Input [ranftl_non_local_2014] Ours
Pipes (14) 0.347 0.119 0.082 2.090 0.685 0.460
Delivery (44) 0.233 0.025 0.023 26.06 0.227 0.211
Office (26) 0.330 0.183 0.167 1.218 0.409 0.381
avg 0.303 0.109 0.091 9.303 0.440 0.351
TABLE III: Refinement of MVS-derived [Kuhn_2020] depth maps from the ETH3D training dataset [schops_multiview_2017]. The table is divided into a top and a bottom sub-table, with their first columns specifying the test scenes, whose number of images is specified in brackets. The top sub-table specifies the percentage of pixels with an error exceeding a predefined threshold: and cm. The bottom sub-table specifies the average absolute error (avgerr) and the root mean square error (rms) in the second and third columns, respectively. For each scene and metric, the best error is in bold.

Vii Conclusions

In this article, we proposed a variational approach to address the problem of depth map refinement by jointly estimating the depth and the normals of the 3D scene imaged by the camera. In particular, we formulated an optimization problem involving a data fidelity term and a graph-based regularization which enforces a piece-wise planar solution that fits most human made environments. Our new graph-based regularization however renders the framework flexible enough to handle non fully piece-wise planar scenes as well. We showed that the proposed framework outperforms state-of-the-art depth refinement methods when the piece-wise planar assumption is actually consistent with the 3D scene, and it leads to competitive results otherwise. Interesting perspectives include the a priori semantic segmentation of the reference image areas into the planar and not planar classes, so that the regularization could be adapted accordingly.

References