Log In Sign Up

Multi-View Guided Multi-View Stereo

by   Matteo Poggi, et al.

This paper introduces a novel deep framework for dense 3D reconstruction from multiple image frames, leveraging a sparse set of depth measurements gathered jointly with image acquisition. Given a deep multi-view stereo network, our framework uses sparse depth hints to guide the neural network by modulating the plane-sweep cost volume built during the forward step, enabling us to infer constantly much more accurate depth maps. Moreover, since multiple viewpoints can provide additional depth measurements, we propose a multi-view guidance strategy that increases the density of the sparse points used to guide the network, thus leading to even more accurate results. We evaluate our Multi-View Guided framework within a variety of state-of-the-art deep multi-view stereo networks, demonstrating its effectiveness at improving the results achieved by each of them on BlendedMVG and DTU datasets.


page 1

page 3

page 6


How Good MVSNets Are at Depth Fusion

We study the effects of the additional input to deep multi-view stereo m...

Multi-View Stereo 3D Edge Reconstruction

This paper presents a novel method for the reconstruction of 3D edges in...

Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Retrieving the missing dimension information in acoustic images from 2D ...

A Differential Volumetric Approach to Multi-View Photometric Stereo

Highly accurate 3D volumetric reconstruction is still an open research t...

Guided Stereo Matching

Stereo is a prominent technique to infer dense depth maps from images, a...

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

Deep learning has made significant impacts on multi-view stereo systems....

PHI-MVS: Plane Hypothesis Inference Multi-view Stereo for Large-Scale Scene Reconstruction

PatchMatch based Multi-view Stereo (MVS) algorithms have achieved great ...

I Introduction

Multi-view stereo (MVS) is a popular technique to obtain dense 3D reconstructions of real-world objects or scenes from a set of multiple, posed images. It represents a first, pivotal step towards a variety of higher-level applications, such as augmented/virtual reality, robotics, cultural heritage and more. It represents one of the fundamental problems in computer vision and it has been studied for years, at first by developing classical algorithms

[barnes2009patchmatch, campbell2008using, furukawa2009accurate, galliani2015massively, schonberger2016pixelwise], making use of hand-crafted matching functions to measure consistency among the multiple views. However, many challenges keep MVS an open problem, such as occlusions between the views, lack of texture or non-Lambertian surfaces, to name a few [aanaes2016large, knapitsch2017tanks, schops2017multi].

The advent of deep learning in computer vision, in particular with the introduction of Convolutional Neural Networks (CNNs), allowed for rapid progress even in geometric tasks such as MVS, partially overcoming some of the issues mentioned above. Indeed, deep MVS networks

[yao2018mvsnet, yao2019recurrent, Wei_2021_ICCV, wang2021patchmatchnet, gu2020cascade, luo2019p] are spreading, thanks to their ever-increasing accuracy on popular benchmarks [jensen2014large, schops2017multi, knapitsch2017tanks]. Common to most CNNs developed for this purpose is the presence of a 3D cost volume [yao2018mvsnet]

, built using plane-sweeping over the source views features and computing their similarity with respect to the reference image features. Such a volume is usually regularized through 3D convolutional layers – or other, more efficient alternatives, such as 2D Long-Short Term Memory (LSTM) layers

[yao2019recurrent] – before regressing the final depth map. However, despite the more robust features representation extracted by 2D CNNs and the strong regularization achieved through 3D convolutions, the high-demanding computational requirements still limit the full deployment of such solutions, often requiring some trade-off between accuracy and complexity. For instance, inferring depth at resolution lower than the one of the input images [yao2018mvsnet] or implementing coarse-to-fine strategies [gu2020cascade, yang2020cost, cheng2020deep]. Moreover, several challenges mentioned above, such as dealing with untextured regions, thin objects or occlusions, remain open.

RGB (and hints) Estimated depth Point cloud
Fig. 1: Multi-View Guided Multi-View Stereo in action. Deep MVS networks struggle at generalizing from synthetic to real images, yielding inaccurate depth maps and poor 3D reconstructions (top). By guiding the network with a set of sparse depth measurements, aggregated over the multiple views, we can greatly ameliorate the results (bottom). Depth maps are encoded with turbo_r colormap, while sparse depth hints on bottom row are densified by a dilation filter to ease visualization in this teaser.

We argue that most of the challenges mentioned so far are inherent to the image domain itself. Thus, their impact could be significantly softened given the availability of additional information with different modalities, for instance, by having access to a sparse set of depth measurements perceived by an active sensor. Nowadays, such sensors are at hand and readily available as standalone off-the-shelf devices. Moreover, they are always more frequently integrated into consumer products like mobile phones and tablets (e.g., Apple iPhones and iPads). However, despite their ever-increasing diffusion, they often provide only sparse depth data (i.e., at a much lower resolution compared to standard cameras).

The recent literature supports our intuition, highlighting the evidence of approaches effectively exploiting the synergy of color images with sparse depth data. For instance, in the case of depth completion [Uhrig2017THREEDV], fusion with stereo algorithms [park2018high, park2019high] and networks [Poggi_2019_CVPR, Cheng_2019_CVPR, wang20193d] or, more recently, with optical flow deep architectures [Poggi_2021_ICCV] as well.

Driven by these facts, we propose a framework for guided multi-view stereo depth estimation. Assuming the availability of a sparse set of depth measurements acquired together with images, we modulate [Poggi_2019_CVPR] the cost volume built by any state-of-the-art MVS network [yao2018mvsnet, gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet] to provide stronger guidance to the architecture towards inferring more accurate depth maps. Moreover, by exploiting the possibility of having multiple sets of sparse depth points acquired from the different viewpoints of the source images, we introduce an integration mechanism to accumulate the multiple depth hints enabling modulating the cost volume inside the deep network with a higher density of guiding points. This allows to boost the performance of a MVS network, allowing it to infer more accurate depth maps, and consequently higher quality 3D reconstructions, for instance when trained on synthetic data and tested on real images, as shown in Fig. 1. To validate this claim, we run an exhaustive set of experiments by training a variety of state-of-the-art MVS architectures and their guided counterparts on the BlendedMVG [yao2020blendedmvs] and DTU [jensen2014large] datasets and assessing their accuracy on them. This proves that our framework consistently boosts the accuracy achievable with any considered deep networks in terms of depth map estimation and overall 3D reconstruction when guidance is available. Our contributions are:

  • We propose the Guided Multi-View Stereo framework (gMVS), extending [Poggi_2019_CVPR] to cope with our purposes. Then, on top of that, we propose the Multi-View Guided Multi-View Stereo (mvgMVS) to exploit multiple sets of depth hints acquired from different viewpoints of the multi-view reconstruction task.

  • We introduce coarse-to-fine guidance by applying cost volume modulation multiple times during the forward pass, compliantly to the coarse-to-fine strategy followed by recent MVS networks [gu2020cascade, cheng2020deep, wang2021patchmatchnet].

  • We implement the proposed mvgMVS framework within five state-of-the-art deep architectures [yao2018mvsnet, gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet], each one characterized by different regularization and optimization strategies.

Ii Related Work

We review the literature relevant to our work concerning stereo vision, traditional MVS approaches, deep MVS networks and guided/conditioned deep learning frameworks.

Stereo Matching. Predicting depth from a set of calibrated images is a fundamental task in computer vision and stereo matching [Taxonomy_Stereo] represents the simplest approach for this purpose, leveraging two rectified images. This task has been faced through hand-crafted algorithms for years [CostAggregationMethods, EnergyMinimization, SemiGlobalMatching], until deep learning diffusion. At first, hand-crafted features used to compute matching costs were replaced with learned ones [MC-CNN], then end-to-end architectures [DispNet, GCNet, GANet] became dominant on the stage [SURVEY_STEREO_DEEP].

Multi-View Stereo. MVS extends stereo matching to an arbitrary number of images, acquired from known viewpoints. Pre-deep learning techniques belongs to four categories, respectively reasoning about voxels [MVSGraphCuts, SemantincMVS, MVSVoxelColoring], surface evolution [AQuasiDenseSurfaceReconstruction], matching patches [furukawa2009accurate] or estimated depth maps [Galliani_2015_ICCV]. The latter strategy results the most practical and efficient and has been embraced by modern deep learning MVS architectures. The first proposal in this direction was MVSNet [yao2018mvsnet]

, a deep network building a variance-based cost volume through plane-sweep, then processed through 3D convolutions. However, 3D CNNs are time and memory consuming and two main strategy have been propose to soften these constraints. The first consist of replacing 3D convolutions with 2D GRU unit

[yao2019recurrent, yan2020dense, wei2021aa]. The second implements multi-stage architectures capable of coarse-to-fine inference [gu2020cascade, cheng2020deep, wang2021patchmatchnet] or pyramidal cost volumes [yang2020cost].

Depth Completion and Guided Frameworks. Two other research trends are relevant to our work. One concerns depth prediction from a single RGB image and sparse depth points, namely depth completion [uhrig2017sparsity, SparseToDense, GuidedNet, PENet]

. The second concerns the idea of conditioning deep features, by either acting in latent

[huang2017arbitrary, courville2017modulating, park2019semantic] or geometric space [Cheng_2019_CVPR, Poggi_2019_CVPR, Poggi_2021_ICCV] via normalization or modulation.

Our proposal to leverage sparse depth data within MVS networks takes inspiration from previous successes in stereo [Poggi_2019_CVPR] and optical flow [Poggi_2021_ICCV]. Nonetheless, we arguably extend the previous guided deep learning formulation under different aspects for our purposes. Specifically: i) considering multiple sources of depth hints, placed at different viewpoints, to increase the guide density and effectiveness and ii) applying modulation within coarse-to-fine architectures – unexplored in previous works [Poggi_2019_CVPR, Poggi_2021_ICCV].

Iii Proposed framework

In this section, we introduce our Multi-View Guided Multi-View Stereo (mvgMVS) framework. First, we review the background relevant to our proposal, specifically concerning deep MVS networks. Then, we cast the guided stereo matching framework [Poggi_2019_CVPR] into the MVS setting and, finally, we extend it to deal with multi-view depth hints and coarse-to-fine architectures.

Fig. 2: Hints aggregation. Depth hints from many views (left) can be aggregated on the reference image viewpoint (right).

Iii-a Deep Multi-View Stereo background

Most learning-based MVS pipelines follow the same pattern. Given a set of images, one assumed as the reference and the other as source images, deep MVS networks process them to predict a global dense depth map aligned with the reference one. To this aim, common to most deep networks designed for this purpose is the definition of a cost volume, encoding features similarity between pixels in the reference image and potential matching candidates from the source images. The latter are retrieved along the epipolar lines in the source views, given intrinsic and extrinsic parameters for any camera collecting the images involved. Specifically, for a particular depth hypothesis , features extracted from a given source view are projected by means of an homography-based warping operation .


Then, to encode the similarity between reference features and , a variance-based volume is defined as follows


with consisting of for . Accordingly, for a given pixel, the lower the variance score, the more similar the features retrieved from the source views are and, thus, the more likely hypothesis is the correct depth for it.

However, implementing this solution requires high memory and results computationally complex. Consequently, several state-of-the-art networks [gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet] implement a coarse-to-fine solution. Specifically, a set of variance-based cost volumes are built as


being a specific resolution or scale at which the cost volume is computed and features from image at resolution sampled as


with being the intrinsic parameters for camera adjusted to resolution and sampled in a range that differs at any scale.

Iii-B Guided Multi-View Stereo

By assuming a setup made of a standard camera and a low-resolution depth sensor, for instance a LiDAR, we leverage the output of the latter to shape the behavior of a deep network estimating depth from a set of color images. When this set is limited to a single frame, a neural network is usually trained to complete the sparse depth points [Uhrig2017THREEDV] guided by the color image [tang2020learning]. When multiple images are available, the mechanism often reverses, and depth measurements are used as hints to guide the image-based estimation process. This strategy is implemented, for instance, by the Guided Stereo framework [Poggi_2019_CVPR] applied to binocular stereo, by applying a Gaussian modulation to the features volume to peak it in correspondence of a depth hint .

In analogy, this mechanism can be applied also to multi-view stereo, implementing a Guided Multi-View Stereo pipeline (gMVS). Indeed, the variance volume introduced in Sec. III-A can be conveniently modulated as well. In this case, since low variance encodes a high likelihood of the corresponding depth hypothesis to be correct, we flip the Gaussian curve to force the variance-based cost volume to have a minimum near depth hint


with being a binary mask equal to 1 for pixels with a valid hint (0 otherwise) and k, c being the amplitude and width of the Gaussian itself. The gMVS formulation outlined so far extends the Guided Stereo framework [Poggi_2019_CVPR] to MVS. In the remainder, we will introduce two significant additional contributions conceived explicitly for the MVS setup and the models designed for it.

Fig. 3: Depth hints filtering. On left, reference image (a). On right, sparse hints over the region inside red rectangle in (a), respectively from the single viewpoint (b), aggregated over multiple viewpoints (c) and filtered (d). Regions in yellow rectangles in (c) and (d) highlight the effect of filtering. Depth points are densified to ease visualization.

Iii-C Multi-View Guided Multi-View Stereo

MVS relies on the availability of multiple images acquired from different viewpoints. Moreover, we assume the availability of sparse depth measurements registered with the colour images in our setup. Therefore, a different set of hints is available for each source image. In such a case, we argue that aggregating the multiple sets of depth hints from each viewpoint can provide stronger guidance to the network and further improve the results of the baseline gMVS framework. To this aim, we perform two main steps.

Depth hints aggregation. Given a pixel having homogeneous 2D coordinates from any source image for which a depth value is available, the 3D coordinates in the reference image viewpoint are obtained as:


From , we can get the new depth hint expressed in the reference image viewpoint, and project it on the image plane according to at coordinates .

This allows to aggregate depth hints on the reference view, as shown in Fig. 2, and thus obtain a denser depth hints map to modulate the volume in the network with stronger guidance. We refer to this extension of the gMVS framework as Multi-View Guided Multi-View Stereo (mvgMVS)

Depth hints filtering. Because of the different viewpoints, some of the depth measurements acquired in one of the source views may belong to occluded regions in the reference view. However, given the sparse nature of the hints, this would cause the aggregation of several wrong values if we would limit to naively projecting them across the views without reasoning about their visibility, as shown in Fig. 3

(b). As a consequence, we would guide the deep network with wrong depth hints, harming its accuracy. To detect and remove these outliers, we deploy the filtering strategy by Zhao et al.

[ZhaoYiming_IEEE_2021], defining as outlier any pixels for which exists at least a pixel in its neighbourhood such that:

  • changes the relative position with respect to , because occluded. This occurs if the difference between and pixels coordinates and angles (in spherical coordinates) have different sign, i.e. if either or are negative

  • distance from the camera is much higher compared to , i.e. , with set according to the specific dataset used

Although simple, this strategy allows for removing most of the outliers at a minor computational cost, as shown in Fig. 3 (c). We will show in our ablation experiments how this step is necessary to achieve optimal guidance. We refer to this final implementation as filtered mvgMVS (fmvgMVS).

Iii-D Coarse-to-Fine Guidance

Unlike deep stereo networks, which usually build a single volume processed through stacked 3D convolutions, MVS networks are often designed to embody coarse-to-fine estimation to reduce the computational burden, as introduced previously in Sec. III-A. We argue that any of the multiple cost volumes built by the network represent a possible entry point for guiding the network. Accordingly, we then modulate any during the forward pass


with and being respectively the binary mask and the depth hints map downsampled to resolution

, with nearest-neighbor interpolation. Our experiments will show how the stronger guidance yielded by these multiple modulations improves the overall network accuracy.

Iv Experimental results

In this section, we collect the outcome of our experiments describing, at first, the datasets involved in our evaluation, details concerning the framework implementation, the networks evaluated and the training protocol. Source code is available at

Iv-a Datasets

We begin by introducing the datasets involved in our experiments. Since none of the existing MVS data collection provides sparse depth points, we simulate the availability of sparse hints by randomly sampling them from ground-truth depth maps, similarly to [Poggi_2019_CVPR, Poggi_2021_ICCV]. Consequently, for our experiments, we can select only datasets providing such information, i.e. we cannot evaluate on Tank & Temples [knapitsch2017tanks].

BlendedMVG. This dataset [yao2020blendedmvs] collects about 110K images sampled from about 500 scenes. It has been created by applying a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, meshes are rendered to color images and depth maps. Following [yao2020blendedmvs] we retain 8 sequences for validation and 7 for testing, using the rest for training each network involved in our experiments.

DTU [aanaes2016large]. This indoor dataset counts 124 different scenes, all sharing the very same camera trajectory. Images are acquired with a structured light scanner mounted on a robot arm, using one of the cameras in the scanner itself. We select training, validation and testing splits following existing works [yao2018mvsnet, gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet]. In particular, we evaluate on the testing split both networks trained on BlendedMVG alone or after being fine-tuned on the DTU training set.

Iv-B Implementation details

Our framework is implemented in PyTorch, starting from existing solutions

[wang2021patchmatchnet]. Concerning gMVS, we simulate the availability of sparse depth hints by randomly sampling 3% of pixels from the ground-truth depth maps. Following [Poggi_2019_CVPR, Poggi_2021_ICCV], we set and . Regarding filtering, we set . We conduct our experiments implementing gMVS and variants with five state-of-the-art networks.

MVSNet [yao2018mvsnet]. The very first deep network for MVS: it builds a single variance volume and process it through 3D convolutions – similarly to 3D stereo networks [Kendall_2017_ICCV] – and estimates depth at a quarter of the input resolution.

DHC-RMVSNet [yan2020dense]. A recurrent architecture, replacing 3D convolutions with 2D convolutional LSTM to reduce memory requirements.

CAS-MVSNet [gu2020cascade]. It implements a cascade cost volume formulation, inferring depth in a coarse-to-fine manner to achieve higher efficiency.

UCSNet [cheng2020deep]. It builds Adaptive Thin Volumes for coarse-to-fine processing. The volumes consist of only a few depth hypotheses selected by modeling uncertainty.

PatchMatchNet [wang2021patchmatchnet]. A very efficient model, implementing a differentiable variant of the PatchMatch algorithm [barnes2009patchmatch] within a deep network.

Any network is implemented by integrating the authors’ code in our framework and following their default configuration – except for the number of depth hypotheses used by MVSNet and DHC-RMVSNet, set to 128 because of memory constraints. During both training and evaluation, if the final output of the original network is lower than the input resolution, it is upsampled to the original size through interpolation.

Iv-C Training and testing protocol

We set the number of images processed by the networks to 5, both during training and testing. Accordingly, we accumulate depth hints coming from 5 views for mvgMVS.

Training schedule. We train each network for 100K iterations on the BlendedMVG dataset on images, with a constant learning rate of 10 – except DHC-RMVSNet, for which it was set to 10 to avoid instability. Any training has been carried out on a single Titan Xp GPU, allowing only for a single sample per batch – except for PatchMatchNet, for which batch 2 fits in memory.

We also fine-tune each network for 50K further iterations on the DTU training set, processing images and using the hyper-parameters as detailed for BlendedMVG.

Testing protocol. We test the networks on the BlendedMVG testing sequences and on the DTU testing split. For each dataset, we report the percentage of pixels in the estimated depth map having an error larger than – respectively in pixels and millimetres on the two datasets, with thresholds set to 1, 2, 3 and 4. Concerning DTU, we also evaluate the quality of reconstructed point clouds: in the former case, we report accuracy and completeness metrics defined as in [aanaes2016large] and their average – the lower the better. Fused point clouds are obtained as in [wang2021patchmatchnet].

Iv-D Ablation study

We start by studying the impact of the different components in our framework, with the main emphasis on mvgMVS extension and coarse-to-fine modulation.

Multi-View Guided MVS. We first measure the improvements yielded by multi-view guidance. To this aim, we run experiments with MVSNet, by training different variants on the BlendedMVG training split and evaluating on the testing sequences. Tab. I collects the outcome of this experiment. From top to bottom, we report the error rates achieved by the original MVSNet architecture, by a variant implementing the baseline guided MVS framework described in Sec. III-B (-g), followed by mvgMVS versions respectively without (-mvg) and with (-fmvg) filtering.

Network Hints dens. >1 Px. >2 Px. >3 Px. >4 Px.
MVSNet [yao2018mvsnet] - 0.139 0.073 0.046 0.031
MVSNet-g 0.03 0.095 0.046 0.027 0.018
MVSNet-mvg 0.03 0.081 0.040 0.024 0.016
MVSNet-fmvg 0.03 0.076 0.037 0.023 0.015
MVSNet-g 0.15 0.068 0.032 0.020 0.013
TABLE I: Ablation study – guiding strategy. Results on BlendedMVG [yao2020blendedmvs] testing scans.
Network Hints dens. >1 Px. >2 Px. >3 Px. >4 Px.
MVSNet-fmvg 0.03 0.076 0.037 0.023 0.015
GuideNet-fmvg 0.03 0.290 0.124 0.080 0.058
TABLE II: Ablation study – mvgMVS versus depth completion. Results on BlendedMVG [yao2020blendedmvs] testing scans.
Network Stages >1 Px. >2 Px. >3 Px. >4 Px.
CAS-MVSNet [gu2020cascade] - 0.071 0.036 0.023 0.016
CAS-MVSNet-fmvg 1 0.057 0.024 0.014 0.010
CAS-MVSNet-fmvg 2 0.084 0.042 0.027 0.019
CAS-MVSNet-fmvg 3 0.078 0.041 0.027 0.020
CAS-MVSNet-fmvg All 0.048 0.018 0.012 0.009
TABLE III: Ablation study – multi-stage guidance. Results on BlendedMVG [yao2020blendedmvs] testing scans.

Starting from the gMVS baseline, it consistently achieves reduced error rates compared to MVSNet by exploiting the sparse depth guidance. Concerning mvgMVS, there are further improvements thanks to the aggregation of multiple sets of depth hints coming from the 5 different viewpoints. Nonetheless, even if this strategy increases the hints density from 3% up to roughly 15%, the improvement might appear not significant as one might expect with a more extensive set of hints. This fact is due to the several hints in non-visible parts of the source images that are wrongly projected in the reference point of view, as discussed previously. Indeed, by filtering out these outliers and consequently reducing the hints density to about 14%, we can improve the performance of MVSNet further when guided by mvgMVS. At the bottom of the table, we also report the performance achieved by MVSNet when guided by the baseline gMVS implementation and 15% hints density. Not surprisingly, having a higher density of depth hints from the single reference viewpoint is more effective than aggregating them over multiple viewpoints because they are not affected by visibility and possible collisions between projected points. However, fmvgMVS achieves performance close to what attainable with a depth sensor providing a much denser guide.

To conclude this study, we also compare the performance of our MVSNet-fmvg with a depth completion framework. Purposely, we select GuideNet [GuidedNet] and train it to process single, RGB images and multi-view aggregated sparse depth points – the very same used to guide MVSNet – for 100K iterations on BlendedMVG as done for MVSNet. Tab. II directly compares the error rates achieved by both highlighting how, when multiple sets of depth hints are available, the guided multi-view framework yields better depth maps compared to a depth completion approach.

Network >1 Px. >2 Px. >3 Px. >4 Px.
MVSNet [yao2018mvsnet] 0.139 0.073 0.046 0.031
MVSNet-fmvg 0.076 0.037 0.023 0.015
DHC-RMVSNet [yan2020dense] 0.174 0.094 0.059 0.040
DHC-RMVSNet-fmvg 0.081 0.041 0.025 0.017
CAS-MVSNet [gu2020cascade] 0.071 0.036 0.023 0.016
CAS-MVSNet-fmvg 0.048 0.018 0.012 0.009
UCSNet [cheng2020deep] 0.071 0.038 0.024 0.017
UCSNet-fmvg 0.040 0.018 0.011 0.008
PatchMatchNet [wang2021patchmatchnet] 0.075 0.039 0.025 0.018
PatchMatchNet-fmvg 0.062 0.033 0.022 0.016
TABLE IV: Evaluation on BlendedMVG [yao2020blendedmvs] testing scans. Comparison between original MVS networks[yao2018mvsnet, gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet] and their guided counterparts.
MVSNet [yao2018mvsnet]
DHC-RMVSNet [yan2020dense]
CAS-MVSNet [gu2020cascade]
UCSNet [cheng2020deep]
PatchMatchNet [wang2021patchmatchnet]
Depth map evaluation Point cloud evaluation
>1 mm >2 mm >3 mm >4 mm Acc. (mm) Comp. (mm) Avg. (mm)
0.658 0.457 0.368 0.326 0.764 0.468 0.616
0.393 0.227 0.194 0.180 0.383 0.264 0.324
0.708 0.519 0.423 0.372 0.764 0.586 0.675
0.401 0.177 0.134 0.115 0.393 0.234 0.314
0.558 0.385 0.330 0.303 0.589 0.310 0.450
0.323 0.243 0.220 0.207 0.345 0.286 0.316
0.541 0.402 0.357 0.333 0.561 0.344 0.453
0.199 0.174 0.164 0.157 0.290 0.264 0.277
0.627 0.440 0.370 0.335 0.574 0.484 0.529
0.446 0.328 0.301 0.287 0.339 0.297 0.318
Depth map evaluation Point cloud evaluation
>1 mm >2 mm >3 mm >4 mm Acc. (mm) Comp. (mm) Avg. (mm)
0.555 0.340 0.268 0.237 0.635 0.304 0.470
0.219 0.103 0.081 0.072 0.324 0.235 0.280
0.630 0.423 0.329 0.283 0.662 0.342 0.502
0.168 0.079 0.061 0.054 0.327 0.240 0.284
0.480 0.307 0.257 0.233 0.528 0.262 0.395
0.082 0.056 0.047 0.042 0.228 0.279 0.254
0.506 0.332 0.277 0.254 0.551 0.272 0.412
0.119 0.105 0.098 0.095 0.319 0.281 0.300
0.475 0.310 0.260 0.236 0.461 0.298 0.380
0.336 0.228 0.204 0.193 0.325 0.230 0.278
(a) trained on BlendedMVG (b) fine-tuned on DTU
TABLE V: Evaluation on DTU [aanaes2016large] testing scans. Comparison between MVS networks [yao2018mvsnet, gu2020cascade, yan2020dense, cheng2020deep, wang2021patchmatchnet] and guided counterparts, trained on BlendedMVG and tested (a) without re-train or (b) after fine-tuning on DTU training split.
RGB DHC-RMVSNet [yan2020dense] Hints DHC-RMVSNet-fmvs
CAS-MVSNet [gu2020cascade] CAS-MVSNet-fmvs
Fig. 4: Qualitative results on DTU dataset – scan9 (top) and scan114 (bottom). Depth maps and point clouds yielded by DHC-RMVSNet (top), CAS-MVSNet (bottom) and guided counterparts trained on BlendedMVG.

Coarse-to-fine strategy. We now ablate the coarse-to-fine guidance mechanism introduced in Sec. III-D, by training different variants of CAS-MVSNet. Tab. III reports results on the BlendedMVG testing split. From top to bottom, we report error rates achieved by the original CAS-MVSNet without guidance, three models guided during one out of the total three stages implemented by the network (i.e. modulating only one out of the three volumes built during inference), and finally the model guided by modulating any single volume. All guided models implement the filtered mvgMVS formulation. In general, guiding the volume computed only during the first stage already improves the results of the original network. Guiding the second or third stage alone fails at even improving the results by CAS-MVSNet when not guided. Nonetheless, providing a consistent modulation across the three stages allows for the best results.

Iv-E Multi-View Guided MVS networks

We now evaluate the impact of the mvgMVS framework on the five state-of-the-art networks selected for our experiments. Specifically, we train both the original networks and their counterpart guided employing filtered mvgMVS.

Evaluation on BlendedMVG. We start by evaluating all the networks on the BlendedMVG testing split, collecting the results in Tab. IV. By looking at the original networks, we can notice that models implementing coarse-to-fine processing [gu2020cascade, cheng2020deep, wang2021patchmatchnet] result, in general, more accurate compared to MVSNet and DHC-RMVSNet, achieving about half the error rates with any threshold. This gap is bridged by guiding both with the filtered mvgMVS framework.

Guided counterparts of CAS-MVSNet, UCSNet and PatchMatchNet are further improved too. In particular, CAS-MVSNet-fmvg and UCSNet-fmvg almost halve the error rates at any given threshold, while PatchMatchNet-fmvg benefits from the guidance in minor measure. We ascribe this latter fact to the random initialization performed at the very first stage of PatchMatchNet, left unchanged when implementing its guided counterpart.

Generalization to DTU. We now aim at assessing the impact of the multi-view guided framework on the generalization capacity of the networks to unseen datasets. Purposely, we evaluate the five networks and their guided counterparts on the DTU testing split without fine-tuning on the DTU training split. Tab. V (a) collects the outcome of this experiment, reporting error metrics on both estimated depth maps (left), as well as on 3D point clouds (right).

By focusing on the former, differently from the experiments on BlendedMVG, we can notice a consistent margin between MVSNet/DHC-RMVSNet and coarse-to-fine models [gu2020cascade, cheng2020deep, wang2021patchmatchnet] only concerning the number of pixels with error larger than 1 or 2 mm, with mixed results at the increase of the threshold. By looking at guided counterparts, we can appreciate how they always produce much more accurate depth maps, dramatically reducing the error rates.

Concerning the quality of the reconstructed 3D point cloud, we can observe that coarse-to-fine models achieve both better accuracy and completeness than MVSNet/DHC-RMVSNet, confirming their effectiveness. Finally, when guided by accumulated depth hints, any network dramatically improves the quality of the fused point clouds, confirming that considerable improvements on single depth maps translates in better 3D reconstructions.

To summarize, this experiment suggests that mvgMVS notably improves the generalization capacity of MVS networks concerning depth maps accuracy and 3D reconstruction quality. Fig. 4 shows some qualitative examples.

Fine-tuning and evaluation on DTU. To confirm that the effect of our framework on 3D reconstructions is not limited to generalization scenarios, we fine-tune all the previous networks on the DTU training split and evaluate their performance. Tab. V (b) collects results concerning both estimated depth maps (left) and point clouds (right).

Concerning the original networks, we witness a behaviour similar to the one observed in Tab. V (a), with a margin between coarse-to-fine models and the others, which is consistent only concerning pixels with error larger than 1 mm. Not surprisingly, any network performs better after being fine-tuned, both in terms of depth maps accuracy and point cloud quality. However, by looking at guided networks, we can notice how their accuracy is further boosted by the fine-tuning phase, with drops of the error rates much higher than those achieved by the original models.

By looking at reconstructed point clouds, for the original networks, we can observe the same trend as in Tab. V, with coarse-to-fine models generally producing higher quality point clouds. Once again, the more accurate depth maps yielded by mvgMVS correspond to better reconstructions.

To summarize, our experiments highlight that the mvgMVS framework constantly outperforms the original counterpart concerning generalization capability, as well as when data for fine-tuning is available.

Network Test Hints >1 Px. >2 Px. >3 Px. >4 Px.
MVSNet [yao2018mvsnet] - 0.139 0.073 0.046 0.031
MVSNet-fmvg 0.03 0.076 0.037 0.023 0.015
MVSNet-fmvg 0.02 0.087 0.043 0.026 0.017
MVSNet-fmvg 0.01 0.109 0.054 0.033 0.022
MVSNet-fmvg 0.00 0.244 0.165 0.126 0.101
TABLE VI: Ablation study – changing density at testing time. Validation errors on BlendedMVG [yao2020blendedmvs] testing scans, MVSNet is trained with 3% hints density and tested with different densities.

Limitations. Although our experiments highlight the potential of the Multi-View Guided Multi-View Stereo framework, effective on both synthetic and real datasets, our proposal suffers from a limitation that may be important in some environments: networks trained with a specific hints density do not generalize to less dense hints inputs. Specifically, once a guided network has been trained with a fixed density of input depth points, if such density is not guaranteed at the testing time, the performance will drop. Table VI investigates this behaviour with a further experiment carried out using MVSNet guided with 3% hints aggregated over the views during training and tested with varying density. We can notice how, by reducing the number of hints, the network performance lowers as well, although still resulting better than the original MVSNet trained without guidance (first row). However, by neglecting the hints at all (last row), the performance dramatically drops below the original MVSNet. This behaviour highlights that the network itself exploits the hints almost blindly when trained with them, losing much accuracy when the hints are not available during deployment, consistently with [Poggi_2019_CVPR]. Future research will explore better training protocols enabling the slightest drop in accuracy in such circumstances. Moreover, the current evaluation is conducted by simulating the availability of depth hints from a sensor. Further experiments with real sensing devices would allow to assess the robustness of the framework to noise in the depth sparse points, as studied in [Poggi_2019_CVPR].

V Conclusion

In this paper, we have presented a novel framework for accurate MVS depth estimation. Starting from the successes in binocular stereo [Poggi_2019_CVPR], we extended guided stereo to fully exploit the potential of the multi-view setup by aggregating multiple depth hints acquired from different viewpoints. Our experiments with five state-of-the-art MVS networks show the effectiveness of our framework, constantly generating much more accurate depth maps and consequently enabling the reconstruction of higher-quality point clouds. This behaviour is consistent either when generalizing from synthetic to real data or after fine-tuning on real images.

Acknowledgment. This work was partially funded by University of Bologna and Ministero dello Sviluppo Economico (MISE) within the Proof of Concept 2020 program.