Log In Sign Up

IB-MVS: An Iterative Algorithm for Deep Multi-View Stereo based on Binary Decisions

by   Christian Sormann, et al.

We present a novel deep-learning-based method for Multi-View Stereo. Our method estimates high resolution and highly precise depth maps iteratively, by traversing the continuous space of feasible depth values at each pixel in a binary decision fashion. The decision process leverages a deep-network architecture: this computes a pixelwise binary mask that establishes whether each pixel actual depth is in front or behind its current iteration individual depth hypothesis. Moreover, in order to handle occluded regions, at each iteration the results from different source images are fused using pixelwise weights estimated by a second network. Thanks to the adopted binary decision strategy, which permits an efficient exploration of the depth space, our method can handle high resolution images without trading resolution and precision. This sets it apart from most alternative learning-based Multi-View Stereo methods, where the explicit discretization of the depth space requires the processing of large cost volumes. We compare our method with state-of-the-art Multi-View Stereo methods on the DTU, Tanks and Temples and the challenging ETH3D benchmarks and show competitive results.


page 1

page 4

page 5

page 6

page 8

page 12

page 14

page 15


How Good MVSNets Are at Depth Fusion

We study the effects of the additional input to deep multi-view stereo m...

Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction

We present an efficient multi-view stereo (MVS) network for 3D reconstru...

Deep Learning-based High-precision Depth Map Estimation from Missing Viewpoints for 360 Degree Digital Holography

In this paper, we propose a novel, convolutional neural network model to...

PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

Recent learning-based multi-view stereo (MVS) methods show excellent per...

PatchmatchNet: Learned Multi-View Patchmatch Stereo

We present PatchmatchNet, a novel and learnable cascade formulation of P...

Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

Multi-view Stereo (MVS) with known camera parameters is essentially a 1D...

RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Learning-based multi-view stereo (MVS) has by far centered around 3D con...

1 Introduction

The objective of a Multi-View Stereo (MVS) system is the estimation of a dense depth map for a reference image, given one or multiple source images and all the camera poses. This involves computing dense matching costs between the reference image and one or more source images. In recent years, learning-based methods have shown promising results using learned input representations in the form of feature maps and learned similarity measures [mvsnet, casmvs, rmvsnet] for computing the matching costs. However, most learning-based methods discretize the depth space and compute the matching cost at each selected depth for each reference image pixel. The result is a cost volume whose size increases quadratically with respect to the image resolution for a given number of discretization steps. As a result, cost volume methods are subject to computational and memory bottlenecks. Newly proposed cascaded cost volume approaches [casmvs]

mitigate these disadvantages, but they discretize the depth space using a pre-determined heuristic, which typically needs to be adapted for different datasets.

We propose a novel learning-based MVS method that explores the continuous depth space iteratively, without relying on an explicit cost-volume. At each iteration, our method computes a pixelwise binary decision mask that estimates whether a given pixel actual depth is in front or behind its current depth hypothesis. The binary decision permits to compute a new depth hypothesis at each pixel and the hypothesis is refined further at the next iteration. Our work is inspired by the two-view stereo method in [bi3d_stereo]. However, [bi3d_stereo] assumes the same depth hypothesis for each pixel, estimating the depth by means of binary masks computed for a predefined set of depth values, thus building a cost volume. Differently from [bi3d_stereo], we do not construct a cost volume and rather propose a novel iterative architecture capable of estimating a binary decision mask for arbitrary depth hypotheses at individual pixels. Moreover, our method targets the MVS scenario and takes advantage of the multiple source images available. In particular, at each iteration, a binary decision mask is generated for each source image and the new resulting depth map hypotheses are fused with a learned weighting scheme inspired by [vismvsnet]. However, differently from [vismvsnet], we do not fuse cost-volumes. Instead, we employ the weights to fuse 2D maps within an iterative depth estimation scheme.

We describe the proposed algorithm in detail in Section 3. As our core contributions, we 1.) design a network architecture to estimate pixelwise depth dependent binary decision masks in the MVS setting, 2.) introduce a pixelwise depth inference algorithm based on the prediction from the previous network, 3.) implement a learning-based fusion strategy, inspired by  [vismvsnet], for the depth maps predicted from different source images, 4.) verify our results on the popular benchmarks DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D [eth3d].

2 Related work

In this section, we discuss the previously published related work in the MVS field and compare it with our proposed method. Traditional MVS methods rely on hand-crafted similarity measures such as normalized cross-correlation [colmap_mvs]. The depth hypotheses space is typically explored via a plane-sweeping cost volume [plane_sweep_mvs] or the PatchMatch algorithm [colmap_mvs, gipuma, acmm, pcf_mvs]. These methods also introduce techniques for pixelwise source view selection, in order to suppress the influence of matching results from occluded source images [colmap_mvs, acmm]. The main limitation of these methods is represented by their hand-crafted similarity measures. On the other hand, the PatchMatch exploration strategy for the depth hypothesis space and the employed pixelwise view-selection techniques elevate them above learning-based methods.

In recent years, deep-learning-based methods for MVS have received significant attention from the research community. Earlier works focus on learning a feature representation and combine this with a learned similarity measure in the form of a 3D convolutional neural network 

[deepmvs, mvsnet]. In order to reduce the computational cost of the regularization, recurrent [rmvsnet, dhcrmvsnet] and cascaded [casmvs, ucsnet] approaches were utilized. The addition of CRF-based cost volume regularization was also explored in several works [bp_mvsnet, mvscrf]. Other methods rely on a voxel-based representation of the input [surfacenet, surfacenet_plus] or refine an initial coarse estimate of a point cloud [pointmvsnet]. The attention [attention_neurips] mechanism has been incorporated by several works [attmvsnet, lanet] as well. More recent methods try to combine the benefits of traditional methods with the advantages of learned representation and similarity measures, e.g, by avoiding cost volumes and rather resorting to PatchMatch-based depth exploration strategies [patchmatchnet] or by leveraging pixelwise source view selection  [vismvsnet].

Similarly, our proposed method does not employ a cost volume for depth estimation, but still incorporates beneficial concepts from learning-based methods. In fact, it benefits from learned input representations and similarity measures, like learning-based approaches, and it implements an efficient strategy for the exploration of the depth hypothesis space, similarly to traditional methods. Moreover, it employs a learning-based strategy to fuse estimates from different source images resembling the pixelwise view selection of traditional methods.

3 Method

In this section we present our MVS method, named IB-MVS due its Iterative approach and its relying on Binary decisions. Below, first we provide a system overview of IB-MVS, then we elaborate on the details of its novel depth inference algorithm and its network architectures.

3.1 System overview

The goal of our system is to estimate a dense depth map of a reference image given source images with . Hereafter denotes a pixel location. Our MVS method is iterative and traverses the continuous space of feasible depth values at each pixel in a binary decision fashion. Inspired by [bi3d_stereo], at each iteration , and for each source image , our method predicts a binary mask with the property:


where is the current depth map hypothesis for the reference image and its ground truth depth map. The pixelwise mask

is predicted using a convolutional neural network named Decision Network (D-Net). In practice, we predict soft binary decision masks, hence

takes values in . Each entry permits to establish whether is in front or behind the current depth hypothesis . Our method offsets the current depth hypothesis and produces a new hypothesis compliant with . The new depth map hypothesis , one for each source image, are then fused using learned pixelwise weights from a network named Weight Network (W-Net), in order to produce the next iteration depth map hypothesis . We depict IB-MVS in Figure 1 and provide a visual overview of the entire system in Figure 2.

3.2 Depth inference algorithm

The depth inference algorithm of IB-MVS assumes a feasible depth range as input and computes the reference image depth map iteratively. The algorithm operates in the inverse depth domain, as this yields improved results in scenes with large depth ranges. To this end, we introduce the inverse depth map hypothesis and define the inverse depth range bounds and .

The algorithm starts at iteration by setting at each pixel. It then uses D-Net with as inverse depth hypothesis in order to compute the first set of binary decision masks , one for each source image. The next pixelwise depth map hypotheses at iteration is then calculated as follows:


where and is referred as the step size. The update in Eq. (2) is guided by the binary decision mask , estimated from the current inverse depth hypothesis . For , the sought ground truth depth is in front of the current hypothesis, therefore we step backwards. For , we step forward instead. Since takes values in , the step size performs a smooth update of the current depth hypothesis. The step size decreases at each iteration, which represents the halving of the search space. However, it is noteworthy that the magnitude of the hypothesis update is adaptive (in both directions) thanks to its dependence on , as modeled in Eq. (2). The update is sketched in Figure 1, where for the sake of simplicity one source image is assumed, hence omitting subscript . After computing the new inverse depth map hypothesis for each source image , the fused inverse depth map hypothesis is calculated as:


where is the sum of the weights estimated by W-Net for each source image. In [vismvsnet] this normalization was argued to be more beneficial than thresholding. We perform iterations and set the final depth map estimate .

Figure 1: Visualization of IB-MVS depth inference algorithm for a single source image. The inverse depth range , is explored iteratively by updating the inverse depth map hypothesis by means of the binary decision mask with .

3.3 Binary decision network

The decision network D-Net exhibits a U-Net [unet]-like encoder-decoder structure and it is depicted in Figure 3. At the top level, the decoder is fed both with the feature maps and , obtained by applying a Feature Pyramid Network [casmvs] to the reference and source images and , respectively, as well as with a depth map hypothesis . At each level of the encoder, deformable convolutions [deform_convsv2] are used to convolve the source image feature maps along locations on the epipolar line determined by the depth hypothesis. Specifically, we deform a kernel such that each of the sampling locations of the kernel correspond to locations on the epipolar line. We use kernel size and can thus center the deformed kernel on the epipolar line at the location predicted by the current depth hypothesis

and distribute the samples on either side spaced with a unit vector in pixel coordinates. The resulting sampled features from the source feature map are then concatenated with the reference feature map, as suggested in 

[bi3d_stereo]. This procedure is repeated at each resolution level, the resulting feature maps are further processed with standard convolutional layers and passed both to the next lower resolution level of the encoder and to the decoder, as depicted in Figure 3. At the decoder side, the feature maps are upsampled and further concatenated with the features from the encoder at the next higher resolution level. The output binary mask

is generated using a sigmoid activation function. We include a detailed specification of the convolutional hyper-parameters in Section A of the supplementary material.

Figure 2:

Architecture of the overall system. Deep features extracted from the reference and source images using a FPN architecture 

[casmvs] are used as the input for three hierarchy levels of D-Nets, which predict the decision masks . They are then used to compute the new inverse depth hypothesis for . These are fused using weights estimated by W-Net and the fused result is the new inverse depth hypothesis for the next iteration.

In practice, in order to predict the binary decision mask , we employ three D-Nets on full, half and quarter resolution inputs. In particular, as depicted in Figure 3, the output features from the previous scale D-Net are employed on the next scale. We observed that employing three resolution levels leads to higher quality binary decision masks. In fact, this choice increases the overall architecture receptive field and permits a coarse to fine refinement of the estimated masks, as each level employs the previous level output.

3.4 Fusion weights network

In the proposed architecture, each D-Net is followed by a W-Net, a network whose objective is to assign a confidence, in the form of a weight map, to the predicted binary decision mask. Figure 3 depicts W-Net in gray. Inspired by [vismvsnet], W-Net operates on the pixelwise entropy of the predicted binary mask . This design strategy, coupled with the choice to use the depth hypothesis to sample the source image, rather than as a network input, makes our overall algorithm scale independent. As suggested in [vismvsnet], the final weight at the pixel is predicted as where is the network output. The pixelwise weights are used by our depth inference algorithm during the fusion stage in Eq. (3). The ideally predicted weights are small in those areas where the binary decision mask is not reliable, such as in occluded regions, and large otherwise, such that inverse depth hypotheses from different source images can complement each other. This approach mitigates the negative effect of potentially erroneous estimates from the source views, when fusing them into the new depth hypothesis . This is crucial, as represents the starting point for the estimation of the next iteration binary decision masks.

Figure 3: D-Net and W-Net architecture overview showing feature channel slices. The network uses standard 2D convolutions in combination with deformable 2D convolutions [deform_convsv2] which utilize depth hypothesis to sample the epipolar line of the source image.

4 Experiments

In this section, firstly we describe the training procedure adopted for IB-MVS, then we compare it to state-of-the-art works on popular MVS benchmarks.

4.1 Network training

First, the multi-level D-Net is pre-trained on random uniform inverse depth map hypotheses with for every and , as proposed in [bi3d_stereo]. The ground truth binary decision mask for a given depth map hypothesis is computed from the ground truth depth map using Equation (1). We employ a loss at each level of the multi-level D-Net, with being the full resolution. In particular, is defined as the average of the Binary Cross Entropy (BCE) at the valid pixels :


where with equal to if the pixel has a valid ground truth depth, otherwise. The BCE loss in Eq. (4) is defined as follows:


The overall loss is the weighted sum with . After the pre-training, we train the multi-level D-Net using the depth inference algorithm described in Section 3.2 with iterations. However, we do not consider the fusion step at this stage and work with a single randomly selected source image over the iterations. The random selection of source images was shown to be beneficial by [patchmatchnet]. We employ the previously introduced loss on each one of the binary decision masks , with , generated alongside the depth inference procedure. We refer to the loss at the iteration as . At iteration , the inverse depth hypothesis is used both by the multi-level D-Net to generate and in Eq. (1) to generate its ground truth. Now the loss can be computed and the next hypothesis generated. The procedure is iterated and the final loss is the sum over the losses . We do not back-propagate across iterations. Finally, in the third training stage, we train W-Net jointly with D-Net. At this stage we fuse from randomly selected source images using the weights from W-Net with the same approach used to fuse the inverse hypotheses in Section 3.2 to generate . We compute the loss of one iteration, where is randomly chosen in , as the sum of losses on the individual and fused

. We implemented IB-MVS in PyTorch 

[pytorch] and trained with batch size using ADAM [adam_optimizer].

4.2 Evaluation metrics

In our evaluation, we present experimental results on three popular MVS benchmarks, namely DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D high and low-res [eth3d]. These benchmarks compare the reconstructed point cloud against a dense ground truth and extract completeness and accuracy metrics (recall and precision, respectively, for Tanks and Temples [tanksandtemples]). Completeness and accuracy are aggregated into a single metric: their average for DTU [dtu]

and harmonic mean, denoted F-score, for Tanks and Temples 

[tanksandtemples] and ETH3D [eth3d]. For DTU [dtu], accuracy and completeness are measured in mm, hence lower is better. For ETH3D [eth3d] and Tanks and Temples [tanksandtemples], these metrics are percentages, hence higher is better.

4.3 Ablation study

We first investigate the influence of the iterations on the point cloud results for DTU [dtu]. In Table 1, we show that increasing the number of iterations leads to improved results, which is coherent with our iterative depth refinement. The larger performance difference between iterations, compared to and , is explained by the progressively shrinking step size of IB-MVS. Furthermore, we investigate the benefits of using our W-Net in the fusion step. In particular, we compare it to a naive strategy that simply averages the depth map hypotheses from the different source views. The results in Table 1 show that, for the same number of iterations , W-Net leads to a better completeness and an overall better quality (avg. metric) than the naive fusion strategy, while exhibiting a competitive accuracy. The improved accuracy of the naive fusion strategy is obtained at the cost of worse completeness, as the absence of W-Net leads to more inconsistent estimates in occluded regions.

W-Net T avg. acc. cmp. RT 6 0.717 0.769 0.664 2.0s 7 0.371 0.402 0.340 2.3s 8 0.321 0.334 0.309 2.7s - 8 0.342 0.326 0.359 2.3s - 9 0.343 0.324 0.362 2.6s
Table 1: Ablation study on DTU test-set using accuracy, completeness and their average in mm (lower is better). We observe improvements with increasing iterations T. After T=8 the subdivision of the search space is sufficient, thus the result for T=9 is very close. Further, the inclusion of W-Net improves the results for the same number of iterations.
figureQualitative depth map results from Tanks and Temples [tanksandtemples] and ETH3D [eth3d].

4.4 Benchmark results

We use Tanks and Temples [tanksandtemples] and ETH3D [eth3d] high and low-res in order to evaluate the generalization capabilities of IB-MVS, since we do not train on their respective training sets.

The feasible depth range is inferred from the SfM model using the method in [mvsnet]. The computed depth maps are fused into a single point cloud with the proposed method of [mvsnet]; we denote its parameters representing the number of consistent views and the geometric re-projection error threshold as and , respectively. We denote the used image resolution as , the runtime per image as and the memory consumption as . Finally, we recall that is number of used source views.

For the evaluation on the DTU benchmark, we train the network on DTU for (, ,

) epochs and learning rates (

, , ) in the stages ( , ) described in Section 4.1, respectively. For the evaluations on Tanks and Temples and ETH3D, we train on DTU at stage and on Blended MVS [blended_mvs] at stages and , for (, , ) epochs using learning rates (, , ). Every training on DTU employs the ground truth depth maps and train-test split of [mvsnet]. Finally, our method is run for iterations on DTU and for iterations on Tanks and Temples [tanksandtemples] and ETH3D [eth3d]. All the experiments were performed using an Nvidia RTX 2080Ti graphics card.

Figure 4: Point cloud comparison with PatchMatchNet [patchmatchnet] on the ETH3D [eth3d] high-res terrains dataset. We show the output point cloud, accuracy and completeness.
low-res-test low-res-train high-res-test high-res-train
publication F acc. cmp. RT F acc. cmp. F acc. cmp. RT F acc. cmp.
DeepC-MVS [deepcmvs] 62.37 65.89 59.42 5746 61.99 65.98 59.27 87.08 89.15 85.52 3155 84.81 90.37 80.30
COLMAP [colmap_mvs] 52.32 61.51 45.89 3312 49.91 69.58 40.86 73.01 91.97 62.98 1658 67.66 91.85 55.13
ACMM [acmm] 55.01 52.37 58.27 1662 55.12 54.69 57.01 80.78 90.65 74.34 1165 78.86 90.67 70.42
PCF-MVS [pcf_mvs] 57.06 56.56 58.42 9289 57.32 57.03 58.17 80.38 82.15 79.29 2272 79.42 84.11 75.73
R-MVSNet [rmvsnet] 36.87 37.45 37.16 2413 - - - - - - - - - -
CasMVSNet [casmvs] 44.49 55.44 38.80 - 49.00 62.06 41.86 - - - - - - -
MVSCRF [mvscrf] 28.32 34.84 24.97 - - - - - - - - - - -
P-MVSNet [pmvsnet] 44.46 54.95 38.28 627 - - - - - - - - - -
BP-MVSNet [bp_mvsnet] 43.22 32.65 64.34 - 50.87 49.12 55.29 - - - - - - -
Att-MVS [attmvsnet] 45.85 64.84 37.07 - - - - - - - - - - -
PVSNet [pvsnet] 45.78 38.39 57.76 2116 - - - 72.08 66.41 80.05 830 67.48 - -
PMNet [patchmatchnet] - - - - - - - 73.12 69.71 77.46 493 64.21 64.81 65.43
IB-MVS (ours) 49.19 39.31 67.29 1487 55.84 61.06 52.66 75.85 71.64 82.18 616 71.21 75.21 69.02
Table 2: ETH3D [eth3d] results: F-score, accuracy and completeness as percentage (2cm). Higher is better, overall best underlined, best among learning based in bold. For the test sets, we report the average per-scene runtime RT in seconds (for methods that provide this).
DTU [dtu] Tanks and Temples [tanksandtemples]
test intermediate advanced
publication avg. acc. cmp. F preci. reca. F preci. reca.
DeepC-MVS [deepcmvs] - - - 59.79 59.11 61.21 34.54 40.68 31.30
ACMM [acmm] - - - 57.27 49.19 70.85 34.02 35.63 34.90
COLMAP [colmap_mvs] - - - 42.14 43.16 44.48 27.24 33.65 23.96
PCF-MVS [pcf_mvs] - - - 55.88 49.82 65.68 35.69 37.52 35.36
R-MVSNet [rmvsnet] 0.422 0.385 0.459 48.40 43.74 57.60 24.91 31.47 22.05
BP-MVSNet [bp_mvsnet] 0.327 0.333 0.320 57.60 51.26 68.77 31.35 29.62 35.61
P-MVSNet [pmvsnet] 0.420 0.406 0.434 55.62 49.93 63.82 - - -
Att-MVS [attmvsnet] 0.356 0.383 0.329 60.05 61.89 58.93 31.93 40.58 27.26
CVP-MVSNet [cvp_mvsnet] 0.351 0.296 0.406 54.03 51.41 60.19 - - -
CasMVSNet [casmvs] 0.355 0.325 0.385 56.84 47.62 74.01 31.12 29.68 35.24
PatchMatchNet [patchmatchnet] 0.352 0.427 0.277 53.15 43.64 69.37 32.31 27.27 41.66
UCS-Net [ucsnet] 0.344 0.338 0.349 54.83 46.66 70.34 - - -
LANet [lanet] 0.335 0.320 0.349 55.70 45.62 75.68 - - -
D2HC-RMVSNet [dhcrmvsnet] 0.386 0.395 0.378 59.20 49.88 74.08 - - -
VisMVSNet [vismvsnet] 0.365 0.369 0.361 60.03 54.44 70.48 - - -
IB-MVS (ours) 0.321 0.334 0.309 56.02 47.71 72.64 31.96 27.85 41.48
Table 3: DTU [dtu] results: accuracy, completeness and their average are in mm, lower is better. Tanks and Temples [tanksandtemples] results: precision, recall and F-score are percentages, higher is better. Overall best results are underlined, best among learning based methods are bold.

ETH3D benchmark [eth3d]

This dataset is the most challenging one for learning-based MVS methods, especially the high-res subset. On the one hand, the high resolution of its images represents a memory bottleneck for learning-based cost-volume methods. On the other, it is characterized by images with wide baselines and with a significantly lower overlap than in Tanks and Temples [tanksandtemples] and DTU [dtu], which can make matching without pixelwise source view selection difficult. For the high-res dataset we set , , , and obtain , . For the low-res dataset we set , , , and obtain , . We provide quantitative results for both the datasets in Table 2. Although this benchmark had been dominated by traditional methods such as ACMM [acmm] and DeepC-MVS [deepcmvs] in the past, the recently published PatchMatchNet [patchmatchnet] was able to achieve competitive results: IB-MVS outperforms [patchmatchnet] on both the training and the test datasets. Learning-based approaches relying on cost-volumes are limited on this benchmark, however our iterative approach allows IB-MVS to infer accurate results, even on high resolution images. Further, since large viewpoint changes are present in this benchmark, the implemented fusion scheme allows IB-MVS to deal with occluded regions.

DTU benchmark [dtu]

This dataset contains close-range images of various objects. For this dataset, we set , , , and obtain , . In Table 3 we compare IB-MVS to the recent learning-based state-of-the-art methods. It can be observed that IB-MVS provides very competitive accuracy and completeness values: in particular it achieves a good trade-off between the two, which results in the best average score. Finally, we achieve a complete reconstruction, even with a strict filtering parameter . This is due to IB-MVS ability to yield very precise results without the need to use large cost-volumes. This is especially important in the case of the DTU benchmark, where we target highly precise reconstructions of single objects.

Tanks and Temples benchmark [tanksandtemples]

The intermediate subset focuses on reconstructing small and large single objects, while the advanced subset consists of large scale indoor and outdoor scenes. In our experiments, we set , , , (interm.), , (adv.) and obtain , . On the advanced subset, IB-MVS is competitive with other state-of-the-art learning-based methods such as CasMVSNet [casmvs] and PatchMatchNet [patchmatchnet], as shown in Table 3. Our iterative exploration of the hypothesis space, along with our pixelwise source view fusion, allows IB-MVS to achieve competitive results among learning based methods on the challenging advanced subset. A possible direction for improving the intermediate set results could be to employ a confidence measure for the final predicted depth map, to filter out inaccurate points ahead of the point cloud fusion, similarly to [deepcmvs]. Another direction could consider the improvement of the core architecture, for instance, by incorporating a regularization stage. In particular, the binary decision mask prediction could be regularized by a differentiable CRF, such as [bp_mvsnet].

Finally, we discuss IB-MVS runtime performance on ETH3D, measured as the per-scene runtime of the complete reconstruction, in seconds, including the point cloud fusion step. IB-MVS is faster than traditional methods on both ETH3D high-res and low-res, as shown in Table 2. It is noteworthy that the low-res scenes contain more images. While the learning-based method PM-Net [patchmatchnet] yields a better runtime than IB-MVS, a direct comparison is difficult. In fact, the PatchMatch algorithm utilized within PM-Net [patchmatchnet] operates at half-resolution and the full resolution is obtained via a subsequent up-sampling and refinement. Instead, IB-MVS operates at full resolution and does not perform an additional refinement. Furthermore, IB-MVS achieves a better F-score than PM-Net [patchmatchnet] on ETH3D high-res. On low-res, IB-MVS offers a competitive runtime compared to learning-based methods. In terms of F-score, IB-MVS outperforms the fastest learning-based method P-MVSNet [pmvsnet], which does not participate on high-res.

5 Conclusion

We presented IB-MVS, a novel learning-based approach for MVS that explores the depth space iteratively in a binary decision fashion. IB-MVS couples the advantages of learning-based methods, such as learned input representations, with an efficient exploration strategy of the hypothesis space. In fact, IB-MVS can handle high resolution images, as it does not require a cost volume. In addition, IB-MVS benefits from a pixelwise source view fusion strategy. Extensive results show that IB-MVS achieves competitive results compared to state of the art methods on popular MVS benchmarks. Acknowledgement: This work has been supported by the FFG, Contract No. 881844: "ProFuture".

Supplementary material

A Network architecture hyper-parameters

We provide the network hyper-parameters of D-Net and W-Net in Table 4 and Table 5, respectively. As specified in the main paper, we use three resolution levels at quarter, half and full resolution. For generating the image features and , we utilize the FPN architecture of [casmvs] and set the number of feature channels for each level

. Further, we replace the batch normalization 

[batch_norm] in the FPN [casmvs] with instance normalization [instance_norm]. We denote the entropy calculated from the output mask according to Section 3.4 of the main paper as . In Tables 4 and 5, we denote 2D convolutions as 2D conv, deformable 2D convolutions [deform_convsv2] as 2D def. conv and transposed 2D convolutions as 2D tran. conv.

Further, we denote the leaky ReLU activation function as

LReLU, the number of input and output channels with and

, the stride with

str. and kernel size with (we add

when no bias is used, the padding is set to


B Supplementary qualitative results

In Figure 5, we provide qualitative point cloud results of our method IB-MVS for the DTU [dtu], Tanks and Temples [tanksandtemples] and ETH3D [eth3d] high and low-res datasets. Additionally, in Figure 7, we provide supplementary qualitative IB-MVS depth map results from the DTU [dtu] dataset.

In order to provide further insights into IB-MVS, in Figure 6 we provide a visualization of its intermediate outputs at different iterations . For each source image , at iteration the hypothesis is used to compute the binary decision mask and the weight mask via D-Net and W-Net, respectively; this permits to compute the new reference depth map hypothesis using Eq. (2) of the main paper. The rows 2-5 of Figure 6 show , and for the 4 source images and different values of . The new reference image depth hypothesis are then fused into a single depth map using the weights . The first row of Figure 6 shows the reference image along with for different values of . We conclude by observing that, as desired, the weight masks in Figure 6 assign a low confidence to those areas of the reference image that are occluded in the source image, as these areas cannot be matched. This can be appreciated in the weight masks depicted in rows 2 and 3, where the area below the sofa and the left-most region (highlighted in yellow) of the reference image are dark because they are occluded in the respective source images.

Figure 5: Qualitative point cloud results for DTU [dtu] (first row), Tanks and Temples [tanksandtemples] (second row) and ETH3D [eth3d] high and low-res (third and fourth row).
in name out name operation
Conv1 2D conv., k=3, str.=1, act.=LReLU
DConv1 2D def. conv., k=5, str.=1, act.=LReLU
Conv1 | DConv1 Conc1 concatenate along channel dim.
Conc1 Conv2 2D conv., k=3, str.=1, act.=LReLU
Conv2 Sc1 2D conv., k=3, str.=2, act.=LReLU
half bilinear interp. downscale to half
half bilinear interp. downscale to half
half Conv3 2D conv., k=3, str.=1, act.=LReLU
half DConv2 2D def. conv., k=5, str.=1, act.=LReLU
Conv3 | DConv2 Conc2 concatenate along channel dim.
Conc2 Conv4 2D conv., k=3, str.=1, act.=LReLU
Sc1 | Conv4 Conc3 for concatenate along channel dim.
Conc3 Conv5 for 2D conv., k=3, str.=1, act.=LReLU
| Sc1 | Conv4 Conc3 for concatenate along channel dim.
Conc3 ConvPr for 2D conv., k=3, str.=1, act.=LReLU
ConvPr Conv5 for 2D conv., k=3, str.=1, act.=LReLU
Conv5 Sc2 2D conv., k=3, str.=2, act.=LReLU
quar. bilinear interp. downscale to quarter
quar. bilinear interp. downscale to quarter
quar. Conv6 2D conv., k=3, str.=1, act.=LReLU
quar. DConv3 2D def. conv., k=5, str.=1, act.=LReLU
Conv6 | DConv3 Conc4 concatenate along channel dim.
Conc4 Conv7 2D conv., k=3, str.=1, act.=LReLU
Sc2 | Conv7 Conc5 concatenate along channel dim.
Conc5 Conv8 2D conv., k=3, str.=1, act.=LReLU
Conv8 Conv9 2D conv., k=3, str.=1, act.=LReLU
Conv9 Conv10 2D conv., k=3, str.=1, act.=LReLU
Conv10 UConv1 2D tran. conv.,k=4, str.=2, act.=LReLU
Conv5 | UConv1 Conc6 concatenate along channel dim.
Conc6 Conv11 2D conv., k=3, str.=1, act.=LReLU
Conv11 Conv12 2D conv., k=3, str.=1, act.=LReLU
Conv12 UConv2 2D tran. conv.,k=4, str.=2, act.=LReLU
Conv2 | UConv2 Conc7 concatenate along channel dim.
Conc7 2D conv., k=3, str.=1, act.=LReLU
2D conv., k=3, str.=1, act.=sigmoid
Table 4: D-Net architecture hyper-parameters, specifying the convolution type, number of input and output channels, kernel size, stride and activation function.
Figure 6: We visualize intermediate results of IB-MVS for a view of the ETH3D [eth3d] high-res living room scene. The top row shows the depth hypothesis predicted at iteration by fusing the depth hypothesis from different source images according to . We also show the binary decision masks . We color code low to high depth values from blue to red. For and black represents the value and white represents the value .
in name out name operation
Conv1 for 2D conv., k=3, str.=1, act.=LReLU
Conv0 for 2D conv., k=3, str.=1, act.=LReLU
for bilinear interp. upscale to double res.
ConvPr for 2D conv., k=3, str.=1, act.=LReLU
Conv0 | ConvPr Conc1 for concatenate along channel dim.
Conc1 Conv1 for 2D conv., k=3, str.=1, act.=LReLU
Conv1 Conv2 2D conv., k=3, str.=1, act.=LReLU
Conv2 Conv3 2D conv., k=3, str.=1, act.=LReLU
Conv3 2D conv., k=3, str.=1, act.=LReLU
1 2D conv., k=3, str.=1, act.=identity
Table 5: W-Net architecture hyper-parameters, specifying the convolution type, number of input and output channels, kernel size, stride and activation function.
Figure 7: Qualitative depth map results for DTU [dtu]. For each column, the reference image is at the top and the corresponding IB-MVS depth map at the bottom. Low to high depth values are color coded from blue to red.