BP-MVSNet: Belief-Propagation-Layers for Multi-View-Stereo

by   Christian Sormann, et al.
TU Graz

In this work, we propose BP-MVSNet, a convolutional neural network (CNN)-based Multi-View-Stereo (MVS) method that uses a differentiable Conditional Random Field (CRF) layer for regularization. To this end, we propose to extend the BP layer and add what is necessary to successfully use it in the MVS setting. We therefore show how we can calculate a normalization based on the expected 3D error, which we can then use to normalize the label jumps in the CRF. This is required to make the BP layer invariant to different scales in the MVS setting. In order to also enable fractional label jumps, we propose a differentiable interpolation step, which we embed into the computation of the pairwise term. These extensions allow us to integrate the BP layer into a multi-scale MVS network, where we continuously improve a rough initial estimate until we get high quality depth maps as a result. We evaluate the proposed BP-MVSNet in an ablation study and conduct extensive experiments on the DTU, Tanks and Temples and ETH3D data sets. The experiments show that we can significantly outperform the baseline and achieve state-of-the-art results.


page 1

page 4

page 7

page 8


One-view occlusion detection for stereo matching with a fully connected CRF model

In this paper, we extend the standard belief propagation (BP) sequential...

Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems

It has been proposed by many researchers that combining deep neural netw...

Multi-Scale Geometric Consistency Guided Multi-View Stereo

In this paper, we propose an efficient multi-scale geometric consistency...

Face Parsing via a Fully-Convolutional Continuous CRF Neural Network

In this work, we address the face parsing task with a Fully-Convolutiona...

PatchmatchNet: Learned Multi-View Patchmatch Stereo

We present PatchmatchNet, a novel and learnable cascade formulation of P...

PHI-MVS: Plane Hypothesis Inference Multi-view Stereo for Large-Scale Scene Reconstruction

PatchMatch based Multi-view Stereo (MVS) algorithms have achieved great ...

Front Contribution instead of Back Propagation

Deep Learning's outstanding track record across several domains has stem...

1 Introduction

The goal of a dense 3D reconstruction system is to initially estimate dense depth maps for a set of overlapping input images capturing an arbitrary scene by utilizing MVS. These dense depth maps are then fused into a dense point cloud. Thus, for each given input image, we want to compute a depth estimate for every pixel. Additional inputs of an MVS pipeline are the camera poses of each image and the camera calibration, which are obtained using a Structure-from-Motion (SfM) pipeline. Following MVSNet [33], we choose a reference view and a given set of source images having a substantial overlap with the reference image in terms of the viewed geometry. Analogous to the two view stereo case, we need to search along the epipolar lines in all source images using a similarity measure to find the correct depth value.

Figure 1: Top left: input rgb image. Top right: ground truth [33]. Bottom left: result from CasMVSNet [8]. Bottom right: result from BP-MVSNet (ours).

Traditional MVS methods usually use raw color information or hand-crafted features to describe local image patches. These features are then used with a traditional similarity measure such as normalized cross correlation (NCC) to find corresponding points. However, color information alone is often ambiguous and therefore a suboptimal choice for comparing patches. Therefore [33] proposed to learn optimal features for matching using a CNN and also to replace the traditional similarity measure with a learned one. However, the authors of [33] do not use any smoothness assumption on the resulting depth maps, which is commonly used in the canonical two-view stereo [25, 15]. [31] observed this and they were the first to suggest that the depth maps should also be regularized with a CRF in the MVS setting. However, the CRF inference in [31] has not been adapted for the MVS setting by explicitly handling different scales. Further, the employed mean-field approximation is highly dependent on the number of iterations. To overcome these limitations, we propose to extend the differentiable BP layer of [16] to enable its applicability in the MVS setting. The BP layer is a fully differentiable CRF inference layer that can deliver high quality results after a single iteration. We summarize our core contributions in the upcoming paragraph.


We propose the end-to-end learnable BP-MVSNet, which explicitly exploits prior knowledge of the MVS task via a CRF. To this end, we use the BP layer [16] and extend it to meet the requirements of a MVS method. In particular, we propose three extensions that correspond to our core contributions: We (i) propose a scale-independent normalization method to enable the handling of label jumps on different scene scales, (ii) add support for fractional label jumps in the CRF using a differentiable interpolation step that is fully integrable into the learning and (iii) propose a method to automatically calculate the sampling interval of the plane hypothesis beyond the initial stage. Our modifications described in Sections 3.4 and 3.5 required us to extend the forward path of [16]. Consequently, we also provide the required gradients to learn the parameters of the employed pairwise score function in the backward path. Thus, we are able to seamlessly integrate the BP layer into a learnable MVS network. We can significantly outperform the baseline CasMVSNet [8] and achieve state-of-the-art results on the DTU [1] and Tanks and Temples [14] benchmarks.

2 Related Work

We group the related work for MVS into traditional approaches and CNN based solutions.

2.1 Traditional MVS

Traditional MVS systems typically employ a photometric similarity measure such as bilateral weighted NCC [26] for evaluating different depth hypothesis. A possible technique for generating the hypothesis is the plane-sweeping MVS [6] approach, where the depth hypothesis for each pixel is computed by sampling a number of candidate planes in the scene. Another way of generating new depth hypothesis is via the PatchMatch [2, 3] algorithm, where a sampling scheme is used to propagate depth hypothesis across the image. This algorithm has also been adapted and extended for the MVS case [37, 26].

The work of [5] utilizes checkerboard-based propagation to further reduce runtime and is combined with a coarse to fine scheme by [29]. Furthermore, [26, 29] use a forward-backward reprojection error as an additional error term for the PatchMatch estimation. MARMVS [30] additionally estimate the optimal patch scale to reduce matching ambiguities. While these methods generally perform well on a variety of different datasets and can deal with high resolution images, their traditional similarity measures severely limit them in scenarios with reflective surfaces, occlusions and strong lighting changes. Recent works [24, 19] try to complete these missing areas using explicit plane priors. However, following [31], we propose to use a CRF based regularization [16] of the score volume for general scenarios, which helps the system to recover correct depth measurements without relying on assumptions about the scene structure.

2.2 Learning-based MVS

The concepts for supervised machine learning based MVS defined by MVSNet 

[33] and DeepMVS [9]

were the basis for many other works which further improve upon this architecture. They utilize plane hypothesis to compute a variance metric from feature maps extracted by a CNN. The score volume is the output of a 3D convolution based neural network. RMVSNet 

[34] significantly improves MVSNet [33]

in terms of memory consumption by utilizing a recurrent neural network. PMVSNet 

[20] learns a confidence for aggregating feature maps from the reference and source images. Recently, AttMVS [21] utilized attention [28] for this task.

Other recent works adopt coarse to fine schemes to iteratively improve upon the depth prediction and to reduce memory consumption further. CVP-MVSNet [32] continuously refines the result by learning depth residuals to improve upscaled results from coarse resolution levels, similar to CasMVSNet [8]. In order to optimize for efficiency in terms of memory and runtime FastMVSNet [36] learns to refine a sparse depth map, however this comes at the expense of the quality of the results compared to [32, 8]. PointMVSNet [4] approaches the refinement problem in 3D space by refining the point cloud in a coarse to fine manner. A related approach is SurfaceNet [12], which operates on 3D voxel representations and is further improved by SurfaceNet+ [11], which introduces a novel view selection approach that can handle sparse MVS configurations. However, methods which work in 3D space are limited in terms of resolution because of increased memory requirements. By performing a coarse to fine scheme in 2.5D, [32, 8]

achieve a good trade-off between accuracy and computational requirements. However, there is no explicit regularization or refinement present for the end-result which reduces outliers. MVSCRF 

[31] employs a CRF as RNN implementation [38], where the CRF distribution is approximated using a simpler distribution [17, 38]. While [31] is the first architecture to apply CRF regularization for learning-based MVS, we propose additional extensions to improve CRF inference for the MVS setting.

3 Methodology

In the following sections we describe the implemented hierarchical network architecture incorporating the BP-layer [16] in detail. Furthermore, we motivate and explain our contributions, which extend the BP-layer [16] for use in the MVS setting.

3.1 Model Overview

Figure 2: Left: High level overview of the model utilizing the extended BP-layer [16]

for the MVS setting. Right: Detailed architecture of the matching network integrating the extended BP-layer. In the detailed architecture we perform trilinear interpolation to upscale the results from the lower resolution level when computing the input for the final BP-layer. In the 3D convolution description K denotes the kernel size, C the number of output channels and S the stride.

We work on images of size . Following [8, 33], we discretize our search space for the depth in each pixel location into a set of labels , which correspond to fronto-parallel plane hypothesis. The quality of the depth estimate that each of these planes represent is quantified by a score volume . One of the problems with this approach is that the computed scores in may be inconsistent in areas where occlusions, reflective surfaces and changing light conditions are present in the images, resulting in noisy or wrong depth estimates. Similar to [31], we perform a regularization step on the score volume using a CRF. The function maximized in the CRF of [16] for a given label assignment is defined as


where is the set of nodes of the graphical model, is the set of edges and is a normalization constant. This function includes a unary score term , as well as pairwise scores which allow the model to penalize inconsistencies between neighbouring pixel locations. The authors of [16] propose a BP-layer, which performs inference in the CRF. The BP-layer is fully differentiable and can thus be integrated into any labeling based model. We provide further details on how we extended the BP-layer for the MVS setting in Section 3.3.

3.2 Network Architecture

The first step of our MVS network is to extract features from the reference image and each of the corresponding source images. In this stage, we incorporate the multi-level feature extraction CNN of 

[8], which extracts feature maps. We use three resolution levels based on evaluations of [8]. The output of this CNN are the feature maps . The following steps are executed in the three hierarchy levels we incorporate. The source feature maps are warped according to fronto-parallel planes [8, 33]

, which yields a tensor

as the output. These warped feature maps are then used to compute the variance between the reference and all warped source feature maps. The resulting variance tensor is used as the input for a matching network utilizing 3D convolutions, which outputs the final score volume . Following the architecture proposed by [31], the matching network integrates the BP-layer as a regularization component as shown in Figure 2. We utilize three BP-layers on different scales in the matching network and apply a temperature softmax [7] to the unary inputs as follows


using learned parameters , , for the respective levels. The input to the BP-layer on the highest scale is calculated as a weighted sum with weights , and which are learned. Further, we train three different pixel-wise pairwise networks using the UNet architecture described in [16] for each of the BP layers incorporated into the matching network. We train separate matching networks for each of the hierarchy levels as it has been shown by [8] that this improves performance. The resulting depth map is then computed as


where is the softmax normalized score volume and is a volume containing the depth hypothesis for each pixel. Following the hierarchical architecture of [8], we compute the hypothesis for the next level using the result from the current level as described in Section 3.4. This means that the hypothesis volume contains the same labels for each pixel in the first hierarchy level and contains different labels per pixel in further levels.

3.3 Extension of the BP-layer

We now provide a detailed description of our extensions to the BP-layer [16], which normalize CRF label jumps for the MVS setting and allow for factional jumps in the pairwise score computation. Further, we describe how we automatically compute the depth hypothesis discretization beyond the initial stage. We show the performance improvements of our contributions in Section 5.

3.4 Label jump normalization

We use the differentiable BP-layer proposed in [16] employing the max-sum variant of belief propagation. The advantages of using belief propagation as the inference algorithm for the CRF used in the matching network are that it is able to model long range interactions, can be efficiently implemented and is interpretable [16]. In our case the unary scores defined in Equation (1) are inputs to the BP-layer from the matching CNN after softmax activation as shown in Figure 2. The term introduced in Equation (1) represents the pairwise scores from label to label . We apply the BP-layer as proposed in [16]. In the standard two-view stereo case the labels and represent disparities corresponding to horizontal displacements measured in pixels independent of scene scale. In the plane sweeping MVS case, each source and target label in the score volume represents the depth hypothesis of a different fronto-parallel plane which is dependent on the scene scale. If we now want to learn the pairwise score term in the MVS setting, we need to normalize depth jumps first, such that it can be applied for any scale. To this end, we apply a normalization to and . We define , based on the expected 3D error [22, 19, 18] in the depth dimension. For a given depth value , this error measure is defined as


where represents the average baselines from all used source images, is the focal length of the reference camera and is a pixelwise uncertainty which we set to in all experiments. We then normalize our source and target depths using this error measure by


where the tilde denotes normalized depth values. The normalized difference from one depth to another is then computed as


Using Equation (6), we observe that we first invert both depth hypothesis and calculate the difference in the inverse depth. Afterwards, we scale the difference by a factor of to account for the differing scene scales in the MVS setting. This is also related to how a disparity in the two-view setting is calculated from depth. Thus, we are actually operating on a scaled inverse depth. Also intuitively, calculating the distance in inverse depth makes sense as the BP-layer relies on the assumption that the gradient in the labels is constant for slanted surfaces, which is not the case when using non inverted depth maps. By incorporating the error model, this yields an extended depth distance normalization measure compared to [19]. Since the label jumps are now normalized using , which is independent of the scene scale [19], the pairwise score function is also scale independent and can be applied in the same manner for any scene. Note that and are real values and they are thus not explicit in our label set . We tackle this problem by performing linear interpolation when computing the pairwise score.

Figure 3: Computation of the pairwise score from learned parameters with linear interpolation.

3.5 Pairwise score interpolation

When learning the pairwise score term we estimate the 5 parameters similar as [16]. These model positive and negative label jumps of quantity and respectively. and can be different for positive or negative jumps while has the same value for both. Hence, we define the pairwise score function for two labels and by:


The way is defined in Equation (6) also implies that there are fractional jumps for our normalized depths. Consequently, we perform a linear interpolation between our learned discrete parameters. Hence, we define our interpolated pairwise score function as


This step can be integrated into the backward path of the BP-layer [16]

by following the chain rule and applying


where is the incoming gradient as defined in [16]. Intuitively, we distribute the incoming gradient based on the weights and as shown in Equation (9) using the notation of [16]. Performing this interpolation allows us to more accurately represent the pairwise score function for our learning setting, where fractional normalized depth jumps are a common occurrence. We also visualize the pairwise score computation in Figure 3.

3.6 Automatic calculation of depth hypothesis sampling interval

We compute the hypothesis volume for the initial hierarchy level as proposed in [8]. However, the error measure described in Section 3.4 allows us to compute the label tensor from the depth estimate after the initial hierarchy level automatically, as opposed to manually defining a scale factor which is done in [8]. We use to normalize our depth hypothesis, thus we want the resolution of the hypothesis volume to be at least as fine grained as this normalization factor, such that our pairwise score function can capture the corresponding depth jumps. Hence, from the upscaled depth estimate of the initial level we compute our pixelwise intervals as


When considering the number of depth hypothesis for a given hierarchy level we then compute the label tensor for every element as


4 Training

We use the Adam [13]

optimizer to train the PyTorch 

[23] implementation of the network with the Huber loss [10] function


on the resulting pixelwise depth predictions and the ground truth depth . We use in all of our experiments. The loss in each hierarchy level is calculated as


where is the hierarchy level. and represent the estimated and ground truth depth map for the respective level. Following the work of [8], the final network loss is calculated as a weighted sum . We train our network with a batch size of . For our ablation study on the extensions of the BP-layer [16], we trained the network on a subset of the DTU training set of [33]

for 7 epochs using a learning rate of

. We evaluate on the full validation set from [33] after every epoch and use the epoch with the lowest error on the threshold for our results in Table 2. For the evaluations of depth maps and point clouds we trained on the full DTU training set by adding the validation set of [33] with a learning rate of for 10 epochs and then continue to train for 4 epochs with a learning rate of . For our experiments on Tanks and Temples [14] and ETH3D [27], we fine-tune the model trained on the full DTU training set using the BlendedMVS [35] training set. We train for additional 7 epochs using a learning rate of . During training we use source images in addition to the reference, while we use source images during inference. Further, we provide the used image resolution , number of hypothesis per level , memory consumption, runtime and fusion paramters for training and inference in Table 1. For the fusion parameters we first state the number of views which have to satisfy that the forward-backward reprojection error is less than  [33]. We use the camera parameters provided by [33] in our experiments.

5 Experiments

In the following sections we describe our evaluation procedures on the DTU [1], Tanks and Temples [14] and ETH3D-low resolution [27] datasets. Furthermore, we provide an ablation study to quantify the improvements gained by integrating the extended BP-layer [16]. Additionally, we evaluate the resulting depth maps on DTU [1] and discuss point cloud results on the datasets.

Dataset #H Mem. t
DTU [1] train 96,32,8 8.9 1.3 -, -
DTU [1] test 128,32,8 6.6 2.0 3, 0.25
Blended [35] train 96,32,8 10.9 1.6 -, -
T & T [14] inter. 96,32,8 10.9 2.7 5, 0.50
T & T [14] adv. 96,32,8 10.9 2.7 3, 0.25
ETH3D [27] test 128,32,8 3.6 1.0 3, 0.10
Table 1: The parameter settings for our experiments, where column #H represents the number of depth hypothesis for the three hierarchy levels and are the fusion parameters described in Section 5.1.3. We also provide the used memory in GB and runtime in seconds.
Method norm.
BP-MVSNet - - 24.20 14.34 9.92 6.53
BP-MVSNet - 23.77 13.53 9.02 5.85
BP-MVSNet 23.21 14.04 9.38 5.95
CasMVSNet [8] - - - 24.51 16.18 11.93 8.41
Table 2: Ablation study on the DTU validation set depth maps of [33]. All methods have been trained on a representative subset of the full training-set. We report the average percentage of pixels where the error is larger than a given threshold, for thresholds in the range of - (lower is better). Column norm. indicates that we enable normalized label jumps and column indicates that we use interpolated pairwise scores. With column we indicate that we enable the automatic computation of the label discretizations after the initial stage.

5.1 DTU dataset

The 128 scenes from the DTU dataset [1] capture objects placed on a table using a full image resolution of

. The ground truth is provided by a structured light scanner. The evaluation metrics for DTU are a completeness metric, which is the average distance from ground truth points to the nearest reconstructed point and an accuracy metric which is the average distance from the reconstructed points to the nearest ground truth point. Furthermore, large outliers and points not in the observability mask are filtered 


CVP-MVSNet [32] 28.99 21.78 16.16 9.48
CasMVSNet [8] 24.84 19.74 16.14 11.56
BP-MVSNet 21.79 15.88 11.57 7.70
Table 3: Comparison of DTU [1] depth maps on the test set of [33]. We list the average percentage of pixels with an error larger than - (lower is better).
Method overall acc. comp.
MVSNet [33] 0.462 0.396 0.527
R-MVSNet [34] 0.422 0.385 0.459
MVSCRF [31] 0.398 0.371 0.426
P-MVSNet [20] 0.420 0.406 0.434
Fast-MVSNet [36] 0.370 0.336 0.403
CasMVSNet [8] 0.355 0.325 0.385
Att-MVS [21] 0.356 0.383 0.329
CVP-MVSNet [32] 0.351 0.296 0.406
BP-MVSNet (ours) 0.327 0.333 0.320
Table 4: Overall, completness and accuracy (lower is better) results on the point clouds of the DTU [1] test set of [33].
Figure 4: RGB image and depth map results from BP-MVSNet and CasMVSNet [8]. The depth maps include the percentage of pixels with an error larger than compared to the ground truth (lower is better).
intermediate advanced
Method F Fam. Franc. Horse Light. M60 Panther Playg. Train F Audit. Ballr. Courtr. Museum Palace Temple
COLMAP [26] 42.14 50.41 22.25 25.63 56.43 44.83 46.97 48.53 42.04 27.24 16.02 25.23 34.70 41.51 18.05 27.94
ACMM [29] 57.27 69.24 51.45 46.97 63.20 55.07 57.64 60.08 54.48 34.02 23.41 32.91 41.17 48.13 23.87 34.60
PCF-MVS [19] 55.88 70.99 49.60 40.34 63.44 57.79 58.91 56.59 49.40 35.69 28.33 38.64 35.95 48.36 26.17 36.69
MVSNet [33] 43.48 55.99 28.55 25.07 50.79 53.96 50.86 47.90 34.69 - - - - - - -
R-MVSNet [34] 48.40 69.96 46.65 32.59 42.95 51.88 48.80 52.00 42.38 24.91 12.55 29.09 25.06 38.68 19.14 24.96
MVSCRF [31] 45.73 59.83 30.60 29.93 51.15 50.61 51.45 52.60 39.68 - - - - - - -
P-MVSNet [20] 55.62 70.04 44.64 40.22 65.20 55.08 55.17 60.37 54.29 - - - - - - -
Fast-MVSNet [36] 47.39 65.18 39.59 34.98 47.81 49.16 46.20 53.27 42.91 - - - - - - -
Att-MVS [21] 60.05 73.90 62.58 44.08 64.88 56.08 59.39 63.42 56.06 31.93 15.96 27.71 37.99 52.01 29.07 28.84
CVP-MVSNet [32] 54.03 76.50 47.74 36.34 55.12 57.28 54.28 57.43 47.54 - - - - - - -
CasMVSNet [8] 56.84 76.37 58.45 46.26 55.81 56.11 54.06 58.18 49.51 31.12 19.81 38.46 29.10 43.87 27.36 28.11
BP-MVSNet(ours) 57.60 77.31 60.90 47.89 58.26 56.00 51.54 58.47 50.41 31.35 20.44 35.87 29.63 43.33 27.93 30.91
Table 5: F-score (higher is better) results on the Tanks and Temples benchmark [14] intermediate and advanced test sets. The overall best results are marked as bold numbers, while the best result between BP-MVSNet and CasMVSNet [8] is underlined.
test train
Method F lake. sandb. stor. stor. 2 tunnel F deliv. electro forest playgr. terrains
ACMM [29] 55.01 59.60 66.07 35.89 50.48 63.01 55.12 38.65 61.75 60.21 43.87 71.11
PCF-MVS [19] 57.06 66.85 62.36 43.32 52.89 59.86 57.32 48.61 56.36 67.24 43.68 70.70
COLMAP [26] 52.32 56.18 61.09 38.61 46.28 59.41 49.91 37.30 52.31 61.83 31.91 66.23
MVSNet [33] 38.33 40.49 57.57 20.52 33.47 39.63 - - - - - -
R-MVSNet [34] 36.87 42.00 46.99 24.73 34.83 35.83 - - - - - -
P-MVSNet [20] 44.46 49.27 49.30 34.35 39.83 49.54 - - - - - -
MVSCRF [31] 28.32 32.16 44.37 14.66 21.69 28.73 - - - - - -
Att-MVS [21] 45.85 49.36 51.75 34.83 43.70 49.63 - - - - - -
CasMVSNet [8] 44.49 56.38 64.76 18.64 31.23 51.43 49.00 34.83 53.10 66.82 28.93 61.32
BP-MVSNet (ours) 43.22 52.86 46.16 27.25 36.92 52.94 50.87 40.10 49.76 63.03 34.23 67.21
Table 6: F-score metrics for the ETH3D low-res [27] test and training set. For CasMVSNet [8], we chose the (base) results in the benchmark. The overall best results are marked as bold numbers, while the best result between BP-MVSNet and CasMVSNet [8] is underlined.

5.1.1 Ablation study

In our ablation study, we evaluate the impact of our extensions to the BP-layer [16] on the performance of BP-MVSNet by comparing error metrics computed on depth maps with resolution from the DTU validation set of [33]. The error metrics we use are the percentage of errors larger than thresholds mm between the ground truth depth map and the estimated depth map . We can observe in Table 2 that adding the BP layer using the normalized depth jumps improves all of the error metrics compared to the CasMVSNet [8] baseline. Enabling the interpolation step for pairwise scores further improves all of the metrics. This means that the combination of these two contributions to the BP-layer improve its performance significantly in the MVS setting. Finally, we also include the automatic computation of the label volume discretizations beyond the initial hierarchy level. This yields the lowest error on , thus we will use this model for all further experiments. The reason for the slight error increase with bigger thresholds is related to the dependence of the sampling interval hypothesis on the previous depth estimate. In case the previous estimate is wrong, a larger fixed interval can potentially improve results for bigger error thresholds.

5.1.2 Evaluation on depth maps

The results in Table 3 show quantitative results on depth maps of resolution , when using our proposed model BP-MVSNet trained on DTU as described in Section 4. We compare with CasMVSNet [8] and CVP-MVSNet [32] using their respective pretrained models on the same resolution with default parameter settings. This comparison allows us to measure the quality of the depth maps before performing the point cloud fusion, as even the same depth maps can yield different results in this stage depending on the fusion parameters. It can be seen that BP-MVSNet improves on all of the measured error metrics signficantly compared to the other methods. This can also be observed in Figure 4, where we show that the extended BP-layer is able to correctly regularize even over large areas containing inconsistent estimates, such as the top of the box in the first image. However, it can also be seen that very poor unary estimates in the matching network can also lead to artifacts in some background regions.

5.1.3 Evaluation on point clouds

For the point cloud fusion step performed for evaluations on point clouds, we utilize the method of [33]. The fusion parameters are set according to Table 1, where denotes the forward-backward reprojection error threshold and are the number of images which have to be consistent with respect to . Additionally we set the maximum relative depth difference parameter for the fusion of [33] to . In Table 4, we provide the results of the DTU point cloud evaluation [1] on the test set of [33]. We set the point cloud fusion parameters as described in Table 1. We compare our results with other state-of-the-art works for learning based MVS. It can be seen that BP-MVSNet outperforms all other methods both in terms of accuracy as well as completeness. In the point cloud results visualized in Figure 5, one can see that the extended BP-layer [16] was able to regularize the depth maps such that we get complete results, even on untextured and reflective surfaces, as seen on the coffee can.

Figure 5: Top row: Point cloud results from the DTU [1] dataset. Bottom row: Point clouds from Tanks and Temples [14] scenes.

5.2 Tanks and Temples dataset

The Tanks and Temples [14] dataset consists of real-world outdoor and indoor scenes, capturing objects, buildings and rooms, where the ground truth has been acquired by a laser scanner. The full image resolution is . This dataset includes many challenging scenes where reflections and occlusions are present. In this section, we compare the results of BP-MVSNet on the Tanks and Temples [14] benchmark with other state of the art methods. The metrics used in the Tanks and Temples [14] dataset are a precision score , which is the percentage of reconstructed points with a distance to the nearest ground truth point. Further, the recall gives the percentage of ground truth points where the closest distance to a reconstructed point is . The F-score [14] is then computed as .

5.2.1 Benchmark results

In Table 5, we present the results of BP-MVSNet on the Tanks and Temples benchmark [14]. We also provide the fusion parameters in Table 1. For the Horse dataset of the intermediate scenes we increased to 1.0 as the base of the statue contains many reflections. It can be observed that BP-MVSNet achieves competitive results, outperforming the base architecture [8] in terms of the mean F-score metric on the advanced and intermediate sets. In Figure 5, we visualize the resulting point clouds. It can be observed that even difficult scenes containing surfaces that reflect the sky such as the base of the statue are quite complete. Furthermore, as seen in Figure 5, smaller details such as the rails on the train are preserved.

5.3 ETH 3D dataset

The ETH3D low resolution [27] dataset consists of outdoor and indoor scenes of varying locations from a forest to a storage room. The full image resolution for this dataset is . Similar to Tanks and Temples [14], the fused point clouds are evaluated based on a laser-scanner ground truth, in terms of accuracy, completeness and F-score.

5.3.1 Benchmark results

In Table 6, we provide results for the ETH3D-low-resolution many view [27] benchmark. It can be seen that we achieve competitive results among learning-based methods such as MVSNet [33] and CasMVSNet [8]. Further, we did not train our network on the training set images of this dataset as described in Section 4. However, we can also observe that all of the learning based methods are outperformed by the traditional methods PCF-MVS [19] and ACMM [29] on this dataset. Compared to the base architecture CasMVSNet [8], we achieve better scores on some datasets such as storage room and terrains, while the score for other datasets such as sandbox or electro is lower, which results in a slightly worse result on the test and a slightly better one on the training set.

6 Conclusion

In this work, we have proposed BP-MVSNet, a CNN based MVS system, employing a CRF regularization layer based on belief propagation [16]. In order to optimize the performance of the BP-layer [16] for the MVS setting, we have made three core contributions: (i) Utilizing a scale agnostic term for normalizing label jumps and (ii) implementing a differentiable interpolation step in the pairwise score computation. (iii) Further, we automatically choose the discretization in our hypothesis volume after the initial stage. These contributions improve the performance of the baseline architecture [8], as seen in our quantitative results presented in Section 5, where we achieve state-of-the-art results on both the DTU [1] and Tanks and Temples [14] datasets. Future work could involve the inclusion of additional information, such as the surface normals as additional guidance for the BP-layer. Acknowledgement: This work was supported by ProFuture (FFG, Contract No. 854184).


  • [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl (2016) Large-scale data for multiple-view stereopsis.

    International Journal of Computer Vision

    , pp. 1–16.
    Cited by: §1, Figure 5, §5.1.3, §5.1, Table 1, Table 3, Table 4, §5, §6.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In SIGGRAPH, Cited by: §2.1.
  • [3] M. Bleyer, C. Rhemann, and C. Rother (2011) PatchMatch stereo - stereo matching with slanted support windows.. In British Machine Vision Conference, Cited by: §2.1.
  • [4] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • [5] S. Galliani, K. Lasinger, and K. Schindler (2015) Massively parallel multiview stereopsis by surface normal diffusion. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [6] D. Gallup, J. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys (2007) Real-time plane-sweeping stereo with multiple sweeping directions. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.1.
  • [7] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618 Cited by: §3.2.
  • [8] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan (2020-06) Cascade cost volume for high-resolution multi-view stereo and stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §2.2, §3.1, §3.2, §3.6, §4, Figure 4, §5.1.1, §5.1.2, §5.2.1, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6, §6.
  • [9] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) DeepMVS: learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [10] P. J. Huber (1964-03) Robust estimation of a location parameter. Ann. Math. Statist. 35 (1), pp. 73–101. External Links: Document, Link Cited by: §4.
  • [11] M. Ji, J. Zhang, Q. Dai, and L. Fang (2020) SurfaceNet+: an end-to-end 3d neural network for very sparse multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §2.2.
  • [12] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017) SurfaceNet: an end-to-end 3d neural network for multiview stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2307–2315. Cited by: §2.2.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [14] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36 (4). Cited by: §1, §4, Figure 5, §5.2.1, §5.2, §5.3, Table 1, Table 5, §5, §6.
  • [15] P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock (2017) End-to-End Training of Hybrid CNN-CRF Models for Stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [16] P. Knöbelreiter, C. Sormann, A. Shekhovtsov, F. Fraundorfer, and T. Pock (2020-06) Belief propagation reloaded: learning bp-layers for labeling problems. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: BP-MVSNet: Belief-Propagation-Layers for Multi-View-Stereo, §1, §1, §2.1, Figure 2, §3.1, §3.2, §3.3, §3.4, §3.5, §3, §4, §5.1.1, §5.1.3, §5, §6.
  • [17] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 109–117. External Links: Link Cited by: §2.2.
  • [18] A. Kuhn, H. Hirschmüller, D. Scharstein, and H. Mayer (2014) A TV prior for high-quality scalable multi-view stereo reconstruction. In International Conference on 3D Vision (3DV), Cited by: §3.4.
  • [19] A. Kuhn, S. Lin, and O. Erdler (2019) Plane completion and filtering for multi-view stereo reconstruction. In German Conference on Pattern Recognition (GCPR), Cited by: §2.1, §3.4, §5.3.1, Table 5, Table 6.
  • [20] K. Luo, T. Guan, L. Ju, H. Huang, and Y. Luo (2019) P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, Table 4, Table 5, Table 6.
  • [21] K. Luo, T. Guan, L. Ju, Y. Wang, Z. Chen, and Y. Luo (2020-06) Attention-aware multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5, Table 6.
  • [22] N. Molton and M. Brady (2000-06) Practical structure and motion from stereo when motion is unconstrained. International Journal of Computer Vision 39, pp. . External Links: Document Cited by: §3.4.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.
  • [24] A. Romanoni and M. Matteucci (2019) TAPA-MVS: textureless-aware PatchMatch multi-view stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [25] D. Scharstein (2007) Learning conditional random fields for stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [26] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §2.1, Table 5, Table 6.
  • [27] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4, §5.3.1, §5.3, Table 1, Table 6, §5.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.2.
  • [29] Q. Xu and W. Tao (2019) Multi-scale geometric consistency guided multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §5.3.1, Table 5, Table 6.
  • [30] Z. Xu, Y. Liu, X. Shi, Y. Wang, and Y. Zheng (2020-06) MARMVS: matching ambiguity reduced multiple view stereo for efficient large scale scene reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [31] Y. Xue, J. Chen, W. Wan, Y. Huang, C. Yu, T. Li, and J. Bao (2019) MVSCRF: learning multi-view stereo with conditional random fields. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §2.2, §3.1, §3.2, Table 4, Table 5, Table 6.
  • [32] J. Yang, W. Mao, J. M. Alvarez, and M. Liu (2020-06) Cost volume pyramid based depth inference for multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §5.1.2, Table 3, Table 4, Table 5.
  • [33] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) MVSNet: depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV). Cited by: Figure 1, §1, §1, §2.2, §3.1, §3.2, §4, §5.1.1, §5.1.3, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [34] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2, Table 4, Table 5, Table 6.
  • [35] Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, L. Zhou, T. Fang, and L. Quan (2020) BlendedMVS: a large-scale dataset for generalized multi-view stereo networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4, Table 1.
  • [36] Z. Yu and S. Gao (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5.
  • [37] E. Zheng, E. Dunn, V. Jojic, and J. Frahm (2014) PatchMatch based joint view selection and depthmap estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [38] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr (2015) Conditional random fields as recurrent neural networks. IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781467383912 Cited by: §2.2.