1 Introduction
The goal of a dense 3D reconstruction system is to initially estimate dense depth maps for a set of overlapping input images capturing an arbitrary scene by utilizing MVS. These dense depth maps are then fused into a dense point cloud. Thus, for each given input image, we want to compute a depth estimate for every pixel. Additional inputs of an MVS pipeline are the camera poses of each image and the camera calibration, which are obtained using a StructurefromMotion (SfM) pipeline. Following MVSNet [33], we choose a reference view and a given set of source images having a substantial overlap with the reference image in terms of the viewed geometry. Analogous to the two view stereo case, we need to search along the epipolar lines in all source images using a similarity measure to find the correct depth value.
Traditional MVS methods usually use raw color information or handcrafted features to describe local image patches. These features are then used with a traditional similarity measure such as normalized cross correlation (NCC) to find corresponding points. However, color information alone is often ambiguous and therefore a suboptimal choice for comparing patches. Therefore [33] proposed to learn optimal features for matching using a CNN and also to replace the traditional similarity measure with a learned one. However, the authors of [33] do not use any smoothness assumption on the resulting depth maps, which is commonly used in the canonical twoview stereo [25, 15]. [31] observed this and they were the first to suggest that the depth maps should also be regularized with a CRF in the MVS setting. However, the CRF inference in [31] has not been adapted for the MVS setting by explicitly handling different scales. Further, the employed meanfield approximation is highly dependent on the number of iterations. To overcome these limitations, we propose to extend the differentiable BP layer of [16] to enable its applicability in the MVS setting. The BP layer is a fully differentiable CRF inference layer that can deliver high quality results after a single iteration. We summarize our core contributions in the upcoming paragraph.
Contribution
We propose the endtoend learnable BPMVSNet, which explicitly exploits prior knowledge of the MVS task via a CRF. To this end, we use the BP layer [16] and extend it to meet the requirements of a MVS method. In particular, we propose three extensions that correspond to our core contributions: We (i) propose a scaleindependent normalization method to enable the handling of label jumps on different scene scales, (ii) add support for fractional label jumps in the CRF using a differentiable interpolation step that is fully integrable into the learning and (iii) propose a method to automatically calculate the sampling interval of the plane hypothesis beyond the initial stage. Our modifications described in Sections 3.4 and 3.5 required us to extend the forward path of [16]. Consequently, we also provide the required gradients to learn the parameters of the employed pairwise score function in the backward path. Thus, we are able to seamlessly integrate the BP layer into a learnable MVS network. We can significantly outperform the baseline CasMVSNet [8] and achieve stateoftheart results on the DTU [1] and Tanks and Temples [14] benchmarks.
2 Related Work
We group the related work for MVS into traditional approaches and CNN based solutions.
2.1 Traditional MVS
Traditional MVS systems typically employ a photometric similarity measure such as bilateral weighted NCC [26] for evaluating different depth hypothesis. A possible technique for generating the hypothesis is the planesweeping MVS [6] approach, where the depth hypothesis for each pixel is computed by sampling a number of candidate planes in the scene. Another way of generating new depth hypothesis is via the PatchMatch [2, 3] algorithm, where a sampling scheme is used to propagate depth hypothesis across the image. This algorithm has also been adapted and extended for the MVS case [37, 26].
The work of [5] utilizes checkerboardbased propagation to further reduce runtime and is combined with a coarse to fine scheme by [29]. Furthermore, [26, 29] use a forwardbackward reprojection error as an additional error term for the PatchMatch estimation. MARMVS [30] additionally estimate the optimal patch scale to reduce matching ambiguities. While these methods generally perform well on a variety of different datasets and can deal with high resolution images, their traditional similarity measures severely limit them in scenarios with reflective surfaces, occlusions and strong lighting changes. Recent works [24, 19] try to complete these missing areas using explicit plane priors. However, following [31], we propose to use a CRF based regularization [16] of the score volume for general scenarios, which helps the system to recover correct depth measurements without relying on assumptions about the scene structure.
2.2 Learningbased MVS
The concepts for supervised machine learning based MVS defined by MVSNet
[33] and DeepMVS [9]were the basis for many other works which further improve upon this architecture. They utilize plane hypothesis to compute a variance metric from feature maps extracted by a CNN. The score volume is the output of a 3D convolution based neural network. RMVSNet
[34] significantly improves MVSNet [33]in terms of memory consumption by utilizing a recurrent neural network. PMVSNet
[20] learns a confidence for aggregating feature maps from the reference and source images. Recently, AttMVS [21] utilized attention [28] for this task.Other recent works adopt coarse to fine schemes to iteratively improve upon the depth prediction and to reduce memory consumption further. CVPMVSNet [32] continuously refines the result by learning depth residuals to improve upscaled results from coarse resolution levels, similar to CasMVSNet [8]. In order to optimize for efficiency in terms of memory and runtime FastMVSNet [36] learns to refine a sparse depth map, however this comes at the expense of the quality of the results compared to [32, 8]. PointMVSNet [4] approaches the refinement problem in 3D space by refining the point cloud in a coarse to fine manner. A related approach is SurfaceNet [12], which operates on 3D voxel representations and is further improved by SurfaceNet+ [11], which introduces a novel view selection approach that can handle sparse MVS configurations. However, methods which work in 3D space are limited in terms of resolution because of increased memory requirements. By performing a coarse to fine scheme in 2.5D, [32, 8]
achieve a good tradeoff between accuracy and computational requirements. However, there is no explicit regularization or refinement present for the endresult which reduces outliers. MVSCRF
[31] employs a CRF as RNN implementation [38], where the CRF distribution is approximated using a simpler distribution [17, 38]. While [31] is the first architecture to apply CRF regularization for learningbased MVS, we propose additional extensions to improve CRF inference for the MVS setting.3 Methodology
In the following sections we describe the implemented hierarchical network architecture incorporating the BPlayer [16] in detail. Furthermore, we motivate and explain our contributions, which extend the BPlayer [16] for use in the MVS setting.
3.1 Model Overview
for the MVS setting. Right: Detailed architecture of the matching network integrating the extended BPlayer. In the detailed architecture we perform trilinear interpolation to upscale the results from the lower resolution level when computing the input for the final BPlayer. In the 3D convolution description K denotes the kernel size, C the number of output channels and S the stride.
We work on images of size . Following [8, 33], we discretize our search space for the depth in each pixel location into a set of labels , which correspond to frontoparallel plane hypothesis. The quality of the depth estimate that each of these planes represent is quantified by a score volume . One of the problems with this approach is that the computed scores in may be inconsistent in areas where occlusions, reflective surfaces and changing light conditions are present in the images, resulting in noisy or wrong depth estimates. Similar to [31], we perform a regularization step on the score volume using a CRF. The function maximized in the CRF of [16] for a given label assignment is defined as
(1) 
where is the set of nodes of the graphical model, is the set of edges and is a normalization constant. This function includes a unary score term , as well as pairwise scores which allow the model to penalize inconsistencies between neighbouring pixel locations. The authors of [16] propose a BPlayer, which performs inference in the CRF. The BPlayer is fully differentiable and can thus be integrated into any labeling based model. We provide further details on how we extended the BPlayer for the MVS setting in Section 3.3.
3.2 Network Architecture
The first step of our MVS network is to extract features from the reference image and each of the corresponding source images. In this stage, we incorporate the multilevel feature extraction CNN of
[8], which extracts feature maps. We use three resolution levels based on evaluations of [8]. The output of this CNN are the feature maps . The following steps are executed in the three hierarchy levels we incorporate. The source feature maps are warped according to frontoparallel planes [8, 33], which yields a tensor
as the output. These warped feature maps are then used to compute the variance between the reference and all warped source feature maps. The resulting variance tensor is used as the input for a matching network utilizing 3D convolutions, which outputs the final score volume . Following the architecture proposed by [31], the matching network integrates the BPlayer as a regularization component as shown in Figure 2. We utilize three BPlayers on different scales in the matching network and apply a temperature softmax [7] to the unary inputs as follows(2) 
using learned parameters , , for the respective levels. The input to the BPlayer on the highest scale is calculated as a weighted sum with weights , and which are learned. Further, we train three different pixelwise pairwise networks using the UNet architecture described in [16] for each of the BP layers incorporated into the matching network. We train separate matching networks for each of the hierarchy levels as it has been shown by [8] that this improves performance. The resulting depth map is then computed as
(3) 
where is the softmax normalized score volume and is a volume containing the depth hypothesis for each pixel. Following the hierarchical architecture of [8], we compute the hypothesis for the next level using the result from the current level as described in Section 3.4. This means that the hypothesis volume contains the same labels for each pixel in the first hierarchy level and contains different labels per pixel in further levels.
3.3 Extension of the BPlayer
We now provide a detailed description of our extensions to the BPlayer [16], which normalize CRF label jumps for the MVS setting and allow for factional jumps in the pairwise score computation. Further, we describe how we automatically compute the depth hypothesis discretization beyond the initial stage. We show the performance improvements of our contributions in Section 5.
3.4 Label jump normalization
We use the differentiable BPlayer proposed in [16] employing the maxsum variant of belief propagation. The advantages of using belief propagation as the inference algorithm for the CRF used in the matching network are that it is able to model long range interactions, can be efficiently implemented and is interpretable [16]. In our case the unary scores defined in Equation (1) are inputs to the BPlayer from the matching CNN after softmax activation as shown in Figure 2. The term introduced in Equation (1) represents the pairwise scores from label to label . We apply the BPlayer as proposed in [16]. In the standard twoview stereo case the labels and represent disparities corresponding to horizontal displacements measured in pixels independent of scene scale. In the plane sweeping MVS case, each source and target label in the score volume represents the depth hypothesis of a different frontoparallel plane which is dependent on the scene scale. If we now want to learn the pairwise score term in the MVS setting, we need to normalize depth jumps first, such that it can be applied for any scale. To this end, we apply a normalization to and . We define , based on the expected 3D error [22, 19, 18] in the depth dimension. For a given depth value , this error measure is defined as
(4) 
where represents the average baselines from all used source images, is the focal length of the reference camera and is a pixelwise uncertainty which we set to in all experiments. We then normalize our source and target depths using this error measure by
(5) 
where the tilde denotes normalized depth values. The normalized difference from one depth to another is then computed as
(6) 
Using Equation (6), we observe that we first invert both depth hypothesis and calculate the difference in the inverse depth. Afterwards, we scale the difference by a factor of to account for the differing scene scales in the MVS setting. This is also related to how a disparity in the twoview setting is calculated from depth. Thus, we are actually operating on a scaled inverse depth. Also intuitively, calculating the distance in inverse depth makes sense as the BPlayer relies on the assumption that the gradient in the labels is constant for slanted surfaces, which is not the case when using non inverted depth maps. By incorporating the error model, this yields an extended depth distance normalization measure compared to [19]. Since the label jumps are now normalized using , which is independent of the scene scale [19], the pairwise score function is also scale independent and can be applied in the same manner for any scene. Note that and are real values and they are thus not explicit in our label set . We tackle this problem by performing linear interpolation when computing the pairwise score.
3.5 Pairwise score interpolation
When learning the pairwise score term we estimate the 5 parameters similar as [16]. These model positive and negative label jumps of quantity and respectively. and can be different for positive or negative jumps while has the same value for both. Hence, we define the pairwise score function for two labels and by:
(7) 
The way is defined in Equation (6) also implies that there are fractional jumps for our normalized depths. Consequently, we perform a linear interpolation between our learned discrete parameters. Hence, we define our interpolated pairwise score function as
(8) 
This step can be integrated into the backward path of the BPlayer [16]
by following the chain rule and applying
(9) 
where is the incoming gradient as defined in [16]. Intuitively, we distribute the incoming gradient based on the weights and as shown in Equation (9) using the notation of [16]. Performing this interpolation allows us to more accurately represent the pairwise score function for our learning setting, where fractional normalized depth jumps are a common occurrence. We also visualize the pairwise score computation in Figure 3.
3.6 Automatic calculation of depth hypothesis sampling interval
We compute the hypothesis volume for the initial hierarchy level as proposed in [8]. However, the error measure described in Section 3.4 allows us to compute the label tensor from the depth estimate after the initial hierarchy level automatically, as opposed to manually defining a scale factor which is done in [8]. We use to normalize our depth hypothesis, thus we want the resolution of the hypothesis volume to be at least as fine grained as this normalization factor, such that our pairwise score function can capture the corresponding depth jumps. Hence, from the upscaled depth estimate of the initial level we compute our pixelwise intervals as
(10) 
When considering the number of depth hypothesis for a given hierarchy level we then compute the label tensor for every element as
(11) 
4 Training
We use the Adam [13]
optimizer to train the PyTorch
[23] implementation of the network with the Huber loss [10] function(12) 
on the resulting pixelwise depth predictions and the ground truth depth . We use in all of our experiments. The loss in each hierarchy level is calculated as
(13) 
where is the hierarchy level. and represent the estimated and ground truth depth map for the respective level. Following the work of [8], the final network loss is calculated as a weighted sum . We train our network with a batch size of . For our ablation study on the extensions of the BPlayer [16], we trained the network on a subset of the DTU training set of [33]
for 7 epochs using a learning rate of
. We evaluate on the full validation set from [33] after every epoch and use the epoch with the lowest error on the threshold for our results in Table 2. For the evaluations of depth maps and point clouds we trained on the full DTU training set by adding the validation set of [33] with a learning rate of for 10 epochs and then continue to train for 4 epochs with a learning rate of . For our experiments on Tanks and Temples [14] and ETH3D [27], we finetune the model trained on the full DTU training set using the BlendedMVS [35] training set. We train for additional 7 epochs using a learning rate of . During training we use source images in addition to the reference, while we use source images during inference. Further, we provide the used image resolution , number of hypothesis per level , memory consumption, runtime and fusion paramters for training and inference in Table 1. For the fusion parameters we first state the number of views which have to satisfy that the forwardbackward reprojection error is less than [33]. We use the camera parameters provided by [33] in our experiments.5 Experiments
In the following sections we describe our evaluation procedures on the DTU [1], Tanks and Temples [14] and ETH3Dlow resolution [27] datasets. Furthermore, we provide an ablation study to quantify the improvements gained by integrating the extended BPlayer [16]. Additionally, we evaluate the resulting depth maps on DTU [1] and discuss point cloud results on the datasets.
Dataset  #H  Mem.  t  

DTU [1] train  96,32,8  8.9  1.3  ,   
DTU [1] test  128,32,8  6.6  2.0  3, 0.25  
Blended [35] train  96,32,8  10.9  1.6  ,   
T & T [14] inter.  96,32,8  10.9  2.7  5, 0.50  
T & T [14] adv.  96,32,8  10.9  2.7  3, 0.25  
ETH3D [27] test  128,32,8  3.6  1.0  3, 0.10 
Method  norm.  

BPMVSNet  ✓      24.20  14.34  9.92  6.53 
BPMVSNet  ✓  ✓    23.77  13.53  9.02  5.85 
BPMVSNet  ✓  ✓  ✓  23.21  14.04  9.38  5.95 
CasMVSNet [8]        24.51  16.18  11.93  8.41 
5.1 DTU dataset
The 128 scenes from the DTU dataset [1] capture objects placed on a table using a full image resolution of
. The ground truth is provided by a structured light scanner. The evaluation metrics for DTU are a completeness metric, which is the average distance from ground truth points to the nearest reconstructed point and an accuracy metric which is the average distance from the reconstructed points to the nearest ground truth point. Furthermore, large outliers and points not in the observability mask are filtered
[1].Method  

CVPMVSNet [32]  28.99  21.78  16.16  9.48 
CasMVSNet [8]  24.84  19.74  16.14  11.56 
BPMVSNet  21.79  15.88  11.57  7.70 
Method  overall  acc.  comp. 

MVSNet [33]  0.462  0.396  0.527 
RMVSNet [34]  0.422  0.385  0.459 
MVSCRF [31]  0.398  0.371  0.426 
PMVSNet [20]  0.420  0.406  0.434 
FastMVSNet [36]  0.370  0.336  0.403 
CasMVSNet [8]  0.355  0.325  0.385 
AttMVS [21]  0.356  0.383  0.329 
CVPMVSNet [32]  0.351  0.296  0.406 
BPMVSNet (ours)  0.327  0.333  0.320 
intermediate  advanced  
Method  F  Fam.  Franc.  Horse  Light.  M60  Panther  Playg.  Train  F  Audit.  Ballr.  Courtr.  Museum  Palace  Temple 
COLMAP [26]  42.14  50.41  22.25  25.63  56.43  44.83  46.97  48.53  42.04  27.24  16.02  25.23  34.70  41.51  18.05  27.94 
ACMM [29]  57.27  69.24  51.45  46.97  63.20  55.07  57.64  60.08  54.48  34.02  23.41  32.91  41.17  48.13  23.87  34.60 
PCFMVS [19]  55.88  70.99  49.60  40.34  63.44  57.79  58.91  56.59  49.40  35.69  28.33  38.64  35.95  48.36  26.17  36.69 
MVSNet [33]  43.48  55.99  28.55  25.07  50.79  53.96  50.86  47.90  34.69               
RMVSNet [34]  48.40  69.96  46.65  32.59  42.95  51.88  48.80  52.00  42.38  24.91  12.55  29.09  25.06  38.68  19.14  24.96 
MVSCRF [31]  45.73  59.83  30.60  29.93  51.15  50.61  51.45  52.60  39.68               
PMVSNet [20]  55.62  70.04  44.64  40.22  65.20  55.08  55.17  60.37  54.29               
FastMVSNet [36]  47.39  65.18  39.59  34.98  47.81  49.16  46.20  53.27  42.91               
AttMVS [21]  60.05  73.90  62.58  44.08  64.88  56.08  59.39  63.42  56.06  31.93  15.96  27.71  37.99  52.01  29.07  28.84 
CVPMVSNet [32]  54.03  76.50  47.74  36.34  55.12  57.28  54.28  57.43  47.54               
CasMVSNet [8]  56.84  76.37  58.45  46.26  55.81  56.11  54.06  58.18  49.51  31.12  19.81  38.46  29.10  43.87  27.36  28.11 
BPMVSNet(ours)  57.60  77.31  60.90  47.89  58.26  56.00  51.54  58.47  50.41  31.35  20.44  35.87  29.63  43.33  27.93  30.91 
test  train  
Method  F  lake.  sandb.  stor.  stor. 2  tunnel  F  deliv.  electro  forest  playgr.  terrains 
ACMM [29]  55.01  59.60  66.07  35.89  50.48  63.01  55.12  38.65  61.75  60.21  43.87  71.11 
PCFMVS [19]  57.06  66.85  62.36  43.32  52.89  59.86  57.32  48.61  56.36  67.24  43.68  70.70 
COLMAP [26]  52.32  56.18  61.09  38.61  46.28  59.41  49.91  37.30  52.31  61.83  31.91  66.23 
MVSNet [33]  38.33  40.49  57.57  20.52  33.47  39.63             
RMVSNet [34]  36.87  42.00  46.99  24.73  34.83  35.83             
PMVSNet [20]  44.46  49.27  49.30  34.35  39.83  49.54             
MVSCRF [31]  28.32  32.16  44.37  14.66  21.69  28.73             
AttMVS [21]  45.85  49.36  51.75  34.83  43.70  49.63             
CasMVSNet [8]  44.49  56.38  64.76  18.64  31.23  51.43  49.00  34.83  53.10  66.82  28.93  61.32 
BPMVSNet (ours)  43.22  52.86  46.16  27.25  36.92  52.94  50.87  40.10  49.76  63.03  34.23  67.21 
5.1.1 Ablation study
In our ablation study, we evaluate the impact of our extensions to the BPlayer [16] on the performance of BPMVSNet by comparing error metrics computed on depth maps with resolution from the DTU validation set of [33]. The error metrics we use are the percentage of errors larger than thresholds mm between the ground truth depth map and the estimated depth map . We can observe in Table 2 that adding the BP layer using the normalized depth jumps improves all of the error metrics compared to the CasMVSNet [8] baseline. Enabling the interpolation step for pairwise scores further improves all of the metrics. This means that the combination of these two contributions to the BPlayer improve its performance significantly in the MVS setting. Finally, we also include the automatic computation of the label volume discretizations beyond the initial hierarchy level. This yields the lowest error on , thus we will use this model for all further experiments. The reason for the slight error increase with bigger thresholds is related to the dependence of the sampling interval hypothesis on the previous depth estimate. In case the previous estimate is wrong, a larger fixed interval can potentially improve results for bigger error thresholds.
5.1.2 Evaluation on depth maps
The results in Table 3 show quantitative results on depth maps of resolution , when using our proposed model BPMVSNet trained on DTU as described in Section 4. We compare with CasMVSNet [8] and CVPMVSNet [32] using their respective pretrained models on the same resolution with default parameter settings. This comparison allows us to measure the quality of the depth maps before performing the point cloud fusion, as even the same depth maps can yield different results in this stage depending on the fusion parameters. It can be seen that BPMVSNet improves on all of the measured error metrics signficantly compared to the other methods. This can also be observed in Figure 4, where we show that the extended BPlayer is able to correctly regularize even over large areas containing inconsistent estimates, such as the top of the box in the first image. However, it can also be seen that very poor unary estimates in the matching network can also lead to artifacts in some background regions.
5.1.3 Evaluation on point clouds
For the point cloud fusion step performed for evaluations on point clouds, we utilize the method of [33]. The fusion parameters are set according to Table 1, where denotes the forwardbackward reprojection error threshold and are the number of images which have to be consistent with respect to . Additionally we set the maximum relative depth difference parameter for the fusion of [33] to . In Table 4, we provide the results of the DTU point cloud evaluation [1] on the test set of [33]. We set the point cloud fusion parameters as described in Table 1. We compare our results with other stateoftheart works for learning based MVS. It can be seen that BPMVSNet outperforms all other methods both in terms of accuracy as well as completeness. In the point cloud results visualized in Figure 5, one can see that the extended BPlayer [16] was able to regularize the depth maps such that we get complete results, even on untextured and reflective surfaces, as seen on the coffee can.
5.2 Tanks and Temples dataset
The Tanks and Temples [14] dataset consists of realworld outdoor and indoor scenes, capturing objects, buildings and rooms, where the ground truth has been acquired by a laser scanner. The full image resolution is . This dataset includes many challenging scenes where reflections and occlusions are present. In this section, we compare the results of BPMVSNet on the Tanks and Temples [14] benchmark with other state of the art methods. The metrics used in the Tanks and Temples [14] dataset are a precision score , which is the percentage of reconstructed points with a distance to the nearest ground truth point. Further, the recall gives the percentage of ground truth points where the closest distance to a reconstructed point is . The Fscore [14] is then computed as .
5.2.1 Benchmark results
In Table 5, we present the results of BPMVSNet on the Tanks and Temples benchmark [14]. We also provide the fusion parameters in Table 1. For the Horse dataset of the intermediate scenes we increased to 1.0 as the base of the statue contains many reflections. It can be observed that BPMVSNet achieves competitive results, outperforming the base architecture [8] in terms of the mean Fscore metric on the advanced and intermediate sets. In Figure 5, we visualize the resulting point clouds. It can be observed that even difficult scenes containing surfaces that reflect the sky such as the base of the statue are quite complete. Furthermore, as seen in Figure 5, smaller details such as the rails on the train are preserved.
5.3 ETH 3D dataset
The ETH3D low resolution [27] dataset consists of outdoor and indoor scenes of varying locations from a forest to a storage room. The full image resolution for this dataset is . Similar to Tanks and Temples [14], the fused point clouds are evaluated based on a laserscanner ground truth, in terms of accuracy, completeness and Fscore.
5.3.1 Benchmark results
In Table 6, we provide results for the ETH3Dlowresolution many view [27] benchmark. It can be seen that we achieve competitive results among learningbased methods such as MVSNet [33] and CasMVSNet [8]. Further, we did not train our network on the training set images of this dataset as described in Section 4. However, we can also observe that all of the learning based methods are outperformed by the traditional methods PCFMVS [19] and ACMM [29] on this dataset. Compared to the base architecture CasMVSNet [8], we achieve better scores on some datasets such as storage room and terrains, while the score for other datasets such as sandbox or electro is lower, which results in a slightly worse result on the test and a slightly better one on the training set.
6 Conclusion
In this work, we have proposed BPMVSNet, a CNN based MVS system, employing a CRF regularization layer based on belief propagation [16]. In order to optimize the performance of the BPlayer [16] for the MVS setting, we have made three core contributions: (i) Utilizing a scale agnostic term for normalizing label jumps and (ii) implementing a differentiable interpolation step in the pairwise score computation. (iii) Further, we automatically choose the discretization in our hypothesis volume after the initial stage. These contributions improve the performance of the baseline architecture [8], as seen in our quantitative results presented in Section 5, where we achieve stateoftheart results on both the DTU [1] and Tanks and Temples [14] datasets. Future work could involve the inclusion of additional information, such as the surface normals as additional guidance for the BPlayer. Acknowledgement: This work was supported by ProFuture (FFG, Contract No. 854184).
References

[1]
(2016)
Largescale data for multipleview stereopsis.
International Journal of Computer Vision
, pp. 1–16. Cited by: §1, Figure 5, §5.1.3, §5.1, Table 1, Table 3, Table 4, §5, §6.  [2] (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In SIGGRAPH, Cited by: §2.1.
 [3] (2011) PatchMatch stereo  stereo matching with slanted support windows.. In British Machine Vision Conference, Cited by: §2.1.
 [4] (2019) Pointbased multiview stereo network. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
 [5] (2015) Massively parallel multiview stereopsis by surface normal diffusion. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.

[6]
(2007)
Realtime planesweeping stereo with multiple sweeping directions.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.1.  [7] (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618 Cited by: §3.2.
 [8] (202006) Cascade cost volume for highresolution multiview stereo and stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §2.2, §3.1, §3.2, §3.6, §4, Figure 4, §5.1.1, §5.1.2, §5.2.1, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6, §6.
 [9] (2018) DeepMVS: learning multiview stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
 [10] (196403) Robust estimation of a location parameter. Ann. Math. Statist. 35 (1), pp. 73–101. External Links: Document, Link Cited by: §4.
 [11] (2020) SurfaceNet+: an endtoend 3d neural network for very sparse multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §2.2.
 [12] (2017) SurfaceNet: an endtoend 3d neural network for multiview stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2307–2315. Cited by: §2.2.
 [13] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 [14] (2017) Tanks and temples: benchmarking largescale scene reconstruction. ACM Transactions on Graphics 36 (4). Cited by: §1, §4, Figure 5, §5.2.1, §5.2, §5.3, Table 1, Table 5, §5, §6.
 [15] (2017) EndtoEnd Training of Hybrid CNNCRF Models for Stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [16] (202006) Belief propagation reloaded: learning bplayers for labeling problems. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: BPMVSNet: BeliefPropagationLayers for MultiViewStereo, §1, §1, §2.1, Figure 2, §3.1, §3.2, §3.3, §3.4, §3.5, §3, §4, §5.1.1, §5.1.3, §5, §6.
 [17] (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 109–117. External Links: Link Cited by: §2.2.
 [18] (2014) A TV prior for highquality scalable multiview stereo reconstruction. In International Conference on 3D Vision (3DV), Cited by: §3.4.
 [19] (2019) Plane completion and filtering for multiview stereo reconstruction. In German Conference on Pattern Recognition (GCPR), Cited by: §2.1, §3.4, §5.3.1, Table 5, Table 6.
 [20] (2019) Pmvsnet: learning patchwise matching confidence aggregation for multiview stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, Table 4, Table 5, Table 6.
 [21] (202006) Attentionaware multiview stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5, Table 6.
 [22] (200006) Practical structure and motion from stereo when motion is unconstrained. International Journal of Computer Vision 39, pp. . External Links: Document Cited by: §3.4.
 [23] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.
 [24] (2019) TAPAMVS: texturelessaware PatchMatch multiview stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
 [25] (2007) Learning conditional random fields for stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [26] (2016) Pixelwise view selection for unstructured multiview stereo. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §2.1, Table 5, Table 6.
 [27] (2017) A multiview stereo benchmark with highresolution images and multicamera videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4, §5.3.1, §5.3, Table 1, Table 6, §5.
 [28] (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.2.
 [29] (2019) Multiscale geometric consistency guided multiview stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §5.3.1, Table 5, Table 6.
 [30] (202006) MARMVS: matching ambiguity reduced multiple view stereo for efficient large scale scene reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
 [31] (2019) MVSCRF: learning multiview stereo with conditional random fields. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §2.2, §3.1, §3.2, Table 4, Table 5, Table 6.
 [32] (202006) Cost volume pyramid based depth inference for multiview stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §5.1.2, Table 3, Table 4, Table 5.
 [33] (2018) MVSNet: depth inference for unstructured multiview stereo. European Conference on Computer Vision (ECCV). Cited by: Figure 1, §1, §1, §2.2, §3.1, §3.2, §4, §5.1.1, §5.1.3, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6.
 [34] (2019) Recurrent mvsnet for highresolution multiview stereo depth inference. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2, Table 4, Table 5, Table 6.
 [35] (2020) BlendedMVS: a largescale dataset for generalized multiview stereo networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4, Table 1.
 [36] (2020) Fastmvsnet: sparsetodense multiview stereo with learned propagation and gaussnewton refinement. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5.
 [37] (2014) PatchMatch based joint view selection and depthmap estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
 [38] (2015) Conditional random fields as recurrent neural networks. IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781467383912 Cited by: §2.2.