The goal of a dense 3D reconstruction system is to initially estimate dense depth maps for a set of overlapping input images capturing an arbitrary scene by utilizing MVS. These dense depth maps are then fused into a dense point cloud. Thus, for each given input image, we want to compute a depth estimate for every pixel. Additional inputs of an MVS pipeline are the camera poses of each image and the camera calibration, which are obtained using a Structure-from-Motion (SfM) pipeline. Following MVSNet , we choose a reference view and a given set of source images having a substantial overlap with the reference image in terms of the viewed geometry. Analogous to the two view stereo case, we need to search along the epipolar lines in all source images using a similarity measure to find the correct depth value.
Traditional MVS methods usually use raw color information or hand-crafted features to describe local image patches. These features are then used with a traditional similarity measure such as normalized cross correlation (NCC) to find corresponding points. However, color information alone is often ambiguous and therefore a suboptimal choice for comparing patches. Therefore  proposed to learn optimal features for matching using a CNN and also to replace the traditional similarity measure with a learned one. However, the authors of  do not use any smoothness assumption on the resulting depth maps, which is commonly used in the canonical two-view stereo [25, 15].  observed this and they were the first to suggest that the depth maps should also be regularized with a CRF in the MVS setting. However, the CRF inference in  has not been adapted for the MVS setting by explicitly handling different scales. Further, the employed mean-field approximation is highly dependent on the number of iterations. To overcome these limitations, we propose to extend the differentiable BP layer of  to enable its applicability in the MVS setting. The BP layer is a fully differentiable CRF inference layer that can deliver high quality results after a single iteration. We summarize our core contributions in the upcoming paragraph.
We propose the end-to-end learnable BP-MVSNet, which explicitly exploits prior knowledge of the MVS task via a CRF. To this end, we use the BP layer  and extend it to meet the requirements of a MVS method. In particular, we propose three extensions that correspond to our core contributions: We (i) propose a scale-independent normalization method to enable the handling of label jumps on different scene scales, (ii) add support for fractional label jumps in the CRF using a differentiable interpolation step that is fully integrable into the learning and (iii) propose a method to automatically calculate the sampling interval of the plane hypothesis beyond the initial stage. Our modifications described in Sections 3.4 and 3.5 required us to extend the forward path of . Consequently, we also provide the required gradients to learn the parameters of the employed pairwise score function in the backward path. Thus, we are able to seamlessly integrate the BP layer into a learnable MVS network. We can significantly outperform the baseline CasMVSNet  and achieve state-of-the-art results on the DTU  and Tanks and Temples  benchmarks.
2 Related Work
We group the related work for MVS into traditional approaches and CNN based solutions.
2.1 Traditional MVS
Traditional MVS systems typically employ a photometric similarity measure such as bilateral weighted NCC  for evaluating different depth hypothesis. A possible technique for generating the hypothesis is the plane-sweeping MVS  approach, where the depth hypothesis for each pixel is computed by sampling a number of candidate planes in the scene. Another way of generating new depth hypothesis is via the PatchMatch [2, 3] algorithm, where a sampling scheme is used to propagate depth hypothesis across the image. This algorithm has also been adapted and extended for the MVS case [37, 26].
The work of  utilizes checkerboard-based propagation to further reduce runtime and is combined with a coarse to fine scheme by . Furthermore, [26, 29] use a forward-backward reprojection error as an additional error term for the PatchMatch estimation. MARMVS  additionally estimate the optimal patch scale to reduce matching ambiguities. While these methods generally perform well on a variety of different datasets and can deal with high resolution images, their traditional similarity measures severely limit them in scenarios with reflective surfaces, occlusions and strong lighting changes. Recent works [24, 19] try to complete these missing areas using explicit plane priors. However, following , we propose to use a CRF based regularization  of the score volume for general scenarios, which helps the system to recover correct depth measurements without relying on assumptions about the scene structure.
2.2 Learning-based MVS
The concepts for supervised machine learning based MVS defined by MVSNet and DeepMVS 
were the basis for many other works which further improve upon this architecture. They utilize plane hypothesis to compute a variance metric from feature maps extracted by a CNN. The score volume is the output of a 3D convolution based neural network. RMVSNet significantly improves MVSNet 
in terms of memory consumption by utilizing a recurrent neural network. PMVSNet learns a confidence for aggregating feature maps from the reference and source images. Recently, AttMVS  utilized attention  for this task.
Other recent works adopt coarse to fine schemes to iteratively improve upon the depth prediction and to reduce memory consumption further. CVP-MVSNet  continuously refines the result by learning depth residuals to improve upscaled results from coarse resolution levels, similar to CasMVSNet . In order to optimize for efficiency in terms of memory and runtime FastMVSNet  learns to refine a sparse depth map, however this comes at the expense of the quality of the results compared to [32, 8]. PointMVSNet  approaches the refinement problem in 3D space by refining the point cloud in a coarse to fine manner. A related approach is SurfaceNet , which operates on 3D voxel representations and is further improved by SurfaceNet+ , which introduces a novel view selection approach that can handle sparse MVS configurations. However, methods which work in 3D space are limited in terms of resolution because of increased memory requirements. By performing a coarse to fine scheme in 2.5D, [32, 8]
achieve a good trade-off between accuracy and computational requirements. However, there is no explicit regularization or refinement present for the end-result which reduces outliers. MVSCRF employs a CRF as RNN implementation , where the CRF distribution is approximated using a simpler distribution [17, 38]. While  is the first architecture to apply CRF regularization for learning-based MVS, we propose additional extensions to improve CRF inference for the MVS setting.
In the following sections we describe the implemented hierarchical network architecture incorporating the BP-layer  in detail. Furthermore, we motivate and explain our contributions, which extend the BP-layer  for use in the MVS setting.
3.1 Model Overview
for the MVS setting. Right: Detailed architecture of the matching network integrating the extended BP-layer. In the detailed architecture we perform trilinear interpolation to upscale the results from the lower resolution level when computing the input for the final BP-layer. In the 3D convolution description K denotes the kernel size, C the number of output channels and S the stride.
We work on images of size . Following [8, 33], we discretize our search space for the depth in each pixel location into a set of labels , which correspond to fronto-parallel plane hypothesis. The quality of the depth estimate that each of these planes represent is quantified by a score volume . One of the problems with this approach is that the computed scores in may be inconsistent in areas where occlusions, reflective surfaces and changing light conditions are present in the images, resulting in noisy or wrong depth estimates. Similar to , we perform a regularization step on the score volume using a CRF. The function maximized in the CRF of  for a given label assignment is defined as
where is the set of nodes of the graphical model, is the set of edges and is a normalization constant. This function includes a unary score term , as well as pairwise scores which allow the model to penalize inconsistencies between neighbouring pixel locations. The authors of  propose a BP-layer, which performs inference in the CRF. The BP-layer is fully differentiable and can thus be integrated into any labeling based model. We provide further details on how we extended the BP-layer for the MVS setting in Section 3.3.
3.2 Network Architecture
The first step of our MVS network is to extract features from the reference image and each of the corresponding source images. In this stage, we incorporate the multi-level feature extraction CNN of, which extracts feature maps. We use three resolution levels based on evaluations of . The output of this CNN are the feature maps . The following steps are executed in the three hierarchy levels we incorporate. The source feature maps are warped according to fronto-parallel planes [8, 33]
, which yields a tensoras the output. These warped feature maps are then used to compute the variance between the reference and all warped source feature maps. The resulting variance tensor is used as the input for a matching network utilizing 3D convolutions, which outputs the final score volume . Following the architecture proposed by , the matching network integrates the BP-layer as a regularization component as shown in Figure 2. We utilize three BP-layers on different scales in the matching network and apply a temperature softmax  to the unary inputs as follows
using learned parameters , , for the respective levels. The input to the BP-layer on the highest scale is calculated as a weighted sum with weights , and which are learned. Further, we train three different pixel-wise pairwise networks using the UNet architecture described in  for each of the BP layers incorporated into the matching network. We train separate matching networks for each of the hierarchy levels as it has been shown by  that this improves performance. The resulting depth map is then computed as
where is the softmax normalized score volume and is a volume containing the depth hypothesis for each pixel. Following the hierarchical architecture of , we compute the hypothesis for the next level using the result from the current level as described in Section 3.4. This means that the hypothesis volume contains the same labels for each pixel in the first hierarchy level and contains different labels per pixel in further levels.
3.3 Extension of the BP-layer
We now provide a detailed description of our extensions to the BP-layer , which normalize CRF label jumps for the MVS setting and allow for factional jumps in the pairwise score computation. Further, we describe how we automatically compute the depth hypothesis discretization beyond the initial stage. We show the performance improvements of our contributions in Section 5.
3.4 Label jump normalization
We use the differentiable BP-layer proposed in  employing the max-sum variant of belief propagation. The advantages of using belief propagation as the inference algorithm for the CRF used in the matching network are that it is able to model long range interactions, can be efficiently implemented and is interpretable . In our case the unary scores defined in Equation (1) are inputs to the BP-layer from the matching CNN after softmax activation as shown in Figure 2. The term introduced in Equation (1) represents the pairwise scores from label to label . We apply the BP-layer as proposed in . In the standard two-view stereo case the labels and represent disparities corresponding to horizontal displacements measured in pixels independent of scene scale. In the plane sweeping MVS case, each source and target label in the score volume represents the depth hypothesis of a different fronto-parallel plane which is dependent on the scene scale. If we now want to learn the pairwise score term in the MVS setting, we need to normalize depth jumps first, such that it can be applied for any scale. To this end, we apply a normalization to and . We define , based on the expected 3D error [22, 19, 18] in the depth dimension. For a given depth value , this error measure is defined as
where represents the average baselines from all used source images, is the focal length of the reference camera and is a pixelwise uncertainty which we set to in all experiments. We then normalize our source and target depths using this error measure by
where the tilde denotes normalized depth values. The normalized difference from one depth to another is then computed as
Using Equation (6), we observe that we first invert both depth hypothesis and calculate the difference in the inverse depth. Afterwards, we scale the difference by a factor of to account for the differing scene scales in the MVS setting. This is also related to how a disparity in the two-view setting is calculated from depth. Thus, we are actually operating on a scaled inverse depth. Also intuitively, calculating the distance in inverse depth makes sense as the BP-layer relies on the assumption that the gradient in the labels is constant for slanted surfaces, which is not the case when using non inverted depth maps. By incorporating the error model, this yields an extended depth distance normalization measure compared to . Since the label jumps are now normalized using , which is independent of the scene scale , the pairwise score function is also scale independent and can be applied in the same manner for any scene. Note that and are real values and they are thus not explicit in our label set . We tackle this problem by performing linear interpolation when computing the pairwise score.
3.5 Pairwise score interpolation
When learning the pairwise score term we estimate the 5 parameters similar as . These model positive and negative label jumps of quantity and respectively. and can be different for positive or negative jumps while has the same value for both. Hence, we define the pairwise score function for two labels and by:
The way is defined in Equation (6) also implies that there are fractional jumps for our normalized depths. Consequently, we perform a linear interpolation between our learned discrete parameters. Hence, we define our interpolated pairwise score function as
This step can be integrated into the backward path of the BP-layer 
by following the chain rule and applying
where is the incoming gradient as defined in . Intuitively, we distribute the incoming gradient based on the weights and as shown in Equation (9) using the notation of . Performing this interpolation allows us to more accurately represent the pairwise score function for our learning setting, where fractional normalized depth jumps are a common occurrence. We also visualize the pairwise score computation in Figure 3.
3.6 Automatic calculation of depth hypothesis sampling interval
We compute the hypothesis volume for the initial hierarchy level as proposed in . However, the error measure described in Section 3.4 allows us to compute the label tensor from the depth estimate after the initial hierarchy level automatically, as opposed to manually defining a scale factor which is done in . We use to normalize our depth hypothesis, thus we want the resolution of the hypothesis volume to be at least as fine grained as this normalization factor, such that our pairwise score function can capture the corresponding depth jumps. Hence, from the upscaled depth estimate of the initial level we compute our pixelwise intervals as
When considering the number of depth hypothesis for a given hierarchy level we then compute the label tensor for every element as
We use the Adam 
optimizer to train the PyTorch implementation of the network with the Huber loss  function
on the resulting pixelwise depth predictions and the ground truth depth . We use in all of our experiments. The loss in each hierarchy level is calculated as
where is the hierarchy level. and represent the estimated and ground truth depth map for the respective level. Following the work of , the final network loss is calculated as a weighted sum . We train our network with a batch size of . For our ablation study on the extensions of the BP-layer , we trained the network on a subset of the DTU training set of 
for 7 epochs using a learning rate of. We evaluate on the full validation set from  after every epoch and use the epoch with the lowest error on the threshold for our results in Table 2. For the evaluations of depth maps and point clouds we trained on the full DTU training set by adding the validation set of  with a learning rate of for 10 epochs and then continue to train for 4 epochs with a learning rate of . For our experiments on Tanks and Temples  and ETH3D , we fine-tune the model trained on the full DTU training set using the BlendedMVS  training set. We train for additional 7 epochs using a learning rate of . During training we use source images in addition to the reference, while we use source images during inference. Further, we provide the used image resolution , number of hypothesis per level , memory consumption, runtime and fusion paramters for training and inference in Table 1. For the fusion parameters we first state the number of views which have to satisfy that the forward-backward reprojection error is less than . We use the camera parameters provided by  in our experiments.
In the following sections we describe our evaluation procedures on the DTU , Tanks and Temples  and ETH3D-low resolution  datasets. Furthermore, we provide an ablation study to quantify the improvements gained by integrating the extended BP-layer . Additionally, we evaluate the resulting depth maps on DTU  and discuss point cloud results on the datasets.
|DTU  train||96,32,8||8.9||1.3||-, -|
|DTU  test||128,32,8||6.6||2.0||3, 0.25|
|Blended  train||96,32,8||10.9||1.6||-, -|
|T & T  inter.||96,32,8||10.9||2.7||5, 0.50|
|T & T  adv.||96,32,8||10.9||2.7||3, 0.25|
|ETH3D  test||128,32,8||3.6||1.0||3, 0.10|
5.1 DTU dataset
The 128 scenes from the DTU dataset  capture objects placed on a table using a full image resolution of
. The ground truth is provided by a structured light scanner. The evaluation metrics for DTU are a completeness metric, which is the average distance from ground truth points to the nearest reconstructed point and an accuracy metric which is the average distance from the reconstructed points to the nearest ground truth point. Furthermore, large outliers and points not in the observability mask are filtered.
5.1.1 Ablation study
In our ablation study, we evaluate the impact of our extensions to the BP-layer  on the performance of BP-MVSNet by comparing error metrics computed on depth maps with resolution from the DTU validation set of . The error metrics we use are the percentage of errors larger than thresholds mm between the ground truth depth map and the estimated depth map . We can observe in Table 2 that adding the BP layer using the normalized depth jumps improves all of the error metrics compared to the CasMVSNet  baseline. Enabling the interpolation step for pairwise scores further improves all of the metrics. This means that the combination of these two contributions to the BP-layer improve its performance significantly in the MVS setting. Finally, we also include the automatic computation of the label volume discretizations beyond the initial hierarchy level. This yields the lowest error on , thus we will use this model for all further experiments. The reason for the slight error increase with bigger thresholds is related to the dependence of the sampling interval hypothesis on the previous depth estimate. In case the previous estimate is wrong, a larger fixed interval can potentially improve results for bigger error thresholds.
5.1.2 Evaluation on depth maps
The results in Table 3 show quantitative results on depth maps of resolution , when using our proposed model BP-MVSNet trained on DTU as described in Section 4. We compare with CasMVSNet  and CVP-MVSNet  using their respective pretrained models on the same resolution with default parameter settings. This comparison allows us to measure the quality of the depth maps before performing the point cloud fusion, as even the same depth maps can yield different results in this stage depending on the fusion parameters. It can be seen that BP-MVSNet improves on all of the measured error metrics signficantly compared to the other methods. This can also be observed in Figure 4, where we show that the extended BP-layer is able to correctly regularize even over large areas containing inconsistent estimates, such as the top of the box in the first image. However, it can also be seen that very poor unary estimates in the matching network can also lead to artifacts in some background regions.
5.1.3 Evaluation on point clouds
For the point cloud fusion step performed for evaluations on point clouds, we utilize the method of . The fusion parameters are set according to Table 1, where denotes the forward-backward reprojection error threshold and are the number of images which have to be consistent with respect to . Additionally we set the maximum relative depth difference parameter for the fusion of  to . In Table 4, we provide the results of the DTU point cloud evaluation  on the test set of . We set the point cloud fusion parameters as described in Table 1. We compare our results with other state-of-the-art works for learning based MVS. It can be seen that BP-MVSNet outperforms all other methods both in terms of accuracy as well as completeness. In the point cloud results visualized in Figure 5, one can see that the extended BP-layer  was able to regularize the depth maps such that we get complete results, even on untextured and reflective surfaces, as seen on the coffee can.
5.2 Tanks and Temples dataset
The Tanks and Temples  dataset consists of real-world outdoor and indoor scenes, capturing objects, buildings and rooms, where the ground truth has been acquired by a laser scanner. The full image resolution is . This dataset includes many challenging scenes where reflections and occlusions are present. In this section, we compare the results of BP-MVSNet on the Tanks and Temples  benchmark with other state of the art methods. The metrics used in the Tanks and Temples  dataset are a precision score , which is the percentage of reconstructed points with a distance to the nearest ground truth point. Further, the recall gives the percentage of ground truth points where the closest distance to a reconstructed point is . The F-score  is then computed as .
5.2.1 Benchmark results
In Table 5, we present the results of BP-MVSNet on the Tanks and Temples benchmark . We also provide the fusion parameters in Table 1. For the Horse dataset of the intermediate scenes we increased to 1.0 as the base of the statue contains many reflections. It can be observed that BP-MVSNet achieves competitive results, outperforming the base architecture  in terms of the mean F-score metric on the advanced and intermediate sets. In Figure 5, we visualize the resulting point clouds. It can be observed that even difficult scenes containing surfaces that reflect the sky such as the base of the statue are quite complete. Furthermore, as seen in Figure 5, smaller details such as the rails on the train are preserved.
5.3 ETH 3D dataset
The ETH3D low resolution  dataset consists of outdoor and indoor scenes of varying locations from a forest to a storage room. The full image resolution for this dataset is . Similar to Tanks and Temples , the fused point clouds are evaluated based on a laser-scanner ground truth, in terms of accuracy, completeness and F-score.
5.3.1 Benchmark results
In Table 6, we provide results for the ETH3D-low-resolution many view  benchmark. It can be seen that we achieve competitive results among learning-based methods such as MVSNet  and CasMVSNet . Further, we did not train our network on the training set images of this dataset as described in Section 4. However, we can also observe that all of the learning based methods are outperformed by the traditional methods PCF-MVS  and ACMM  on this dataset. Compared to the base architecture CasMVSNet , we achieve better scores on some datasets such as storage room and terrains, while the score for other datasets such as sandbox or electro is lower, which results in a slightly worse result on the test and a slightly better one on the training set.
In this work, we have proposed BP-MVSNet, a CNN based MVS system, employing a CRF regularization layer based on belief propagation . In order to optimize the performance of the BP-layer  for the MVS setting, we have made three core contributions: (i) Utilizing a scale agnostic term for normalizing label jumps and (ii) implementing a differentiable interpolation step in the pairwise score computation. (iii) Further, we automatically choose the discretization in our hypothesis volume after the initial stage. These contributions improve the performance of the baseline architecture , as seen in our quantitative results presented in Section 5, where we achieve state-of-the-art results on both the DTU  and Tanks and Temples  datasets. Future work could involve the inclusion of additional information, such as the surface normals as additional guidance for the BP-layer. Acknowledgement: This work was supported by ProFuture (FFG, Contract No. 854184).
Large-scale data for multiple-view stereopsis.
International Journal of Computer Vision, pp. 1–16. Cited by: §1, Figure 5, §5.1.3, §5.1, Table 1, Table 3, Table 4, §5, §6.
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In SIGGRAPH, Cited by: §2.1.
-  (2011) PatchMatch stereo - stereo matching with slanted support windows.. In British Machine Vision Conference, Cited by: §2.1.
-  (2019) Point-based multi-view stereo network. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
-  (2015) Massively parallel multiview stereopsis by surface normal diffusion. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
Real-time plane-sweeping stereo with multiple sweeping directions.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2016) Deep learning. The MIT Press. External Links: Cited by: §3.2.
-  (2020-06) Cascade cost volume for high-resolution multi-view stereo and stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §2.2, §3.1, §3.2, §3.6, §4, Figure 4, §5.1.1, §5.1.2, §5.2.1, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6, §6.
-  (2018) DeepMVS: learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (1964-03) Robust estimation of a location parameter. Ann. Math. Statist. 35 (1), pp. 73–101. External Links: Cited by: §4.
-  (2020) SurfaceNet+: an end-to-end 3d neural network for very sparse multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §2.2.
-  (2017) SurfaceNet: an end-to-end 3d neural network for multiview stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2307–2315. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36 (4). Cited by: §1, §4, Figure 5, §5.2.1, §5.2, §5.3, Table 1, Table 5, §5, §6.
-  (2017) End-to-End Training of Hybrid CNN-CRF Models for Stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2020-06) Belief propagation reloaded: learning bp-layers for labeling problems. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: BP-MVSNet: Belief-Propagation-Layers for Multi-View-Stereo, §1, §1, §2.1, Figure 2, §3.1, §3.2, §3.3, §3.4, §3.5, §3, §4, §5.1.1, §5.1.3, §5, §6.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 109–117. External Links: Cited by: §2.2.
-  (2014) A TV prior for high-quality scalable multi-view stereo reconstruction. In International Conference on 3D Vision (3DV), Cited by: §3.4.
-  (2019) Plane completion and filtering for multi-view stereo reconstruction. In German Conference on Pattern Recognition (GCPR), Cited by: §2.1, §3.4, §5.3.1, Table 5, Table 6.
-  (2019) P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, Table 4, Table 5, Table 6.
-  (2020-06) Attention-aware multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5, Table 6.
-  (2000-06) Practical structure and motion from stereo when motion is unconstrained. International Journal of Computer Vision 39, pp. . External Links: Cited by: §3.4.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §4.
-  (2019) TAPA-MVS: textureless-aware PatchMatch multi-view stereo. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2007) Learning conditional random fields for stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §2.1, Table 5, Table 6.
-  (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4, §5.3.1, §5.3, Table 1, Table 6, §5.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §2.2.
-  (2019) Multi-scale geometric consistency guided multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §5.3.1, Table 5, Table 6.
-  (2020-06) MARMVS: matching ambiguity reduced multiple view stereo for efficient large scale scene reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2019) MVSCRF: learning multi-view stereo with conditional random fields. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §2.2, §3.1, §3.2, Table 4, Table 5, Table 6.
-  (2020-06) Cost volume pyramid based depth inference for multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §5.1.2, Table 3, Table 4, Table 5.
-  (2018) MVSNet: depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV). Cited by: Figure 1, §1, §1, §2.2, §3.1, §3.2, §4, §5.1.1, §5.1.3, §5.3.1, Table 2, Table 3, Table 4, Table 5, Table 6.
-  (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2, Table 4, Table 5, Table 6.
-  (2020) BlendedMVS: a large-scale dataset for generalized multi-view stereo networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4, Table 1.
-  (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, Table 4, Table 5.
-  (2014) PatchMatch based joint view selection and depthmap estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2015) Conditional random fields as recurrent neural networks. IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §2.2.