1 Introduction
MultiView Stereo (MVS) methods render possible the reconstruction of a 3D scene from multiple images for which the camera calibration and the relative camera poses are known. The major challenge is the reconstruction of 3D point clouds as complete as possible while minimizing the number of outliers and maintaining a high accuracy of the 3D points. One way of minimizing outliers is the use of filtering methods which preserve the most accurate measurements and which remove unreliable measurements which can result in rather sparse point clouds. Applications with the goal of photorealistic 3D rendering have a need for complete 3D models however.
For improving the completeness of reconstructed 3D scenes, regularization techniques are useful. Many conventional MVS methods perform regularization by using a 3D cost volume of matching costs [15, 35, 11]. One drawback of this technique is the large size of the cost volume and subsequently the increased optimization complexity, which depends polynomially on the image resolution limiting their applicability on high resolution images [17]. The 3D cost volumes can also be used as an input to a DNN [12, 40], to optimize the volume employing a learned regularization measure. However, the high memory requirements of DNNs for these approaches makes them rather unsuitable for processing highresolution image data .
For efficient estimation of depth maps, the Patch Match (PM) method
[7] has demonstrated high quality results without the need of handling a global cost volume. PM instead utilizes stochastic search over the depth space. Many efficient implementations exist, which further optimize the efficiency by proposing parallelization schemes [8, 42] and sophisticated view selection schemes [30]. Due to its local optimization, however, PM depth maps lack in completeness, which need further processing. In this paper, we extend multiscale estimation for PM [38] and show how to filter and optimize noisy depth maps.2 Related Work
In this section we give an overview of papers that are relevant to our proposed method. Firstly, we review MVS methods. Secondly, confidence prediction methods which mainly focus on disparity maps classification from stereo images are summarized. Finally, depth refinement methods are reviewed as they can naturally make use of our confidence prediction technique.
MultiView Stereo: In a similar approach to classical twoview stereo methods, which build up a cost volume by matching image patches along the epipolar line, planesweep based MultiView Stereo methods construct a cost volume by computing costs for a set of given plane hypothesis [9]. Instead of the number of disparities, the depth of the volume is defined by the number of planes. As this leads to a significant consumption of computational resources, Bleyer et al. [7] make use of the PatchMatch algorithm [5], which tries to reduce the amount of computed matching costs by propagating depth hypothesis across the image. This concept has also been implemented for the MVS case [4]. Zheng et al. [42] make use of a probabilistic scheme for view selection in PatchMatch MVS [4], which is improved upon by Schönberger et al. [30]. In order to increase the completeness of the depth estimates Romanoni and Matteucci [27] introduce a method which propagates depth estimates from local planes estimated from superpixels. This approach is extended upon by Kuhn et al. [17], who propose a region growing for the superpixels and additional outlier filtering strategies. A blackred checkerboard sampling scheme was utilized by Galliani et al. [8] in order to decrease the runtime of PatchMatch based MultiView Stereo, which was further improved upon by the multiscale approach of Xu and Tao [38].
In recent years, neural network based approaches to MVS [12, 39], working with cost volumes, have also been established. Yao et al. [41] have extended their approach in order to be able to process higher resolution imagery. However, as described in [41], this method is not able to process resolutions as high as the ones found in the highresolution version of the ETH3D benchmark [31]. This is why we make use of a PatchMatch [5] based approach for our MVS pipeline, which has the advantage of not having to process a large cost volume for high resolution imagery. Furthermore, we utilize the multiscale approach of [38], to increase the robustness of the method when processing images with large amounts of nontextured regions.
Confidence Prediction: Prediction of confidences is an inherent part of MVS methods. It can be calculated as costs by local patch comparison based on metrics like Normalized Cross Correlation. Detailed analyses of the influence from local matching costs are given by Hirschmüller and Scharstein [10] as well as by Hu and Mordohai [2]. Moreover, the confidence can be derived from a globally optimized cost volume within the cost aggregation process [32]. The latter allows the consideration of global cost terms like the overall smoothness of a disparity map [15, 35, 11].
It has been shown that machine learning methods can improve the quality of confidence prediction, e.g, by employing handcrafted features as input for a random forest classifier
[26, 2, 19, 21]. The use of automatically learned features for confidence prediction was firstly proposed by Seki et al. [1] while Poggi et al. [20] presented the first endtoend confidence prediction giving the raw disparity map as input to a DNN. Further improvements are possible by exploiting local consistencies [18], adding the image as additional input to the network [36] or even using extended information from the entire cost volume as input data [13].Due to scalability issues we avoid processing on global cost volumes and focus on methods that can be applied in the image domain. Unfortunately, all proposed 2D methods work on disparity maps from standard stereo configurations only. In this paper we present the first confidence prediction network for generic MVSderived depth maps.
Depth Refinement: MVS methods tend to fall short in untextured areas, where the matching becomes ambiguous. Therefore, most 3D reconstruction pipelines include a refinement step meant to remove depth outliers or even estimate missing depth areas. Most of these methods rely on a confidence map, as it is critical to understand which depth map areas are reliable and which to extend.
Local refinement methods, such as Tosi et al. [37]
, binarize the confidence map in order to classify pixels as reliable; then the depth at non reliable pixels is inpainted using an interpolation heuristic relying exclusively on the neighboring reliable pixels. Unfortunately, local approaches tend to fall short in the presence of large unreliable areas.
Global methods, instead, compute the refined depth map as the minimizer of a global cost function.
Some methods, e.g., the Fast Bilateral Solver [6], promote a smooth refined depth map while making sure that this is close to the input depth in those areas recognized as reliable by the confidence map.
Although smoothing is avoided across color edges, where depth discontinuities could take place, simple smoothness is a weak regularization.
In fact, other methods rely on stronger assumptions.
Park et al. [24] assume a piecewise planar world: a set of candidate planes is estimated a priori and a Markov Random Field is used to assign each pixel to a plane, from which the depth can be derived.
However, the a priori selection of the possible planes, first, fixes the complexity of the scene in front of the camera and, second, assumes that the selected planes are correct.
To overcome the previous limitations, in [28] the authors propose to adopt a regularizer which promotes depth maps fulfilling a piecewise planar world assumption, but without any a priori candidate plane selection.
Also in this method, the confidence plays a fundamental role, as the method implicitly fits planes based on the reliable depth areas.
Recently, some authors addressed the depth refinement problem within the deep learning framework [34, 25]. However, the performances of these methods tend to decrease significantly when applied to data even slightly different from that use in the training phase. Therefore, in this article we adopt the methods in [28] and show that, coupled with our proposed confidence, it can improve the 3D reconstruction both qualitatively and quantitatively.
3 Algorithm
In this section we give a detailed explanation of our proposed MVS algorithm based on PatchMatch [5]. The generated depth maps are subsequently used as an input to our confidence prediction network. For the latter we propose extended input and a new syntheticallyderived dataset.
3.1 MultiView Stereo
The proposed MVS pipeline makes use of the PatchMatch [5] algorithm with a redblack checkerboard propagation scheme, as proposed in [8]. A visualization of the used sampling pattern is shown in Figure 2. We have removed samples close to the central pixel, due to the intuition that these hypothesis can be replaced by a perturbed estimation which can be utilized in addition to samples from the pattern [30]. The depth and normal estimates are perturbed according to the scheme presented in [30], where the parameter is calculated as follows , analogous to [30]. The variable
denotes the current redblack iteration. We also test hypothesis which result from combinations of the current, perturbed and random depth estimates with their respective normal vectors, as suggested by
[30, 38]. However, as opposed to [38], we compute the matching costs and test these hypothesis during the main redblack iterations and do not perform these calculations in a separate step. We are drawing from six potential samples in each direction yielding a total of 24, as we have found that sampling from a lot of hypothesis increases the chance of using incorrect estimates. This is because we follow the view selection scheme proposed by Xu and Tao [38], which selects the eight best candidates from the pattern based on their respective matching costs. This means that outliers are potentially selected due to ambiguities in the photometric matching cost metric we are using, which is the bilateral weighted normalized cross correlation proposed in [30]. In contrast to the scheme described in [38], the number of candidates which are considered when updating the current estimate is increased to 16. It should be noted that the viewselection matrix is still based on the eight candidates with the lowest matching costs [38]. Furthermore, we are incorporating a planebased depth propagation strategy, as proposed in [30]. This means that instead of directly using the depth estimate at a given sampling location as a new hypothesis, one makes use of the local plane defined by the depth and normal estimate at this location. By means of intersecting the viewing ray of the destination pixel with the local plane defined at the sampling location, we are able to propagate rapid changes in depth values more effectively. We also make use of the multiscale estimation, geometric consistency and detail restorer proposed by [38] on three hierarchy levels with a downsampling factor of . For the view selection and detail restorer, we use the parameter settings from [38]. Example depth maps are shown in Fig. 3.3.2 Confidence Prediction
We present the first confidence prediction method feasible of handling complex MVS scenarios using a DNN. While for disparity maps from stereo configurations datasets with ground truth disparities are available [29, 22] for MVS only semidense lidar ground truth [31] and data captured in a laboratory environment with relatively low resolutions and nonvarying distance to the scene [32, 3] exist. Because training data in sufficiently large numbers is a requirement for training a DNN, we propose a new dataset generated from synthetic 3D scenes. We use the AirSim framework [33] for the simulation of a drone flight to capture multiple images with ground truth depth maps and camera calibration from varying viewpoints. The dataset includes representative scenarios with large baselines, perspective deformations, varying distance to the scene, specular reflections and weakly textured surfaces. Examples from the dataset are shown in Fig. 3. Trees in the simulation framework can lead to erroneous ground truth depth maps because branches are simulated simplified as partially transparent planes. Hence, we masked out vegetation areas from the ground truth data. For the final learning of the network parameters, we combine publicly available semidense real world dataset [31] with our proposed synthetic dataset to alleviate the influence of the domain gap.
Our confidence prediction is inspired by ConfNet [36] which allows the prediction of confidences on disparity maps by employing derived global features. The major component of ConfNet is an encoderdecoder network which takes as input the RGB image and the disparity map derived from a stereo pair and outputs a pixelwise confidence. In contrast to [36] we are solely using the encoderdecoder network ConfNet as we found that the originally proposed combination with a local confidence network does not improve the results in our application. The network was originally trained on the Middlebury [29] and Kitti [22]
datasets. Because of a binary classification into valid and nonvalid measurements the cross entropy loss function on binarized ground truth is used. For these datasets the ground truth can be created in a straightforward way from disparities by generating binary confidence maps by setting a maximum disparity error of, e.g., one pixel.
We, in contrast, work on depth maps derived from multiple viewpoints as it is typical for MVS scenarios. Basically, confidence prediction is about assigning a probability value for each pixel specifying if the corresponding depth value is either an outlier or an inlier. Positive confidences would support a depth value to be arranged in a reasonable noise range. The constant setting of a fixed noise level, e.g., in meters is not possible, because of varying baselines, focal lengths and distances to the scene. Hence, we make use of a pixelwise estimation of the expected noise in the 3D space. Therefore, we use a wellknown model derived from analytical error propagation
[23] which was already used in the 3D fusion of depth maps [16]. The simplified expected 3D standard deviation
as scalar value is described as:(1) 
with distance to the scene , baseline , focal length and pixel uncertainty . Note that the error model is based on the assumption of classical stereo cameras. In MVS configurations we would have multiple baselines, hence we use the average baseline from all cameras for following [17]. Having a scalar depth range, we can perform a binary classification of the depth maps generating ground truth for valid confidences (see Figs 3 and 4).
The input of ConfNet is a four channel tensor consisting of the RGB image and the disparity map. In our case we have depth maps, which are generally not normalized as the scale of a scene is not necessarily given. In addition, the unconsidered varying noise levels in MVSderived depth maps can be problematic for the network. Therefore, the depth map input is replaced by two normal channels created by the estimation of spherical coordinates of a normal vector. An example input normal map is shown in Fig.
5.As already mentioned, we cannot use an entire cost volume from MVS as input for the network because of memory issues. Nonetheless we give additional information for each pixel to the network derived from MVS. As done by our baseline MVS methods, we perform consistency checks employing the geometric and photometric error of depth measurements. Instead of using a fixed threshold for the minimum number of successfully matched images, we keep this information for every pixel and generate a counter map as additional network input (see Fig. 5).
4 Applications
In the following we describe how the proposed learned confidence maps can be integrated in a 3D reconstruction pipeline to improve the quality of depth maps.
4.1 Filtering
MVS generates dense depth maps including outliers due to occlusion and wrongly matched patches. It is common practise that individual depths are filtered pixelbypixel within a postprocessing step. Stateoftheart MVS methods make use of geometric filtering by firstly estimating the reprojection error when projecting the depth from the reference image into overlapping source images. In addition, photometric consistency is checked [30]. If there is an insufficient number of source images that fulfill the geometric and photometric requirements the depth is filtered. We, in contrast, use that information as additional input for our confidence prediction network. Subsequently, the depth is filtered depending on the optimized confidence additionally considering RGB image and normal map.
An advanced filtering can be useful to decimate clusters of outliers. To this end, small islands in depth image [27] or disparity space [11, 17] are removed if they have an insufficiently large size. In contrast to existing methods, we cluster normals instead of disparities or depths to overcome the problem of an unknown scale. In addition, we use probabilistic fusion to estimate for an overall confidence of a finally clustered island. More precisely, we cluster neighboring pixels incrementally, if its normal vectors are similar with respect to a given threshold. At this point, we consider our confidences as an inlier probability . The overall probability of an island is estimated by fusion of individual pixel probabilities . For the fusion of probabilities for binary states the Binary Bayes Fusion is used:
(2) 
4.2 Depth Refinement
Most human made environments either consist of piecewise planar surfaces or can be well approximated as such. Moreover, in the pinhole camera model, the inverse depth map of a planar surface imaged by the camera results in a planar function. Formally, if the point belongs to the same plane of the point in the scene, with and the pixel coordinates where the two points are imaged, then the inverse depth at can be expressed as:
(3) 
where is a vector defining the plane orientation at . Based on these observations, the authors in [28] propose to refine a noisy and possibly incomplete depth map by enforcing that the refined inverse depth map is piecewise planar. The method in [28] requires a depth map confidence: we adopt our proposed one.
The refinement of each inverse depth map is cast into the minimization of a cost function comprising two terms. The first one is a simple data term, which penalizes those solutions deviating from the input inverse depth map in those areas where this is considered as reliable by our confidence map :
(4) 
The second one is a regularization term, which promotes piecewise planar solutions explicitly.
In particular, the regularization term models the inverse depth map as a weighted graph, where the pixels are the nodes, and where the weight of the edge between the pixels and encodes the probability that and belong to the same plane in the scene. The higher , the more the regularizer will enforce the fulfillment of the constraint in Eq. (3). The regularizer reads as follows:
(5)  
(6) 
where is the set of pixels connected to the pixel in the graph. The weight is computed based both on the color difference between two pixels and on their Euclidean distance. In particular, the weight is computed for all the pixels in a searching window centered at , but only the pixels with the highest weights are selected. This permits to avoid edges between points that are not likely to correspond to the same plane in the scene, as these would just increase the graph size unnecessarily. We refer to [28] for more details on the regularizer in Eqs. (5)–(6) and on the graph construction.
The resulting optimization problem is the following:
(7) 
where balances the two terms. Thanks to the direct relation between and the normals of the planes in the scene, the solution of the optimization problem in Eq. (7) provides both the refined depth map and the corresponding normal map. An example is provided in Figure 6.
5 Experiments
In this section, we report the evaluation and validation of our proposed method on confidence prediction and demonstrate the improvement by applying our method for outlier filtering and depth map refinement on popular MVS datasets. We trained the confidence prediction network on the entire highres dataset and tested on the ETH3D lowres and test data [31] which were not included in the training procedure. All network configurations are trained with the same parameters for epochs with a learning rate of and ADAM [14] as the optimizer to minimize the Binary Cross Entropy on our ground truth label maps.
5.1 Mvs
We evaluate the performance of our vanilla MVS pipeline (prior to outlier filtering and depth map optimization) by benchmarking on the highresolution and lowresolution training datasets of the ETH3D benchmark [31]
. This benchmark provides ground truth laser scan point clouds. These are used to evaluate a reconstructed point cloud based on an Fscore metric which is computed as the harmonic mean of a completeness and accuracy term
[31]. These terms represent the coverage of a given reconstruction with respect to the ground truth and the accuracy of the point locations measured against the laser scan point cloud. The reconstructed point cloud is generated by utilizing the depth fusion algorithm implemented in COLMAP [30] using the parameter settings from [38]. For our experiments, we perform 8 redblack iterations of the algorithm and downsample the input images to half of the original resolution. In Table 1 we first show the baseline performance of the pipeline, which is denoted as ours. The results are compared with both ACMM [38] and COLMAP [30], as our algorithm is based on features proposed in both pipelines. It can be observed that our pipeline achieves a better F1score than both ACMM [38] and COLMAP [30]. The improvement is based on an increase in the completeness score and competitive results in terms of accuracy. As seen in Table 1, the plane based depth propagation is the main contributor to this increase, as highly varying depth estimates along surfaces can now be propagated more effectively. However, the increased completeness also results in a lower F1score on the lowres datasets which contain a lot of images, thus inaccuracies have a bigger impact. We have also evaluated the performance of the sampling pattern compared to the one used by ACMM [38]. Table 1 shows that the sampling pattern used in our pipeline leads to an improvement in terms of the F1score as well, due to an increase in accuracy.Method  F1  compl.  acc.  F1  compl.  acc.  

COLMAP [30]  67.66  55.13  91.85  49.91  40.86  69.58  
ACMM [38]  78.86  70.42  90.67  55.12  57.01  54.69  
ours  83.62  83.25  84.17  55.61  62.55  50.66  

82.77  82.90  82.85  53.66  62.72  47.30  

81.23  78.96  83.97  56.18  61.74  52.51 
5.2 Confidence Prediction
Confidence prediction aims a separation between correct and incorrect measurements. As described in Sec. 3.2, we generate pixelwise ground truth labels for each depth map by assigning depth measurements to an expected range of the 3D standard deviation. We found that the amount of positive confidences is higher then the amount of negative confidences in the training dataset. Hence, we reweight the influence by scaling the loss of negative values by factor in the Binary Cross Entropy loss function. For expected uncertainty (Eq.(1)) we use three times the standard deviation assuming a pixel uncertainty of .
The effectiveness of confidence prediction is generally measured using the receiver operating characteristic (ROC) curve as metric. The area under the ROC curve (AUC) gives a scalar value representing the separability of two functions and, hence, the quality of the predicted confidences. As already mentioned, our method is the first able to process depth maps from MultiView Stereo configurations and cannot be compared to existing work directly as they cannot handle depth maps from complex configurations. Only when employing the estimated 3D error, which is one of our contributions, the network is able to learn for complex depth errors. Another contribution of our work is proposal of a novel multichannel data input comprising the RGB image, normal vectors and counter information from MVS (see Sec. 3.2). For an evaluation, we firstly analyse the performance of our method with varying input data. Firstly, we use solely the depth maps as input data of our network which is comparable to [20] which employs disparity maps. Secondly, we use the RGB and depth maps which is comparable to [36]. Thirdly we add the counter map and finally replace the depth by normal maps from our method. Note, that for depth map input, we perturbed the depth by a random scale, because on generic scenes no scale is given. Our method gives an AUC of , while the baseline approaches have lower scores; depth: , RGB+depth: , RGB+depth+counter: . In general, the use of the normal map allows the prediction of confidences for generic scenes as no scale is learned implicitly. The counter map information from MVS makes the network robust against complex configurations.
5.3 Filtering
In this experiment we optimized the 3D fusion parameters by allowing a maximum normal difference of and a maximum reprojection error of pixel. Additionally the minimum number of consistent points in 3D is set to and the minimum angle is set to half for the dense sequences (lowres ETH3D) as they provide a higher redundancy.
Dense depth maps from MVS suffer from outliers due to occlusion or wrongly matched patches. Therefore, depth maps are generally filtered in a postprocessing step. Stateoftheart methods project the depth values of the reference image to the employed source images and estimates for the geometric reprojection error and a photoconsistency measure [30, 38]. If a sufficient number of supporting source images is found the depth value is kept valid otherwise it is filtered. In the following, we call it mask filtering. In practice, a number of (mask1) or (mask2) images is used for datasets with limited number of images as provided by the ETH3D datasets [31]. We generate a counter map including the pixelwise count as additional input to the network which takes into account additional information as the degree of texturedness and the level of noise implicitely. We use a confidence threshold of for filtering all depths below that confidence threshold. Sophisticated methods employ the filtering of clustered islands (peaks) with small sizes [17, 27]. We filter islands with taking into account the clustered confidence threshold (Eq.(2)) of and compare it to filtering clusters with sizes smaller (clust150) which we empirically found to be the best value. Table 2 shows a direct comparison of the competitive methods.
Method  F1  compl  acc.  F1  comp  acc 
Ours  84.80  81.63  88.51  61.35  63.14  60.17 
+conf  84.56  79.87  90.32  61.48  59.83  63.80 
+mask1  84.09  79.77  89.38  61.35  60.15  63.16 
+mask2  79.57  71.78  90.88  61.31  60.03  63.19 
+clust conf  84.28  79.66  89.97  61.39  59.52  63.98 
+clust150  82.01  75.43  90.77  60.80  53.83  70.67 
+conf+clust  84.16  78.87  90.83  61.48  59.45  64.26 
Filtering improves the accuracy of the point cloud, and hence, is minimizing the number of outliers in the 3D point cloud. Even though mask2 filtering has a higher accuracy the Fscore falls drastically. Our confidencebased filtering allows a better regulation of the parameter. The confidencebased clustering allows a stable filtering of islands and gets better scores then numberbased island filtering. In comparison to our method without filtering, the FScore slightly decreases and combining confidence and clustered confidence is numerically best only for the F1 scores on video data. On the other hand, visually the point cloud is obviously improved comparing the results in Fig. 7. Note, that the unfiltered point cloud has best FScore as outliers behind the walls are not considered in the evaluation.
5.4 Optimization
Beside for filtering of outliers, confidences can be used for a refinement of depth maps.
As described in Sec. 4.2 we are using our confidence map as input of an optimization framework, which strongly regularizes depth and normals on areas marked as low confidence while preserving fine details on high confidence areas. We use the same parameters as proposed by the authors in [28]. The finally optimized depth and normal maps are fused with the same method as the filtered maps besides that the minimum normal consistency was set to because the refined normal maps are of a higher accuracy (see Fig. 6). Table 3 shows the results for ETH3D high and lowresolution training datasets.
Method  F1  compl  acc.  F1  comp  acc 

Ours  84.80  81.63  88.51  61.35  63.14  60.17 
Ours opt  85.86  82.70  89.65  61.45  58.44  65.33 
The numerical improvements results especially from closed holes in depth maps. As demonstrated in Fig. 8 the optimization is also suitable for refinement of thin structures which does not influence the evaluation score significantly.
5.5 Final Evaluation
As a final experiment, we test our proposed method with subsequent filtering and refining on the ETH3D test [31] datasets. We use the same parameters for all datasets for the final evaluation as described in Sec. 5.3 and 5.4. As our refinement method generates complete depth maps, geometric consistency can lack in the filtering of outliers in sky areas. Hence, we enable a sky filtering method [17] to minimize visual artifacts. The sky filtering does influence the score on ETH3D marginally ( on the highres FScore) data but gives visually better results.
Our proposed refinement method is effective but computational complex while the confidencebased filtering allows an efficient processing. Hence, we call the refinement method DeepCMVS and the filtering DeepCMVS. Table 4 shows a direct numerical comparison against the leading methods on the ETH3D highresolution and lowresolution datasets [31]. Both of our methods outperform the current state of the art. In comparison to ACMM which is considered as baseline method the improvement on the ETH3D highresolution data [31] is . By visually comparing the refinement against the filtering (Fig. 1 and 8) imrovements are obvious on flat surface parts. Numerically they perform almost similar because there is currently no extended filtering enabled after the refinement. To this end, our confidence prediction network could be retrained on the refined depth maps in the future. Furthermore, thin structures are highly unrepresented in the datasets. When evaluating the FScore on a fine resolution (cm) the refinement method gives better result (refined) vs .
Method  train  test  train  test 

DeepCMVS  85.85  86.80  61.47  61.99 
84.18  86.82  62.18  62.35  
ACMM  78.86  80.78  55.12  55.01 
PCFMVS  79.42  79.29  57.32  57.06 
TAPAMVS  77.69  79.15  55.13  58.67 
LTVRE  61.82  76.25  53.25  53.52 
COLMAP  67.66  73.01  49.91  52.32 
6 Summary
We have presented the first confidence prediction method of depth maps derived from challenging MultiView Stereo (MVS) configurations. To allow training on dense depth maps, we propose a new dataset with dense ground truth generated from synthetic scenes. Binarized ground truth labels are estimated from ground truth and estimated depth maps employing an analytical error propagation. The finally predicted confidence maps are used for the probabilistic filtering of outliers as well as for the refinement of depth maps based on global optimization. Our method improves the state of the art in terms of qualitative and quantitative results when employing our proposed PatchMatchbased MultiView Stereo method, as demonstrated on popular 3D reconstruction benchmarks.
References
 [1] (2016) Patch based confidence prediction for dense disparity map. In BMVC, Cited by: §2.
 [2] (2014) Learning to detect ground controlpoints for improving the accuracy of stereo matching. In CVPR, Cited by: §2, §2.

[3]
(2016)
Largescale data for multipleview stereopsis.
International Journal of Computer Vision (IJCV)
120 (2), pp. 153–168. Cited by: §3.2.  [4] (2012) Scale robust multi view stereo. In ECCV, Cited by: §2.
 [5] (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM SIGGRAPH 2009 Papers, SIGGRAPH ’09. Cited by: §2, §2, §3.1, §3.
 [6] (2016) The fast bilateral solver. In European Computer Vision Conference, ECCV 2016, Amsterdam, The Netherlands, pp. 617–632. Cited by: §2.
 [7] (2011) PatchMatch stereo  stereo matching with slanted support windows.. In British Machine Vision Conference (BMVC), Cited by: §1, §2.
 [8] (2015) Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, Cited by: §1, §2, §3.1.
 [9] (2007) Realtime planesweeping stereo with multiple sweeping directions. In CVPR, Cited by: §2.
 [10] (2008) Evaluation of stereo matching costs on images with radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31 (), pp. 1582–1599. Cited by: §2.
 [11] (2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30 (2), pp. 328–341. Cited by: §1, §2, §4.1.
 [12] (2018) DeepMVS: learning multiview stereopis. In CVPR, Cited by: §1, §2.

[13]
(201906)
LAFnet: locally adaptive fusion networks for stereo confidence estimation.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.  [14] (2014) Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §5.
 [15] (2002) Multicamera scene reconstruction via graph cuts. In Proceedings of the 7th European Conference on Computer VisionPart III, ECCV ’02, Berlin, Heidelberg, pp. 82–96. External Links: ISBN 3540437460, Link Cited by: §1, §2.
 [16] (2017) A TV prior for highquality scalable multiview stereo reconstruction. International Journal of Computer Vision (IJCV) 124 (1), pp. 2–17. Cited by: §3.2.
 [17] (2019) Plane completion and filtering for multiview stereo reconstruction. In GCPR, Cited by: §1, §2, §3.2, §4.1, Figure 7, §5.3, §5.5, Table 2.
 [18] (2017) Quantitative evaluation of confidence measure sin a machine learning world. In ICCV, Cited by: §2.
 [19] (2016) Learning a generalpurpose confidence measure basedon o(1) features and a smarter aggregation strategy for semi globalmatching. In 3DV, Cited by: §2.
 [20] (2017) Learning from scratch a confidence measure. In BMVC, Cited by: §2, §5.2.
 [21] (2015) Leveraging stereo matching with learningbased confidencemeasures. In CVPR, Cited by: §2.
 [22] (2015) Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §3.2.
 [23] (2000) Practical structure and motion from stereo when motion is unconstrained. International Journal of Computer Vision (IJCV) 39 (1), pp. 5–23. Cited by: §3.2.
 [24] (2019Apr.) Asplanaraspossible depth map estimation. Computer Vision and Image Understanding 181, pp. 50–59. Cited by: §2.
 [25] (2019) Learned collaborative stereo refinement. In German Conference on Pattern Recognition, GCPR 2019, Dortmund, Germany, Cited by: §2.
 [26] (2013) Ensemble learning for confidence measuresin stereo vision. In CVPR, Cited by: §2.
 [27] (2019) TAPAMVS: texturelessaware PatchMatch multiview stereo. CoRR abs/1903.10929. External Links: Link Cited by: §2, §4.1, Figure 7, §5.3, Table 2.
 [28] (2019) Joint graphbased depth refinement and normal estimation. In arXiv:1912.01306 [cs.CV], Cited by: §2, §2, §4.2, §4.2, §5.4.
 [29] (2014) Highresolution stereo datasets with subpixelaccurate ground truth. In GCPR, Cited by: §3.2, §3.2.
 [30] (2016) Pixelwise view selection for unstructured multiview stereo. In ECCV, Cited by: §1, §2, §3.1, §4.1, Figure 7, §5.1, §5.3, Table 1, Table 2.
 [31] (2017) A multiview stereo benchmark with highresolution images and multicamera videos. In CVPR, Cited by: Figure 1, §2, Figure 3, §3.2, §5.1, §5.3, §5.5, §5.5, Table 1, §5.
 [32] (2006) A comparison and evaluation of multiview stereo reconstruction algorithms. In CVPR, Cited by: §2, §3.2.
 [33] (2017) AirSim: highfidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, External Links: arXiv:1705.05065, Link Cited by: §3.2.
 [34] (2017) Detect, replace, refine: deep structured prediction for pixel wise labeling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, Cited by: §2.
 [35] (200307) Stereo matching using belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 25 (7), pp. 787–800. External Links: ISSN 01628828, Link, Document Cited by: §1, §2.
 [36] (2018) Beyond local reasoning for stereo confidence estimation with deep learning. In ECCV, Cited by: §2, §3.2, §5.2.
 [37] (2019) Leveraging confident points for accurate depth refinement on embedded systems. In IEEE Embedded Vision Workshop, EVW 2019, Long Beach, CA, USA, Cited by: §2.
 [38] (2019) Multiscale geometric consistency guided multiview stereo. CoRR abs/1904.08103. External Links: Link Cited by: §1, §2, §2, Figure 2, §3.1, Figure 7, §5.1, §5.3, Table 1, Table 2.
 [39] (2018) MVSNet: depth inference for unstructured multiview stereo. ECCV. Cited by: §2.
 [40] (2018) MVSNet: depth inference for unstructured multiview stereo. In ECCV, Cited by: §1.
 [41] (2019) Recurrent mvsnet for highresolution multiview stereo depth inference. Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
 [42] (2014) PatchMatch based joint view selection and depthmap estimation. In CVPR, Cited by: §1, §2.