Log In Sign Up

PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

Recent learning-based multi-view stereo (MVS) methods show excellent performance with dense cameras and small depth ranges. However, non-learning based approaches still outperform for scenes with large depth ranges and sparser wide-baseline views, in part due to their PatchMatch optimization over pixelwise estimates of depth, normals, and visibility. In this paper, we propose an end-to-end trainable PatchMatch-based MVS approach that combines advantages of trainable costs and regularizations with pixelwise estimates. To overcome the challenge of the non-differentiable PatchMatch optimization that involves iterative sampling and hard decisions, we use reinforcement learning to minimize expected photometric cost and maximize likelihood of ground truth depth and normals. We incorporate normal estimation by using dilated patch kernels, and propose a recurrent cost regularization that applies beyond frontal plane-sweep algorithms to our pixelwise depth/normal estimates. We evaluate our method on widely used MVS benchmarks, ETH3D and Tanks and Temples (TnT), and compare to other state of the art learning based MVS models. On ETH3D, our method outperforms other recent learning-based approaches and performs comparably on advanced TnT.


page 1

page 3

page 4

page 6

page 8


Deep PatchMatch MVS with Learned Patch Coplanarity, Geometric Consistency and Adaptive Pixel Sampling

Recent work in multi-view stereo (MVS) combines learnable photometric sc...

Normal Assisted Stereo Depth Estimation

Accurate stereo depth estimation plays a critical role in various 3D tas...

IB-MVS: An Iterative Algorithm for Deep Multi-View Stereo based on Binary Decisions

We present a novel deep-learning-based method for Multi-View Stereo. Our...

Adversarial View-Consistent Learning for Monocular Depth Estimation

This paper addresses the problem of Monocular Depth Estimation (MDE). Ex...

DPSNet: End-to-end Deep Plane Sweep Stereo

Multiview stereo aims to reconstruct scene depth from images acquired by...

DeLTra: Deep Light Transport for Projector-Camera Systems

In projector-camera systems, light transport models the propagation from...

Ranking Cost: Building An Efficient and Scalable Circuit Routing Planner with Evolution-Based Optimization

Circuit routing has been a historically challenging problem in designing...

1 Introduction

Image Ground Truth
COLMAP [27] Ours
Figure 1: We propose PatchMatch-RL, an end-to-end trainable PatchMatch-based MVS approach that combines advantages of trainable costs and regularizations with pixelwise estimates of depth, normal, and visibility. The left half of the bottom images is the depth, and the right half is the normals. We show that our method can achieve smoother and more complete depth and normal map estimation over the existing approach (COLMAP).

Multi-view stereo (MVS) aims to reconstruct 3D scene geometry from a set of RGB images with known camera poses, with many important applications such as robotics [25], self-driving cars [8], infrastructure inspection [7, 13], and mapping [31]. Non-learning based MVS methods [5, 26, 32, 34, 41] evolved to support pixelwise estimates of depths, normals, and source view selection, with PatchMatch based iterative optimization and cross-image consistency checks. Recent learning-based MVS methods [12, 15, 16, 39, 40] tend to use frontal plane sweeps, evaluating the same set of depth candidates for each pixel based on the same images. The trainable photometric scores and cost-volume regularization of the learning-based methods leads to excellent performance with dense cameras and small depth ranges, as evidenced in the DTU [1] and Tanks-and-Temples (TnT) benchmarks [18], but the pixelwise non-learning based approach outperforms for scenes with large depth ranges and slanted surfaces observed with sparser wide-baseline views, as evidenced in the ETH3D benchmark [28].

Our paper aims to incorporate pixelwise depth, normal, and view estimates into an end-to-end trainable system with advantages from both approaches:

  • Pixelwise depth and normal prediction efficiently models scenes with large depth ranges and slanted surfaces.

  • Pixelwise view selection improves robustness to occlusion and enables reconstruction from sparser images.

  • Learned photometric cost functions improve correspondence robustness.

  • Learned regularization and contextual inference enable completion of textureless and glossy surfaces.

One challenge is that PatchMatch optimization and pixelwise view selection involve iterative sampling and hard decisions that are not differentiable. We propose a reinforcement learning approach to minimize expected photometric cost and maximize discounted rewards for reaching a good final solution. Our techniques can also be used to enable learning for other PatchMatch applications (e.g. [3, 14, 21]), though we focus on MVS only. Estimating 3D normals of pixels is also challenging because convolutional features tend to be smooth so that neighboring cells add little new information, and patch-wise photometric costs are memory intensive. We find that with shallower feature channels and dilated patch kernels, we effectively estimate pixel normals. A third challenge is how to perform regularization or global inference. Each pixel has its own depth/normal estimate, so cost-volume based regularization does not apply. We propose a recurrent cost regularization that updates a hidden state via message passing that accounts for depth/normal similarities between pixels.

Figure 2: Architecture overview: We first extract multi-scale features using CNNs with shared weights. We then perform coarse-to-fine estimation, with correlation of feature maps at corresponding scales used to evaluate photometric costs and perform view selection. At the coarsest stage, we initialize pixelwise oriented points (depths/normals) and associated hidden states per plane. Then, a series of PatchMatch iterations updates the points and hidden state maps. The PatchMatch iteration consists of four stages: (1) pixelwise view selection; (2) candidate propagation; (3) candidate scoring with recurrent cost regularization; and (4) candidate selection. The current solution is then upsampled as an input to the finer level, and this continues until oriented point estimates at the finest level are fused from all images.

In summary, our main contribution is an end-to-end trainable PatchMatch-based MVS approach that combines advantages of trainable costs and regularizations with pixelwise estimates, requiring multiple innovations:

  • Reinforcement learning approach to train end-to-end within a PatchMatch sampling based optimization.

  • Use of normal estimates in learning-based MVS, enabled by trainable PatchMatch optimization and CNN patch features.

  • Depth/normal regularization that applies beyond frontal plane-sweep algorithms; e.g. to our pixelwise depth/normal estimates.

In experiments, our system outperforms other recent learning-based methods on ETH3D and performs similarly on TnT, and our ablation study validates the importance of pixelwise normal and view selection estimates.

2 Related Works

Given correct scene geometry, the pixels that correspond to a surface patch in different calibrated cameras can be determined, and their appearance patterns will be similar (“photometrically consistent”). This core idea of multi-view stereo (MVS) leads to an array of formulations, optimization algorithms, and refinements. We focus on our work’s direct lineage, referring the interested reader to a survey/tutorial [9] and paper list [2] for more complete background and coverage.

The first and simplest formulation is to assign each pixel to one of a set of candidate disparities or depth values [22]. The locally best assignment can be determined by filtering across rows in rectified images, and surface smoothness priors can be easily incorporated within this ordered labeling problem. However, per-view depth labeling has many shortcomings in a wide-baseline MVS setting: (1) depth maps do not align in different views, making consistency checking and fusion more difficult; (2) depth for oblique surfaces is not constant, degrading matching of intensity patches; and (3) the range of depth values may be large, so that large steps in depth are needed to feasibly evaluate the full range. Further, occlusion and partially overlapping images demand more care in evaluating photometric consistency.

These difficulties led to a reformulation of MVS as solving for a depth, normal, and view selection for each pixel in a reference image [27, 41]

. The view selection identifies which other source images will be used to evaluate photometric consistency. This more complex formulation creates a challenging optimization problem, since each pixel has a 4D continuous value (depth/normal) and binary label vector (view selection). PatchMatch 

[3, 5, 27] is well-suited for the depth/normal optimization, since it employs a hypothesize-test-propagate framework that is ideal for efficient inference when labels have a large range but are approximately piecewise constant in local neighborhoods. The pixelwise PatchMatch formulations have been refined with better propagation schemes [32], multi-scale features [32], and plane priors [26, 34]. Though this line of work addresses the shortcomings of the depth labeling approach, it often fails to reconstruct smooth or glossy surfaces where photometric consistency is uninformative, mainly due to the challenge of incorporating global priors, which is addressed in part by Kuhn et al.’s post-process trainable regularization [19]. Also, though hancrafted photometric consistency functions, such as bilaterally weighted NCC, perform well in general, learned functions can potentially outperform by being context-sensitive.

Naturally, the first inroads to fully trainable MVS also followed the simplistic depth labeling formulation [15, 16, 36], which comfortably fits the CNN forte of learning features, performing inference over “cost volumes” (features or scores for each position/label), and producing label maps. But despite improvements such as using recurrent networks [37] to refine estimates, coarse-to-fine reconstruction [39], visibility maps [35], and attention-based regularization [23], many of the original drawbacks of the depth labeling formulation persist.

Thus, we now have two parallel branches of MVS state-of-the-art: (1) complex hand-engineered formulations with PatchMatch optimization that outperform for large-scale scene reconstruction from sparse wide-baseline views; and (2) deep network depth-labeling formulations that outperform for smaller scenes, smooth surfaces, and denser views. Differentiation-based learning and sampling-based optimization are not easily reconciled with refinements or combinations of existing approaches. Duggal et al. [8] propose a differentiable PatchMatch that optimizes softmax-weighted samples, instead of argmax, and use it to prune the depth search space to initialize depth labeling. We use their idea of one-hot filter banks to perform propagation but use an expectation based loss that sharpens towards argmax during training to enable argmax inference. The very recent PatchmatchNet [30] minimizes a sum of per-iteration losses and employs a one-time prediction of visibility (soft view selection). We use reinforcement learning to train view selection and minimize the loss of the final depth/normal estimates. Our work is the first, to our knowledge, to propose an end-to-end trainable formulation that combines the advantages of pixelwise depth/normal/view estimates and PatchMatch optimization with deep network learned photometric consistency and refinement.

3 PatchMatch-RL MVS

We propose PatchMatch-RL, an end-to-end learning framework that uses PatchMatch for Multi-View Stereo (MVS) reconstruction. Figure 2 shows an overview of our approach. Given a set of images and its corresponding camera poses with intrinsic and extrinsic matrices, our goal is to recover the depths (and normals) of the reference image using a set of selected source images that overlap with .

Rather than solving only for depth, we also estimate surface normals, which enables propagating hypotheses and comparing spatially distributed features between reference and source images along the local plane. Surface normal estimation improves depth estimates for oblique surfaces and is also useful for consistency checks, surface modeling, and other downstream processing.

Our estimation proceeds coarse-to-fine. At the coarsest level, estimates are randomly initialized and then refined through a series of PatchMatch iterations that consist of pixelwise view selection, candidate propagation, regularized cost computation, and candidate update. Resulting estimates are then upsampled and further refined, this continues until the finest layer, after which all depth estimates are fused into a 3D point cloud.

3.1 Initialization

For each level in the coarse-to-fine optimization, we extract CNN features for the reference and source images using a Feature Pyramid Network (FPN) [20]. For memory efficiency, the number of output channels varies per scale, with shallower feature channels in the higher-resolution feature maps. denotes the feature vector for pixel at image .

Our goal is to solve for an oriented point , consisting of a plane-camera distance and normal , for each pixel in . Pixel depth is related to through . The depth is sampled uniformly from the inverse depth range as: , with and specifying the depth range. Sampling from the inverted range prioritizes depths closer to the camera center, as shown effective by Gallup  [10]. The per-pixel normal is initialized independently of depth by sampling from a 3D Gaussian and applying L2 normalization [24]. The normal vector is reversed if it faces the same direction as the pixel ray.

3.2 Feature Correlation

The feature maps can be differentiably warped [36] according to the pixelwise plane homographies from reference image to source image as . With support window of size and dilation centered at , we define the correlation value of the oriented point as the attention-aggregated group-wise correlation for matching feature vectors in the source image:

We denote group-wise feature vector correlation [33] as , scaled dot-product attention for supporting pixel on center pixel by the reference feature map as , and the attentional feature projection vector as , implemented as a 1x1 convolution. The resulting represents the similarity of the features centered at in the reference image and the corresponding features in the source image, according to .

In preliminary experiments, our estimation of normals was poor and did not improve depth estimation. The problem was that the smoothness of features prevented a 3x3 patch from providing much additional information. Making larger patches was not practical due to memory constraints. This problem was solved through use of dilation (), and we further reduced memory usage by producing shallower feature channels.

Figure 3: Estimated Visibilities. The leftmost image corresponds to the reference image, and the right two images are the source images. We color-code the corresponding region to be estimated as visible in the last PatchMatch iteration. The estimated visibility matches precisely with the actual visibility. (Best viewed in color.)

3.3 Pixel-wise View Selection

Based on Schönberger et al. [27], we compute scale, incident-angle, and triangulation angle difference based geometric priors for each source image for each . Instead of hand-crafting the prior function, we concatenate the priors with the feature correlations

and use a multi-layered perceptron (MLP) to predict a pixel-wise visibility estimate, denoted

. Figure 3 shows an example of the estimated visibilities in the source images.

We then sample

-views based on the L1 normalized probability distribution over

for each pixel, to obtain a sampled set of views,

. The visibility probabilities are further used to compute a weighted sum of feature correlations across views.

3.4 Candidate Propagation

The oriented point map at the -th PatchMatch iteration is propagated according to the propagation kernel. A common kernel is the Red-Black propagation kernel by Galliani  [10], as illustrated in Figure 4. We let denote the set of candidate oriented points obtained by propagation kernel at pixel

and by random perturbation of the current candidate. The propagation can be applied using a series of convolutional filters of one-hot encodings, with one values in positions that correspond to each neighbor, as defined by

. The visibility-weighted feature correlations for each candidate are computed as .

(a) (b) (c)
Figure 4: Red-Black PatchMatch Propagation Kernels. In each kernel, the grey square indicates the pixel to be updated. The red squares indicate the neighboring pixels that provide PatchMatch a set of candidate oriented points for the gray pixel. We use kernel (c) for the coarsest level and kernel (b) for the finer levels.

3.5 Candidate Regularized Cost and Update

Existing learning-based cost regularization methods, such as 3D convolution on spatially aligned cost volume [36] or -nearest neighbor based graph convolutions [6], exploit ordinal relationships between neighboring label maps. However, there is no consistent relationship between candidates for or for candidates of neighboring pixels. Instead, we get insight from loopy Belief-Propagation (LBP), where each node’s belief is iteratively updated by message-passing from the neighboring nodes, so that confidently labeled nodes propagate to less confident neighbors. We represent beliefs for each candidate as hidden states

, and use a recurrent neural network (RNN) to estimate regularized score

and updated hidden state . Figure 5 illustrates the process.

Similar to LBP, we compute pairwise neighborhood smoothness [4] of the candidate with respect to the current label, , where is the sum of distances between each oriented point and the plane parameterized by the other oriented point. We append the smoothness terms to the weighted feature correlation as an input to the RNN. The RNN can then aggregate the confidences (represented by feature correlations) over similar oriented points.

The per-pixel candidates and corresponding hidden states are updated by:

In inference, the sampling of is ; in training, the sampling hardens from probabilistic to as training progresses. The updated hidden states are used as an input to the recurrent cost regularization module in the next PatchMatch iteration.

Figure 5: Recurrent Cost Regularization. Given the hidden state of each pixel in the -th iteration and the visibility-weighted feature correlations of each propagated candidate

, we use a Gated Recurrent Unit (GRU) module to estimate the regularized cost

and updated hidden state for each plane candidate. Then, the best candidate for the next iteration is hard-sampled according to the regularized costs, replacing the current oriented point at , and the corresponding hidden states of the pixel are updated using the corresponding sampled candidate .

3.6 Coarse-to-Fine PatchMatch and Fusion

The estimated map of oriented points and the corresponding hidden states

are upsampled as an input to the finer level PatchMatch iteration using nearest neighbor interpolation. The

of the finest level are fused together into a 3D point cloud by following the method used by other MVS systems [10, 27, 36]. First, consistency is checked for each reference image with the source views using reprojection distance, relative depth distance, and normal consistency. Then, we reproject the mean value of -view consistent depths into the world space to obtain consensus points.

Ref. Image GT. Depth COLMAP Ours
Figure 6: Qualitative comparison against COLMAP on the ETH3D high-resolution benchmark. From the left, reference image, ground truth depth, depth estimate from COLMAP, normal estimate from COLMAP, depth estimate of our model, and normal estimate of our model. All of the depth maps share the same color scale based on the ground truth depth ranges. We show that our estimated depths and normals are more complete than COLMAP.

4 PatchMatch-RL Training

It is challenging to make PatchMatch MVS end-to-end trainable. The based hard decisions/sampling required for PatchMatch update and view selection is non-differentiable, and the incorporation of normal estimates with soft- causes depth and normal to depend on each other. We propose a reinforcement learning approach to jointly learn the candidate cost and visibility estimation parameters.

We use to denote the pixel-wise visibility estimation function, parameterized by , that outputs visibility score for each source image given images and cameras . We use to denote a matching score function, parameterized by , that produces plane candidate score for each given and selected views . Our formulation contains two agents: one selects views and the other selects the candidates.

4.1 Reward Function

We define the reward as a probability of observing the oriented point from distribution given ground truth oriented point value in iteration

. We define the distribution as a joint independent normal distribution of depth and normal of pixel



We let the expected future reward be a -discounted sum of future rewards: . We formulate the gradient of the reward as a negation of the gradient of cross-entropy between the step-wise reward and an agent , according to the REINFORCE algorithm as:


The sampling can be done in two ways: the categorical distribution, which makes the policy approximate the expectation of the distribution; or argmax, which makes the policy the greedy solution. As an exploration versus exploitation strategy, we employ a decaying -greedy approach where we sample candidates using (1) expectation by probability of or (2) using argmax by probability of . We also apply a decaying reward of .

Below, we describe the policy of each agent. We use , and to denote the state, action, policy and reward space of the view selection and candidate selection agents respectively. For simplicity, we use , , and to denote the corresponding agent’s state, action, and reward in the -th iteration that apply to a particular pixel.

4.2 Learning Photometric Cost

For the candidate selecting agent, the state space is the set of candidate plane parameters for each oriented point , and the the action space is the selection of a candidate label for each pixel in each iteration according to the parameterized photometric cost function . The probability of selecting each candidate is defined as a softmax distribution based on the photometric cost of each plane candidate, and the stochastic policy samples from this distribution:


The parameters can be learned via gradient ascent through the negative cross-entropy between the probability distribution of the candidates given ground truth and the probability distribution of the candidates estimated by photometric cost function:

where represents the probability of observing the candidate according to the ground truth.

4.3 Learning View Selection

For the view selection agent, the state space contains the set of source images; the action space is a selection of images among the source images for each iteration; and the policy uses the parameterized view selection function to estimate the visibility . The stochastic policy is:


and the gradient:

For robustness of training, we include only the selected views and worse views in the denominator to prevent minimizing the probabilities of good but unselected views. This incentivizes training to assign more visibility to good views than bad views (that do not view the point corresponding to the reference pixel).

Test 2cm: Accuracy / Completeness / F1 Test 5cm: Accuracy / Completeness / F1
Method Resolution Time(s) Indoor Outdoor Combined Indoor Outdoor Combined
ACMH [32] 3200x2130 546.77 91.1 / 64.8 / 73.9 84.0 / 80.0 / 81.8 89.3 / 68.6 / 75.9 97.4 / 78.0 / 83.7 94.1 / 75.0 / 90.4 96.6 / 87.1 / 85.4
Gipuma [10] 2000x1332 272.81 86.3 / 31.4 / 41.9 78.8 / 45.3 / 55.2 84.4 / 34.9 / 45.2 95.8 / 42.1 / 54.9 93.8 / 54.3 / 67.2 95.3 / 45.1 / 58.0
COLMAP [27] 3200x2130 2245.57 92.0 / 59.7 / 70.4 92.0 / 73.0 / 80.8 92.0 / 63.0 / 73.0 96.6 / 73.0 / 82.0 97.1 / 83.9 / 89.7 96.8 / 75.7 / 84.0
PVSNet [35] 1920x1280 - 65.6 / 78.6 / 70.9 68.8 / 84.3 / 75.7 66.4 / 80.1 / 72.1 82.4 / 87.8 / 84.7 84.5 / 92.7 / 88.2 82.9 / 89.0 / 85.6
PatchmatchNet [30] 2688x1792 491.69 68.8 / 74.6 / 71.3 72.3 / 86.0 / 78.5 69.7 / 77.5 / 73.1 84.6 / 85.1 / 84.7 87.0 / 92.0 / 89.3 85.2 / 86.8 / 85.9
Ours 1920x1280 556.50 73.2 / 70.0 / 70.9 78.3 / 78.3 / 76.8 74.5 / 72.1 / 72.4 88.0 / 83.7 / 85.5 92.6 / 89.0 / 90.5 89.2 / 85.0 / 86.8
Train 2cm: Accuracy / Completeness / F1 Train 5cm: Accuracy / Completeness / F1
Method Resolution Time(s) Indoor Outdoor Combined Indoor Outdoor Combined
ACMH [32] 3200x2130 486.35 92.6 / 59.2 / 70.0 84.7 / 64.4 / 71.5 88.9 / 61.6 / 70.7 97.7 / 70.1 / 80.5 95.4 / 75.6 / 83.5 96.6 / 72.7 / 81.9
Gipuma [10] 2000x1332 243.34 89.3 / 24.6 / 35.8 83.2 / 25.3 / 37.1 86.5 / 24.9 / 36.4 96.2 / 34.0 / 47.1 95.5 / 36.7 / 51.7 95.9 / 35.2 / 49.2
COLMAP [27] 3200x2130 2102.71 95.0 / 52.9 / 66.8 88.2 / 57.7 / 68.7 91.9 / 55.1 / 67.7 98.0 / 66.6 / 78.5 96.1 / 73.8 / 82.9 97.1 / 69.9 / 80.5
PatchmatchNet [30] 2688x1792 473.92 63.7 / 67.7 / 64.7 66.1 / 62.8 / 63.7 64.8 / 65.4 / 64.2 78.7 / 80.0 / 78.9 86.8 / 73.2 / 78.5 82.4 / 76.9 / 78.7
Ours 1920x1280 555.58 76.6 / 60.7 / 66.7 75.4 / 64.0 / 69.1 76.1 / 62.2 / 67.8 89.6 / 76.5 / 81.4 88.8 / 81.4 / 85.7 90.5 / 78.8 / 83.3
Table 1: Results on the ETH3D high-resolution MVS benchmark train and test sets. We do not train on any ETH3D data. Bold denotes the method with the highest F1 score for each setting. Results from several other methods are shown for comparison. We measure the mean time taken for reconstructing each scene (including the fusion stage) using the author provided code on the same hardware. PVSNet results are not available on the train set. Our method outperforms other recent learning-based approaches (PVSNet and PatchmatchNet) in most of the metrics.
Precision / Recall / F1
Method Intermediate Advanced
CIDER [33] 42.8 / 55.2 / 46.8 26.6 / 21.3 / 23.1
COLMAP [27] 43.2 / 44.5 / 42.1 33.7 / 24.0 / 27.2
R-MVSNet [37] 43.7 / 57.6 / 48.4 31.5 / 22.1 / 24.9
CasMVSNet [11] 47.6 / 74.0 / 56.8 29.7 / 35.2 / 31.1
AttMVS [23] 61.9 / 58.9 / 60.1 40.6 / 27.3 / 31.9
PatchmatchNet [30] 43.6 / 69.4 / 53.2 27.3 / 41.7 / 32.3
PVSNet [35] 53.7 / 63.9 / 56.9 29.4 / 41.2 / 33.5
BP-MVSNet [29] 51.3 / 68.8 / 57.6 29.6 / 35.6 / 31.4
Ours 45.9 / 62.3 / 51.8 30.6 / 36.7 / 31.8
Table 2: Results on the Tanks and Temples benchmark. The best performing model based on score is marked as bold. Similar to Table 1. Our method performs on par with existing learning based methods on the advanced sets.

5 Experiments

We evaluate our work on two large-scale benchmarks: Tanks and Temples Benchmark [18] and ETH3D High-Res Multi-View Benchmark [28].

5.1 Training Details

For all experiments, we train using the BlendedMVS dataset [38], which contains a combination of 113 object, indoor, and outdoor scenes with large viewpoint variations. We use the low-res version of the dataset which has a spatial resolution of . Throughout training and evaluation, we use and , 3 layers of hidden states , for photometric scorer and view selection scorer respecively, and feature map sizes corresponding to and of the original image size. For training, we use , , and iterations, and for evaluation we use , , and iterations for each scale respectively. We use the PatchMatch Kernel shown in Figure 4(b) for training. As an exploitation versus exploration strategy, we employ a Decaying -Greedy approach where we either sample candidates proportional to their softmax scores with a probability of or select the argmax candidate with the probability of 1 - . The initial value of is 0.9 with an exponential decay of 0.999 per each step.

To promote view-selection robustness, for each reference image, we select 6 total views from the same scene: 3 random views and 3 views sampled from the 10 best views according to BlendedMVS. Among 6 source images, we sample 1 best visibility-scoring image as visible and 2 worst visibility-scoring images as invisible. We train the model with Adam [17] and set the initial learning rate to and the decay to

per epoch. We implemented our approach in PyTorch. We use an Nvidia RTX 3090 for training and evaluation.

5.2 ETH3D High-Res Multi-View Benchmark

We evaluate our method on the ETH3D High-res Multi-View Benchmark, which contains 17 different indoor or outdoor scenes with 6048x4032 resolution images for each scene. For evaluation, we fix the number of the source views to 10 and sample the 3 best views. We use a fixed image resolution of with camera intrinsics obtained by COLMAP [27]. The system takes 13.5 seconds and uses 7693MB of peak memory for each reference image. Table 1 provides quantitative results. We show that our method achieves comparable results to the other listed methods on the standard 2cm benchmark and the best results on the 5cm benchmark. Most learning-based methods fail to produce reasonable results on ETH3D because there are few images with wide baselines and large depth ranges. In Figure 6, we compare the inferred depth and normal maps with COLMAP [27]. From the results, we can see that our method can cover weakly textured regions, such as white walls and pillars, more completely than COLMAP [27], while still maintaining good accuracy. However, the model may fail to reconstruct reflective surfaces and large texture-less areas.

Figure 7: Point cloud reconstruction results. For the top row, from left to right, we show Ballroom and Playground from the Tanks and Temples benchmark [18]. For the bottom row, from left to right, we show the reconstruction results of COLMAP [27], PatchmatchNet [30] and our method in the Exhibition Hall from the ETH3D benchmark [28]. See the benchmark website for more reconstruction results.
Accuracy / Completeness / F1
Model Train 2CM Train 5cm
w/o normal 62.9 / 58.0 / 54.0 81.1 / 76.7 / 75.1
w/o view sel. 75.8 / 56.7 / 64.1 89.4 / 72.9 / 79.8
w/o rcr. 75.6 / 60.9 / 66.7 89.0 / 77.9 / 82.6
Ours 76.1 / 62.2 / 67.8 90.5 / 78.8 / 83.3
Table 3: Ablation Study on ETH3D High-Res Training Set. We compare our original system to using 1x1 feature patches that do not take advantage of normal estimates (“w/o normal”), without pixelwise view selection, instead using the top 3-ranked source images for all pixels, and without recurrent cost regularization (“w/o rcr”). The model with the highest score is marked with bold for each threshold.

5.3 Tanks and Temples Benchmark

With the same trained model, we evaluate on the Tanks and Temples [18] intermediate and advanced benchmarks which contain 8 intermediate and 6 advanced large-scale scenes respectively. Similar to the ETH3D High-res benchmark, we fix the number of the source views to 10, sample the 3 best views, and fix the image resolution to . Our method takes 12.1 seconds and uses 5801MB of peak memory for each reference image. Table 2 shows the quantitative results of the benchmark. We achieve similar results to CasMVSNet [11] and PatchmatchNet [30]. In Figure 7, we present qualitative results on the reconstructed point clouds. We show that our method can generate complete and accurate reconstructions, which includes repeated textures such as carpet and thin structures such as poles for the swing.

5.4 Ablation Studies

Table 3 shows how each component contributes to the performance of our method.

Importance of normals: Our use of normals enables modeling oblique surfaces and provides a locally planar support region for photometric costs, which has otherwise been achived through deformable convolution [30] or -NN [6]. Without normal estimation for more non-frontal planar propagation and support, accuracy drops by 13.2% and completeness drops by 3.2% for the 2cm threshold on ETH3D (Table 3 “w/o normal”).

Importance of pixelwise view selection: Without pixelwise selection of the source images, the completeness of the reconstruction at the 2cm threshold drops by 5.5% and accuracy drops slightly (Table 3 “w/o view sel”). Pixelwise view selection makes better use of many source views that are partially overlapping the reference image.

Importance of recurrent cost regularization: We introduce recurrent cost regularization to aggregate confidences (i.e. feature correlations) across similar points without requiring aligned cost volumes. For comparison, we try scoring candidates using a multi-layer network based on only the feature correlations for the centered patch. With this simplification, the overall score drops by 1.1% for the 2cm threshold (Table 3 “w/o rcr ”).

Importance of argmax sampling: We tried to train using soft-argmax based candidates where we take the expectation of the normals and the depths independently. However, we failed to train an effective model due to the initial aggregated values being clustered to the middle of the depth ranges, which limits the range of predictions. Existing works may avoid this problem by using sampling in a more restrictive way; e.g., by performing a depth search with reduced range [8] or by sampling initial points from uniformly separated bins [30]. Our reinforcement learning approach enables us to perform argmax sampling in the same way as non-learning based approaches while benefiting from learned representations.

6 Conclusion

We propose an end-to-end trainable MVS system that estimates pixelwise depths, normals, and visibilities using PatchMatch optimization. We use reinforcement learning and a decaying -greedy sampling in training to learn effectively despite using view selection and argmax sampling in inference. Our system performs well compared to the latest learning-based MVS systems, but further improvements are possible. For example, we have not yet incorporated some of the sophisticated geometric checks of ACMM [32] or post-process refinement of DeepC-MVS [19], and higher resolution processing would also yield better results. By incorporating most of the critical ideas from non-learning based methods into a learning-based framework, our work provides a promising direction for further improvements in end-to-end approaches.


We thank ONR MURI Award N00014-16-1-2007 for support in our research.


  • [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl (2016) Large-scale data for multiple-view stereopsis.

    International Journal of Computer Vision

    120 (2), pp. 153–168.
    Cited by: §1.
  • [2] Awesome 3d reconstruction list. Note: Cited by: §2.
  • [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman PatchMatch: a randomized correspondence algorithm for structural image editing. Cited by: §1, §2.
  • [4] F. Besse, C. Rother, A. Fitzgibbon, and J. Kautz (2012-01) PMBP: patchmatch belief propagation for correspondence field estimation. In BMVC - Best Industrial Impact Prize award, BMVC - Best Industrial Impact Prize award edition. External Links: Link Cited by: §3.5.
  • [5] M. Bleyer, C. Rhemann, and C. Rother PatchMatch stereo-stereo matching with slanted support windows.. Cited by: §1, §2.
  • [6] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547. Cited by: §3.5, §5.4.
  • [7] J. DeGol, M. Golparvar-Fard, and D. Hoiem (2016) Geometry-informed material recognition. In CVPR, Cited by: §1.
  • [8] S. Duggal, S. Wang, W. Ma, R. Hu, and R. Urtasun (2019) Deeppruner: learning efficient stereo matching via differentiable patchmatch. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4384–4393. Cited by: §1, §2, §5.4.
  • [9] Y. Furukawa and C. Hernández (2015) Multi-view stereo: a tutorial. Found. Trends Comput. Graph. Vis. 9, pp. 1–148. Cited by: §2.
  • [10] S. Galliani, K. Lasinger, and K. Schindler (2015) Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881. Cited by: §3.1, §3.4, §3.6, Table 1.
  • [11] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan (2020-06) Cascade cost volume for high-resolution multi-view stereo and stereo matching. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: Table 2, §5.3.
  • [12] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li (2019-06) Group-wise correlation stereo network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [13] K. Han, J. Degol, and M. Golparvar-Fard (2018) Geometry- and appearance-based reasoning of construction progress monitoring. Journal of Construction Engineering and Management 144 (2), pp. 04017110. Cited by: §1.
  • [14] Y. Hu, R. Song, and Y. Li (2016) Efficient coarse-to-fine patchmatch for large displacement optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5704–5712. Cited by: §1.
  • [15] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §1, §2.
  • [16] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017)

    Surfacenet: an end-to-end 3d neural network for multiview stereopsis

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 2307–2315. Cited by: §1, §2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §5.1.
  • [18] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36 (4). Cited by: §1, Figure 7, §5.3, §5.
  • [19] A. Kuhn, C. Sormann, M. Rossi, O. Erdler, and F. Fraundorfer (2020) DeepC-mvs: deep confidence prediction for multi-view stereo reconstruction. In 2020 International Conference on 3D Vision (3DV), pp. 404–413. Cited by: §2, §6.
  • [20] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.1.
  • [21] J. Lu, H. Yang, D. Min, and M. N. Do (2013) Patch match filter: efficient edge-aware filtering meets randomized search for fast correspondence field estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1854–1861. Cited by: §1.
  • [22] B. D. Lucas and T. Kanade (1981) An iterative image registration technique with an application to stereo vision. In

    Proc. International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 674–679. Cited by: §2.
  • [23] K. Luo, T. Guan, L. Ju, Y. Wang, Z. Chen, and Y. Luo (2020) Attention-aware multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1590–1599. Cited by: §2, Table 2.
  • [24] M. E. Muller (1959-04) A note on a method for generating points uniformly on n-dimensional spheres. Commun. ACM 2 (4), pp. 19–20. External Links: ISSN 0001-0782, Link, Document Cited by: §3.1.
  • [25] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza (2018) EMVS: event-based multi-view stereo—3d reconstruction with an event camera in real-time. International Journal of Computer Vision 126 (12), pp. 1394–1414. Cited by: §1.
  • [26] A. Romanoni and M. Matteucci (2019) Tapa-mvs: textureless-aware patchmatch multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10413–10422. Cited by: §1, §2.
  • [27] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), Cited by: Figure 1, §2, §3.3, §3.6, Table 1, Table 2, Figure 7, §5.2.
  • [28] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Figure 7, §5.
  • [29] C. Sormann, P. Knobelreiter, A. Kuhn, M. Rossi, T. Pock, and F. Fraundorfer (2020-11) BP-mvsnet: belief-propagation-layers for multi-view-stereo. In 2020 International Conference on 3D Vision (3DV), Vol. , Los Alamitos, CA, USA, pp. 394–403. External Links: ISSN , Document, Link Cited by: Table 2.
  • [30] F. Wang, S. Galliani, C. Vogel, P. Speciale, and M. Pollefeys (2020) PatchmatchNet: learned multi-view patchmatch stereo. External Links: 2012.01411 Cited by: §2, Table 1, Table 2, Figure 7, §5.3, §5.4, §5.4.
  • [31] R. Wang, M. Schworer, and D. Cremers (2017) Stereo dso: large-scale direct sparse visual odometry with stereo cameras. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3903–3911. Cited by: §1.
  • [32] Q. Xu and W. Tao (2019) Multi-scale geometric consistency guided multi-view stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5483–5492. Cited by: §1, §2, Table 1, §6.
  • [33] Q. Xu and W. Tao (2020) Learning inverse depth regression for multi-view stereo with correlation cost volume. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12508–12515. Cited by: §3.2, Table 2.
  • [34] Q. Xu and W. Tao (2020) Planar prior assisted patchmatch multi-view stereo. AAAI Conference on Artificial Intelligence (AAAI). Cited by: §1, §2.
  • [35] Q. Xu and W. Tao (2020) PVSNet: pixelwise visibility-aware multi-view stereo network. arXiv preprint arXiv:2007.07714. Cited by: §2, Table 1, Table 2.
  • [36] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §2, §3.2, §3.5, §3.6.
  • [37] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534. Cited by: §2, Table 2.
  • [38] Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, L. Zhou, T. Fang, and L. Quan (2020) BlendedMVS: a large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR). Cited by: §5.1.
  • [39] Z. Yu and S. Gao (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1949–1958. Cited by: §1, §2.
  • [40] J. Žbontar and Y. LeCun (2016)

    Stereo matching by training a convolutional neural network to compare image patches


    The journal of machine learning research

    17 (1), pp. 2287–2318.
    Cited by: §1.
  • [41] E. Zheng, E. Dunn, V. Jojic, and J. Frahm (2014) Patchmatch based joint view selection and depthmap estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1517. Cited by: §1, §2.