Log In Sign Up

Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

by   Yao Yao, et al.

Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. Code is available at


page 3

page 4

page 6

page 7

page 8

page 12

page 14

page 15


Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement

Almost all previous deep learning-based multi-view stereo (MVS) approach...

Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume

Deep learning has shown to be effective for depth inference in multi-vie...

SRH-Net: Stacked Recurrent Hourglass Network for Stereo Matching

The cost aggregation strategy shows a crucial role in learning-based ste...

Non-local Recurrent Regularization Networks for Multi-view Stereo

In deep multi-view stereo networks, cost regularization is crucial to ac...

BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

While deep learning has recently achieved great success on multi-view st...

DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

We propose an online multi-view depth prediction approach on posed video...

DRO: Deep Recurrent Optimizer for Structure-from-Motion

There are increasing interests of studying the structure-from-motion (Sf...

Code Repositories


MVSNet (ECCV2018) & R-MVSNet (CVPR2019)

view repo


Deep learning based depth map estimation

view repo

1 Introduction

Multi-view stereo (MVS) aims to recover the dense representation of the scene given multi-view images and calibrated cameras. While traditional methods have achieved excellent performance on MVS benchmarks, recent works [14, 13, 30] show that learned approaches are able to produce results comparable to the traditional state-of-the-arts. In particular, MVSNet [30]

proposed a deep architecture for depth map estimation, which significantly boosts the reconstruction completeness and the overall quality.

One of the key advantages of learning-based MVS is the cost volume regularization, where most networks apply multi-scale 3D CNNs [14, 15, 30] to regularize the 3D cost volume. However, this step is extremely memory expensive: it operates on 3D volumes and the memory requirement grows cubically with the model resolution (Fig. 1 (d)). Consequently, current learned MVS algorithms could hardly be scaled up to high-resolution scenarios.

Recent works on 3D with deep learning also acknowledge this problem. OctNet [24] and O-CNN [28] exploit the sparsity in 3D data and introduce the octree structure to 3D CNNs. SurfaceNet [14] and DeepMVS [13] apply the engineered divide-and-conquer strategy to the MVS reconstruction. MVSNet [30] builds the cost volume upon the reference camera frustum to decouple the reconstruction into smaller problems of per-view depth map estimation. However, when it comes to a high-resolution 3D reconstruction (e.g., volume size voxels), these methods will either fail or take a long time for processing.

To this end, we present a novel scalable multi-view stereo framework, dubbed as R-MVSNet, based on the recurrent neural network. The proposed network is built upon the MVSNet architecture [30], but regularizes the cost volume in a sequential manner using the convolutional gated recurrent unit (GRU) rather than 3D CNNs. With the sequential processing, the online memory requirement of the algorithm is reduced from cubic to quadratic to the model resolution (Fig. 1 (c)). As a result, the R-MVSNet is applicable to high resolution 3D reconstruction with unlimited depth-wise resolution.

We first evaluate the R-MVSNet on DTU [1], Tanks and Temples [17] and ETH3D [26] datasets, where our method produces results comparable or even outperforms the state-of-the-art MVSNet [30]. Next, we demonstrate the scalability of the proposed method on several large-scale scenarios with detailed analysis on the memory consumption. R-MVSNet is much more efficient than other methods in GPU memory and is the first learning-based approach applicable to such wide depth range scenes, e.g., the advance set of Tanks and Temples dataset [17].

Figure 1: Illustrations of different regularization schemes. For the interest red voxel, we use voxels in blue to denote its receptive field during the cost volume regularization. The runtime memory requirement is also listed on top of the volume, where H, W and D denote the image height, width and depth sample number respectively. The 3D CNNs gather the cost information across the whole space, however, requires a runtime memory cubical to the model resolution

2 Related Work

Learning-based MVS Reconstruction

Recent learning-based approaches have shown great potentials for MVS reconstruction. Multi-patch similarity [11] is proposed to replace the traditional cost metric with the learned one. SurfaceNet [14] and DeepMVS [13] pre-warp the multi-view images to 3D space, and regularize the cost volume using CNNs. LSM [15] proposes differentiable projection operations to enable the end-to-end MVS training. Our approach is mostly related to MVSNet [30], which encodes camera geometries in the network as differentiable homography and infers the depth map for the reference image. While some methods have achieved excellent performance in MVS benchmarks, aforementioned learning-based pipelines are restricted to small-scale MVS reconstructions due to the memory constraint.

Scalable MVS Reconstruction

The memory requirement of learned cost volume regularizations [14, 15, 13, 5, 30] grows cubically with the model resolution, which will be intractable when large image sizes or wide depth ranges occur. Similar problem also exists in traditional MVS reconstructions (e.g., semi-global matching [12]) if the whole volume is taken as the input to the regularization. To mitigate the scalability issue, learning-based OctNet [24] and O-CNN [28] exploit the sparsity in 3D data and introduce the octree structure to 3D CNNs, but are still restricted to reconstructions with resolution

voxels. Heuristic divide-and-conquer strategies are applied in both classical

[18] and learned MVS approaches [14, 13], however, usually lead to the loss of global context information and the slow processing speed.

On the other hand, scalable traditional MVS algorithms all regularize the cost volume implicitly. They either apply local depth propagation [19, 9, 10, 25] to iteratively refine depth maps/point clouds, or sequentially regularize the cost volume using simple plane sweeping [7] and 2D spatial cost aggregation with depth-wise winner-take-all [29, 31]. In this work, we follow the idea of sequential processing, and propose to regularize the cost volume using the convolutional GRU [6]. GRU is a RNN architecture [8] initially proposed for learning sequential speech and text data, and is recently applied to 3D volume processing, e.g., video sequence analysis [3, 34]. For our task, the convolutional GRU gathers spatial as well as temporal context information in the depth direction, which is able to achieve comparable regularization results to 3D CNNs.

Figure 2: The R-MVSNet architecture. Deep image features are extracted from input images and then warped to the fronto-parallel planes of the reference camera frustum. The cost maps are computed at different depths and are sequentially regularized by the convolutional GRU. The network is trained as a classification problem with the cross-entropy loss

3 Network Architecture

This section describes the detailed network architecture of R-MVSNet. Our method can be viewed as an extension to the recent MVSNet [30] with cost volume regularization using convolutional GRU. We first review the MVSNet architecture in Sec. 3.1, and then introduce the recurrent regularization in Sec. 3.2 and the corresponding loss formulation in Sec. 3.3.

3.1 Review of MVSNet

Given a reference image and a set of its neighboring source images , MVSNet [30] proposes an end-to-end deep neural network to infer the reference depth map . In its network, deep image features are first extracted from input images through a 2D network. These 2D image features will then be warped into the reference camera frustum by differentiable homographies to build the feature volumes in 3D space. To handle arbitrary

-view image input, a variance based cost metric is proposed to map N feature volumes to one cost volume

. Similar to other stereo and MVS algorithms, MVSNet regularizes the cost volume using the multi-scale 3D CNNs, and regresses the reference depth map through the soft argmin [16] operation. A refinement network is applied at the end of MVSNet to further enhance the depth map quality. As deep image features

are downsized during the feature extraction, the output depth map size is

to the original image size in each dimension.

MVSNet has shown state-of-the-art performance on DTU dataset [1] and the intermediate set of Tanks and Temples dataset [17], which contain scenes with outside-looking-in camera trajectories and small depth ranges. However, MVSNet can only handle a maximum reconstruction scale at with the 16 GB large memory Tesla P100 GPU, and will fail at larger scenes e.g., the advanced set of Tanks and Temples. To resolve the scalability issue especially for the wide depth range reconstructions, we will introduce the novel recurrent cost volume regularization in the next section.

3.2 Recurrent Regularization

Sequential Processing

An alternative to globally regularize the cost volume in one go is to sequentially process the volume through the depth direction. The simplest sequential approach is the winner-take-all plane sweeping stereo [7], which crudely replaces the pixel-wise depth value with the better one and thus suffers from noise (Fig. 1 (a)). To improve, cost aggregation methods [29, 31] filter the matching cost at different depths (Fig. 1 (b)) so as to gather spatial context information for each cost estimation. In this work, we follow the idea of sequential processing, and propose a more powerful recurrent regularization scheme based on convolutional GRU. The proposed method is able to gather spatial as well as the uni-directional context information in the depth direction (Fig. 1 (c)), which achieves regularization results comparable to the full-space 3D CNNs but is much more efficient in runtime memory.

Figure 3: Reconstruction pipeline. (a) Image 24 of DTU [1] scan 15. (b) Initial depth map from the network. (c) Final depth map (Sec. 4.3

). (d) Ground truth depth map. (e) Point cloud output. (f) Probability estimation for depth map filtering (Sec. 

4.3). (g) The gradient visualization of the initial depth map. (h) The gradient visualization after the refinement (Sec. 4.2)
Convolutional GRU

Cost volume could be viewed as cost maps concatenated in the depth direction. If we denote the output of regularized cost maps as , for the ideal sequential processing at the step, should be dependent on cost maps of the current step as well as all previous steps . Specifically, in our network we apply a convolutional variant of GRU to aggregate such temporal context information in depth direction, which corresponds to the time direction in language processing. In the following, we denote ‘’ as the element-wise multiplication, ‘[]’ the concatenation and ‘’ the convolution operation. Cost dependencies are formulated as:


where is the update gate map to decide whether to update the output for current step, is the regularized cost map of late step, and could be viewed as the updated cost map in current step, which is defined as:


here is the reset gate map to decide how much the previous should affect the current update.

is the nonlinear mapping, which is the element-wise sigmoid function. The update gate and reset gate maps are also related to the current input and previous output:


and are learned parameters. The nonlinear is the hyperbolic tangent to make soft decisions for the updates.

The convolutional GRU architecture not only spatially regularizes the cost maps through 2D convolutions, but also aggregates the temporal context information in depth direction. We will show in the experiment section that our GRU regularization can significantly outperform the simple winner-take-all or only the spatial cost aggregation.

Stacked GRU

The basic GRU model is comprised of a single layer. To further enhance the regularization ability, more GRU units could be stacked to make a deeper network. In our experiments, we adopt a 3-layer stacked GRU structure (Fig. 2). Specifically, we first apply a 2D convolutional layer to map the 32-channel cost map to 16-channel as the input to the first GRU layer. The output of each GRU layer will be used as the input to the next GRU layer, and the output channel numbers of the 3 layers are set to 16, 4, 1 respectively. The regularized cost maps

will finally go through a softmax layer to generate the probability volume

for calculating the training loss.

3.3 Training Loss

Most deep stereo/MVS networks regress the disparity/depth outputs using the soft argmin operation [16], which can be interpreted as the expectation value along the depth direction [30]. The expectation formulation is valid if depth values are uniformly sampled within the depth range. However, in recurrent MVSNet, we apply the inverse depth to sample the depth values in order to efficiently handle reconstructions with wide depth ranges. Rather than treat the problem as a regression task, we train the network as a multi-class classification problem with cross entropy loss:


where is the spatial image coordinate and is a voxel in the probability volume .

is the ground truth binary occupancy volume, which is generated by the one-hot encoding of the ground truth depth map.

is the corresponding voxel to .

One concern about the classification formulation is the discretized depth map output [32, 21, 13]. To achieve sub-pixel accuracy, a variational depth map refinement algorithm is proposed in Sec. 4.2 to further refine the depth map output. In addition, while we need to compute the whole probability volume during training, for testing, the depth map can be sequentially retrieved from the regularized cost maps using the winner-take-all selection.

4 Reconstruction Pipeline

The proposed network in the previous section generates the depth map per-view. This section describes the non-learning parts of our 3D reconstruction pipeline.

4.1 Preprocessing

To estimate the reference depth map using R-MVSNet, we need to prepare: 1) the source images of the given reference image , 2) the depth range of the reference view and 3) the depth sample number for sampling depth values using the inverse depth setting.

For selecting the source images, we follow MVSNet [30] to score each image pair using a piece-wise Gaussian function w.r.t. the baseline angle of the sparse point cloud [33]. The neighboring source images are selected according to the pair scores in descending order. The depth range is also determined by the sparse point cloud with the implementation of COLMAP [25]. Depth samples are chosen within using the inverse depth setting and we determine the total depth sample number by adjusting the temporal depth resolution to the spatial image resolution (details are described in the supplementary material).

4.2 Variational Depth Map Refinement

As mentioned in Sec. 3.3, a depth map will be retrieved from the regularized cost maps through the winner-take-all selection. Compare to the soft argmin [16] operation, the argmax operation of winner-take-all cannot produce depth estimations with sub-pixel accuracy. To alleviate the stair effect (see Fig. 3 (g) and (h)), we propose to refine the depth map in a small depth range by enforcing the multi-view photo-consistency.

Given the reference image , the reference depth map and one source image , we project to through to form the reprojected image . The image reprojection error between and at pixel is defined as:


where is the photo-metric error between two pixels, is the regularization term to ensure the depth map smoothness. We choose the zero-mean normalized cross-correlation (ZNCC) to measure the photo-consistency , and use the bilateral squared depth difference between and its neighbors for smoothness.

During the refinement, we iteratively minimize the total image reprojection error between the reference image and all source images w.r.t. depth map

. It is noteworthy that the initial depth map from R-MVSNet has already achieved satisfying result. The proposed variational refinement only fine-tunes the depth values within a small range to achieve sub-pixel depth accuracy, which is similar to the quadratic interpolation in stereo methods

[32, 21] and the DenseCRF in DeepMVS [13].

4.3 Filtering and Fusion

Similar to other depth map based MVS approaches[10, 25, 30], we filter and fuse depth maps in R-MVSNet into a single 3D point cloud. The photo-metric and the geometric consistencies are considered in depth map filtering. As described in previous sections, the regularized cost maps will go through a softmax layer to generate the probability volume. In our experiments, we take the corresponding probability of the selected depth value as its confidence measurement (Fig. 3 (f)), and we will filter out pixels with probability lower than a threshold of . The geometric constraint measures the depth consistency among multiple views, and we follow the geometric criteria in MVSNet [30] that pixels should be at least three view visible. For depth map fusion, we apply the visibility-based depth map fusion [22] as well as the mean average fusion [30] to further enhance the depth map quality and produce the 3D point cloud. Illustrations of our reconstruction pipeline are shown in Fig. 3.

5 Experiments

5.1 Implementation


We train R-MVSNet on the DTU dataset [1], which contains over 100 scans taken under 7 different lighting conditions and fixed camera trajectories. While the dataset only provides the ground truth point clouds, we follow MVSNet [30] to generate the rendered depth maps for training. The training image size is set to and the input view number is . The depth hypotheses are sampled from 425mm to 905mm with . In addition, to prevent depth maps from being biased on the GRU regularization order, each training sample is passed to the network with forward GRU regularization from to as well as the backward regularization from to . The dataset is splitted into the same training, validation and evaluation sets as previous works [14, 30]

. We choose TensorFlow


for the network implementation, and the model is trained for 100k iterations with batch size of 1 on a GTX 1080Ti graphics card. RMSProp is chosen as the optimizer and the learning rate is set to 0.001 with an exponential decay of 0.9 for every 10k iterations.


For testing, we use images as input, and the inverse depth samples are adaptively selected as described in Sec. 4.1. For Tanks and Temples dataset, the camera parameters are computed from OpenMVG [23] as suggested by MVSNet [30]. Depth map refinement, filtering and fusion are implemented using OpenGL on the same GTX 1080Ti GPU.

5.2 Benchmarks

We first demonstrate the state-of-the-art performance of the proposed R-MVSNet, which produces results comparable to or outperforms the previous MVSNet [30].

DTU Dataset [1]

We evaluate the proposed method on the DTU evaluation set. To compare R-MVSNet with MVSNet [30], we set and for all scans. Quantitative results are shown in Table 1. The accuracy and the completeness are calculated using the matlab script provided by the DTU dataset. To summarize the overall reconstruction quality, we calculate the average of the mean accuracy and the mean completeness as the overall score. Our R-MVSNet produces the best reconstruction completeness and overall score among all methods. Qualitative results can be found in Fig. 3.

Tanks and Temples Benchmark [17]

Unlike the indoor DTU dataset, Tanks and Temples is a large dataset captured in more complex environments. Specifically, the dataset is divided into the intermediate and the advanced sets. The intermediate set contains scenes with outside-look-in camera trajectories, while the advanced set contains large scenes with complex geometric layouts, where almost all previous learned algorithms fail due to the memory constraint.

The proposed method ranks on the intermediate set, which outperforms the original MVSNet [30]. Moreover, R-MVSNet successfully reconstructs all scenes and also ranks on the advanced set. The reconstructed point clouds are shown in Fig. 5. It is noteworthy that the benchmarking result of Tanks and Temples is highly dependent on the point cloud density. Our depth map is of size , which is relatively low-resolution and will result in low reconstruction completeness. So for the evaluation, we linearly upsample the depth map from the network by two () before the depth map refinement. The f_scores of intermediate and advanced sets increase from to and from to respectively.

ETH3D Benchmark [26]

We also evaluate our method on the recent ETH3D benchmark. The dataset is divided into the low-res and the high-res scenes, and provides the ground truth depth maps for MVS training. We first fine-tune the model on the ETH3D low-res training set, however, observe no performance gain compared to the model only pre-trained on DTU. We suspect the problem may be some images in low-res training set are blurred and overexposed as they are captured using hand-held devices. Also, the scenes of ETH3D dataset are complicated in object occlusions, which are not explicitly handled in the proposed network. We evaluate on this benchmark without fine-tuning the network. Our method achieves similar performance to MVSNet [30] and ranks on the low-res benchmark.

Mean Acc. Mean Comp. Overall ()
Camp [4] 0.835 0.554 0.695
Furu [9] 0.613 0.941 0.777
Tola [27] 0.342 1.19 0.766
Gipuma [10] 0.283 0.873 0.578
Colmap [10] 0.400 0.664 0.532
SurfaceNet [14] 0.450 1.04 0.745
MVSNet (D=256) [30] 0.396 0.527 0.462
R-MVSNet (D=256) 0.385 0.459 0.422
R-MVSNet (D=512) 0.383 0.452 0.417
Table 1: Quantitative results on the DTU evaluation scans [1]. R-MVSNet outperforms all methods in terms of reconstruction completeness and overall quality
Figure 4: Our results and the ground truth point clouds of scans 10 and 23, DTU [1] dataset
Dataset MVSNet[30] R-MVSNet (Ours) Mem-Util
Rank H W Ave. D Mem. Mem-Util Rank H W Ave. D Mem. Mem-Util Ratio
DTU [1] 2 1600 1184 256 15.4 GB 1.97 M 1 1600 1200 512 6.7 GB 9.17 M 4.7
T. Int. [17] 4 1920 1072 256 15.3 GB 2.15 M 3 1920 1080 898 6.7 GB 17.4 M 8.1
T. Adv. [17] - - - - - - 3 1920 1080 698 6.7 GB 13.5 M -
ETH3D [26] 5 928 480 320 8.7 GB 1.02 M 6 928 480 351 2.1 GB 4.65 M 4.6
Table 2: Comparisons between MVSNet [30] and the proposed R-MVSNet on benchmarking rankings, reconstruction scales and GPU memory requirements on the three MVS datasets [1, 17, 26]. The memory utility (Mem-Util) measures the data size processed per memory unit, and the high ratio between the two algorithms reflects the scalability of R-MVSNet
Figure 5: Point cloud reconstructions of Tanks and Temples dataset [17]

5.3 Scalability

Next, we demonstrate the scalability of R-MVSNet from: 1) wide-range and 2) high-resolution depth reconstructions.

Wide-range Depth Reconstructions

The memory requirement of R-MVSNet is independent to the depth sample number , which enables the network to infer depth maps with large depth range that is unable to be recovered by previous learning-based MVS methods. Some large scale reconstructions of Tanks and Temples dataset are shown in Fig. 5. Table 2 compares MVSNet [30] and R-MVSNet in terms of benchmarking rankings, reconstruction scales and memory requirements. We define the algorithm’s memory utility (Mem-Util) as the size of volume processed per memory unit ( / runtime memory size). R-MVSNet is more efficient than MVSNet in Mem-Util.

High-resolution Depth Reconstructions

R-MVSNet can also produce high-resolution depth reconstructions by sampling denser in depth direction. For the DTU evaluation in Sec. 5.2, if we fix the depth range and change the depth sample number from to , the overall distance score will be reduced from to (see last row of Table 1).

Ref. Pho. Geo. Fus. Acc. Comp. Overall 0.385 0.459 0.422 0.444 0.486 0.465 0.550 0.384 0.467 0.479 0.385 0.432 0.498 0.364 0.431 0.605 0.373 0.489 0.591 0.411 0.501
Figure 6: Ablation studies on network architectures, which demonstrate the importance of learned features and learned regularization. WTA is referred to Winner-Take-All and the figure records testing depth map errors during training
Table 3: Ablation studies on different combinations of variational Refinement, Photo-metic filtering, Geometry filtering and depth map Fusion for post-processing. Tested on DTU [1] evaluation set
Figure 7: Depth map reconstructions of scan 11, DTU dataset [1] using different image features and cost volume regularization methods. All models are trained for 100k iterations

5.4 Ablation studies

5.4.1 Networks

This section studies how different components in the network affect the depth map reconstruction. We perform the study on DTU validation set with , and use the average absolute difference between the inferred and the ground truth depth maps for the quantitative comparison. We denote the learned 2D image features as 2D CNNs. The comparison results of following settings are shown Fig. 6 and Fig. 7:

2D CNNs + 3D CNNs

Replace the GRU regularization with the same 3D CNNs regularization in MVSNet [30]. As shown in Fig. 6 and Fig. 7, 3D CNNs produces the best depth map reconstructions.


The setting of the proposed R-MVSNet, which produces the best depth map results among all settings. The qualitative comparison between 3D CNNs and GRU is shown in Fig. 7 (d) and (e).

2D CNNs + Spatial

Replace the GRU regularization with the simple spatial regularization. We approach the spatial regularization by a simple 3-layer, 32-channel 2D network on the cost map. The depth map error of spatial regularization is larger than the GRU regularization.

2D CNNs + Winner-Take-All

Replace the GRU regularization with simple the winner-take-all selection. We apply a single layer, 1-channel 2D CNN to directly map the cost map to the regularized cost map. The depth map error is further larger than the spatial regularization.

ZNCC + Winner-Take-All

Replace the learned image feature and cost metric with the engineered (window size of ). This setting is also referred to the classical plane sweeping [7]. As expected, plane sweeping produces the highest depth map error among all methods.

5.4.2 Post-processing

Next, we study the influences of post processing steps on the final point cloud reconstruction. We reconstruct the DTU evaluation without the variational refinement, photo-metric filtering, geometric filtering or depth map fusion. Quantitative results are shown in Table 3.

Without Variational Refinement

This setting is similar to the post-processing of MVSNet [30]. The f_score is changed to a larger number of 0.465, demonstrating the effectiveness of the proposed depth map refinement.

Without Photo-metric Filtering

Table 3 shows that the f_score without photo-metric filtering is increased to a larger number of 0.467, which demonstrates the importance of the probability map for photo-metric filtering (Fig. 3 (f)).

Without Geo-metric Filtering

The f_score is increased to 0.432, showing the effectiveness of depth consistency.

Without Depth Map Fusion

The f_score is also increased to 0.431, showing the effectiveness of depth fusion.

5.5 Discussion

Running Time

For DTU evaluation with , R-MVSNet generates the depth map at a speed of / view. Specifically, it takes to infer the initial depth map and to perform the depth map refinement. It is noteworthy that the runtime of depth map refinement only relates to refinement iterations and the input image size. Filtering and fusion takes neglectable runtime.


R-MVSNet is trained with fixed input size of , but it is applicable to arbitrary input size during testing. It is noteworthy that we use the model trained on the DTU dataset [1] for all our experiments without fine-tuning. While R-MVSNet has shown satisfying generalizability to the other two datasets [17, 26], we hope to train R-MVSNet on a more diverse MVS dataset, and expect better performances on Tanks and Temples [17] and ETH3D [26] benchmarks in the future.

Limitation on Image Resolution

While R-MVSNet is applicable to reconstructions with unlimited depth-wise resolution, the reconstruction scale is still restricted to the input image size. Currently R-MVSNet can handle a maximum input image size of on a 11GB GPU, which covers all modern MVS benchmarks except for the ETH3D high-res benchmark ().

6 Conclusions

We presented a scalable deep architecture for high-resolution multi-view stereo reconstruction. Instead of using 3D CNNs, the proposed R-MVSNet sequentially regularizes the cost volume through the depth direction with the convolutional GRU, which dramatically reduces the memory requirement for learning-based MVS reconstructions. Experiments show that with the proposed post-processing, R-MVSNet is able to produce high quality benchmarking results as the original MVSNet [30]. Also, R-MVSNet is applicable to large-scale reconstructions which cannot be handled by the previous learning-based MVS approaches.


  • [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. Large-scale data for multiple-view stereopsis.

    International Journal of Computer Vision (IJCV)

    , 2016.
  • [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from
  • [3] N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. International Conference on Learning Representations (ICLR), 2016.
  • [4] N. D. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla. Using multiple hypotheses to improve depth-maps for multi-view stereo. European Conference on Computer Vision (ECCV), 2008.
  • [5] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network.

    Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [6] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation.

    Empirical Methods in Natural Language Processing (EMNLP)

    , 2014.
  • [7] R. T. Collins. A space-sweep approach to true multi-image matching. Computer Vision and Pattern Recognition (CVPR), 1996.
  • [8] J. L. Elman. Finding structure in time. Cognitive science, 1990.
  • [9] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2010.
  • [10] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by surface normal diffusion. International Conference on Computer Vision (ICCV), 2015.
  • [11] W. Hartmann, S. Galliani, M. Havlena, L. Van Gool, and K. Schindler. Learned multi-patch similarity. International Conference on Computer Vision (ICCV), 2017.
  • [12] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2008.
  • [13] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. Computer Vision and Pattern Recognition (CVPR), 2018.
  • [14] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An end-to-end 3d neural network for multiview stereopsis. International Conference on Computer Vision (ICCV), 2017.
  • [15] A. Kar, C. Häne, and J. Malik. Learning a multi-view stereo machine. Advances in Neural Information Processing Systems (NIPS), 2017.
  • [16] A. Kendall, H. Martirosyan, S. Dasgupta, and P. Henry. End-to-end learning of geometry and context for deep stereo regression. Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (TOG), 2017.
  • [18] A. Kuhn, H. Hirschmüller, D. Scharstein, and H. Mayer. A tv prior for high-quality scalable multi-view stereo reconstruction. International Journal of Computer Vision (IJCV), 2017.
  • [19] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2005.
  • [20] S. Li, S. Y. Siu, T. Fang, and L. Quan. Efficient multi-view surface refinement with adaptive resolution control. European Conference on Computer Vision (ECCV), 2016.
  • [21] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. Computer Vision and Pattern Recognition (CVPR), 2016.
  • [22] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Nistér, and M. Pollefeys. Real-time visibility-based fusion of depth maps. International Conference on Computer Vision (ICCV), 2007.
  • [23] P. Moulon, P. Monasse, R. Marlet, and Others. Openmvg. an open multiple view geometry library.
  • [24] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. Computer Vision and Pattern Recognition (CVPR), 2017.
  • [25] J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. European Conference on Computer Vision (ECCV), 2016.
  • [26] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. Computer Vision and Pattern Recognition (CVPR), 2017.
  • [27] E. Tola, C. Strecha, and P. Fua. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications (MVA), 2012.
  • [28] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong.

    O-cnn: Octree-based convolutional neural networks for 3d shape analysis.

    ACM Transactions on Graphics (TOG), 2017.
  • [29] Q. Yang. A non-local cost aggregation method for stereo matching. Computer Vision and Pattern Recognition (CVPR), 2012.
  • [30] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV), 2018.
  • [31] K.-J. Yoon and I. S. Kweon. Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2006.
  • [32] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR), 2016.
  • [33] R. Zhang, S. Li, T. Fang, S. Zhu, and L. Quan. Joint camera clustering and surface segmentation for large-scale multi-view stereo. International Conference on Computer Vision (ICCV), 2015.
  • [34] X. Zhu, J. Dai, X. Zhu, Y. Wei, and L. Yuan. Towards high performance video object detection for mobiles. arXiv preprint arXiv 1804.05830, 2018.

1 Network Architecture

This section describes the network architecture of R-MVSNet (Table 1). R-MVSNet constructs cost maps at different depths, and recurrently regularizes cost maps through the depth direction. The probability volume need to be explicitly computed during the network training, but for testing, we can sequentially retrieve the regularized cost maps and all layers only require the GPU memory with size linear to the input image resolution.

Output Layer Input Output Size
Image Features Extration
2D_0 ConvBR,K=3x3,S=1,F=8 HW8
2D_1 ConvBR,K=3x3,S=1,F=8 2D_0 HW 8
2D_2 ConvBR,K=5x5,S=2,F=16 2D_1 12H12W16
2D_3 ConvBR,K=3x3,S=1,F=16 2D_2 12H12W16
2D_4 ConvBR,K=3x3,S=1,F=16 2D_3 12H12W16
2D_5 ConvBR,K=5x5,S=2,F=32 2D_4 14H14W32
2D_6 ConvBR,K=3x3,S=1,F=32 2D_5 14H14W32
Conv,K=3x3,S=1,F=32 2D_6 14H14W32
Differentiable Homography Warping
DH-Warping 14H14W32
Cost Map Construction
Variance Cost Metric 14H14W32
GRU Regularization
Conv,K=3x3,S=1,F=16 14H14W16
GRU, K=3x3, F=16 14H14W16
GRU, K=3x3, F=4 14H14W4
GRU, K=3x3, F=1 14H14W1
Probability Volume Construction
Softmax 14H14WD
Table 1:

R-MVSNet architecture. We denote the 2D convolution as Conv and use BR to abbreviate the batch normalization and the Relu. K is the kernel size, S the kernel stride and F the output channel number. N, H, W, D denote input view number, image width, height and depth sample number respectively

2 Depth Sample Number

Given the depth range , we sample depth values using the inverse depth setting:


where is the index of the depth sampling and is the depth sample number. To determine the sample number , we assume that the spatial image resolution should be the same as the temporal depth resolution. Supposing and are two 3D points by projecting the reference image center and its neighboring pixel to the space at depth , the spatial image resolution at depth is defined as . Meanwhile, we define the temporal depth resolution at depth as . Considering Equation 1, the depth sample number is calculated as:

Figure 1: Point cloud reconstructions of DTU dataset [1] with different post-processing settings

3 Variational Depth Map Refinement

We derive the iterative minimization procedure for Equation 8 in the main paper. Focusing on one pixel in the reference image, we denote its corresponding 3D point in the space as , where , and are the projection matrix, camera center of the reference camera and the depth of pixel . The projection of in the source image is . For the photo-consistency term, we assume and abbreviate it as . The image reprojection error will be changed as deforms, and we take the derivative of the photo-consistency term w.r.t. to depth :


where is the Jacobian of the projection matrix . is the derivative of the photo-metric measurement w.r.t. the pixel coordinate. For computing the derivatives of NCC and ZNCC, we refer readers to [20] for detailed implementations. Also, considering , the derivative of the smoothness term can be derived as:


where is the bilateral smoothness weighting.

We iteratively minimize the total image reprojection error by gradient descent with a descending step size of and . The reference depth map and all reprojected images will be updated at each step. The refinement iteration is fixed to 20 for all our experiments.

4 Sliding Window 3D CNNs

One concern about R-MVSNet is that whether the proposed GRU regularization could be simply replaced by streaming the 3D CNNs regularization in the depth direction. To address this concern, we conduct two more ablation studies. For DTU dataset, we divide the cost volume () into sub-volumes () along the depth direction. To better regularize the boundary voxels, we set the overlap between two adjacent sub-volumes to , so in this way is divided into 7 subsequent sub-volumes . We then sequentially apply 3D CNNs (except for the softmax layer) on to obtain the regularized sub-volumes. Then, we generate the final depth map by two different fusion strategies:

  • Volume Fusion First concatenate the regularized sub-volumes (truncated with to fit the overlap region) in depth direction. Then apply softmax and soft argmin to regress the final depth map.

  • Depth Map Fusion First regress 7 depth maps and probability maps from the regularized sub-volumes. Then fuse the 7 depth maps into the final depth map by winner-take-all selection on probability maps.

Qualitative and quantitative results are shown in Fig. 2. Both sliding strategies produce errors higher than GRU and 3D CNNs. Also, sliding strategies take to infer depth map (), which is slower than MVSNet and R-MVSNet.

The sliding window 3D CNNs regularization is a depth-wise divide-and-conquer algorithm and there are two major limitations: 1) One is the discrepancies among sub-volumes, as sub-volumes are not regularized as a whole. 2) The second is the limited size of the sub-volume, which is far less than the actual receptive field size of the multi-scale 3D CNNs (). As a result, such strategies cannot be fully benefit from the powerful 3D CNNs regularization.

Figure 2: Sliding window 3D CNNs. (a) and (b) are depth map results of the proposed two fusion strategies in A1.

5 Post-processing

We show in Fig. 1 the qualitative point cloud results of DTU evaluation set [1]

using different post-processing settings. The photo-metric filtering and the geometric filtering are able to remove different kinds of outliers and produce visually clean point clouds. Depth map refinement and depth map fusion have little influence on the qualitative results, however, they are able to reduce the

score for the quantitative evaluation (Table 3 in the main paper).

6 Point Cloud Results

This section presents the point cloud reconstructions of DTU dataset [1], Tanks and Temples benchmark [17] and ETH3D benchmark [26] that have not been shown in the main paper. The point cloud results of the three datasets can be found in Fig. 3, Fig. 4 and Fig. 5 respectively. R-MVSNet is able to produce visually clean and complete point cloud for all reconstructions.

Figure 3: Point cloud reconstructions of DTU evaluation set [1]
Figure 4: Point cloud reconstructions of Tanks and Temples dataset [17]
Figure 5: Point cloud reconstructions of ETH3D low-res dataset [26]