MVSNet (ECCV2018) & R-MVSNet (CVPR2019)
Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. Code is available at https://github.com/YoYo000/MVSNet.READ FULL TEXT VIEW PDF
MVSNet (ECCV2018) & R-MVSNet (CVPR2019)
Deep learning based depth map estimation
Multi-view stereo (MVS) aims to recover the dense representation of the scene given multi-view images and calibrated cameras. While traditional methods have achieved excellent performance on MVS benchmarks, recent works [14, 13, 30] show that learned approaches are able to produce results comparable to the traditional state-of-the-arts. In particular, MVSNet 
proposed a deep architecture for depth map estimation, which significantly boosts the reconstruction completeness and the overall quality.
One of the key advantages of learning-based MVS is the cost volume regularization, where most networks apply multi-scale 3D CNNs [14, 15, 30] to regularize the 3D cost volume. However, this step is extremely memory expensive: it operates on 3D volumes and the memory requirement grows cubically with the model resolution (Fig. 1 (d)). Consequently, current learned MVS algorithms could hardly be scaled up to high-resolution scenarios.
Recent works on 3D with deep learning also acknowledge this problem. OctNet  and O-CNN  exploit the sparsity in 3D data and introduce the octree structure to 3D CNNs. SurfaceNet  and DeepMVS  apply the engineered divide-and-conquer strategy to the MVS reconstruction. MVSNet  builds the cost volume upon the reference camera frustum to decouple the reconstruction into smaller problems of per-view depth map estimation. However, when it comes to a high-resolution 3D reconstruction (e.g., volume size voxels), these methods will either fail or take a long time for processing.
To this end, we present a novel scalable multi-view stereo framework, dubbed as R-MVSNet, based on the recurrent neural network. The proposed network is built upon the MVSNet architecture , but regularizes the cost volume in a sequential manner using the convolutional gated recurrent unit (GRU) rather than 3D CNNs. With the sequential processing, the online memory requirement of the algorithm is reduced from cubic to quadratic to the model resolution (Fig. 1 (c)). As a result, the R-MVSNet is applicable to high resolution 3D reconstruction with unlimited depth-wise resolution.
We first evaluate the R-MVSNet on DTU , Tanks and Temples  and ETH3D  datasets, where our method produces results comparable or even outperforms the state-of-the-art MVSNet . Next, we demonstrate the scalability of the proposed method on several large-scale scenarios with detailed analysis on the memory consumption. R-MVSNet is much more efficient than other methods in GPU memory and is the first learning-based approach applicable to such wide depth range scenes, e.g., the advance set of Tanks and Temples dataset .
Recent learning-based approaches have shown great potentials for MVS reconstruction. Multi-patch similarity  is proposed to replace the traditional cost metric with the learned one. SurfaceNet  and DeepMVS  pre-warp the multi-view images to 3D space, and regularize the cost volume using CNNs. LSM  proposes differentiable projection operations to enable the end-to-end MVS training. Our approach is mostly related to MVSNet , which encodes camera geometries in the network as differentiable homography and infers the depth map for the reference image. While some methods have achieved excellent performance in MVS benchmarks, aforementioned learning-based pipelines are restricted to small-scale MVS reconstructions due to the memory constraint.
The memory requirement of learned cost volume regularizations [14, 15, 13, 5, 30] grows cubically with the model resolution, which will be intractable when large image sizes or wide depth ranges occur. Similar problem also exists in traditional MVS reconstructions (e.g., semi-global matching ) if the whole volume is taken as the input to the regularization. To mitigate the scalability issue, learning-based OctNet  and O-CNN  exploit the sparsity in 3D data and introduce the octree structure to 3D CNNs, but are still restricted to reconstructions with resolution
voxels. Heuristic divide-and-conquer strategies are applied in both classical and learned MVS approaches [14, 13], however, usually lead to the loss of global context information and the slow processing speed.
On the other hand, scalable traditional MVS algorithms all regularize the cost volume implicitly. They either apply local depth propagation [19, 9, 10, 25] to iteratively refine depth maps/point clouds, or sequentially regularize the cost volume using simple plane sweeping  and 2D spatial cost aggregation with depth-wise winner-take-all [29, 31]. In this work, we follow the idea of sequential processing, and propose to regularize the cost volume using the convolutional GRU . GRU is a RNN architecture  initially proposed for learning sequential speech and text data, and is recently applied to 3D volume processing, e.g., video sequence analysis [3, 34]. For our task, the convolutional GRU gathers spatial as well as temporal context information in the depth direction, which is able to achieve comparable regularization results to 3D CNNs.
This section describes the detailed network architecture of R-MVSNet. Our method can be viewed as an extension to the recent MVSNet  with cost volume regularization using convolutional GRU. We first review the MVSNet architecture in Sec. 3.1, and then introduce the recurrent regularization in Sec. 3.2 and the corresponding loss formulation in Sec. 3.3.
Given a reference image and a set of its neighboring source images , MVSNet  proposes an end-to-end deep neural network to infer the reference depth map . In its network, deep image features are first extracted from input images through a 2D network. These 2D image features will then be warped into the reference camera frustum by differentiable homographies to build the feature volumes in 3D space. To handle arbitrary
-view image input, a variance based cost metric is proposed to map N feature volumes to one cost volume. Similar to other stereo and MVS algorithms, MVSNet regularizes the cost volume using the multi-scale 3D CNNs, and regresses the reference depth map through the soft argmin  operation. A refinement network is applied at the end of MVSNet to further enhance the depth map quality. As deep image features
are downsized during the feature extraction, the output depth map size isto the original image size in each dimension.
MVSNet has shown state-of-the-art performance on DTU dataset  and the intermediate set of Tanks and Temples dataset , which contain scenes with outside-looking-in camera trajectories and small depth ranges. However, MVSNet can only handle a maximum reconstruction scale at with the 16 GB large memory Tesla P100 GPU, and will fail at larger scenes e.g., the advanced set of Tanks and Temples. To resolve the scalability issue especially for the wide depth range reconstructions, we will introduce the novel recurrent cost volume regularization in the next section.
An alternative to globally regularize the cost volume in one go is to sequentially process the volume through the depth direction. The simplest sequential approach is the winner-take-all plane sweeping stereo , which crudely replaces the pixel-wise depth value with the better one and thus suffers from noise (Fig. 1 (a)). To improve, cost aggregation methods [29, 31] filter the matching cost at different depths (Fig. 1 (b)) so as to gather spatial context information for each cost estimation. In this work, we follow the idea of sequential processing, and propose a more powerful recurrent regularization scheme based on convolutional GRU. The proposed method is able to gather spatial as well as the uni-directional context information in the depth direction (Fig. 1 (c)), which achieves regularization results comparable to the full-space 3D CNNs but is much more efficient in runtime memory.
Cost volume could be viewed as cost maps concatenated in the depth direction. If we denote the output of regularized cost maps as , for the ideal sequential processing at the step, should be dependent on cost maps of the current step as well as all previous steps . Specifically, in our network we apply a convolutional variant of GRU to aggregate such temporal context information in depth direction, which corresponds to the time direction in language processing. In the following, we denote ‘’ as the element-wise multiplication, ‘’ the concatenation and ‘’ the convolution operation. Cost dependencies are formulated as:
where is the update gate map to decide whether to update the output for current step, is the regularized cost map of late step, and could be viewed as the updated cost map in current step, which is defined as:
here is the reset gate map to decide how much the previous should affect the current update.
is the nonlinear mapping, which is the element-wise sigmoid function. The update gate and reset gate maps are also related to the current input and previous output:
and are learned parameters. The nonlinear is the hyperbolic tangent to make soft decisions for the updates.
The convolutional GRU architecture not only spatially regularizes the cost maps through 2D convolutions, but also aggregates the temporal context information in depth direction. We will show in the experiment section that our GRU regularization can significantly outperform the simple winner-take-all or only the spatial cost aggregation.
The basic GRU model is comprised of a single layer. To further enhance the regularization ability, more GRU units could be stacked to make a deeper network. In our experiments, we adopt a 3-layer stacked GRU structure (Fig. 2). Specifically, we first apply a 2D convolutional layer to map the 32-channel cost map to 16-channel as the input to the first GRU layer. The output of each GRU layer will be used as the input to the next GRU layer, and the output channel numbers of the 3 layers are set to 16, 4, 1 respectively. The regularized cost maps
will finally go through a softmax layer to generate the probability volumefor calculating the training loss.
Most deep stereo/MVS networks regress the disparity/depth outputs using the soft argmin operation , which can be interpreted as the expectation value along the depth direction . The expectation formulation is valid if depth values are uniformly sampled within the depth range. However, in recurrent MVSNet, we apply the inverse depth to sample the depth values in order to efficiently handle reconstructions with wide depth ranges. Rather than treat the problem as a regression task, we train the network as a multi-class classification problem with cross entropy loss:
where is the spatial image coordinate and is a voxel in the probability volume .
is the ground truth binary occupancy volume, which is generated by the one-hot encoding of the ground truth depth map.is the corresponding voxel to .
One concern about the classification formulation is the discretized depth map output [32, 21, 13]. To achieve sub-pixel accuracy, a variational depth map refinement algorithm is proposed in Sec. 4.2 to further refine the depth map output. In addition, while we need to compute the whole probability volume during training, for testing, the depth map can be sequentially retrieved from the regularized cost maps using the winner-take-all selection.
The proposed network in the previous section generates the depth map per-view. This section describes the non-learning parts of our 3D reconstruction pipeline.
To estimate the reference depth map using R-MVSNet, we need to prepare: 1) the source images of the given reference image , 2) the depth range of the reference view and 3) the depth sample number for sampling depth values using the inverse depth setting.
For selecting the source images, we follow MVSNet  to score each image pair using a piece-wise Gaussian function w.r.t. the baseline angle of the sparse point cloud . The neighboring source images are selected according to the pair scores in descending order. The depth range is also determined by the sparse point cloud with the implementation of COLMAP . Depth samples are chosen within using the inverse depth setting and we determine the total depth sample number by adjusting the temporal depth resolution to the spatial image resolution (details are described in the supplementary material).
As mentioned in Sec. 3.3, a depth map will be retrieved from the regularized cost maps through the winner-take-all selection. Compare to the soft argmin  operation, the argmax operation of winner-take-all cannot produce depth estimations with sub-pixel accuracy. To alleviate the stair effect (see Fig. 3 (g) and (h)), we propose to refine the depth map in a small depth range by enforcing the multi-view photo-consistency.
Given the reference image , the reference depth map and one source image , we project to through to form the reprojected image . The image reprojection error between and at pixel is defined as:
where is the photo-metric error between two pixels, is the regularization term to ensure the depth map smoothness. We choose the zero-mean normalized cross-correlation (ZNCC) to measure the photo-consistency , and use the bilateral squared depth difference between and its neighbors for smoothness.
During the refinement, we iteratively minimize the total image reprojection error between the reference image and all source images w.r.t. depth map
. It is noteworthy that the initial depth map from R-MVSNet has already achieved satisfying result. The proposed variational refinement only fine-tunes the depth values within a small range to achieve sub-pixel depth accuracy, which is similar to the quadratic interpolation in stereo methods[32, 21] and the DenseCRF in DeepMVS .
Similar to other depth map based MVS approaches[10, 25, 30], we filter and fuse depth maps in R-MVSNet into a single 3D point cloud. The photo-metric and the geometric consistencies are considered in depth map filtering. As described in previous sections, the regularized cost maps will go through a softmax layer to generate the probability volume. In our experiments, we take the corresponding probability of the selected depth value as its confidence measurement (Fig. 3 (f)), and we will filter out pixels with probability lower than a threshold of . The geometric constraint measures the depth consistency among multiple views, and we follow the geometric criteria in MVSNet  that pixels should be at least three view visible. For depth map fusion, we apply the visibility-based depth map fusion  as well as the mean average fusion  to further enhance the depth map quality and produce the 3D point cloud. Illustrations of our reconstruction pipeline are shown in Fig. 3.
We train R-MVSNet on the DTU dataset , which contains over 100 scans taken under 7 different lighting conditions and fixed camera trajectories. While the dataset only provides the ground truth point clouds, we follow MVSNet  to generate the rendered depth maps for training. The training image size is set to and the input view number is . The depth hypotheses are sampled from 425mm to 905mm with . In addition, to prevent depth maps from being biased on the GRU regularization order, each training sample is passed to the network with forward GRU regularization from to as well as the backward regularization from to . The dataset is splitted into the same training, validation and evaluation sets as previous works [14, 30]
. We choose TensorFlow
for the network implementation, and the model is trained for 100k iterations with batch size of 1 on a GTX 1080Ti graphics card. RMSProp is chosen as the optimizer and the learning rate is set to 0.001 with an exponential decay of 0.9 for every 10k iterations.
For testing, we use images as input, and the inverse depth samples are adaptively selected as described in Sec. 4.1. For Tanks and Temples dataset, the camera parameters are computed from OpenMVG  as suggested by MVSNet . Depth map refinement, filtering and fusion are implemented using OpenGL on the same GTX 1080Ti GPU.
We first demonstrate the state-of-the-art performance of the proposed R-MVSNet, which produces results comparable to or outperforms the previous MVSNet .
We evaluate the proposed method on the DTU evaluation set. To compare R-MVSNet with MVSNet , we set and for all scans. Quantitative results are shown in Table 1. The accuracy and the completeness are calculated using the matlab script provided by the DTU dataset. To summarize the overall reconstruction quality, we calculate the average of the mean accuracy and the mean completeness as the overall score. Our R-MVSNet produces the best reconstruction completeness and overall score among all methods. Qualitative results can be found in Fig. 3.
Unlike the indoor DTU dataset, Tanks and Temples is a large dataset captured in more complex environments. Specifically, the dataset is divided into the intermediate and the advanced sets. The intermediate set contains scenes with outside-look-in camera trajectories, while the advanced set contains large scenes with complex geometric layouts, where almost all previous learned algorithms fail due to the memory constraint.
The proposed method ranks on the intermediate set, which outperforms the original MVSNet . Moreover, R-MVSNet successfully reconstructs all scenes and also ranks on the advanced set. The reconstructed point clouds are shown in Fig. 5. It is noteworthy that the benchmarking result of Tanks and Temples is highly dependent on the point cloud density. Our depth map is of size , which is relatively low-resolution and will result in low reconstruction completeness. So for the evaluation, we linearly upsample the depth map from the network by two () before the depth map refinement. The f_scores of intermediate and advanced sets increase from to and from to respectively.
We also evaluate our method on the recent ETH3D benchmark. The dataset is divided into the low-res and the high-res scenes, and provides the ground truth depth maps for MVS training. We first fine-tune the model on the ETH3D low-res training set, however, observe no performance gain compared to the model only pre-trained on DTU. We suspect the problem may be some images in low-res training set are blurred and overexposed as they are captured using hand-held devices. Also, the scenes of ETH3D dataset are complicated in object occlusions, which are not explicitly handled in the proposed network. We evaluate on this benchmark without fine-tuning the network. Our method achieves similar performance to MVSNet  and ranks on the low-res benchmark.
|Mean Acc.||Mean Comp.||Overall ()|
|MVSNet (D=256) ||0.396||0.527||0.462|
|Rank||H||W||Ave. D||Mem.||Mem-Util||Rank||H||W||Ave. D||Mem.||Mem-Util||Ratio|
|DTU ||2||1600||1184||256||15.4 GB||1.97 M||1||1600||1200||512||6.7 GB||9.17 M||4.7|
|T. Int. ||4||1920||1072||256||15.3 GB||2.15 M||3||1920||1080||898||6.7 GB||17.4 M||8.1|
|T. Adv. ||-||-||-||-||-||-||3||1920||1080||698||6.7 GB||13.5 M||-|
|ETH3D ||5||928||480||320||8.7 GB||1.02 M||6||928||480||351||2.1 GB||4.65 M||4.6|
Next, we demonstrate the scalability of R-MVSNet from: 1) wide-range and 2) high-resolution depth reconstructions.
The memory requirement of R-MVSNet is independent to the depth sample number , which enables the network to infer depth maps with large depth range that is unable to be recovered by previous learning-based MVS methods. Some large scale reconstructions of Tanks and Temples dataset are shown in Fig. 5. Table 2 compares MVSNet  and R-MVSNet in terms of benchmarking rankings, reconstruction scales and memory requirements. We define the algorithm’s memory utility (Mem-Util) as the size of volume processed per memory unit ( / runtime memory size). R-MVSNet is more efficient than MVSNet in Mem-Util.
This section studies how different components in the network affect the depth map reconstruction. We perform the study on DTU validation set with , and use the average absolute difference between the inferred and the ground truth depth maps for the quantitative comparison. We denote the learned 2D image features as 2D CNNs. The comparison results of following settings are shown Fig. 6 and Fig. 7:
The setting of the proposed R-MVSNet, which produces the best depth map results among all settings. The qualitative comparison between 3D CNNs and GRU is shown in Fig. 7 (d) and (e).
Replace the GRU regularization with the simple spatial regularization. We approach the spatial regularization by a simple 3-layer, 32-channel 2D network on the cost map. The depth map error of spatial regularization is larger than the GRU regularization.
Replace the GRU regularization with simple the winner-take-all selection. We apply a single layer, 1-channel 2D CNN to directly map the cost map to the regularized cost map. The depth map error is further larger than the spatial regularization.
Replace the learned image feature and cost metric with the engineered (window size of ). This setting is also referred to the classical plane sweeping . As expected, plane sweeping produces the highest depth map error among all methods.
Next, we study the influences of post processing steps on the final point cloud reconstruction. We reconstruct the DTU evaluation without the variational refinement, photo-metric filtering, geometric filtering or depth map fusion. Quantitative results are shown in Table 3.
This setting is similar to the post-processing of MVSNet . The f_score is changed to a larger number of 0.465, demonstrating the effectiveness of the proposed depth map refinement.
The f_score is increased to 0.432, showing the effectiveness of depth consistency.
The f_score is also increased to 0.431, showing the effectiveness of depth fusion.
For DTU evaluation with , R-MVSNet generates the depth map at a speed of / view. Specifically, it takes to infer the initial depth map and to perform the depth map refinement. It is noteworthy that the runtime of depth map refinement only relates to refinement iterations and the input image size. Filtering and fusion takes neglectable runtime.
R-MVSNet is trained with fixed input size of , but it is applicable to arbitrary input size during testing. It is noteworthy that we use the model trained on the DTU dataset  for all our experiments without fine-tuning. While R-MVSNet has shown satisfying generalizability to the other two datasets [17, 26], we hope to train R-MVSNet on a more diverse MVS dataset, and expect better performances on Tanks and Temples  and ETH3D  benchmarks in the future.
While R-MVSNet is applicable to reconstructions with unlimited depth-wise resolution, the reconstruction scale is still restricted to the input image size. Currently R-MVSNet can handle a maximum input image size of on a 11GB GPU, which covers all modern MVS benchmarks except for the ETH3D high-res benchmark ().
We presented a scalable deep architecture for high-resolution multi-view stereo reconstruction. Instead of using 3D CNNs, the proposed R-MVSNet sequentially regularizes the cost volume through the depth direction with the convolutional GRU, which dramatically reduces the memory requirement for learning-based MVS reconstructions. Experiments show that with the proposed post-processing, R-MVSNet is able to produce high quality benchmarking results as the original MVSNet . Also, R-MVSNet is applicable to large-scale reconstructions which cannot be handled by the previous learning-based MVS approaches.
International Journal of Computer Vision (IJCV), 2016.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
Computer Vision and Pattern Recognition (CVPR), 2018.
Empirical Methods in Natural Language Processing (EMNLP), 2014.
O-cnn: Octree-based convolutional neural networks for 3d shape analysis.ACM Transactions on Graphics (TOG), 2017.
This section describes the network architecture of R-MVSNet (Table 1). R-MVSNet constructs cost maps at different depths, and recurrently regularizes cost maps through the depth direction. The probability volume need to be explicitly computed during the network training, but for testing, we can sequentially retrieve the regularized cost maps and all layers only require the GPU memory with size linear to the input image resolution.
|Image Features Extration|
|Differentiable Homography Warping|
|Cost Map Construction|
|Variance Cost Metric||14H14W32|
|GRU, K=3x3, F=16||14H14W16|
|GRU, K=3x3, F=4||14H14W4|
|GRU, K=3x3, F=1||14H14W1|
|Probability Volume Construction|
R-MVSNet architecture. We denote the 2D convolution as Conv and use BR to abbreviate the batch normalization and the Relu. K is the kernel size, S the kernel stride and F the output channel number. N, H, W, D denote input view number, image width, height and depth sample number respectively
Given the depth range , we sample depth values using the inverse depth setting:
where is the index of the depth sampling and is the depth sample number. To determine the sample number , we assume that the spatial image resolution should be the same as the temporal depth resolution. Supposing and are two 3D points by projecting the reference image center and its neighboring pixel to the space at depth , the spatial image resolution at depth is defined as . Meanwhile, we define the temporal depth resolution at depth as . Considering Equation 1, the depth sample number is calculated as:
We derive the iterative minimization procedure for Equation 8 in the main paper. Focusing on one pixel in the reference image, we denote its corresponding 3D point in the space as , where , and are the projection matrix, camera center of the reference camera and the depth of pixel . The projection of in the source image is . For the photo-consistency term, we assume and abbreviate it as . The image reprojection error will be changed as deforms, and we take the derivative of the photo-consistency term w.r.t. to depth :
where is the Jacobian of the projection matrix . is the derivative of the photo-metric measurement w.r.t. the pixel coordinate. For computing the derivatives of NCC and ZNCC, we refer readers to  for detailed implementations. Also, considering , the derivative of the smoothness term can be derived as:
where is the bilateral smoothness weighting.
We iteratively minimize the total image reprojection error by gradient descent with a descending step size of and . The reference depth map and all reprojected images will be updated at each step. The refinement iteration is fixed to 20 for all our experiments.
One concern about R-MVSNet is that whether the proposed GRU regularization could be simply replaced by streaming the 3D CNNs regularization in the depth direction. To address this concern, we conduct two more ablation studies. For DTU dataset, we divide the cost volume () into sub-volumes () along the depth direction. To better regularize the boundary voxels, we set the overlap between two adjacent sub-volumes to , so in this way is divided into 7 subsequent sub-volumes . We then sequentially apply 3D CNNs (except for the softmax layer) on to obtain the regularized sub-volumes. Then, we generate the final depth map by two different fusion strategies:
Volume Fusion First concatenate the regularized sub-volumes (truncated with to fit the overlap region) in depth direction. Then apply softmax and soft argmin to regress the final depth map.
Depth Map Fusion First regress 7 depth maps and probability maps from the regularized sub-volumes. Then fuse the 7 depth maps into the final depth map by winner-take-all selection on probability maps.
Qualitative and quantitative results are shown in Fig. 2. Both sliding strategies produce errors higher than GRU and 3D CNNs. Also, sliding strategies take to infer depth map (), which is slower than MVSNet and R-MVSNet.
The sliding window 3D CNNs regularization is a depth-wise divide-and-conquer algorithm and there are two major limitations: 1) One is the discrepancies among sub-volumes, as sub-volumes are not regularized as a whole. 2) The second is the limited size of the sub-volume, which is far less than the actual receptive field size of the multi-scale 3D CNNs (). As a result, such strategies cannot be fully benefit from the powerful 3D CNNs regularization.
using different post-processing settings. The photo-metric filtering and the geometric filtering are able to remove different kinds of outliers and produce visually clean point clouds. Depth map refinement and depth map fusion have little influence on the qualitative results, however, they are able to reduce thescore for the quantitative evaluation (Table 3 in the main paper).
This section presents the point cloud reconstructions of DTU dataset , Tanks and Temples benchmark  and ETH3D benchmark  that have not been shown in the main paper. The point cloud results of the three datasets can be found in Fig. 3, Fig. 4 and Fig. 5 respectively. R-MVSNet is able to produce visually clean and complete point cloud for all reconstructions.