1 Introduction
Multiview stereo (MVS) is an important task in 3D computer vision. It seeks to reconstruct a full 3D model, typically in the form of a dense 3D point cloud, from multiple RGB images with known camera intrinsics and poses. It is a difficult task that remains unsolved; the main challenge is producing a 3D model that is not only accurate but also complete, that is, no parts should be missing and all fine details should be recovered.
Many of the latest results of multiview stereo are achieved by deep networks. In particular, many recent leading methods [zhang2020visibility, yan2020dense] are variants of MVSNet [yao2018mvsnet]
, a deep architecture that consists of two main steps: (1) constructing a 3D cost volume in the frustum of a reference view, by warping features from other views, and (2) using 3D convolutional layers to transform, or “regularize”, the cost volume before using it to predict a depth map. The resulting depth maps, one from each reference view, are then combined to form a single 3D point cloud through a heuristic procedure.
However, a drawback of MVSNet is that regularizing the 3D planesweeping cost volume using 3D convolutions can be costly in terms of computation and memory, potentially limiting the quality of reconstruction under finite resources. Subsequent variants [yao2019recurrent] of MVSNet have attempted to address this issue by replacing 3D convolutions with recurrent sequential processing of 2D slices. Despite significant empirical improvements, however, such sequential processing can be suboptimal because the 3D cost volume does not have a natural sequential structure.
In this work, we propose CERMVS, a new deeplearning multiview stereo approach that is significantly different from existing methods. Like prior deeplearning work on multiview stereo, CERMVS predicts individual depth maps and then fuses them, but differs significantly in how it predicts each depth map. Given a reference view and multiple neighbor views, CERMVS constructs a 3D cost volume for each neighbor view by computing the similarity between each pixel in the reference view and pixels along the epipolar line, indexed by increments of inverse depth (i.e. disparity) in the reference view. Then, the cost volumes from all neighbor views are aggregated into a single cost volume. CERMVS uses a GRU to iteratively update a disparity field—the field that represents pixel correspondence. Each update is generated by the GRU by sampling from the aggregated cost volume using the current disparity field.
The key difference of CERMVS from MVSNet and its variants lies in how depth is predicted from the 3D cost volume. MVSNet updates (i.e. regularizes) the 3D cost volume and predicts depth through a soft argmax on the updated cost volume. In contrast, CERMVS does not update the cost volume at all; instead it iteratively updates a disparity field, which is used to retrieve values from the cost volume. The final depth prediction is simply the inverted disparity field. Updating a disparity field, which is less expensive than updating the cost volume, can allow more effective use of finite computing resources.
CERMVS builds upon RAFT [teed2020raft]
, an architecture that estimates optical flow between two video frames. Compared to RAFT, which cannot be directly applied to multiview stereo, CERMVS introduces four novel changes:

Epipolar cost volume: RAFT constructs a 4D cost volume that compares all pairs of pixels from two views, whereas we construct a 3D cost volume comparing each pixel in the reference view with pixels which are on the epipolar line in a neighbor view and spaced by uniform increments of disparity.

Cost volume cascading: Unlike RAFT, the size of our epipolar cost volumes depends not only on the image resolution but also the number of disparity increments. To reconstruct fine details, a large number of disparity increments is necessary, but can blow up GPU memory. To address this issue, we introduce cascaded epipolar cost volumes, a novel design in the context of RAFT. In particular, after a fixed number of RAFT iterations, we construct additional finergrained epipolar cost volumes centered around current disparity predictions with finer increments of disparity, allowing reconstruction of fine details with less memory.

Multiview fusion of cost volumes: RAFT constructs a single cost volume from two views, whereas CERMVS constructs multiple cost volumes, one for each neighbor of a reference view. The cost volumes are then aggregated into a single volume through a simple averaging operator.

Dynamic supervision: RAFT uses exponentially decaying weights to add up flow errors in each iteration. We also use such weights, but supervise a dynamic combination of depth errors and disparity errors.

Multiresolution fusion of depth maps: RAFT operates on a single resolution of the input images, whereas CERMVS applies the same network to predict depth maps on multiple resolutions, and aggregate the depth maps into a single highresolution depth map through a simple but novel heuristic.
When stitching the depth maps into point clouds, a filtering algorithm is often used, e.g., Dynamic Consistency Checking proposed in D2HCRMVSNet [yan2020dense]
. However, a good balance of accuracy and completeness is required for high scores on the evaluation metric, which is ignored by these algorithms. Therefore, we propose an adaptive thresholding method built on top of
[yan2020dense].We evaluate CERMVS on two challenging benchmarks, DTU [aanaes2016large] and TanksandTemples [knapitsch2017tanks]. On DTU, CERMVS achieves performance competitive to the current state of the art (the second best among published results). On TanksandTemples, CERMVS significantly advances the state of the art of the intermediate set from a mean F1 score of to , and the advanced set from to .
2 Related Work
Classical MVS
Classical methods [campbell2008using, furukawa2009accurate, galliani2015massively, schonberger2016pixelwise, tola2012efficient, hirschmuller2007stereo] essentially formulate multiview stereo as an optimization problem, which seeks to find a 3D model that is most compatible with the observed images. The compatibility is typically based on some handdesigned notion of photoconsistency, assuming that pixels that are projections of the same 3D point should have similar appearance. Often photoconsistency alone does not sufficiently constrain the solution space, and the optimization objective can also include shape priors, which make additional assumptions about what shapes are likely. To solve the optimization problem, a concrete classical algorithm usually consists of a particular 3D representation (e.g. polygon meshes, voxels, or depth maps) and a optimization procedure to compute the best model under that representation. The different combinations of photoconsistency measures, shape priors, 3D representations, and optimization procedures give rise to a large variety of algorithms. For more details, we refer the reader to excellent surveys of these algorithms by Seitz et al. [seitz2006comparison] and by Furukawa and Hernández [10.1561/0600000052].
One family of classical MVS methods [zheng2014patchmatch, schonberger2016pixelwise, galliani2015massively, xu2019multi, xu2020planar, romanoni2019tapa] is based on the PatchMatch [barnes2009patchmatch] algorithm, which enables efficient dense matching of pixels across views. PatchMatch methods have proved very effective and have demonstrated highly competitive performance. In particular, Xu and Tao [xu2020planar] introduced the ACMP algorithm, which, among other enhancements, incorporates planar priors and has achieved competitive results on TanksandTemples.
Learningbased MVS
Unlike classical algorithms, our approach is learningbased. Existing learningbased MVS methods either use learning to improve parts of a classical pipeline such as PatchMatch [zagoruyko2015learning, han2015matchnet, zbontar2015computing, zbontar2016stereo], or develop endtoend architectures [kar2017learning, ji2017surfacenet, yao2018mvsnet, yao2019recurrent, chen2019point, luo2019p, xue2019mvscrf, gu2020cascade, yang2020cost, yu2020fast, xu2020learning, yi2020pyramid, cheng2020deep, yan2020dense, zhang2020visibility]. A common step in existing endtoend architectures is the construction of a 3D cost volume (or feature grid) through some differentiable geometric operations. Then, this 3D cost volume undergoes further updates, often through 3D convolutions, before being transformed into the final 3D model in some particular representation such as voxels [kar2017learning, ji2017surfacenet], depth maps [yao2018mvsnet, yao2019recurrent, luo2019p, xue2019mvscrf, gu2020cascade, yang2020cost, yu2020fast, xu2020learning, yi2020pyramid, cheng2020deep, yan2020dense, zhang2020visibility, ma2021epp, wei2021aa], or point clouds [chen2019point].
The main difference between our approach and existing works is that although we also construct a 3D cost volume, we do not update it. Instead, we update an inversedepth field that is used to iteratively index from the 3D cost volume to produce 2D feature maps. Our approach thus avoids the costly operations of updating a 3D volume and focuses limited computing resources on refining the depth maps directly.
3 Approach
This section describes the detailed architecture and pipeline of CERMVS, as shown in Fig. 1. Given a reference view and a set of neighbor views, we first extract features using a set of convolutional networks. Features are then used to build a collection of cost volumes. We then predict a depth map through recurrent iterative updates, followed by the fusion of multiresolution depths. Finally, depth maps from all references views are fused and stitched to produce a final point cloud.
3.1 Cost Volume Construction
Image Features
We need to extract image features from both reference views and neighbor views before using them to construct the cost volumes. In addition, the iterative update unit, to be introduced later, needs context features from reference views. We extract these image features using convolutional encoders following RAFT: , where and
are hyperparameters that control the feature resolution and dimension (See Sec.
4.1 and Appendix A for more details).Epipolar Cost Volume
After extracting feature maps , where is the reference view and others are neighbor views, each with resolution , we construct a 3D cost volume by computing the correlation of each pixel in the reference view with pixels along its epipolar line in a neighbor view. Specifically, for a pixel in the reference view, we backproject it to 3D points with disparity (inverse depth) uniformly spaced in the range from to (after proper scaling as described in Sec. 4.1), reproject the 3D points to the epipolar line in the neighbor view, and use differentiable bilinear sampling to retrieve the features from the neighbor view. This procedure outputs a volume .
Like RAFT, we compute a stack of of multiscale cost volumes by repeated averagepooling, i.e., where .
Cost Volume Cascading
Unlike RAFT, the size of an epipolar cost volume depends on not only the image resolution but also the number of disparity values sampled. A dense sampling of a large number of disparity values effectively increases the resolution of the cost volume along the depth dimension and can help reconstruct fine details. However, using a large number of disparity values can take too much GPU memory. To address this issue, we introduce a cascade design. The basic idea is to construct additional cost volumes that are finergrained along the disparity dimension and centered around the current disparity predictions.
Concretely, after iterative updates, we create a new stack of cost volumes , , where is the number of disparity values uniformly sampled centered around the current prediction of disparities with smaller increments than those used in the initial stack of cost volumes. Specifically, the value of is determined by , where is a hyperparameter that controls the size of the neighborhood described in Sec. 3.2. The factor is needed to allow repeated pooling. In this work we use up to 2 stages in our experiments, but the design can be trivially extended to more stages.
It is worth noting that cost volume cascading has been used in prior MVS work [gu2020cascade, yang2020cost], but it is a novel design in the context of a RAFTlike architecture, which differs significantly from prior MVS work in that the cost volumes are not updated and are only used as static lookup tables.
3.2 Iterative Updates
The iterative updates follow RAFT in overall structure. We iteratively update a disparity field initialized to zero. In each iteration, the input to the update operator includes a hidden state , the current disparity field, the context features from the reference view, as well as perpixel features retrieved from the cost volumes using the current disparity field. The output of the update operator includes a new hidden state and an increment to the disparity field.
Multiview Fusion of Cost Volumes
Different from RAFT, in multiview stereo we need to consider multiple neighbor views. For each pixel in the reference view, we generate one correlation feature vector against each neighbor view. Given such feature vectors from multiple neighbor views, we take the elementwise mean as the final vector. The intuition behind this operator is that mean value is more robust as the number of neighbor views can vary in test time.
To generate the correlation feature vector for each pixel against a single neighbor view, we perform the same lookup procedure as RAFT. Given the current disparity estimate for the pixel and the stack of cost volumes against the neighbor views, we retrieve, from each cost volume, correlation values corresponding to a local 1D integer grid of length centered around the current disparity. This is repeated for each level of the stack, and the values from all levels are concatenated to form a single feature vector.
Update Operator
We use a GRUbased update operator to propose a sequence of incremental updates to the disparity field.
First, we extract features from the current disparity estimate . The feature vector is formed by subtracting the disparity of each pixel by its 7x7 neighborhood, then reshaping the result into a 49dimensional vector. This operation has the effect of making the feature vector invariant to the disparity field up to a shift factor, since the retrieved vector only depends on relative disparity between neighboring pixels.
Second, because we have a cascade of cost volumes and our update operator accesses different cost volumes at different stages of the cascade, the operator, while still recurrent, should be given the flexibility to behave somewhat differently for different stages of the cascade. Thus, we modify the weight tying scheme of RAFT such that some weights are tied across all iterations while others are tied only within a single stage of the cascade. Specially, we tie all weights across iterations except the decoder layer that decodes a disparity update from the hidden state of the GRU. The weights of the decoder layer are tied only within each stage of the cascade.
Third, RAFT uses upsampling layers for final predictions of flow field, whereas we do not use any upsampling layer.
The update equations are as follows, with a 2stage cascade with iterations for stage 1.
(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
Here is the context features, and is an encoder the transforms the correlation features using two convolution layers (see Appendix A for details).
3.3 Multiresolution Depth Fusion
To construct fine details, it generally helps to operate at high resolution, but the available GPU memory limits the highest resolution the network can access, especially during training with large minibatches. One approach to get around this limit is to apply the network to a higher resolution during inference, which is the common approach adopted in prior works.
However, we find that while using a higher resolution during inference can help, an even better approach is to apply the same network on two input resolutions, the “low” resolution used to train the network and the higher resolution , and combine the two disparity maps and to form a fused disparity map with a control parameter :
(7) 
That is, if the low resolution prediction and high resolution prediction are similar at a pixel, we use the high resolution prediction; otherwise we use the low resolution prediction. This is motivated by the observation that low resolution predictions are more reliable in term of textureless large structures such as planes, whereas high resolution predictions are more reliable in terms of fine details, which do not tend to deviate drastically from low resolution predictions. Note that as the control parameter varies from 0 to infinity, varies from to .
Training dataset  DTU  BlendedMVS 
Native resolution  (1200, 1600)  (1536, 2048) 
# neighbor views  10  8 
# training epochs 
15  16 
Feature map downsize ratio  4  
Feature map dimension  64  
Cost volume stack size  3  
Retrieved neighborhood size  11  
Cascaded stages  2  
Max disparity  0.0025  
Disparity increment in stage 1  / 64  
Disparity increment in stage 2  / 320  
# GRU iterations in each stage  8  
Batch size  2  
Loss parameter 
Test dataset  DTU  TanksandTemples  
Native resolution  (1200, 1600) 



10  15  

10  25  

0.02  0.02  

native resolution  1/2 native resolution  

0.25  0.25 
DTU mean distance (mm)  
Acc.  Comp.  Overall  
COLMAP [schonberger2016pixelwise]  
MVSNet [yao2018mvsnet]  
D2HCMVSNet [yan2020dense]  
PointMVSNet [chen2019point]  
VisMVSNet [zhang2020visibility]  
AARMVSNet [wei2021aa]  
CasMVSNet [gu2020cascade]  
EPPMVSNet [ma2021epp]  
CVPMVSNet [yang2020cost]  
UCSNet [cheng2020deep]  
IBMVS [sormann2021ib]  
Ours 
intermediate  advanced  
Method  mean  Fam.  Franc.  Horse  Light.  M60  Pan.  Play.  Train  mean  Audi.  Ballr.  Courtr.  Museum  Palace  Temple 
COLMAP [schonberger2016pixelwise]  
MVSNet [yao2018mvsnet]                
PointMVSNet [chen2019point]                
CVPMVSNet [yang2020cost]                
UCSNet [cheng2020deep]                
AltizureSFM, PCFMVS [kuhn2019plane]  
IBMVS [sormann2021ib]  
CasMVSNet [gu2020cascade]  
ACMM [xu2019multi]  
ACMP [xu2020planar]  
AltizureHKUST2019 [Altizure]  
DeepCMVS [kuhn2020deepc]  
VisMVSNet [zhang2020visibility]  
AttMVS [luo2020attention]  
D2HCMVSNet [yan2020dense]                
AARMVSNet [wei2021aa]                
EPPMVSNet [ma2021epp]  
Ours 
3.4 Adaptive Point Cloud Stitching
As a last step, the depth maps from the reference views are stitched together to form a single point cloud. We use an adaptive thresholding approach based on Dynamic Consistency Checking (DCC) proposed in D2HCRMVSNet [yan2020dense]. DCC hardcodes two thresholds and for reprojection errors, however, we use the thresholds and where is different for each scene to ensure a fixed percentage, of all pixels pass through consistency test. And is optimized through the validation set.
3.5 Supervision
We supervise our network with a loss consisting of two parts. The first part measures the L1 error of the predicted disparity against the ground truth at each iteration, with exponentially increasing weights for later iterations. This part enables faster training of all disparity ranges regardless of outliers at the beginning. The second part of the loss is similar to the first part except that (1) it measures the error of depth (i.e. inverted disparity) so as to be more aligned with point cloud evaluation, and that (2) the error is capped at a constant
so as to prevent outliers from dominating the loss.Given the predicted disparity in each iteration be and ground truth disparity , the combined loss is defined as follows:
(8)  
(9)  
(10) 
where controls the weights across iterations and makes the two parts have roughly the same range. The parameter balances the two parts and changes from 0 to 1 linearly as training progresses to focus more on the depth error, e.g. for a total number of 16 training epochs, would be 0.5 when 8 epochs are finished.
4 Experiments
4.1 Implementation Details
We evaluate our models on two datasets, DTU and TanksandTemples. On DTU, we train on its training split of DTU and evaluate on its test split, which was suggested by Yao et al. [yao2018mvsnet] and followed by most authors. On TanksandTemples, we train on the BlendedMVS dataset [yao2020blendedmvs], following the practice of prior work [yao2018mvsnet, yan2020dense, ma2021epp]. For all datasets, during training we use the native image resolutions after some random cropping and scaling as input to the network and other details on the hyperparameters are given in Table 1.
To pair neighbor views with reference views, we use the same method as MVSNet [yao2018mvsnet]. In BlendedMVS, which is used for training only, the scenes have large variations in the range of depth values, we scale each reference view, along with its neighbor views, so that its groundtruth depth has a median value mm. When we evaluate on TanksandTemples, due to lack of groundtruth and noisy background, we scale each reference view, along with its neighbor views, so that its minimum depth of a set of reliable feature points (computed by COLMAP [schonberger2016pixelwise] as in MVSNet [yao2018mvsnet]) is mm. To stitch the predicted depth maps from multiple reference views, we simply scale back each depth map to its original scale.
4.2 Main Results
Dtu
TanksandTemples
On the TanksandTemples dataset, we achieve state of the art performance, as shown in Table 3. Notably, the model is trained on the BlendedMVS dataset without finetuning on TanksandTemples except for some testtime hyperparameter selection using the validation set, as described in Table 1. This indicates a good generalization ability of our approach. A visualization of some results is shown in Fig. 3, from which we can see that many reconstructed scenes look reasonably accurate, detailed, and complete, but there is still substantial room for improvement, especially on lowtexture planar regions.
4.3 Ablations
We show our ablation experiments on TanksandTemples official training set (used as validation set) in a restricted setting where we only train the model on BlendedMVS for 2 epochs but keep everything else the same as in Table 1.
Cost Volume Cascading
We study the effect of cost volume cascading on memory consumption. In Fig. 4, we plot the GPU memory usage versus score on TanksandTemples validation set for (1) a series of cascaded model (with different disparity increments in the first stage), (2) its noncascaded counterpart, which matches the firststage disparity resolution used in the cascaded model and has equal total GRU iterations. We train all models as described in Sec. 4.1 and finally chose the cascaded model (64, 320) for longtime training and benchmarking. It uses 44 disparity values with an increment of in the second stage, and uses 64 values with a coarser increment in the first stage to cover the entire disparity range from to . For the noncascaded model, because it needs to fill the entire disparity range from to , it needs significantly more memory as the disparity resolution increases. We see from Fig. 4 that cascading produces significant savings of memory. Note the reported memory is the peak memory reported by the command ”nvidiasmi”.
Method  score 
(1) Truncated depth loss  N/A 
(2) disparity loss  
(3) Average of (1) and (2)  
(4) Proposed dynamic loss 
Mean score (%) 




Dynamic Supervision
In Table 5, we show our model trained with different loss supervision. Among them, the truncated depth loss does not help the model to start up; and disparity loss has inferior performance; while the proposed dynamic loss is marginally better than the direct average of depth loss and disparity loss.
Number of Neighbor Views
During inference, our network can use a different number of neighbor views than in training. In table 5, we study the effect of changing the number of neighbor views during inference. In particular, we study how this number can be chosen differently for the two resolutions we use to predict depth maps. As the results on the validation set show, the best combination is 15 views for native resolution prediction and 25 views for 2 native resolution prediction. And these are the numbers we use on the test set.
Aggergation option  Mean score (%) 
max  
max + mean  
std  
std + mean  
mean 
Controlled percentage  
Mean score (%) 
Fixed threshold  
Mean score (%) 
Aggregation of Cost Volumes
Here in Table 7 we study the effect of aggregation options different from our simple averaging including both onechannel and twochannel ones. It shows that taking the mean is the best.
Adaptive Thresholding
To strike a balance between accuracy and completeness scores, we use adaptive thresholding method and search for the best parameter . The results are in Table 7 in comparison with results from fixed thresholds. We see that our adaptive thresholding approach is significantly better than fixed thresholding.


0.01  0.02  0.04 


Mean score (%) 
Weighted average with 

0.25  0.5  0.75 


Mean F1score (%) 
Multiresolution Fusion of Depth Maps
An important part of CERMVS is the multiresolution fusion of depth maps. Different from previous components, its effect is most obvious on our final model trained for 16 epochs. We report the following results on the validation sets of TanksandTemples: (1) Different control parameter , and (2) simple weighted average of native input results and 2 native input results with weight . We see from Table 8 that our novel fusion approach is significantly better than all the other approaches.
Method 






CasMVSNet  4 

(1056, 1920)  792.2  9.5  
VisMVSNet  (528, 960)  864.2  4.5  
PatchmatchNet  (1056, 1920)  317.7  3.2  
EPPMVSNet  (528, 960)  522.2  8.2  
Ours  (264, 480)  664.4  3.0  
Ours 

(528, 960)  1754.5  7.0  
Ours  25  7611.3  22.6 
4.4 Memory and Runtime
The computational cost of CERMVS is compared with other methods in Table 9. When using similar resolution and numbers of views, the time and memory cost of our method is comparable to others.
5 Conclusion
We have proposed CERMVS, a new approach based on the RAFT architecture developed for optical flow. CERMVS introduces five new changes to RAFT: epipolar cost volumes, cost volume cascading, multiview fusion of cost volumes, dynamic supervision, and multiresolution fusion of depth maps, as well as adaptive thresholding to construct point clouds. Experiments show that our approach achieves competitive performance on DTU and stateoftheart performance on TanksandTemples.
Acknowledgments: This work is partially supported by the National Science Foundation under Award IIS1942981.