Deep Stereo using Adaptive Thin Volume Representation with Uncertainty Awareness

11/27/2019 ∙ by Shuo Cheng, et al. ∙ University of California, San Diego 27

We present Uncertainty-aware Cascaded Stereo Network (UCS-Net) for 3D reconstruction from multiple RGB images. Multi-view stereo (MVS) aims to reconstruct fine-grained scene geometry from multi-view images. Previous learning-based MVS methods estimate per-view depth using plane sweep volumes with a fixed depth hypothesis at each plane; this generally requires densely sampled planes for desired accuracy, and it is very hard to achieve high-resolution depth. In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions. Our UCS-Net has three stages: the first stage processes a small standard plane sweep volume to predict low-resolution depth; two ATVs are then used in the following stages to refine the depth with higher resolution and higher accuracy. Our ATV consists of only a small number of planes; yet, it efficiently partitions local depth ranges within learned small intervals. In particular, we propose to use variance-based uncertainty estimates to adaptively construct ATVs; this differentiable process introduces reasonable and fine-grained spatial partitioning. Our multi-stage framework progressively subdivides the vast scene space with increasing depth resolution and precision, which enables scene reconstruction with high completeness and accuracy in a coarse-to-fine fashion. We demonstrate that our method achieves superior performance compared with state-of-the-art benchmarks on various challenging datasets.



There are no comments yet.


page 1

page 3

page 5

page 6

page 8

page 13

page 14

page 15

Code Repositories


Code for "Deep Stereo using Adaptive Thin Volume Representation with Uncertainty Awareness"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Our UCS-Net leverages adaptive thin volumes (ATVs) to progressively reconstruct a highly accurate high-resolution depth map through multiple stages. We show the input RGB image, depth predictions with increasing sizes from three stages, and our final point cloud reconstruction obtained by fusing multiple depth maps. We also show local 2D slices of our ATVs around a pixel (red dot). Note that, our ATVs become thinner after a stage because of reduced uncertainty.

Inferring 3D scene geometry from captured images is a core problem in computer vision and graphics with applications in 3D visualization, scene understanding, robotics and autonomous driving. Multi-view stereo (MVS) aims to reconstruct dense 3D representations from multiple images with calibrated cameras. Inspired by the success of deep convolutional neural networks (CNN), several learning-based MVS methods have been presented

[20, 23, 45, 17, 38]; the most recent work leverages cost volumes in a learning pipeline [48, 18], and outperforms many traditional MVS methods [10].

At the core of the recent success on MVS [48, 18] is the application of 3D CNNs on plane sweep cost volumes to effectively infer multi-view correspondence. However, such 3D CNNs involve massive memory usage for depth estimation with high accuracy and completeness. In particular, for a large scene, high accuracy requires sampling a large number of sweeping planes and high completeness requires reconstructing high-resolution depth maps. In general, given limited memory, there is an undesired trade-off between accuracy (more planes) and completeness (more pixels) in previous work [48, 18].

Our goal is to achieve highly accurate and highly complete depth reconstruction with moderate memory usage at the same time. To do so, we propose a novel learning-based uncertainty-aware multi-view stereo framework, which utilizes multiple small volumes, instead of a large standard plane sweep volume, to progressively regress a high-quality depth map in a coarse-to-fine fashion. A key in our method is that we propose novel adaptive thin volumes (ATVs, as shown in Fig. 1) to achieve efficient spatial partitioning.

Specifically, we propose a novel cascaded network with three stages (see Fig. 2): each stage of the cascade predicts a depth map with a different size; each following stage constructs an ATV to refine the predicted depth from the previous stage with higher pixel-wise resolution and finer depth partitioning. The first stage uses a small standard plane sweep volume with low image resolution and relatively sparse depth planes – 160 planes that are fewer than the number of planes (256 or 512) in previous work [48, 49]; the following two stages use ATVs with higher image resolutions and significantly fewer depth planes – only 16 and 8 planes. While consisting of a very small number of planes, our ATVs are constructed within learned local depth ranges, which enables efficient and fine-grained spatial partitioning for accurate and complete depth reconstruction.

This is made possible by the novel uncertainty-aware construction of an ATV. In particular, we leverage the variances of the predicted per-pixel depth probabilities, and infer the uncertainty intervals (as shown in Fig. 


) by calculating variance-based confidence intervals of the per-pixel probability distributions for the ATV construction. Specifically, we apply the previously predicted depth map as a central curved plane, and construct an ATV around the central plane within local per-pixel uncertainty intervals. In this way, we explicitly express the uncertainty of the depth prediction at one stage, and embed this knowledge into the input volume for the next stage.

Our variance-based uncertainty estimation is differentiable and we train our UCSNet from end to end with depth supervision for the predicted depths from all three stages. Our network can thus learn to optimize the estimated uncertainty intervals, to make sure that an ATV is constructed with proper depth coverage that is both large enough – to try to cover ground truth depth – and small enough – to enable accurate reconstruction for the following stages. Overall, our multi-stage framework can progressively sub-divide the local space at a finer scale in a reasonable way, which leads to high-quality depth reconstruction. We demonstrate that our novel UCS-Net outperforms the state-of-the-art learning-based MVS methods on various datasets.

2 Related Work

Figure 2: Overview of our UCS-Net. Our UCS-Net leverages multi-scale cost volumes to achieve coarse-to-fine depth prediction with three cascade stages. The cost volumes are constructed using multi-scale deep image features from a multi-scale feature extractor. The last two stages utilize the uncertainty of the previous depth prediction to build adaptive thin volumes (ATVs) for depth reconstruction at a finer scale. We mark different parts of the network in different colors. Please refer to Sec 3 and the corresponding subsections for more details.

We present a multi-view stereo framework leveraging our novel spatial representation, ATV to reconstruct high-quality depth for fine-grain scene reconstruction. We discuss our work in the context of spatial representation for 3D reconstruction, deep multi-view stereo and high-resolution depth estimation.

Spatial Representation for 3D Reconstruction. Existing methods can be categorized based on learned 3D representations. Volumetric based approaches partition the space into a regular 3D volume with millions of small voxels [20, 23, 45, 46, 50, 33], and the network predicts if a voxel is on the surface or not. Ray tracing can be incorporated into this voxelized structure [40, 32, 41]. The main drawback of these methods is computation and memory inefficiency, given that most voxels are not on the surface. Researchers have also tried to reconstruct point clouds [19, 10, 29, 43, 30, 2]

, however the high dimensionality of a point cloud often results in noisy outliers since a point cloud does not efficiently encode connectivity between points. Some recent works utilize single or multiple images to reconstruct a point cloud given strong shape priors

[9, 19, 30], which cannot be directly extended to large-scale scene reconstruction. Recent work also tried to directly reconstruct surface meshes [28, 22, 44, 16, 37, 24], deformable shapes [21, 22], and some learned implicit distance functions [7, 34, 31, 6]. These reconstructed surfaces often look smoother than point-cloud-based approaches, but often lack high-frequency details. A depth map represents dense 3D information that is perfectly aligned with a reference view; depth reconstruction has been demonstrated in many previous works on reconstruction with both single view [8, 42, 13, 14, 52] and multiple views [4, 39, 15, 11, 35, 47, 35]. Some of them leverage normal information as well [11, 12]. In this paper, we present ATV, a novel spatial representation for depth estimation; we use two ATVs to progressively partition local space, which is the key to achieve coarse-to-fine reconstruction.

Deep Multi-View Stereo (MVS).

The traditional MVS pipeline mainly relies on photo-consistency constraints to infer the underlying 3D geometry, but usually performs poorly on texture-less or occluded areas, or under complex lighting environments. To overcome such limitations, many deep learning-based MVS methods have emerged in the last two years, including regression-based approaches

[48, 18], classification-based approaches [17] and approaches based on recurrent- or iterative- style architectures [49, 51, 5] and many other approaches [26, 32, 3, 36]. Most of these methods build a single cost volume with uniformly sampled depth hypotheses by projecting 2D image features into 3D space, and then use a stack of either 2D or 3D CNNs to infer the final depth. However, a single cost volume often requires a large number of depth planes to achieve enough reconstruction accuracy, and it is difficult to reconstruct high-resolution depth, limited by the memory bottleneck. R-MVSNet [49] leverages recurrent networks to sequentially build a cost volume with a high depth-wise sampling rate (512 planes). In contrast, we apply an adaptive sampling strategy with ATVs, which enables more efficient spatial partitioning with a higher depth-wise sampling rate using fewer depth planes (184 planes in total, see Sec. 3.5), and our method achieves significantly better reconstruction than R-MVSNet (see Tab. 1 and Tab. 2). On the other hand, Point-MVSNet [5] densifies a coarse reconstruction within a predefined local spatial range for better reconstruction with learning-based refinement. We propose to refine depth in a learned local space with adaptive thin volumes to obtain accurate high-resolution depth, which leads to better reconstruction than Point-MVSNet and other state-of-the-art methods (see Tab. 1 and Tab. 2).

3 Method

3.1 Overview of UCS-Net

Some recent works aim to improve learning-based MVS methods. Recurrent networks [49] have been utilized to achieve fine depth-wise partitioning for high accuracy; a PointNet-based method [5] is also presented to densify the reconstruction for high completeness. Our goal is to reconstruct high-quality 3D geometry with both high accuracy and high completeness. To this end, we propose a novel uncertainty-aware cascaded network (UCS-Net) to reconstruct highly accurate per-view depth with high resolution.

Given a reference image and source images , our UCS-Net progressively regresses a fine-grained depth map at the same resolution as the reference image. We show the architecture of the UCS-Net in Fig. 2. Our UCS-Net first leverages a 2D CNN to extract multi-scale deep image features at three resolutions (Sec. 3.2). Our depth prediction is achieved through three stages, which leverage multi-scale image features to predict multi-resolution depth maps. In these stages, we construct multi-scale cost volumes (Sec. 3.3), where each volume is a plane sweep volume or an adaptive thin volume (ATV). We then apply 3D CNNs to process the cost volumes to predict per-pixel depth probability distributions, and a depth map is reconstructed from the expectations of the distributions (Sec. 3.4). To achieve efficient spatial partitioning, we utilize the uncertainty of the depth prediction to construct ATVs as cost volumes for the last two stages (Sec. 3.5). Our multi-stage network effectively reconstructs depth in a coarse-to-fine fashion (Sec. 3.6).

3.2 Multi-scale feature extractor

Many methods [48, 49]

use downsampling convolutional layers to extract deep features and build a plane sweep volume at a single downsampled resolution. To reconstruct high-resolution depth, we introduce a multi-scale feature extractor, which enables constructing multiple cost volumes at different scales for multi-resolution depth prediction. As schematically shown in Fig. 


, our feature extractor is a small 2D UNet, which has an encoder and a decoder with skip connections. The encoder consists of a set of convolutional layers followed by GN (group normalization) and ReLu activation layers; we use stride = 2 convolutions to downsample the original image size twice. The decoder upsamples the feature maps, convolves the upsampled features and the concatenated features from skip links, and also applies GN and Relu layers. Given each input image

, the feature extractor provides three scale feature maps, , , , from the decoder for the following cost volume construction. We represent the original image size as , where and denote the image width and height; correspondingly, , and have resolutions of , and

, and their numbers of channels are 32, 16 and 8 respectively. Our multi-scale feature extractor allows for the high-resolution features to properly incorporate the information at lower resolutions through the learned upsampling process; thus in the multi-stage depth prediction, each stage is aware of the meaningful feature knowledge used in previous stages, which leads to reasonable high-frequency feature extraction.

3.3 Cost volume construction

We construct multiple cost volumes at multiple scales by warping the extracted feature maps, , , from source views to a reference view. Similar to previous work, this process is achieved through differentiable unprojecting and projecting. In particular, given camera intrinsics, rotations and translations for each view , the warping matrix at depth at the reference view is given by:


In particular, when warping to a pixel in the reference image at depth , finds its corresponding pixel location at each in homogeneous coordinates.

Each cost volume consists of multiple planes; we use to denote the depth hypothesis of the th plane at the th stage, and represents its value at pixel . At stage , once we warp per-view feature maps at all depth planes with corresponding hypotheses , we calculate the variance of the warped feature maps across views at each plane to construct a cost volume. We use to represent the number of planes for stage . For the first stage, we build a standard plane sweep volume, whose depth hypotheses are of constant values, i.e. . We uniformly sample from a pre-defined depth interval to construct the volume, in which each plane is constructed using to warp multi-view images. For the second and third stages, we build novel adaptive thin volumes, whose depth hypotheses have spatially-varying depth values according to pixel-wise uncertainty estimates of the previous depth prediction. In this case, we calculate per-pixel per-plane warping matrices by setting in Eqn. 1 to warp images and construct cost volumes. Please refer to Sec. 3.5 for uncertainty estimation.

3.4 Depth prediction and probability distribution

At each stage, we apply a 3D CNN to process the cost volume, infer multi-view correspondence and predict depth probability distributions. In particular, we use a 3D UNet similar to [48], which has multiple downsampling and upsampling 3D convolutional layers to reason about scene geometry at multiple scales. We apply depth-wise softmax at the end of the 3D CNNs to predict per-pixel depth probabilities. Our three stages use the same network architecture without sharing weights, so that each stage learns to process its information at a different scale. Please refer to the supplemental material for details of our 3D CNN architecture.

The 3D CNN at each stage predicts a depth probability volume that consists of depth probability maps associated with the depth hypotheses . expresses per-pixel depth probability distributions, where represents how probable the depth at pixel is . A depth map at stage is reconstructed by weighted sum:


3.5 Uncertainty estimation and ATV

The key for our framework is to progressively sub-partition the local space and refine the depth prediction with increasing resolution and accuracy. To do so, we construct novel ATVs for the last two stages, which have curved sweeping planes with spatially-varying depth hypotheses (as illustrated in Fig. 1 and Fig. 2), based on uncertainty inference of the predicted depth in its previous stage.

Figure 3: At a cascade stage, we predict a depth map (bottom right) from input RGB images (top left), and infer the uncertainty of the prediction, expressed by a confidence interval (marked in purple). On the bottom left, we show the predicted depth probabilities (connected blue dots) of a pixel (red points in depth), depth prediction (red dash line), the ground truth depth (blue dash line) and confidence intervals in the three stages.

Given a set of depth probability maps, previous work only utilizes the expectation of the per-pixel distributions (using Eqn. (2)) to determine an estimated depth map. For the first time, we leverage the variance of the distribution for uncertainty estimation, and construct ATVs using the uncertainty. In particular, the variance of the probability distribution at pixel and stage is calculated as:


and the corresponding standard deviation is

. Given the depth prediction and its variance at pixel , we propose to use a variance-based confidence interval to measure the uncertainty of the prediction:


where is a scalar parameter that determines how large the confidence interval is. For each pixel , we uniformly sample depth values from of the th stage, to get its depth values , ,…, of the depth planes for stage . In this way, we construct spatially-varying depth hypotheses , which form the ATV for stage .

The estimated expresses the uncertainty interval of the prediction , which determines the physical thickness of an ATV at each pixel. In Fig. 3, we show an actual example of three of a pixel around the prediction (red dash line), with , , . The essentially depicts a probabilistic local space around the ground truth surface, and the ground truth depth is located in the uncertainty interval with a very high confidence. Note that, our variance-based uncertainty estimation is differentiable, which enables our UCS-Net to learn to adjust the probability prediction at each stage to achieve optimized intervals and corresponding ATVs for following stages in an end-to-end training process. As a result, the spatially varying depth hypotheses in ATVs naturally adapt to the uncertainty of depth predictions, which leads to highly efficient spatial partitioning.

3.6 Coarse-to-fine prediction

Our UCS-Net leverages three stages to reconstruct depth at multiple scales from coarse to fine. In particular, we use , and to construct a plane sweep volume and two ATVs with sizes of , and to estimate depth at corresponding resolutions. While our two ATVs have small numbers ( and ) of depth planes, they in fact partition local depth ranges at finer scales than the first stage volume; this is achieved by our novel uncertainty-aware volume construction process which adaptively controls local depth intervals. This efficient usage of a small number of depth planes enables the last two stages to deal with higher pixel-wise resolutions given the limited memory, which makes fine-grained depth reconstruction possible. Our novel ATV effectively expresses the locality and uncertainty in the depth prediction, which enables state-of-the-art depth reconstruction results with high accuracy and high completeness through a coarse-to-fine framework.

3.7 Training details

Training set. We train our network on the DTU dataset [1]. We split the dataset into training, validate and testing set, and create ground truth depth similar to [48]. In particular, we apply Poisson reconstruction [25] on the point clouds in DTU, and render the surface at the captured views with three resolutions, , and the original . In particular, we use for training.

Loss function. Our UCS-Net predicts depth at three resolutions; we apply loss on depth prediction at each resolution with the rendered ground truth at the same resolution. Our final loss is the combination of the three losses.

Training policy.

To train our UCS-Net, we first train the first stage for 10 epochs. We then turn on all three stages and train the entire network from end to end for 20 epochs. We use Adam optimizer to train the network with an initial learning rate as

. We use 2 NVIDIA GTX 1080Ti GPUs to train the network.

4 Experiments

We now evaluate our UCS-Net. We first do benchmarking on the DTU and Tanks and Temple datasets. Afterwards, we justify the effectiveness of the designs of our network, in terms of uncertainty estimation and multi-stage prediction.


Method Acc. Comp. Overall


Camp [4] 0.835 0.554 0.695
Furu [10] 0.613 0.941 0.777
Tola [39] 0.342 1.190 0.766
Gipuma [11] 0.283 0.873 0.578
SurfaceNet [37] 0.450 1.040 0.745
MVSNet [48] 0.396 0.527 0.462
R-MVSNet [49] 0.383 0.452 0.417
Point-MVSNet [5] 0.342 0.411 0.376


Our 1st stage 0.507 0.498 0.502
Our 2nd stage 0.410 0.389 0.399
Our full model 0.330 0.372 0.351


Table 1: Quantitative results of accuracy, completeness and overall on the DTU testing set. Numbers represent distances in millimeters and smaller means better.

Evaluation on the DTU dataset [1]. We evaluate our method on the DTU testing set. To reconstruct the final point cloud, we follow [11] to fuse the depth from multiple views; we use this fusion method for all our experiments. For fair comparisons, we use the same view selection, image size and initial depth range as in [48] with , , , and ; similar settings are also used in other learning-based MVS methods [5, 49]. We use a NVIDIA P6000 GPU to run the evaluation.


Method Mean Family Francis Horse Lighthouse M60 Panther Playground Train


MVSNet[48] 43.48 55.99 28.55 25.07 50.79 53.96 50.86 47.90 34.69
R-MVSNet[49] 48.40 69.96 46.65 32.59 42.95 51.88 48.80 52.00 42.38
Dense-R-MVSNet[49] 50.55 73.01 54.46 43.42 43.88 46.80 46.69 50.87 45.25
Point-MVSNet[5] 48.27 61.79 41.15 34.20 50.79 51.97 50.85 52.38 43.06


Our full model 53.14 70.93 51.75 42.66 53.43 54.33 50.67 54.37 47.02


Table 2:

Quantitative results of F-scores (higher means better) on Tanks and Temples.

Figure 4: Comparisons with R-MVSNet on an example in the DTU dataset. We show rendered images of the point clouds of our method, R-MVSNet and the ground truth. In this example, the ground truth from scanning is incomplete. We also show insets for detailed comparisons marked as a blue box in the ground truth. Note that our result is smoother and has fewer outliers than R-MVSNet’s result.

We compare the accuracy and the completeness of the final reconstructions using the distance metric in [1]. We compare against both traditional methods and learning-based methods, and the average quantitative results are shown in Tab. 1. While Gipuma [11] (a traditional method) achieves the best accuracy among all methods, our method has significantly better completeness and overall scores. Besides, our method outperforms all state-of-the-art baseline methods in terms of both accuracy and completeness. Note that with the same input, MVSNet and R-MVSNet predict depth maps with a size of only ;our final depth maps are estimated at the original image size, which are of much higher resolution and lead to significantly better completeness. Meanwhile, such high completeness is obtained without losing accuracy; our accuracy is also significantly better thanks to our uncertainty-aware progressive reconstruction. Point-MVSNet [5] densifies low-resolution depth within a predefined local depth range, which also reconstructs depth at the original image resolution; in contrast, our UCS-Net leverages learned adaptive local depth ranges and achieves better accuracy and completeness.

We also show results from our intermediate low-resolution depth of the first and the second stages in Tab. 1. Note that, because of sparser depth planes, our first-stage results (160 planes) are worse than MVSNet (256 planes) and R-MVSNet (512 planes) that reconstruct depth at the same low resolution. Nevertheless, our novel uncertainty-aware network introduces highly efficient spatial partitioning with ATVs in the following stages, so that our intermediate second-stage reconstruction is already much better than the two previous methods, and our third stage further improves the quality and achieves the best reconstruction.

We show qualitative comparisons between our method and R-MVSNet [49] in Fig. 4, in which we use the released point cloud reconstruction on R-MVSNet’s website for the comparison. While both methods achieve comparable completeness in this example, it is very hard for R-MVSNet to achieve high accuracy at the same time, which is demonstrated by the outliers and noise on the surface. In contrast, our method is able to obtain high completeness and high accuracy simultaneously as reflected by the smooth complete geometry in the image.

Evaluation on Tanks and Temple dataset [27]. We now evaluate the generalization of our model by testing our network trained with the DTU dataset on complex outdoor scenes in the Tanks and Temple intermediate dataset. We use , and for this experiment. Our method outperforms most published methods, and to the best of our knowledge, when comparing with all published learning-based methods, we achieve the best average F-score (53.14) as shown in Tab. 2. In particular, our method obtains higher F-scores than MVSNet [48] and Point-MVSNet [5] in almost all (eight of the nine) testing scenes. Dense R-MVSNet leverages a well-designed post-processing method and achieves better performance than ours on some of the scenes, whereas our work is focused on high-quality per-view depth reconstruction, and we use a traditional fusion technique for post-processing. Thanks to our high-quality depth, our method still outperforms R-MVSNet on most of the testing scenes and achieves the best overall performance.

Evaluation of uncertainty estimation. One key design of our UCS-Net is leveraging differentiable uncertainty estimation for the ATV construction. We now evaluate our uncertainty estimation on the DTU validate set. In Tab. 3, we show the average length of our estimated uncertainty intervals, the corresponding average sampling distances between planes, and the ratio of the pixels whose estimated uncertainty intervals cover the ground truth depth in the ATVs; we also show the corresponding values of the standard plane sweep volume (PSV) used in the first stage, which has an interval length of and covers the ground truth depth with certainty.


Ratio Interval


PSV 100 496mm
1st ATV 95.2 13.42mm
2st ATV 89.0 6.69mm


Table 3: Evaluation of uncertainty estimation. The PSV is the plane sweep volume used in the first stage; The 1st ATV is constructed after the first stage and used in the second stage; the 2nd ATV is used in the third stage. The second column shows the percentages of uncertainty intervals that cover the ground truth depth. We also show the average length of the intervals, the number of depth planes and the unit sampling distance.
Figure 5: Qualitative comparisons between multi-stage point clouds and the ground truth point cloud on a scene in the DTU validate set. We show zoom-out (top) and zoom-in (bottom) rendered point clouds; the corresponding zoom-in region is marked in the ground truth as a green box. Our UCS-Net achieves increasingly dense and accurate reconstruction through the multiple stages. Note that, the ground truth point cloud is obtained by scanning, which is even of lower quality than our reconstructions in this example.

We can see that our method is able to construct efficient ATVs that cover very local depth ranges. Specifically, the first ATV significantly reduces the initial depth range from 496mm to only 13.42mm in average, and the second ATV further halves the interval length to 6.69mm. Our ATV enables high-resolution depth sampling in an adaptive way, and obtains about 0.84mm sampling distance with only 16 or 8 depth planes. Note that, MVSNet and R-MVSNet sample the same large depth range (496mm) in a uniform way with a large number of planes (256 and 512); however, such uniform sampling merely obtains volumes with sampling distances of 1.94mm and 0.97mm along depth. In contrast, our UCS-Net achieves a higher actual depth-wise sampling rate with a small number of planes; this allows for the focus of the cost volumes to be changed from sampling the depth to sampling the image plane with dense pixels in ATVs given the limited memory, which leads to high-resolution depth reconstruction.

Besides, our adaptive thin volumes achieve high ratios ( and ) of covering the ground truth depth in the validate set, as shown in Tab. 3

; this justifies that our estimated uncertainty intervals are of high confidence. Our variance-based uncertainty estimation is equivalent to approximating a depth probability distribution as a Gaussian distribution and then computing its confidence interval with a specified scale on its standard deviation as in Eqn. 

4. Note that, our specified corresponds to an uncertainty interval with confidence for a Gaussian distribution, whereas our actual ratio of covering the ground truth depth is much higher than that. This is made possible by the differentiable uncertainty estimation and the end-to-end training process, from which the network learns to control per-stage probability estimation to obtain proper uncertainty intervals for ATV construction. Because of this, we find that our network is not very sensitive to different , and learns to predict similar uncertainty. Our uncertainty-aware volume construction process enables highly efficient spatial partitioning, which further allows for the final reconstruction to be of high accuracy and high completeness.

Evaluation of multi-stage depth prediction. We have quantitatively demonstrated that our multi-stage framework reconstructs scene geometry with increasing accuracy and completeness in every stage (see Fig. 1). We now further evaluate our network, and do an ablation study with different stages on the DTU testing set with more detailed quantitative and qualitative comparisons. Figure. 5 shows qualitative comparisons between our reconstructed point clouds and the ground truth point cloud. Our UCS-Net is able to effectively refine and densify the reconstruction through multiple stages. Note that, our MVS-based reconstruction is even more complete than the ground truth point cloud that is obtained by scanning, which shows the high quality of our reconstruction.

Besides, we compare with naive upsampling to justify the effectiveness of our uncertainty-aware coarse-to-fine framework. In particular, we compare the results from our full model and the results from the first two stages with naive bilinear upsampling using a scale of 2 (for both height and width), as shown in Tab. 4. We can see that upsampling does improve the reconstruction, which benefits from denser geometry and using our high-quality low-resolution results as input. However, the improvement made by naive upsampling is very limited, which is much lower than our improvement from our ATV-based upsampling. Our UCS-Net makes use of the ATV – a learned local spatial representation that is constructed in an uncertainty-aware way – to reasonably densify the map with a significant increase of both completeness and accuracy at the same time.


Stage Scale Size Acc. Comp. Overall


1 1 400x296 0.506 0.498 0.502
1 2 800x592 0.449 0.453 0.451
2 1 800x592 0.410 0.389 0.399
2 2 1600x1184 0.359 0.381 0.370
3 1 1600x1184 0.330 0.373 0.351


Table 4: Ablation study on the DTU testing set with different stages and upsampling scales (a scale of 1 represents the original result at the stage). The quantitative results represent average distances in mm (lower is better).


Method Running time (s) Memory (MB) Input size Prediction size


One stage Two stages Our full model 0.659 0.954 2.013 4261 640x480 160x120 320x240 640x480


MVSNet [48] 1.049 4511 640x480 160x120
R-MVSNet [49] 1.421 4261 640x480 160x120


Table 5: Performance comparisons. We show the running time and memory of our method by running the first stage, the first two stages and our full model.

Comparing runtime performance. We now evaluate the timing and memory usage of our method. We run our model on the DTU validate set with an input image resolution of ; We compare performance with MVSNet and R-MVSNet with 256 depth planes using the same inputs. Table 5 shows the performance comparisons involving running time and memory. Note that, since the memory bottleneck of our model is the first stage, running our UCS-Net with different stages takes the same memory, which is lower than MVSNet and comparable to R-MVSNet. While our full model requires longer running time than the other two methods, our first two stages alone perform with timing comparable to others. Note that, our two-stage reconstruction has already achieved better reconstruction than the comparison methods as shown in Tab. 1. Overall, our UCS-Net with ATVs achieves high-quality reconstruction with moderate computational and memory resources.

5 Conclusion

In this paper, we present a novel deep learning-based approach for multi-view stereo. We propose the novel uncertainty-aware cascaded stereo network (UCS-Net), which utilizes the adaptive thin volume (ATV), a novel spatial representation. For the first time, we make use of the uncertainty of the prediction in a learning-based MVS system. Specifically, we leverage variance-based uncertainty intervals at one cascade stage to construct an ATV for its following stage. The ATVs are able to progressively sub-partition the local space at a finer scale, and ensure that the smaller volume still surrounds the actual surface with a high probability. Our novel UCS-Net achieves highly accurate and highly complete scene reconstruction in a coarse-to-fine fashion. We compare our method with various state-of-the-art benchmarks; we demonstrate that our method is able to achieve the qualitatively and quantitatively best performance with moderate computation- and memory- complexity. Our novel UCS-Net takes a step towards making the learning-based MVS method more reliable and efficient.



In this appendix, we evaluate the uncertainty estimation with additional experiments, show the sub-networks of our network architecture in detail, and demonstrate our final point cloud reconstruction results of the DTU testing set and the Tanks and Temple dataset.

6 Additional experiments of uncertainty estimation.

In this section, we discuss additional experiments and analysis about our uncertainty estimation evaluated on the DTU validate set.

Figure 6: Histograms of the uncertainty interval lengths. We create bins for every 0.5mm to compute the histograms of the lengths of the uncertainty intervals in the two ATVs. We mark the median and the mean values of the lengths in the histograms.
Figure 7: Uncertainty in depth predictions. We show three examples from the DTU validate set regarding the depth predictions and their pixel-wise uncertainty estimates. In each example, we show the reference RGB image, the ground truth depth, the depth prediction and a corresponding error map; we also illustrate the uncertainty intervals by showing the difference between the ground truth depth and the interval boundaries (lower bound and upper bound). Note that, in the right two columns, the white colors represent small intervals with low uncertainty, the red colors represent large intervals with large uncertainty, and the blue colors correspond to the intervals that fail to cover the ground truth.

We have shown the average lengths of the uncertainty intervals and the corresponding average sampling distances between the depth planes of the ATVs in Tab. 3 of the main paper. We now show the histograms of the uncertainty interval length in Fig. 6 to better illustrate the distributions of the interval length. We also mark the average lengths and the median lengths in the histograms. Note that, the distributions of the two ATVs are unimodal, in which most lengths distribute around the peaks; however, the average interval lengths differ significantly from the modes in the histograms, because of small portions of the intervals that have very large uncertainty. This means that using the average interval lengths – as what we do for Tab. 3 in the main paper – to discuss the depth-wise sampling is in fact underestimating the sampling efficiency we have achieved for most of the pixels, though our average lengths are good and correspond to a high sampling rate. Therefore, we additionally show the median values in the histograms, which are less sensitive to the large-value outliers and are more representative than the mean values for these distributions. As shown in Fig. 6, the median interval lengths of the two ATVs are 8.78mm and 4.47mm respectively, which are closer to the peaks of the histograms; these lengths correspond to a depth-wise sampling distance of about 0.55mm, given our specified 16 and 8 depth planes. This is a significantly higher sampling rate than previous works, such as MVSNet [48] – which uses 256 planes to obtain a sampling distance of 1.94mm – and RMVSNet [49] – which uses 512 planes to obtain a sampling distance of 0.97mm. Our ATV allows for highly efficient spatial partitioning, which achieves a high sampling rate with a small number of depth planes.

To illustrate how the per-pixel uncertainty estimates vary in a predicted depth map, we show the pixel-wise difference between the ground truth depth and the boundaries of the uncertainty intervals in Fig. 7. We can see that, while our estimated uncertainty intervals have small lengths (as shown in Fig. 6), the uncertainty estimation is very reliable, reflected by the fact that most intervals are covering the ground truth depth in both ATVs (the red and white colors in the right two columns of Fig. 7) . This verifies the high average covering ratios of and of the two ATVs, which we have shown in Tab. 3 of the main paper. We also observe more white colors in the 3rd-stage ATV than those in the 2nd-stage ATV, which reflects that the uncertainty is well reduced after a stage and the prediction becomes more precise. Note that, while our method may predict incorrect intervals (blue colors in the right two columns of Fig. 7) that fail to cover the ground truth for some pixels, those pixels are mostly around the shape boundaries, oblique surfaces and highly textureless regions, which are known to be challenging and still open problems for depth estimation. On the other hand, our method predicts large uncertainty for these challenging pixels, which is as we expect and reflects the inaccuracies in the predictions.

7 Network architecture.

We have shown the overview of our network in Fig. 2 of the main paper and discussed our network in Sec. 3 of the paper. Our network consists of a 2D U-Net for feature extraction and three 3D U-Nets with the same architecture for cost volume processing. We show the details of the 2D U-Net in Tab. 6, which is used for our multi-scale feature extractor (see Sec. 3.2 of the paper); we also show the details of our 3D U-Net in Tab. 7 that is used to process the cost volume at each stage (see Sec. 3.4 of the paper).


Layer Stride Output size Input
conv2_1, deconv4_1
conv1_1, deconv5_1


Table 6: The U-Net architecture of our multi-scale feature extractor. We show the detailed convolutional layers of our multi-scale feature extractor; each convolutional layer is followed by a GN (group normalization) layer and a ReLU layer. We use a kernel size of for all convolutional and deconvolutional layers. The output from the three bold layers (conv3_1, con4_2, conv5_2) will be further processed to provide multi-scale features, for which an additional convolutional layer with stride=1 is applied after each bold layer to compute the final features for cost volume construction.


Layer Stride Output size Input
conv2_1, deconv4_1
conv1_1, deconv5_1


Table 7: The network architecture of the 3D U-Net. We show the 3D U-Net architecture that is used to process the cost volume and predict the depth probabilities at each stage. We use a kernel size of

for all convolutional and deconvolutional layers. Except the final convolutional layer, we apply a BN (batch normalization) layer and a ReLU layer after each convolution. We apply soft-max on the final one-channel output over depth planes to compute the final depth probability maps.

8 Point cloud reconstruction.

We show our final point cloud reconstruction results of the DTU testing set [1] in Fig. 8 and Fig. 9, and the results of the Tanks and Temple dataset [27] in Fig. 10. Please refer to Tab. 1 and Tab. 2 in the main paper for quantitative results on these datasets.

Figure 8: Point cloud reconstruction on the DTU test set.
Figure 9: Point cloud reconstruction on the DTU test set.
Figure 10: Point cloud reconstruction on the Tanks and Temple dataset.


  • [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl (2016) Large-scale data for multiple-view stereopsis. International Journal of Computer Vision 120 (2), pp. 153–168. Cited by: §3.7, §4, §4, §8.
  • [2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3d point clouds. In

    International Conference on Machine Learning

    pp. 40–49. Cited by: §2.
  • [3] K. Batsos, C. Cai, and P. Mordohai (2018) CBMV: a coalesced bidirectional matching volume for disparity estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2060–2069. Cited by: §2.
  • [4] N. D. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla (2008) Using multiple hypotheses to improve depth-maps for multi-view stereo. In European Conference on Computer Vision, pp. 766–779. Cited by: §2, Table 1.
  • [5] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547. Cited by: §2, §3.1, Table 1, Table 2, §4, §4, §4.
  • [6] Z. Chen and H. Zhang (2018) Learning implicit fields for generative shape modeling. arXiv preprint arXiv:1812.02822. Cited by: §2.
  • [7] A. Dai, C. Ruizhongtai Qi, and M. Nießner (2017) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5868–5877. Cited by: §2.
  • [8] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pp. 2650–2658. Cited by: §2.
  • [9] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §2.
  • [10] Y. Furukawa and J. Ponce (2010) Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence 32 (8), pp. 1362–1376. Cited by: §1, §2, Table 1.
  • [11] S. Galliani, K. Lasinger, and K. Schindler (2015) Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881. Cited by: §2, Table 1, §4, §4.
  • [12] S. Galliani and K. Schindler (2016) Just look at the image: viewpoint-specific surface normal prediction for improved multi-view reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5479–5487. Cited by: §2.
  • [13] R. Garg, V. K. BG, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §2.
  • [14] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §2.
  • [15] W. Hartmann, S. Galliani, M. Havlena, L. Van Gool, and K. Schindler (2017) Learned multi-patch similarity. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1586–1594. Cited by: §2.
  • [16] P. Henderson and V. Ferrari (2019) Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. arXiv preprint arXiv:1901.06447. Cited by: §2.
  • [17] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §1, §2.
  • [18] S. Im, H. Jeon, S. Lin, and I. S. Kweon (2018) DPSNet: end-to-end deep plane sweep stereo. Cited by: §1, §1, §2.
  • [19] E. Insafutdinov and A. Dosovitskiy (2018) Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems, pp. 2807–2817. Cited by: §2.
  • [20] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017) Surfacenet: an end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2307–2315. Cited by: §1, §2.
  • [21] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131. Cited by: §2.
  • [22] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386. Cited by: §2.
  • [23] A. Kar, C. Häne, and J. Malik (2017) Learning a multi-view stereo machine. In Advances in neural information processing systems, pp. 365–376. Cited by: §1, §2.
  • [24] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §2.
  • [25] M. Kazhdan and H. Hoppe (2013) Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG) 32 (3), pp. 29. Cited by: §3.7.
  • [26] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §2.
  • [27] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36 (4), pp. 78. Cited by: §4, §8.
  • [28] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, and M. Pollefeys (2017) From point clouds to mesh using regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3893–3902. Cited by: §2.
  • [29] M. Lhuillier and L. Quan (2005) A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE transactions on pattern analysis and machine intelligence 27 (3), pp. 418–433. Cited by: §2.
  • [30] C. Lin, C. Kong, and S. Lucey (2018) Learning efficient point cloud generation for dense 3d object reconstruction. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [31] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2018) Occupancy networks: learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828. Cited by: §2.
  • [32] D. Paschalidou, O. Ulusoy, C. Schmitt, L. Van Gool, and A. Geiger (2018) Raynet: learning volumetric 3d reconstruction with ray potentials. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906. Cited by: §2, §2.
  • [33] S. R. Richter and S. Roth (2018) Matryoshka networks: predicting 3d geometry via nested shape layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1936–1944. Cited by: §2.
  • [34] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger (2017) Octnetfusion: learning depth fusion from data. In 2017 International Conference on 3D Vision (3DV), pp. 57–66. Cited by: §2.
  • [35] J. L. Schönberger, E. Zheng, J. Frahm, and M. Pollefeys (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pp. 501–518. Cited by: §2.
  • [36] D. Shin, Z. Ren, E. B. Sudderth, and C. C. Fowlkes (2019) Multi-layer depth and epipolar feature transformers for 3d scene reconstruction. arXiv preprint arXiv:1902.06729. Cited by: §2.
  • [37] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani (2017) Surfnet: generating 3d shape surfaces using deep residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6040–6049. Cited by: §2, Table 1.
  • [38] C. Tang and P. Tan (2018) Ba-net: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §1.
  • [39] E. Tola, C. Strecha, and P. Fua (2012) Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications 23 (5), pp. 903–920. Cited by: §2, Table 1.
  • [40] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik (2017) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2626–2634. Cited by: §2.
  • [41] A. O. Ulusoy, A. Geiger, and M. J. Black (2015) Towards probabilistic volumetric reconstruction using ray potentials. In 2015 International Conference on 3D Vision, pp. 10–18. Cited by: §2.
  • [42] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047. Cited by: §2.
  • [43] J. Wang, B. Sun, and Y. Lu (2018) MVPNet: multi-view point regression networks for 3d object reconstruction from a single image. arXiv preprint arXiv:1811.09410. Cited by: §2.
  • [44] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §2.
  • [45] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum (2017) Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in neural information processing systems, pp. 540–550. Cited by: §1, §2.
  • [46] J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. T. Freeman, and J. B. Tenenbaum (2018) Learning shape priors for single-view 3d completion and reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 646–662. Cited by: §2.
  • [47] Y. Yao, S. Li, S. Zhu, H. Deng, T. Fang, and L. Quan (2017) Relative camera refinement for accurate dense reconstruction. In 2017 International Conference on 3D Vision (3DV), pp. 185–194. Cited by: §2.
  • [48] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §1, §1, §1, §2, §3.2, §3.4, §3.7, Table 1, Table 2, Table 5, §4, §4, §6.
  • [49] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. arXiv preprint arXiv:1902.10556. Cited by: §1, §2, §3.1, §3.2, Table 1, Table 2, Table 5, §4, §4, §6.
  • [50] X. Zhang, Z. Zhang, C. Zhang, J. Tenenbaum, B. Freeman, and J. Wu (2018) Learning to reconstruct shapes from unseen classes. In Advances in Neural Information Processing Systems, pp. 2263–2274. Cited by: §2.
  • [51] H. Zhou, B. Ummenhofer, and T. Brox (2018) Deeptam: deep tracking and mapping. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 822–838. Cited by: §2.
  • [52] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2.