3DVNet: Multi-View Depth Prediction and Volumetric Refinement

We present 3DVNet, a novel multi-view stereo (MVS) depth-prediction method that combines the advantages of previous depth-based and volumetric MVS approaches. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions, resulting in highly accurate predictions which agree on the underlying scene geometry. Unlike existing depth-prediction techniques, our method uses a volumetric 3D convolutional neural network (CNN) that operates in world space on all depth maps jointly. The network can therefore learn meaningful scene-level priors. Furthermore, unlike existing volumetric MVS techniques, our 3D CNN operates on a feature-augmented point cloud, allowing for effective aggregation of multi-view information and flexible iterative refinement of depth maps. Experimental results show our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics on the ScanNet dataset, as well as a selection of scenes from the TUM-RGBD and ICL-NUIM datasets. This shows that our method is both effective and generalizes to new settings.



There are no comments yet.


page 4

page 8


Point-Based Multi-View Stereo Network

We introduce Point-MVSNet, a novel point-based deep framework for multi-...

Learn-to-Score: Efficient 3D Scene Exploration by Predicting View Utility

Camera equipped drones are nowadays being used to explore large scenes a...

Point-Based Neural Rendering with Per-View Optimization

There has recently been great interest in neural rendering methods. Some...

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

We present 3DMV, a novel method for 3D semantic scene segmentation of RG...

Mesh-based Camera Pairs Selection and Occlusion-Aware Masking for Mesh Refinement

Many Multi-View-Stereo algorithms extract a 3D mesh model of a scene, af...

Deep Multi-view Depth Estimation with Predicted Uncertainty

In this paper, we address the problem of estimating dense depth from a s...

Large-Scale 3D Scene Classification With Multi-View Volumetric CNN

We introduce a method to classify imagery using a convo- lutional neural...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Volumetric methods lack local detail while depth-based methods lack global coherence. Our method cyclically predicts depth, back-projects into 3D space, volumetrically models geometry, and updates all depth predictions to match, resulting in local detail and global coherence.

Multi-view stereo (MVS) is a central problem in computer vision with applications from augmented reality to autonomous navigation. In MVS, the goal is to reconstruct a scene using only posed RGB images as input. This reconstruction can take many forms, from voxelized occupancy or truncated signed distance fields (TSDFs), to per-frame depth prediction, the focus of this paper. In recent years, MVS methods based on deep learning

[2, 6, 11, 12, 17, 18, 22, 24, 26, 29, 30, 31, 32] have surpassed traditional MVS methods [9, 21] on numerous benchmark datasets [5, 13, 15]

. In this work, we consider these methods as falling into two categories, depth estimation and volumetric reconstruction, each with advantages and disadvantages.

The most recent learning methods in depth estimation use deep features to perform dense multi-view matching robust to large environmental lighting changes and textureless or specular surfaces, among other things. These methods take advantage of well researched multi-view aggregation techniques and the flexibility of depth as an output modality. They formulate explicit multi-view matching costs and include iterative refinement layers in which a network predicts a small depth offset between an initial prediction and the ground truth depth map

[2, 32]. While these techniques have been successful for depth prediction, most are constrained to making independent, per-frame predictions. This results in predictions that do not agree on the underlying 3D geometry of the scene. Those that do make joint predictions across multiple frames use either regularization constraints [11]

or recurrent neural networks (RNNs)


to encourage frames close in pose space to make similar predictions. However, these methods do not directly operate on a unified 3D scene representation, and their resulting reconstructions lack global coherence (see Fig. 


Meanwhile, volumetric techniques operate directly on a unified 3D scene representation by back-projecting and aggregating 2D features into a 3D voxel grid and using a 3D convolutional neural network (CNN) to regress a voxelized parameter, often a TSDF. These methods benefit from the use of 3D CNNs and naturally produce highly coherent 3D reconstructions and accurate depth predictions. However, they do not explicitly formulate a multi-view matching cost like depth-based methods, generally averaging deep features from different views to populate the 3D voxel grid. This results in overly-smooth output meshes (see Fig. 1).

In this paper, we propose 3DVNet, an end-to-end differentiable method for learned multi-view depth prediction that leverages the advantages of both volumetric scene modeling and depth-based multi-view matching and refinement. The key idea behind our method is the use of a 3D scene-modeling network which outputs a multi-scale volumetric encoding of the scene. This encoding is used with a modified PointFlow algorithm [2] to iteratively update a set of initial coarse depth predictions, resulting in predictions that agree on the underlying scene geometry.

Our 3D network operates on all depth predictions at once, and extracts meaningful, scene-level priors similar to volumetric MVS methods. However, the 3D network operates on features aggregated using depth-based multi-view matching and can be used iteratively to update depth maps. In this way, we combine the advantages of the two separate classes of techniques. Because of this, 3DVNet exceeds state-of-the-art results on ScanNet [5] in nearly all depth map prediction and 3D reconstruction metrics when compared with the current best depth and volumetric baselines. Furthermore, we show our method generalizes to other real and synthetic datasets [10, 23], again exceeding the best results on nearly all metrics. Our contributions are as follows:

  1. We present a 3D scene-modeling network which outputs a volumetric scene encoding, and show its effectiveness for iterative depth residual prediction.

  2. We modify PointFlow [2], an existing method for depth map residual predictions, to use our volumetric scene encoding.

  3. We design 3DVNet, a full MVS pipeline, using our 3D scene-modeling network and PointFlow refinement.

2 Related Works

We cover MVS methods using deep learning, categorizing them as either depth-prediction methods or volumetric methods. Our method falls into the first category, but is very much inspired by volumetric techniques.

Depth-Prediction MVS Methods: With some notable exceptions [22, 28]

, nearly all depth-prediction methods follow a similar paradigm: (1) they construct a plane sweep cost volume on a reference image’s camera frustum, (2) they fill the volume with deep features using a cost function that operates on source and reference image features, (3) they use a network to predict depth from this cost volume. Most methods differ in their cost metric used to construct the volume. Many cost metrics exist, including per-channel variance of deep features

[29, 30], learned aggregation using a network [17, 31], concatenation of deep features [12], the dot product of deep features [6], and absolute intensity difference of raw image RGB values [11, 26]. We find per-channel variance [29] to be the most commonly used cost metric, and adopt it in our system.

The choice of cost aggregation method results in either a vectorized matching cost and thus a 4D cost volume

[2, 12, 17, 29, 30, 31, 32] or a scalar matching cost and thus a 3D cost volume [6, 11, 26]. Methods with 4D cost volumes generally require 3D networks for processing, while 3D cost volumes can be processed with a 2D U-Net-style [20] encoder-decoder architecture. Some methods operate on the deep features at the bottleneck of this U-Net to make joint depth predictions for all frames or a subset of frames in a given scene. This is similar to our proposed method, and we highlight the differences.

GPMVS [11] uses a Gaussian Process (GP) constraint conditioned on pose distance to regularize these deep features. This GP constraint only operates on deep features and assumes Gaussian priors. In contrast, we directly learn priors from predicted depth maps and explicitly predict depth residuals to modify depth maps to match. DV-MVS [6] introduces an RNN to propagate information from the deep features in frame to frame given an ordered sequence of frames. While they do propagate this information in a geometrically plausible way, the RNN operates only on deep features similar to GPMVS. Furthermore, the RNN never considers all frames jointly like our method.

Similar to our method, some networks iteratively predict a residual to refine an initial depth prediction [2, 32]. We specifically highlight Point-MVSNet [2], which introduces PointFlow, a point cloud learning method for residual prediction. Our method is very much inspired by this work. We briefly describe the differences.

In their work, they operate on a point cloud back-projected from a single depth map and augmented with additional points. Features are extracted from this point cloud using point cloud learning techniques and used in their PointFlow module for residual prediction. Crucially, these features do not come from a unified 3D representation of the scene. Thus the residual prediction is only conditioned on information local to the individual depth prediction and not global scene information. In contrast, our variation of PointFlow uses our volumetric scene model to condition residual prediction on information from all depth maps. For an in depth discussion of differences, see Sec. 3.2.

Volumetric MVS Methods: In volumetric MVS, the goal is to directly regress a global volumetric representation of the scene, generally a TSDF volume. We highlight two methods which inspired our work. Atlas [18]

back-projects rays of deep features extracted from images into a global voxel grid, pools features from multiple views using a running average, then directly regress a TSDF in a coarse-to-fine fashion using a 3D U-Net. NeuralRecon 

[24] improves on the memory consumption and run-time of Atlas by reconstructing local fragments using the most recent keyframes, then fusing the local fragments to a global volume using an RNN. The reconstructions these methods produce are pleasing. However, both construct feature volumes using averaging in a single forward pass, which we believe is non-optimal. In contrast, our depth-based method allows us to construct a feature volume using multi-view matching features and perform iterative refinement.

Figure 2: Our novel 3D scene modeling and refinement method first constructs a multi-scale volumetric scene encoding from a set of input depth maps with corresponding feature maps. It then uses that encoding in a variation of the PointFlow algorithm [2] to predict a residual for each of the depth maps. The full method can be run in a nested for-loop fashion, predicting multiple residuals per depth map in the inner loop and running scene modeling in the outer loop.

3 Methods

Our method takes as input images, denoted , with corresponding known extrinsic and intrinsic camera parameters. Our goal is to predict depth maps corresponding to the images. As a pre-processing step, we define for every image a set of indices pointing to which images to use as source images for depth prediction, and append the reference index to form the set .

Our pipeline is as follows. First, a small depth-prediction network is used to independently predict initial coarse depth maps for every frame using extracted image features (Sec. 3.3). Second, we back-project our initial depth maps to form a joint point cloud (Sec 3.1). Because each point is associated with one depth map that has associated feature maps , we can augment it with a multi-view matching feature aggregated from those feature maps. Third, our 3D scene-modeling network takes as input this feature-rich point cloud and outputs a multi-scale scene encoding (Sec. 3.1). Fourth, we update each depth map to match this scene encoding using a modified PointFlow algorithm, resulting in highly coherent depth maps and thus highly coherent reconstructions (Sec. 3.2). Steps 2-4 can be run in a nested for-loop, with steps 2 and 3 run in the outer loop to generate updated scene models with the current depth maps and step 4 run in the inner loop to refine depth maps with the current scene model. We denote the updated depth map after outer loop iterations of scene modeling and inner loop iterations of updating as . Finally, we upsample the resulting refined depth maps to the size of the original image in a coarse-to-fine manner, guided by deep features and the original image, to arrive at final predictions for every image (Sec. 3.3).

3.1 3D Scene Modeling

A visualization of our 3D scene modeling method is given in the upper half of Fig. 2. As stated previously, our 3D scene-modeling network operates on a feature rich point cloud back-projected from or subsequent updated depth maps. To process this point cloud, we adopt a voxelize-then-extract approach. We first generate a sparse 3D grid of voxels, culling voxels that do not contain depth points. To avoid losing granular information of the point cloud, we generate a deep feature for each voxel using a per-voxel PointNet [1]. The PointNet inputs are the features of each depth point in the voxel as well as the 3D offset of that point to the voxel center. Finally, we run a 3D U-Net [20] on the resulting voxelized feature volume and extract intermediate outputs at multiple resolutions. By nature of construction, this U-Net learns meaningful, scene-level priors. The result is a multi-scale, volumetric scene encoding.

Point Cloud Formation: We form our point cloud by back-projecting all depth pixels in all depth maps. For our multi-view matching feature associated with each point , we follow existing work [2, 29] and use per-channel variance aggregation using the reference and source feature maps associated with each depth pixel. For , given that belongs to depth map , the equation for variance feature , applied per-channel, is:


where is the projection of to feature map ,

is the bilinear interpolation of

to point , and is the average interpolated feature over all indices . Intuitively, if lies on a surface it is more likely to have low variance in most feature channels in while if it doesn’t lie on a surface the variance will likely be high.

Point Cloud Voxelization: To form our initial feature volume, we regularly sample an initial 3D grid of points every cm within the axis-aligned bounding box of point cloud and define the voxel associated with each grid point as the cube with center . We denote the set of depth points that fall within a voxel with center as . We sparsify this grid by discarding if no depth points lie within the associated voxel, denoting this set of grid coordinates as . For , we produce a feature for the associated voxel using PointNet [1]

with max pooling. The PointNet feature for each voxel is defined as:



is a learnable multi-layer perceptron (MLP),

indicates concatenation of the 3D coordinates with the feature channel of to form a feature with 3 additional channels, and is the channel-wise max pooling operation. The result of this stage is a sparse feature volume with features given by Eq. 2 and coordinates .

Multi-Scale 3D Feature Extraction: In this stage, we use a sparse 3D U-Net to model the underlying scene geometry. We use a basic U-Net architecture with skip connections. Group normalization is used throughout. See supplementary material for a more detailed description of our architecture. Our sparse U-Net takes as input sparse feature volume . From intermediate outputs of the U-Net, we extract three scales of feature volumes , , with a voxel edge length of cm, cm, and cm, respectively, describing the scene. In this way, we extract a rich, multi-scale, volumetric encoding of the scene.

Figure 3: Diagram of standard PointFlow hypothesis point construction and our proposed feature generation, shown in 2D for simplicity. Feature volume in diagram corresponds to a single scale of our multi-scale scene encoding. Our key change from the original formulation is to generate hypothesis point features by trilinear interpolation of our volumetric scene encoding rather than edge convolutions on the point cloud from a single back-projected depth map.

3.2 PointFlow-Based Refinement

Figure 4: Overview of the full 3DVNet pipeline. See Secs. 3.1 and 3.2 for a description of our scene modeling and refinement.

In this stage, we use our multi-scale scene encoding from the previous stage in a variation of the PointFlow algorithm proposed by Chen  [2]. The goal is to refine our predicted depth maps to match our scene model by predicting a residual for each depth pixel. We briefly review the core components of PointFlow and the intuition behind our proposed change.

In PointFlow, a set of points called hypothesis points are constructed at regular intervals along a depth ray, centered about the depth prediction associated with the given depth ray. The blue and red points in Fig. 3

illustrate this. Features are generated for the hypothesis points. Then, a network processes these features and outputs a probability score for every point indicating confidence the given point is at the correct depth. Finally, the expected offset is calculated using these probabilities and added to the original depth prediction. Our key innovation is the use of our multi-scale scene encoding to generate the hypothesis point features.

In the original PointFlow, hypothesis points are constructed for a single depth map, augmented with features using Eq. 1, and aggregated into a point cloud. Note this point cloud is strictly different from our point cloud as (1) it is produced using a single depth map, and (2) it includes hypothesis points. Features are generated for each point using edge convolutions [27]

on the k-Nearest-Neighbor (kNN) graph. Crucially, these edge convolutions never operate on a unified 3D scene representation in the original PointFlow. This prevents the offset predictions from learning global information, which we believe is a critical step for depth residual prediction. Furthermore, because of the required kNN search, this formulation cannot scale to process a joint point cloud from an arbitrary number of depth maps, therefore preventing it from scaling to learn global information.

Inspired by convolutional occupancy networks [19] and IFNets [3], we instead generate hypothesis features by interpolating each scale of our multi-scale scene encoding (see Fig. 3). With this key change, we use powerful scene-level priors in our offset prediction conditioned on all depth predictions for a given scene. Furthermore, by using the same encoding to update all depth predictions, we encourage global consistency of predictions. We now describe in detail our variation of the PointFlow method (see Figs. 2 and 3), using notation similar to the original paper.

Hypothesis Point Construction: For a given back-projected depth pixel from depth map , we generate point hypotheses :


where is the normalized reference camera direction of , and is the displacement step size.

Feature Generation: We generate a multi-scale feature for each hypothesis point using trilinear interpolation to point of our sparse features volumes , using s where features are not defined:


Next, we generate a variance feature for hypothesis point using Eq. 1. The final feature for a hypothesis point is the channel-wise concatenation of these features:


We stack our point-hypothesis features to form a 2D feature , where is the sum of the dimensions of our variance and scene encoding features.

Offset Prediction: We apply a 4 layer 1D CNN followed by a softmax function to predict a probability scalar for each point-wise entry in . The predicted displacement of point is then as follows:


The updated depth for each depth map is the depth of point with respect to the camera associated with .

3.3 Bringing It All Together: 3DVNet

In this section, we describe our full depth-prediction pipeline using our multi-scale volumetric scene modeling and PointFlow-based refinement, which we name 3DVNet (see Fig. 4). Our pipeline consists of (1) initial feature extraction and depth prediction, (2) scene modeling and refinement, and (3) upsampling of our refined depth map to the size of the original image. The scene modeling and refinement is done in a nested for-loop fashion, extracting a scene model in the outer loop and iteratively refining the depth predictions using that scene model in the inner loop. We fix the input image size of 3DVNet to .

2D Feature Extraction: For our 2D feature extraction, we adopt the approach of Düzçeker  [6], and use a 32 channel feature pyramid network (FPN) [16] constructed on a MnasNet [25] backbone to extract coarse and fine resolution feature maps of size and respectively. For every image , we denote these and .

MVSNet Prediction: For the coarse depth prediction of image , we use a small MVSNet [29] using the reference and source coarse feature maps to predict an initial coarse depth . Our cost volume is constructed using traditional plane sweep stereo with depth hypotheses sampled uniformly at intervals of 5 cm starting at 50 cm. Similar to Yu and Gao [32], our predicted depth map is spatially sparse compared to feature map . We fix our coarse depth map prediction size to .

Nested For-Loop Refinement: We denote the updated depths after scene-modeling iteration and PointFlow iteration as . We use initial depth predictions and coarse feature maps to generate multi-scale scene encoding , , . We then run PointFlow refinement three times with displacement step size cm, cm, and cm and to get updated depths . In early experiments, we found two iterations at cm to be helpful. We re-generate our scene encoding using updated depths and coarse feature maps . We then run PointFlow three more times with step sizes  cm,  cm, and cm and to get updated depths . We find our depth maps converge at this point.

Coarse-to-Fine Upsampling: In this stage, we upsample each refined depth prediction to the size of image . We find PointFlow refinement does not remove interpolation artifacts, as this generally requires predicting large offsets across depth boundaries. We outline a simple, coarse-to-fine method for upsampling while removing artifacts. See the right section of Fig. 4. At each step, we upsample the current depth prediction using nearest-neighbor interpolation to the size of the next-largest feature map and concatenate, using the original image in the final step. We then pass the concatenated feature map and depth through a smoothing network. We use a version of the propagation network proposed by Yu and Gao [32]. For every pixel in depth map , the smoothed depth is a weighted sum of in the neighborhood about :


where is a 4 layer CNN that takes as input the concatenated feature and depth map and outputs 9 weights for every pixel , and indexes those weights for the pixel . A softmax function is applied to the weights for normalization. We apply this coarse-to-fine upsampling to every refined depth map to arrive at a final depth prediction for every input image .

4 Experiments


(FT) (FT) pair fusion (FT) Recon


Abs-rel 0.389 0.085 0.274 0.084 0.069 0.061 0.121 0.062 0.062 0.063 0.040
Abs-diff 0.668 0.168 0.444 0.165 0.142 0.127 0.214 0.124 0.116 0.099 0.079
Abs-inv 0.148 0.048 0.145 0.050 0.044 0.038 0.066 0.039 0.044 0.039 0.026
Sq-rel 0.798 0.046 0.463 0.045 0.026 0.021 0.860 0.022 0.040 0.039 0.015
RMSE 1.051 0.267 0.776 0.267 0.220 0.200 0.339 0.199 0.238 0.206 0.154
0.630 0.922 0.732 0.922 0.949 0.963 0.890 0.960 0.935 0.948 0.975
0.768 0.981 0.857 0.979 0.989 0.992 0.971 0.992 0.971 0.976 0.992
0.859 0.994 0.915 0.993 0.997 0.997 0.990 0.998 0.985 0.989 0.997
Acc 0.093 0.039 0.059 0.043 0.059 0.067 0.077 0.057 0.078 0.058 0.051
Comp 0.303 0.256 0.184 0.212 0.145 0.128 0.150 0.111 0.097 0.108 0.075
Prec 0.651 0.738 0.570 0.707 0.595 0.557 0.486 0.604 0.607 0.636 0.715
Rec 0.317 0.433 0.486 0.454 0.489 0.504 0.453 0.565 0.546 0.509 0.625
F-score 0.409 0.529 0.511 0.541 0.524 0.520 0.459 0.574 0.573 0.564 0.665


Abs-rel 0.318 0.111 0.273 0.113 0.117 0.095 0.102 0.093 0.163 0.106 0.076
Abs-diff 0.642 0.275 0.573 0.281 0.339 0.273 0.243 0.239 0.404 0.167 0.210
0.662 0.858 0.694 0.851 0.838 0.886 0.874 0.891 0.816 0.912 0.912
F-score 0.115 0.145 0.150 0.154 0.141 0.162 0.157 0.170 0.129 0.117 0.181


Abs-rel 0.614 0.107 0.303 0.095 0.106 0.114 0.107 0.066 0.110 0.123 0.050
Abs-diff 1.469 0.262 0.707 0.245 0.278 0.322 0.290 0.176 0.332 0.303 0.120
0.311 0.877 0.659 0.894 0.878 0.847 0.855 0.965 0.833 0.709 0.980
F-score 0.064 0.144 0.382 0.246 0.173 0.150 0.241 0.323 0.194 0.055 0.440


Table 1: Metrics for three datasets (ScanNet, TUM-RGBD, and ICL-NUIM). Bold indicates best performing method, underline the second best. White rows indicate 2D depth metrics while gray rows indicate 3D metrics. Vertical lines separate depth-based methods, volumetric methods, and our method. “FT” denotes method was finetuned on ScanNet. Our method outperforms all other baseline methods by a wide margin on most metrics.

4.1 Implementation and Training Details


Our model is implemented in PyTorch using PyTorch Lightning

[7] and PyTorch Geometric [8]. We use Minkowski Engine [4]

as our sparse tensor library. We use Open3D 

[33] for both visualization and evaluation.

Training Parameters:

We train our network on a single NVIDIA RTX 3090 GPU. Our network is trained end-to-end with a mini-batch size of 2. Each mini-batch consists of 7 images for depth prediction. For our loss function, we accumulate the average

error between ground truth and predicted depth maps, appropriately downsampling the ground truth depth map to the correct resolution, for all predicted, refined, and upsampled depth map at every stage in our pipeline. Additionally, we employ random geometric scale augmentation with a factor selected between to and random rotation about the gravitational axis.

We first train with the pre-trained MnasNet backbone frozen using the Adam optimizer [14] with an initial learning rate of

which is divided by 10 every 100 epochs (

1.5k iterations), to convergence (1.8k iterations). We unfreeze the MnasNet backbone and finetune the entire network using Adam and an initial learning rate of that is halved every 50 epochs to convergence (1.8k iterations).

4.2 Datasets, Baselines, Metrics, and Protocols

Datasets: To train and validate our model, we use the ScanNet [5] official training and validation splits. For our main comparison experiment, we use the ScanNet official test set, which consists of 100 test scenes in a variety of indoor settings. To evaluate the generalization ability of our model, we select 10 sequences from TUM-RGBD [23], and 4 sequences from ICL-NUIM [10] for comparison.

Baselines: We compare our method to seven state of the art baselines: Point-MVSNet (PMVS) [2], Fast-MVSNet (FMVS) [32], DeepVideoMVS pair/fusion networks (DVMVS pair/fusion) [6], GPMVS batched [11], Atlas [18], and NeuralRecon [24]. The first five baselines are depth-prediction methods while the last two are volumetric methods. Of these, we consider GPMVS and Atlas the most relevant depth and volumetric methods respectively, as both use information from all frames simultaneously during inference. We use the ScanNet training scenes to fintetune methods not trained on ScanNet [2, 11, 32]. We report both the finetuned and pretrained results, denoted with and without “FT”. To account for range differences between the DTU dataset [13] and ScanNet, we use our model’s plane sweep parameters with PMVS and FMVS.

Metrics: We use the 2D and 3D metrics presented by Murez  [18] for evaluation. See supplementary for definitions. Amongst these metrics, we consider Abs-rel, Abs-diff, and the first inlier ratio metric as the most suitable 2D metrics for measuring depth prediction quality, and F-score as the most suitable 3D metric for measuring 3D reconstruction quality. Following Düzçeker  [6]

, we only consider ground truth depth values greater than 50 cm to account for some methods not being able to predict smaller depth. We note F-score, Precision, and Recall are calculated per-scene and then averaged across all the scenes. This results in a different F-score than when calculating from the averaged Precision and Recall reported.

Protocols: For depth-based methods, we fuse predicted depths using the standard multi-view consistency based point cloud fusion. Based on results on validation sets, we modify the implementation of Galliani [9] to use depth-based multi-view consistency check, rather than a disparity-based check (see Sec. 3.3 of the supplementary materials). For volumetric methods, we use marching cubes to extract a mesh from the predicted TSDF. Following Murez  [18], we trim the meshes to remove geometry not observed in the ground truth camera frustums. Additionally, ScanNet ground truth meshes often contain holes in observed regions. We mask out these holes for all baselines to avoid false penalization. All meshes are single layer to match ScanNet ground truth as noted by Sun [24].

We use the DVMVS keyframe selection. For depth-based methods, we use each keyframe as a reference image for depth prediction. We use the 2 previous and 2 next keyframes as source images (4 source images total). For depth-based methods, we resize the output depth map to using nearest-neighbor interpolation. For volumetric methods, we use the predicted mesh to render depth maps for each keyframe.

4.3 Evaluation Results and Discussion

Figure 5: Qualitative depth results on ScanNet. Our method produces sharp details with well defined object boundaries.
Figure 6: Qualitative reconstruction results on ScanNet for the four best-performing methods. Our technique produces globally coherent reconstructions like purely volumetric methods while containing the local detail of depth-based methods.

See Tab. 1 for 2D depth and 3D geometry metrics on all datasets. Our method outperforms all baselines by a wide margin on most metrics. Notably, our Abs-rel error on ScanNet is 0.021 less than the DVMVS fusion, the second best method, while the Abs-rel of the third, fourth, and fifth best methods are all within 0.002 of DVMVS fusion. Similarly, Our ScanNet F-score is 0.09 more than GPMVS (FT), the second best method, while the F-score is within 0.001 of GPMVS (FT) for Atlas, the third best method. This demonstrates the significant quantitative increase in both depth and reconstruction metrics of our method. Results on additional datasets show the strong generalization ability of our model.

We include qualitative results on ScanNet. See Figs. 5 and 6. See Sec. 4 of the supplementary materials for additional qualitative results. Our depth maps are visually pleasing, with clearly defined edges. They are comparable in quality to those of GPMVS and DVMVS fusion while being quantitatively more accurate. Our reconstructions are coherent like volumetric methods, without the noise present in other depth-based reconstructions, which we believe is a result of our volumetric scene encoding and refinement.

We do note one benefit of Atlas is its ability to fill large unobserved holes. Though not reflected in the metrics, this leads to qualitatively more complete reconstructions. Our system relies on depth maps and thus cannot do this as designed. However, as a result of averaging across image features, Atlas produces meshes that are overly smooth and lack detail. In contrast, our reconstructions contain sharper, better defined features than purely volumetric methods. Finally, we note our system cannot naturally be run in an online fashion, requiring availability of all frames prior to use.


Abs-rel Abs-diff F-score
0 0 0.070 0.137 0.949 0.559
1 1 0.050 0.100 0.965 0.651
1 2 0.044 0.088 0.971 0.661
1 3 0.043 0.086 0.972 0.664
2 1 0.041 0.081 0.974 0.668
2 2 0.040 0.079 0.975 0.667
2 3 0.040 0.079 0.975 0.665
Table 2: Metrics as a function of number of inner PointFlow-refinement iterations (denoted ) and number of outer-loop scene-modeling passes (denoted ).


Model Abs-rel Abs-diff F-score
no 3d 0.067 0.134 0.952 0.551
single scale 0.041 0.080 0.973 0.662
avg feats 0.043 0.082 0.975 0.656
full 0.040 0.079 0.975 0.665
Table 3: Metrics for our ablation study. See Sec. 4.4 for descriptions of each condition.

4.4 Ablation and Additional Studies

Does Iterative Refinement Help? We study the effect of each inner and outer loop iteration of our depth refinement. See Tab. 2. We exceed state-of-the-art metrics after 2 iterations. 3 additional iterations add continued improvement, confirming the effectiveness of iterative refinement. By 5 iterations, our metrics have converged, with depth stabilizing and F-score decreasing slightly. Interestingly, the final iteration appears slightly detrimental.

Does Multi-Scale Scene Modeling Help? To test this, we (1) completely remove our multi-scale scene encoding from the PointFlow refinement, and (2) only use the coarsest scale , respectively denoted “no 3D” and “single scale” in Tab. 3. Without any scene-level information, our refinement breaks down, indicating the scene modeling is essential. The single scale model does slightly worse, confirming the effectiveness of our multi-scale encoding.

Do Multi-View Matching Features Help? We use a per-channel average instead of variance aggregation for each point in our feature-rich point cloud, denoted “avg feats” in Tab. 3. Most metrics, notably the F-score, suffer. This supports our hypothesis that multi-view matching is more beneficial for reconstruction than averaging.

For additional studies, see the supplementary material.

5 Conclusion

We present 3DVNet, which uses the advantages of both depth-based and volumetric MVS. We perform experiments with 3DVNet to show depth-based iterative refinement and multi-view matching combined with volumetric scene modeling greatly improves both depth-prediction and reconstruction metrics. We believe our 3D scene-modeling network bridges an important gap between depth prediction, image feature aggregation, and volumetric scene modeling and has applications far beyond depth-residual prediction. In future work, we will explore its use for segmentation, normal estimation, and direct TSDF prediction.

Acknowledgements: Support for this work was provided by ONR grants N00014-19-1-2553 and N00174-19-1-0024, as well as NSF grant 1911230.


  • [1] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 77–85. External Links: Document Cited by: §3.1, §3.1.
  • [2] R. Chen, S. Han, J. Xu, and H. Su (2019-10) Point-based multi-view stereo network. In International Conference on Computer Vision (ICCV), Cited by: item 2, §1, §1, §1, Figure 2, §2, §2, §3.1, §3.2, §4.2.
  • [3] J. Chibane, T. Alldieck, and G. Pons-Moll (2020-06) Implicit functions in feature space for 3D shape reconstruction and completion. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
  • [4] C. Choy, J. Gwak, and S. Savarese (2019) 4D spatio-temporal ConvNets: minkowski convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084. Cited by: §4.1.
  • [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.2.
  • [6] A. Düzçeker, S. Galliani, C. Vogel, P. Speciale, M. Dusmanu, and M. Pollefeys (2021) DeepVideoMVS: multi-view stereo on video with recurrent spatio-temporal fusion. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §2, §2, §3.3, §4.2, §4.2.
  • [7] W. Falcon and et al. (2019) PyTorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning 3. Cited by: §4.1.
  • [8] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.1.
  • [9] S. Galliani, K. Lasinger, and K. Schindler (2015-12) Massively parallel multiview stereopsis by surface normal diffusion. In International Conference on Computer Vision (ICCV), Cited by: §1, §4.2.
  • [10] A. Handa, T. Whelan, J. McDonald, and A. J. Davison (2014-05) A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In International Conference on Robotics and Automation (ICRA), Hong Kong, China. Cited by: §1, §4.2.
  • [11] Y. Hou, J. Kannala, and A. Solin (2019) Multi-view stereo by temporal nonparametric fusion. In International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, §2, §2, §4.2.
  • [12] S. Im, H. Jeon, S. Lin, and I. S. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §2.
  • [13] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs (2014) Large scale multi-view stereopsis evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 406–413. Cited by: §1, §4.2.
  • [14] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [15] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (TOG) 36 (4). Cited by: §1.
  • [16] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944. External Links: Document Cited by: §3.3.
  • [17] K. Luo, T. Guan, L. Ju, H. Huang, and Y. Luo (2019-10) P-MVSNet: learning patch-wise matching confidence aggregation for multi-view stereo. In International Conference on Computer Vision (ICCV), Cited by: §1, §2, §2.
  • [18] Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich (2020) Atlas: end-to-end 3D scene reconstruction from posed images. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.2, §4.2, §4.2.
  • [19] S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger (2020) Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), Cited by: §3.2.
  • [20] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §2, §3.1.
  • [21] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §1.
  • [22] A. Sinha, Z. Murez, J. Bartolozzi, V. Badrinarayanan, and A. Rabinovich (2020) DELTAS: depth estimation by learning triangulation and densification of sparse points. In European Conference on Computer Vision (ECCV), External Links: Link Cited by: §1, §2.
  • [23] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012-Oct.) A benchmark for the evaluation of RGB-D SLAM systems. In International Conference on Intelligent Robot Systems (IROS), Cited by: §1, §4.2.
  • [24] J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao (2021) NeuralRecon: real-time coherent 3D reconstruction from monocular video. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.2, §4.2.
  • [25] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [26] K. Wang and S. Shen (2018-Sep.) MVDepthNet: real-time multiview depth estimation neural network. In International Conference on 3D Vision (3DV), Cited by: §1, §2, §2.
  • [27] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §3.2.
  • [28] Z. Yang, Z. Ren, Q. Shan, and Q. Huang (2021) MVS2D: efficient multi-view stereo via attention-driven 2D convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [29] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2019) MVSNet: depth inference for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, §3.1, §3.3.
  • [30] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent MVSNet for high-resolution multi-view stereo depth inference. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
  • [31] H. Yi, Z. Wei, M. Ding, R. Zhang, Y. Chen, G. Wang, and Y. Tai (2020) Pyramid multi-view stereo net with self-adaptive view aggregation. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2.
  • [32] Z. Yu and S. Gao (2020) Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §2, §3.3, §3.3, §4.2.
  • [33] Q. Zhou, J. Park, and V. Koltun (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §4.1.