1 Introduction
^{†}^{†}https://github.com/alexrich021/3dvnetMultiview stereo (MVS) is a central problem in computer vision with applications from augmented reality to autonomous navigation. In MVS, the goal is to reconstruct a scene using only posed RGB images as input. This reconstruction can take many forms, from voxelized occupancy or truncated signed distance fields (TSDFs), to perframe depth prediction, the focus of this paper. In recent years, MVS methods based on deep learning
[2, 6, 11, 12, 17, 18, 22, 24, 26, 29, 30, 31, 32] have surpassed traditional MVS methods [9, 21] on numerous benchmark datasets [5, 13, 15]. In this work, we consider these methods as falling into two categories, depth estimation and volumetric reconstruction, each with advantages and disadvantages.
The most recent learning methods in depth estimation use deep features to perform dense multiview matching robust to large environmental lighting changes and textureless or specular surfaces, among other things. These methods take advantage of well researched multiview aggregation techniques and the flexibility of depth as an output modality. They formulate explicit multiview matching costs and include iterative refinement layers in which a network predicts a small depth offset between an initial prediction and the ground truth depth map
[2, 32]. While these techniques have been successful for depth prediction, most are constrained to making independent, perframe predictions. This results in predictions that do not agree on the underlying 3D geometry of the scene. Those that do make joint predictions across multiple frames use either regularization constraints [11]or recurrent neural networks (RNNs)
[6]to encourage frames close in pose space to make similar predictions. However, these methods do not directly operate on a unified 3D scene representation, and their resulting reconstructions lack global coherence (see Fig.
1).Meanwhile, volumetric techniques operate directly on a unified 3D scene representation by backprojecting and aggregating 2D features into a 3D voxel grid and using a 3D convolutional neural network (CNN) to regress a voxelized parameter, often a TSDF. These methods benefit from the use of 3D CNNs and naturally produce highly coherent 3D reconstructions and accurate depth predictions. However, they do not explicitly formulate a multiview matching cost like depthbased methods, generally averaging deep features from different views to populate the 3D voxel grid. This results in overlysmooth output meshes (see Fig. 1).
In this paper, we propose 3DVNet, an endtoend differentiable method for learned multiview depth prediction that leverages the advantages of both volumetric scene modeling and depthbased multiview matching and refinement. The key idea behind our method is the use of a 3D scenemodeling network which outputs a multiscale volumetric encoding of the scene. This encoding is used with a modified PointFlow algorithm [2] to iteratively update a set of initial coarse depth predictions, resulting in predictions that agree on the underlying scene geometry.
Our 3D network operates on all depth predictions at once, and extracts meaningful, scenelevel priors similar to volumetric MVS methods. However, the 3D network operates on features aggregated using depthbased multiview matching and can be used iteratively to update depth maps. In this way, we combine the advantages of the two separate classes of techniques. Because of this, 3DVNet exceeds stateoftheart results on ScanNet [5] in nearly all depth map prediction and 3D reconstruction metrics when compared with the current best depth and volumetric baselines. Furthermore, we show our method generalizes to other real and synthetic datasets [10, 23], again exceeding the best results on nearly all metrics. Our contributions are as follows:

We present a 3D scenemodeling network which outputs a volumetric scene encoding, and show its effectiveness for iterative depth residual prediction.

We modify PointFlow [2], an existing method for depth map residual predictions, to use our volumetric scene encoding.

We design 3DVNet, a full MVS pipeline, using our 3D scenemodeling network and PointFlow refinement.
2 Related Works
We cover MVS methods using deep learning, categorizing them as either depthprediction methods or volumetric methods. Our method falls into the first category, but is very much inspired by volumetric techniques.
DepthPrediction MVS Methods: With some notable exceptions [22, 28]
, nearly all depthprediction methods follow a similar paradigm: (1) they construct a plane sweep cost volume on a reference image’s camera frustum, (2) they fill the volume with deep features using a cost function that operates on source and reference image features, (3) they use a network to predict depth from this cost volume. Most methods differ in their cost metric used to construct the volume. Many cost metrics exist, including perchannel variance of deep features
[29, 30], learned aggregation using a network [17, 31], concatenation of deep features [12], the dot product of deep features [6], and absolute intensity difference of raw image RGB values [11, 26]. We find perchannel variance [29] to be the most commonly used cost metric, and adopt it in our system.The choice of cost aggregation method results in either a vectorized matching cost and thus a 4D cost volume
[2, 12, 17, 29, 30, 31, 32] or a scalar matching cost and thus a 3D cost volume [6, 11, 26]. Methods with 4D cost volumes generally require 3D networks for processing, while 3D cost volumes can be processed with a 2D UNetstyle [20] encoderdecoder architecture. Some methods operate on the deep features at the bottleneck of this UNet to make joint depth predictions for all frames or a subset of frames in a given scene. This is similar to our proposed method, and we highlight the differences.GPMVS [11] uses a Gaussian Process (GP) constraint conditioned on pose distance to regularize these deep features. This GP constraint only operates on deep features and assumes Gaussian priors. In contrast, we directly learn priors from predicted depth maps and explicitly predict depth residuals to modify depth maps to match. DVMVS [6] introduces an RNN to propagate information from the deep features in frame to frame given an ordered sequence of frames. While they do propagate this information in a geometrically plausible way, the RNN operates only on deep features similar to GPMVS. Furthermore, the RNN never considers all frames jointly like our method.
Similar to our method, some networks iteratively predict a residual to refine an initial depth prediction [2, 32]. We specifically highlight PointMVSNet [2], which introduces PointFlow, a point cloud learning method for residual prediction. Our method is very much inspired by this work. We briefly describe the differences.
In their work, they operate on a point cloud backprojected from a single depth map and augmented with additional points. Features are extracted from this point cloud using point cloud learning techniques and used in their PointFlow module for residual prediction. Crucially, these features do not come from a unified 3D representation of the scene. Thus the residual prediction is only conditioned on information local to the individual depth prediction and not global scene information. In contrast, our variation of PointFlow uses our volumetric scene model to condition residual prediction on information from all depth maps. For an in depth discussion of differences, see Sec. 3.2.
Volumetric MVS Methods: In volumetric MVS, the goal is to directly regress a global volumetric representation of the scene, generally a TSDF volume. We highlight two methods which inspired our work. Atlas [18]
backprojects rays of deep features extracted from images into a global voxel grid, pools features from multiple views using a running average, then directly regress a TSDF in a coarsetofine fashion using a 3D UNet. NeuralRecon
[24] improves on the memory consumption and runtime of Atlas by reconstructing local fragments using the most recent keyframes, then fusing the local fragments to a global volume using an RNN. The reconstructions these methods produce are pleasing. However, both construct feature volumes using averaging in a single forward pass, which we believe is nonoptimal. In contrast, our depthbased method allows us to construct a feature volume using multiview matching features and perform iterative refinement.3 Methods
Our method takes as input images, denoted , with corresponding known extrinsic and intrinsic camera parameters. Our goal is to predict depth maps corresponding to the images. As a preprocessing step, we define for every image a set of indices pointing to which images to use as source images for depth prediction, and append the reference index to form the set .
Our pipeline is as follows. First, a small depthprediction network is used to independently predict initial coarse depth maps for every frame using extracted image features (Sec. 3.3). Second, we backproject our initial depth maps to form a joint point cloud (Sec 3.1). Because each point is associated with one depth map that has associated feature maps , we can augment it with a multiview matching feature aggregated from those feature maps. Third, our 3D scenemodeling network takes as input this featurerich point cloud and outputs a multiscale scene encoding (Sec. 3.1). Fourth, we update each depth map to match this scene encoding using a modified PointFlow algorithm, resulting in highly coherent depth maps and thus highly coherent reconstructions (Sec. 3.2). Steps 24 can be run in a nested forloop, with steps 2 and 3 run in the outer loop to generate updated scene models with the current depth maps and step 4 run in the inner loop to refine depth maps with the current scene model. We denote the updated depth map after outer loop iterations of scene modeling and inner loop iterations of updating as . Finally, we upsample the resulting refined depth maps to the size of the original image in a coarsetofine manner, guided by deep features and the original image, to arrive at final predictions for every image (Sec. 3.3).
3.1 3D Scene Modeling
A visualization of our 3D scene modeling method is given in the upper half of Fig. 2. As stated previously, our 3D scenemodeling network operates on a feature rich point cloud backprojected from or subsequent updated depth maps. To process this point cloud, we adopt a voxelizethenextract approach. We first generate a sparse 3D grid of voxels, culling voxels that do not contain depth points. To avoid losing granular information of the point cloud, we generate a deep feature for each voxel using a pervoxel PointNet [1]. The PointNet inputs are the features of each depth point in the voxel as well as the 3D offset of that point to the voxel center. Finally, we run a 3D UNet [20] on the resulting voxelized feature volume and extract intermediate outputs at multiple resolutions. By nature of construction, this UNet learns meaningful, scenelevel priors. The result is a multiscale, volumetric scene encoding.
Point Cloud Formation: We form our point cloud by backprojecting all depth pixels in all depth maps. For our multiview matching feature associated with each point , we follow existing work [2, 29] and use perchannel variance aggregation using the reference and source feature maps associated with each depth pixel. For , given that belongs to depth map , the equation for variance feature , applied perchannel, is:
(1) 
where is the projection of to feature map ,
is the bilinear interpolation of
to point , and is the average interpolated feature over all indices . Intuitively, if lies on a surface it is more likely to have low variance in most feature channels in while if it doesn’t lie on a surface the variance will likely be high.Point Cloud Voxelization: To form our initial feature volume, we regularly sample an initial 3D grid of points every cm within the axisaligned bounding box of point cloud and define the voxel associated with each grid point as the cube with center . We denote the set of depth points that fall within a voxel with center as . We sparsify this grid by discarding if no depth points lie within the associated voxel, denoting this set of grid coordinates as . For , we produce a feature for the associated voxel using PointNet [1]
with max pooling. The PointNet feature for each voxel is defined as:
(2) 
where
is a learnable multilayer perceptron (MLP),
indicates concatenation of the 3D coordinates with the feature channel of to form a feature with 3 additional channels, and is the channelwise max pooling operation. The result of this stage is a sparse feature volume with features given by Eq. 2 and coordinates .MultiScale 3D Feature Extraction: In this stage, we use a sparse 3D UNet to model the underlying scene geometry. We use a basic UNet architecture with skip connections. Group normalization is used throughout. See supplementary material for a more detailed description of our architecture. Our sparse UNet takes as input sparse feature volume . From intermediate outputs of the UNet, we extract three scales of feature volumes , , with a voxel edge length of cm, cm, and cm, respectively, describing the scene. In this way, we extract a rich, multiscale, volumetric encoding of the scene.
3.2 PointFlowBased Refinement
In this stage, we use our multiscale scene encoding from the previous stage in a variation of the PointFlow algorithm proposed by Chen [2]. The goal is to refine our predicted depth maps to match our scene model by predicting a residual for each depth pixel. We briefly review the core components of PointFlow and the intuition behind our proposed change.
In PointFlow, a set of points called hypothesis points are constructed at regular intervals along a depth ray, centered about the depth prediction associated with the given depth ray. The blue and red points in Fig. 3
illustrate this. Features are generated for the hypothesis points. Then, a network processes these features and outputs a probability score for every point indicating confidence the given point is at the correct depth. Finally, the expected offset is calculated using these probabilities and added to the original depth prediction. Our key innovation is the use of our multiscale scene encoding to generate the hypothesis point features.
In the original PointFlow, hypothesis points are constructed for a single depth map, augmented with features using Eq. 1, and aggregated into a point cloud. Note this point cloud is strictly different from our point cloud as (1) it is produced using a single depth map, and (2) it includes hypothesis points. Features are generated for each point using edge convolutions [27]
on the kNearestNeighbor (kNN) graph. Crucially, these edge convolutions never operate on a unified 3D scene representation in the original PointFlow. This prevents the offset predictions from learning global information, which we believe is a critical step for depth residual prediction. Furthermore, because of the required kNN search, this formulation cannot scale to process a joint point cloud from an arbitrary number of depth maps, therefore preventing it from scaling to learn global information.
Inspired by convolutional occupancy networks [19] and IFNets [3], we instead generate hypothesis features by interpolating each scale of our multiscale scene encoding (see Fig. 3). With this key change, we use powerful scenelevel priors in our offset prediction conditioned on all depth predictions for a given scene. Furthermore, by using the same encoding to update all depth predictions, we encourage global consistency of predictions. We now describe in detail our variation of the PointFlow method (see Figs. 2 and 3), using notation similar to the original paper.
Hypothesis Point Construction: For a given backprojected depth pixel from depth map , we generate point hypotheses :
(3) 
where is the normalized reference camera direction of , and is the displacement step size.
Feature Generation: We generate a multiscale feature for each hypothesis point using trilinear interpolation to point of our sparse features volumes , using s where features are not defined:
(4) 
Next, we generate a variance feature for hypothesis point using Eq. 1. The final feature for a hypothesis point is the channelwise concatenation of these features:
(5) 
We stack our pointhypothesis features to form a 2D feature , where is the sum of the dimensions of our variance and scene encoding features.
Offset Prediction: We apply a 4 layer 1D CNN followed by a softmax function to predict a probability scalar for each pointwise entry in . The predicted displacement of point is then as follows:
(6) 
The updated depth for each depth map is the depth of point with respect to the camera associated with .
3.3 Bringing It All Together: 3DVNet
In this section, we describe our full depthprediction pipeline using our multiscale volumetric scene modeling and PointFlowbased refinement, which we name 3DVNet (see Fig. 4). Our pipeline consists of (1) initial feature extraction and depth prediction, (2) scene modeling and refinement, and (3) upsampling of our refined depth map to the size of the original image. The scene modeling and refinement is done in a nested forloop fashion, extracting a scene model in the outer loop and iteratively refining the depth predictions using that scene model in the inner loop. We fix the input image size of 3DVNet to .
2D Feature Extraction: For our 2D feature extraction, we adopt the approach of Düzçeker [6], and use a 32 channel feature pyramid network (FPN) [16] constructed on a MnasNet [25] backbone to extract coarse and fine resolution feature maps of size and respectively. For every image , we denote these and .
MVSNet Prediction: For the coarse depth prediction of image , we use a small MVSNet [29] using the reference and source coarse feature maps to predict an initial coarse depth . Our cost volume is constructed using traditional plane sweep stereo with depth hypotheses sampled uniformly at intervals of 5 cm starting at 50 cm. Similar to Yu and Gao [32], our predicted depth map is spatially sparse compared to feature map . We fix our coarse depth map prediction size to .
Nested ForLoop Refinement: We denote the updated depths after scenemodeling iteration and PointFlow iteration as . We use initial depth predictions and coarse feature maps to generate multiscale scene encoding , , . We then run PointFlow refinement three times with displacement step size cm, cm, and cm and to get updated depths . In early experiments, we found two iterations at cm to be helpful. We regenerate our scene encoding using updated depths and coarse feature maps . We then run PointFlow three more times with step sizes cm, cm, and cm and to get updated depths . We find our depth maps converge at this point.
CoarsetoFine Upsampling: In this stage, we upsample each refined depth prediction to the size of image . We find PointFlow refinement does not remove interpolation artifacts, as this generally requires predicting large offsets across depth boundaries. We outline a simple, coarsetofine method for upsampling while removing artifacts. See the right section of Fig. 4. At each step, we upsample the current depth prediction using nearestneighbor interpolation to the size of the nextlargest feature map and concatenate, using the original image in the final step. We then pass the concatenated feature map and depth through a smoothing network. We use a version of the propagation network proposed by Yu and Gao [32]. For every pixel in depth map , the smoothed depth is a weighted sum of in the neighborhood about :
(7) 
where is a 4 layer CNN that takes as input the concatenated feature and depth map and outputs 9 weights for every pixel , and indexes those weights for the pixel . A softmax function is applied to the weights for normalization. We apply this coarsetofine upsampling to every refined depth map to arrive at a final depth prediction for every input image .
4 Experiments


PMVS  PMVS  FMVS  FMVS  DVMVS  DVMVS  GPMVS  GPMVS  Atlas  Neural  Ours  
(FT)  (FT)  pair  fusion  (FT)  Recon  


ScanNet  
Absrel  0.389  0.085  0.274  0.084  0.069  0.061  0.121  0.062  0.062  0.063  0.040 
Absdiff  0.668  0.168  0.444  0.165  0.142  0.127  0.214  0.124  0.116  0.099  0.079 
Absinv  0.148  0.048  0.145  0.050  0.044  0.038  0.066  0.039  0.044  0.039  0.026 
Sqrel  0.798  0.046  0.463  0.045  0.026  0.021  0.860  0.022  0.040  0.039  0.015 
RMSE  1.051  0.267  0.776  0.267  0.220  0.200  0.339  0.199  0.238  0.206  0.154 
0.630  0.922  0.732  0.922  0.949  0.963  0.890  0.960  0.935  0.948  0.975  
0.768  0.981  0.857  0.979  0.989  0.992  0.971  0.992  0.971  0.976  0.992  
0.859  0.994  0.915  0.993  0.997  0.997  0.990  0.998  0.985  0.989  0.997  
Acc  0.093  0.039  0.059  0.043  0.059  0.067  0.077  0.057  0.078  0.058  0.051 
Comp  0.303  0.256  0.184  0.212  0.145  0.128  0.150  0.111  0.097  0.108  0.075 
Prec  0.651  0.738  0.570  0.707  0.595  0.557  0.486  0.604  0.607  0.636  0.715 
Rec  0.317  0.433  0.486  0.454  0.489  0.504  0.453  0.565  0.546  0.509  0.625 
Fscore  0.409  0.529  0.511  0.541  0.524  0.520  0.459  0.574  0.573  0.564  0.665 


TUMRGBD  
Absrel  0.318  0.111  0.273  0.113  0.117  0.095  0.102  0.093  0.163  0.106  0.076 
Absdiff  0.642  0.275  0.573  0.281  0.339  0.273  0.243  0.239  0.404  0.167  0.210 
0.662  0.858  0.694  0.851  0.838  0.886  0.874  0.891  0.816  0.912  0.912  
Fscore  0.115  0.145  0.150  0.154  0.141  0.162  0.157  0.170  0.129  0.117  0.181 


ICLNUIM  
Absrel  0.614  0.107  0.303  0.095  0.106  0.114  0.107  0.066  0.110  0.123  0.050 
Absdiff  1.469  0.262  0.707  0.245  0.278  0.322  0.290  0.176  0.332  0.303  0.120 
0.311  0.877  0.659  0.894  0.878  0.847  0.855  0.965  0.833  0.709  0.980  
Fscore  0.064  0.144  0.382  0.246  0.173  0.150  0.241  0.323  0.194  0.055  0.440 

4.1 Implementation and Training Details
Libraries:
Our model is implemented in PyTorch using PyTorch Lightning
[7] and PyTorch Geometric [8]. We use Minkowski Engine [4]as our sparse tensor library. We use Open3D
[33] for both visualization and evaluation.Training Parameters:
We train our network on a single NVIDIA RTX 3090 GPU. Our network is trained endtoend with a minibatch size of 2. Each minibatch consists of 7 images for depth prediction. For our loss function, we accumulate the average
error between ground truth and predicted depth maps, appropriately downsampling the ground truth depth map to the correct resolution, for all predicted, refined, and upsampled depth map at every stage in our pipeline. Additionally, we employ random geometric scale augmentation with a factor selected between to and random rotation about the gravitational axis.We first train with the pretrained MnasNet backbone frozen using the Adam optimizer [14] with an initial learning rate of
which is divided by 10 every 100 epochs (
1.5k iterations), to convergence (1.8k iterations). We unfreeze the MnasNet backbone and finetune the entire network using Adam and an initial learning rate of that is halved every 50 epochs to convergence (1.8k iterations).4.2 Datasets, Baselines, Metrics, and Protocols
Datasets: To train and validate our model, we use the ScanNet [5] official training and validation splits. For our main comparison experiment, we use the ScanNet official test set, which consists of 100 test scenes in a variety of indoor settings. To evaluate the generalization ability of our model, we select 10 sequences from TUMRGBD [23], and 4 sequences from ICLNUIM [10] for comparison.
Baselines: We compare our method to seven state of the art baselines: PointMVSNet (PMVS) [2], FastMVSNet (FMVS) [32], DeepVideoMVS pair/fusion networks (DVMVS pair/fusion) [6], GPMVS batched [11], Atlas [18], and NeuralRecon [24]. The first five baselines are depthprediction methods while the last two are volumetric methods. Of these, we consider GPMVS and Atlas the most relevant depth and volumetric methods respectively, as both use information from all frames simultaneously during inference. We use the ScanNet training scenes to fintetune methods not trained on ScanNet [2, 11, 32]. We report both the finetuned and pretrained results, denoted with and without “FT”. To account for range differences between the DTU dataset [13] and ScanNet, we use our model’s plane sweep parameters with PMVS and FMVS.
Metrics: We use the 2D and 3D metrics presented by Murez [18] for evaluation. See supplementary for definitions. Amongst these metrics, we consider Absrel, Absdiff, and the first inlier ratio metric as the most suitable 2D metrics for measuring depth prediction quality, and Fscore as the most suitable 3D metric for measuring 3D reconstruction quality. Following Düzçeker [6]
, we only consider ground truth depth values greater than 50 cm to account for some methods not being able to predict smaller depth. We note Fscore, Precision, and Recall are calculated perscene and then averaged across all the scenes. This results in a different Fscore than when calculating from the averaged Precision and Recall reported.
Protocols: For depthbased methods, we fuse predicted depths using the standard multiview consistency based point cloud fusion. Based on results on validation sets, we modify the implementation of Galliani [9] to use depthbased multiview consistency check, rather than a disparitybased check (see Sec. 3.3 of the supplementary materials). For volumetric methods, we use marching cubes to extract a mesh from the predicted TSDF. Following Murez [18], we trim the meshes to remove geometry not observed in the ground truth camera frustums. Additionally, ScanNet ground truth meshes often contain holes in observed regions. We mask out these holes for all baselines to avoid false penalization. All meshes are single layer to match ScanNet ground truth as noted by Sun [24].
We use the DVMVS keyframe selection. For depthbased methods, we use each keyframe as a reference image for depth prediction. We use the 2 previous and 2 next keyframes as source images (4 source images total). For depthbased methods, we resize the output depth map to using nearestneighbor interpolation. For volumetric methods, we use the predicted mesh to render depth maps for each keyframe.
4.3 Evaluation Results and Discussion
See Tab. 1 for 2D depth and 3D geometry metrics on all datasets. Our method outperforms all baselines by a wide margin on most metrics. Notably, our Absrel error on ScanNet is 0.021 less than the DVMVS fusion, the second best method, while the Absrel of the third, fourth, and fifth best methods are all within 0.002 of DVMVS fusion. Similarly, Our ScanNet Fscore is 0.09 more than GPMVS (FT), the second best method, while the Fscore is within 0.001 of GPMVS (FT) for Atlas, the third best method. This demonstrates the significant quantitative increase in both depth and reconstruction metrics of our method. Results on additional datasets show the strong generalization ability of our model.
We include qualitative results on ScanNet. See Figs. 5 and 6. See Sec. 4 of the supplementary materials for additional qualitative results. Our depth maps are visually pleasing, with clearly defined edges. They are comparable in quality to those of GPMVS and DVMVS fusion while being quantitatively more accurate. Our reconstructions are coherent like volumetric methods, without the noise present in other depthbased reconstructions, which we believe is a result of our volumetric scene encoding and refinement.
We do note one benefit of Atlas is its ability to fill large unobserved holes. Though not reflected in the metrics, this leads to qualitatively more complete reconstructions. Our system relies on depth maps and thus cannot do this as designed. However, as a result of averaging across image features, Atlas produces meshes that are overly smooth and lack detail. In contrast, our reconstructions contain sharper, better defined features than purely volumetric methods. Finally, we note our system cannot naturally be run in an online fashion, requiring availability of all frames prior to use.



Absrel  Absdiff  Fscore  
0  0  0.070  0.137  0.949  0.559 
1  1  0.050  0.100  0.965  0.651 
1  2  0.044  0.088  0.971  0.661 
1  3  0.043  0.086  0.972  0.664 
2  1  0.041  0.081  0.974  0.668 
2  2  0.040  0.079  0.975  0.667 
2  3  0.040  0.079  0.975  0.665 



Model  Absrel  Absdiff  Fscore  
no 3d  0.067  0.134  0.952  0.551 
single scale  0.041  0.080  0.973  0.662 
avg feats  0.043  0.082  0.975  0.656 
full  0.040  0.079  0.975  0.665 
4.4 Ablation and Additional Studies
Does Iterative Refinement Help? We study the effect of each inner and outer loop iteration of our depth refinement. See Tab. 2. We exceed stateoftheart metrics after 2 iterations. 3 additional iterations add continued improvement, confirming the effectiveness of iterative refinement. By 5 iterations, our metrics have converged, with depth stabilizing and Fscore decreasing slightly. Interestingly, the final iteration appears slightly detrimental.
Does MultiScale Scene Modeling Help? To test this, we (1) completely remove our multiscale scene encoding from the PointFlow refinement, and (2) only use the coarsest scale , respectively denoted “no 3D” and “single scale” in Tab. 3. Without any scenelevel information, our refinement breaks down, indicating the scene modeling is essential. The single scale model does slightly worse, confirming the effectiveness of our multiscale encoding.
Do MultiView Matching Features Help? We use a perchannel average instead of variance aggregation for each point in our featurerich point cloud, denoted “avg feats” in Tab. 3. Most metrics, notably the Fscore, suffer. This supports our hypothesis that multiview matching is more beneficial for reconstruction than averaging.
For additional studies, see the supplementary material.
5 Conclusion
We present 3DVNet, which uses the advantages of both depthbased and volumetric MVS. We perform experiments with 3DVNet to show depthbased iterative refinement and multiview matching combined with volumetric scene modeling greatly improves both depthprediction and reconstruction metrics. We believe our 3D scenemodeling network bridges an important gap between depth prediction, image feature aggregation, and volumetric scene modeling and has applications far beyond depthresidual prediction. In future work, we will explore its use for segmentation, normal estimation, and direct TSDF prediction.
Acknowledgements: Support for this work was provided by ONR grants N000141912553 and N001741910024, as well as NSF grant 1911230.
References

[1]
(2017)
PointNet: deep learning on point sets for 3D classification and segmentation.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 77–85. External Links: Document Cited by: §3.1, §3.1.  [2] (201910) Pointbased multiview stereo network. In International Conference on Computer Vision (ICCV), Cited by: item 2, §1, §1, §1, Figure 2, §2, §2, §3.1, §3.2, §4.2.
 [3] (202006) Implicit functions in feature space for 3D shape reconstruction and completion. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
 [4] (2019) 4D spatiotemporal ConvNets: minkowski convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084. Cited by: §4.1.
 [5] (2017) ScanNet: richlyannotated 3D reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.2.
 [6] (2021) DeepVideoMVS: multiview stereo on video with recurrent spatiotemporal fusion. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §2, §2, §3.3, §4.2, §4.2.
 [7] (2019) PyTorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorchlightning 3. Cited by: §4.1.
 [8] (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.1.
 [9] (201512) Massively parallel multiview stereopsis by surface normal diffusion. In International Conference on Computer Vision (ICCV), Cited by: §1, §4.2.
 [10] (201405) A benchmark for RGBD visual odometry, 3D reconstruction and SLAM. In International Conference on Robotics and Automation (ICRA), Hong Kong, China. Cited by: §1, §4.2.
 [11] (2019) Multiview stereo by temporal nonparametric fusion. In International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, §2, §2, §4.2.
 [12] (2019) DPSNet: endtoend deep plane sweep stereo. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §2.
 [13] (2014) Large scale multiview stereopsis evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 406–413. Cited by: §1, §4.2.
 [14] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
 [15] (2017) Tanks and temples: benchmarking largescale scene reconstruction. ACM Transactions on Graphics (TOG) 36 (4). Cited by: §1.
 [16] (2017) Feature pyramid networks for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944. External Links: Document Cited by: §3.3.
 [17] (201910) PMVSNet: learning patchwise matching confidence aggregation for multiview stereo. In International Conference on Computer Vision (ICCV), Cited by: §1, §2, §2.
 [18] (2020) Atlas: endtoend 3D scene reconstruction from posed images. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.2, §4.2, §4.2.
 [19] (2020) Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), Cited by: §3.2.
 [20] (2015) UNet: convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §2, §3.1.
 [21] (2016) Pixelwise view selection for unstructured multiview stereo. In European Conference on Computer Vision (ECCV), Cited by: §1.
 [22] (2020) DELTAS: depth estimation by learning triangulation and densification of sparse points. In European Conference on Computer Vision (ECCV), External Links: Link Cited by: §1, §2.
 [23] (2012Oct.) A benchmark for the evaluation of RGBD SLAM systems. In International Conference on Intelligent Robot Systems (IROS), Cited by: §1, §4.2.
 [24] (2021) NeuralRecon: realtime coherent 3D reconstruction from monocular video. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.2, §4.2.
 [25] (2019) MnasNet: platformaware neural architecture search for mobile. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
 [26] (2018Sep.) MVDepthNet: realtime multiview depth estimation neural network. In International Conference on 3D Vision (3DV), Cited by: §1, §2, §2.
 [27] (2019) Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §3.2.
 [28] (2021) MVS2D: efficient multiview stereo via attentiondriven 2D convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [29] (2019) MVSNet: depth inference for unstructured multiview stereo. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, §3.1, §3.3.
 [30] (2019) Recurrent MVSNet for highresolution multiview stereo depth inference. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
 [31] (2020) Pyramid multiview stereo net with selfadaptive view aggregation. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2.
 [32] (2020) FastMVSNet: sparsetodense multiview stereo with learned propagation and gaussnewton refinement. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §2, §3.3, §3.3, §4.2.
 [33] (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §4.1.
Comments
There are no comments yet.