1 Introduction
Multiview stereopsis (MVS) aims to recover a dense 3D model from a set of 2D images with known camera parameters. As the observations become sparser, the more 3D information of the imaged scene get lost during the sensing procedure, making the following perception procedure, for example, an MVS task, more challenging. Dense multiview sensation has attracted tremendous attention in light field imaging and rendering. Its advantages, such as being robust to occlusion [yucer2016depth][wu2018light] and reducing image noise [bishop2009light][chung2019computational], have been well studied. Unfortunately, it is impractical to densely sample a scene for highresolution 3D reconstruction, especially for the largescale scenes. In contrast, the sparser sensation with a wide baseline is more practical and more costefficient; however, it aggravates the difficulty of MVS problem since the larger baseline angles lead to tough densecorrespondence matching.
We propose an imperative sparseMVS leaderboard and call for the community’s attention on the general sparse MVS problem with a large range of baseline angle that could be up to . Despite of several approaches recovering 3D model from a single view, they are biased towards recovering specific objects or scenes with poor generalization ability. For instance, some work focus on improving the depth map generation with the aid of semantic embeddings [chen2016single][godard2017unsupervised][liu2015deep] or objectlevel shape prior [huang2018deep][mescheder2019occupancy][kar2017learning]. Other methods [schoenberger2016mvs][paschalidou2018raynet][yao2019recurrent][yao2018mvsnet][chen2019point][gu2019cascade]
, classified as depth map fusion algorithms, try to estimate the depth map for each camera view and fuse them into a 3D model. Unfortunately, for the sparse MVS setting with
the large baseline angle, e.g. larger than, these algorithms suffer from incomplete models, because the large baseline angle leads to significantly skewed matching patches from different views and worsens the photoconsistency check. Additionally, as the baseline angle gets larger, the 2D regularization on the depth maps is less helpful for a complete and smooth 3D surface. Because the 2D observation is formed by uneven samples on the 3D surface, the photo consistency agreements can be hardly met by the depth predictions from two views with
the large baseline angle, as shown in Fig. 1 and Fig. 2(a).Instead of fusing multiple 2D information into 3D, for the first time, SurfaceNet [ji2017surfacenet] optimizes the 3D geometry in an endtoend manner by directly learning the volumewise geometric context from 3D unprojected color volumes. Even though directly utilizing the 3D regularization may avoid the aforementioned shortcomings of the depth map fusion methods, it still suffers from distinct disadvantages such as noisy surface and large holes around the regions with the repeating pattern and complex geometry. The main reason is that the volumewise predictions are independently performed without global geometric prior. Consequently, around the region with the repeating pattern, SurfaceNet returns periodic floating surface fragments around the groundtruth surface. Additionally, such noisy predictions further interferes the view selection and leads to large black holes, as shown in Fig. 1.
In this paper, we present an endtoend learning framework, SurfaceNet+, attacking the very sparse MVS problem. As the sensation sparsity increase, the number of available photoconsistent views becomes less and the view selection scheme gets more critical. Therefore, to adapt to a large range of degree of sparsity, the core innovation is a trainable occlusionaware view selection scheme that takes the geometric prior into account via a coarsetofine scheme. Such volumewise view selection strategy can significantly boost the performance of the learningbased volumetric MVS methods. More specifically, as shown in Fig. 2, it starts from very coarse 3D surface prediction using all the view candidates, and consequently refines the recovered geometry by gradually discarding the occluded views based on the coarser level geometric prediction. Unlike the traditional imagewise [barnes2009patchmatch][pang2014self][zheng2015motion] or pixelwise view selection [schoenberger2016mvs], which cannot filter out the less irrelevant visible views for the final 3D model fusion, the proposed occlusionaware volumewise view selection can identify the most valuable view pairs for each 3D subvolume and the ranking weights is endtoend trainable. Therefore, consequently only a little proportion of view pairs is needed for volumewise surface prediction with little performance reduction. That can dramatically reduce the computational complexity by removing redundancy of the multiview sampling. Benefited from the coarsetofine fashion, SurfaceNet+ makes the volumewise occlusion detection more feasible and leads to a highrecall 3D model.
The proposed sparseMVS leaderboard is built on the largescale DTU dataset [aanaes2016large] and the TanksandTemples dataset [Knapitsch2017] with sparsely sampled camera views. The sparseMVS setting selects one view from every consecutive camera index, i.e., , where is termed as Sparsity positively related with baseline angle. The poor performance of the stateoftheart MVS algorithms on the proposed leaderboard demonstrates the necessity of further effort and attention from the community on achieving MVS with various degrees of sparsity. Additionally, the extensive comparison depicts the tremendous performance improvement of SurfaceNet+ over existing methods in terms of precision, recall, and efficiency. As illustrated in Fig. 1, under a very sparse camera setting, SurfaceNet+ predicts a much more complete 3D model compared with recent methods, especially around the border region viewed by a less number of cameras. In summary, the technical contributions in this work are twofold.

In consideration of the practical necessity of very sparse MVS and the poor performance of the existing MVS methods, we propose a sparse MVS evaluation benchmark and call for the community’s attention on the general sparse MVS problem with a broad range of baseline angles.

To tackle with the sparse MVS problem, we propose a novel trainable occlusionaware view selection scheme, which is a volumewise strategy and can significantly boost the performance of the volumetric MVS learning framework. The benchmark and the implementation are publicly available at https://github.com/mjiUST/SurfaceNetplus.
2 Related Work
2.1 Multiview Stereopsis Reconstruction
Works in the multiview stereopsis (MVS) field can be roughly categorised into 1) direct point cloud reconstructions 2) depth maps fusion algorithms and 3) volumetric methods. Pointcloudbased methods operate directly on 3D points, usually relying on the propagation strategy to gradually densify the reconstruction [1388267][5226635]. As the propagation of point clouds proceeds sequentially, these methods are difficult to be fully parallelized and usually take a long time in the processing. Depth maps fusion algorithms[tola2012efficient][campbell2008using][galliani2015massively] decouples the complex MVS problem into relatively small problems of perview depth map estimation, which focus on only one reference and a few source images at a time and then fuse together with the point cloud[MerrellAWMFYNP07]. Yet they suffer from incomplete fusion model with large baseline angle or occluded views since skewed patches and uneven samples on the 3D surface in these cases leads to poor quality photo consistency agreements.
Volumetricbased methods divide the 3D space into regular grids and handle the problem in a global coordinate. They use either implicit representation[Zach2008][lempitsky2007global][curless1996volumetric][riegler2017octnetfusion] or explicit surface properties[spacecarving99][Seitz1999][lsmKarHM2017][ji2017surfacenet][jancosek2011multi][galliani2015massively] to represent and optimize in a global framework. These methods are easy to be parallelized for a multiview process using a regularization function[lempitsky2007global][Zach2008]
to minimize errors through all points by gradient descent. Though they are more robust to data noise and outliers, the downside of this representation is the high memory consumption, leading to space discretization error, so they are only applicable to synthetic data with lowresolution inputs
[lsmKarHM2017]. To deal with the smallscale reconstruction problem, these methods either apply the divideandconquer strategy [ji2017surfacenet], or allow a hierarchical multiscale structure. [riegler2017octnetfusion][tatarchenko2017octree]use an octree representation network to represent both the structure of the octree and the probability of each cell and reconstruct the scene in a coarsetofine manner, so that time and space complexities are proportional to the size of the reconstructed model. To perceive more geometry details with limited memory,
[kazhdan2006poisson][hane2017hierarchical] adopt a hierarchical adaptive multiscale algorithm and further facilitates the prediction of highresolution surfaces. Compared with the mentioned volumetricbased methods, the proposed SurfaceNet+ shares the ideal with the divideandconquer strategy but infers the 3D surface in a coarsetofine fashion with dynamic viewselection strategy.2.2 Learningbased MVS
Many learningbased MVS methods have also been developed in recent years. 2Dconvolutional neural networks (2DCNNs)
[matchnet_cvpr_15][Seki2017CVPR][knoebelreiter_cvpr2017] are applied for better patch representation and matching, and others such as [7780960] learn the normals of a given depth map to improve the depth map fusion. Yet these methods focus on improving the individual steps in the pipeline and their performance is limited in challenging scenes due to the lack of contextual geometry knowledge. The main promotion in this area is 3D cost volume regularization proposed by [Kendall2017EndtoEndLO][yao2018mvsnet][lsmKarHM2017]. This method deploys a 3D volume in the scene space or in the reference camera space. Then, an inverse projection procedure is applied to the 3D volume from several 2D image features gained from different camera positions. Other similar processes such as colored voxel cube[ji2017surfacenet] and recurrent regularization[yao2019recurrent] also use unprojected volumes to get 3D information from 2D image features. The key advantage to process a 3D volume instead of 2D features is that the camera position image information can implicitly write into the 3D volume and the 3D geometry of the scene can be predicted by 3D convolutional layers explicitly. Additionally, during the convolution process, the network is doing the work as in the patch matching method in a highly parallel way, regardless of image distortion and various light conditions. Our approach is more closely related to SurfaceNet[ji2017surfacenet], which encodes camera geometries in the network as multiple unprojected volumes to infer the surface prediction in the global coordinate.2.3 Depth Map Fusion Methods
The depth map fusion algorithms first recover depth maps [xu2013novel] from view pairs by matching similarity patches [barnes2009patchmatch][pang2014self][zheng2015motion] along the epipolar line and then fuse the depth maps to obtain a 3D reconstruction of the object [tola2012efficient][campbell2008using][galliani2015massively]. [tola2012efficient] is designed for ultra highresolution image sets and uses a robust descriptor for efficient matching purposes. In [furu2010accurate]
describes a patch model that consists of a quasidense set of rectangular patches covering the surface. To aggregate image similarity across multiple views,
[galliani2015massively] obtains more accurate depth maps. However, in views with the large baseline angle it is problematic with the photoconsistency check because of the significantly skewed patches from different view angles. Therefore, it suffers from incomplete models in sparseMVS.After getting multiple depth maps, the depth map fusion algorithm integrates them into a unified and augmented scene representation while mitigating any inconsistencies among individual estimates. To improve fusion accuracy,
[campbell2008using] learns several sources of the depth map outliers. [Jancosek2011] fuses multiple depth estimations into a surface by evaluating visibility in 3D space, and also attempts to reconstruct the region that is not directly supported by depth measurements. [Goesele2007] proposes to explicitly target the reconstruction from crowdsourced images. [Zach2008] proposes a variational depth map formulation that enables parallelized computation on the GPU. COLMAP[schoenberger2016mvs] directly maximizes the estimated surface support in the depth maps and allows datasetwide, pixelwise sampling for view selection. However, as the observations become sparser, 2D depth fusion regularization is less helpful for a complete 3D model, because each 2D view is formed by uneven samples on the 3D surface and the sparse MVS scenario can hardly lead to photoconsistency agreements of the 3D surface prediction from multiple views.Compared with the heuristic pixelwise and imagewise view selection methods that manually filter out the occluded views, the proposed volumewise view selection method is endtoend trainable from both geometric and photometric priors for each subvolume.
2.4 Review SurfaceNet
SurfaceNet [ji2017surfacenet] firstly proposes an endtoend learning framework for MVS by automatically learning both photoconsistency and geometric relations of the surface structure. Given two images (,) and the corresponding camera views (,), SurfaceNet reconstructs the 3D surface in each subvolume by estimating for each voxel whether it is on the surface or not.
Firstly, each image of and is unprojected into by colorizing the voxels on a traced pixel ray into the same pixel color, so that the new representation (,) encodes the camera parameters implicitly. The gleaming point of the unprojected subvolume is viewinvariant, because the subvolume is under the global coordinate rather than the relevant coordinate, like the viewvariant sweep plane widely used by depthfusion methods [chen2019point][yao2019recurrent]. So that it does not lead to the uneven sampling effect.
Then, a pair of colored voxel cubes (,) is fed into SurfaceNet, a fully 3D convolutional neural network, to predict for each voxel the confidence , which indicates whether a voxel is on the surface or not by using crossentropy loss. Due to the fully convolutional design, the subvolume size for inference can be different from that for training, and can be adaptive to various graphic memory sizes.
Lastly, to generalize to a case with multiple views , it only selects a subset of view pairs () to predict , i.e., the confidence that a voxel is on the surface, then combines together by taking the weighted average of the predictions based on the relative weight for each view pair
(1) 
which is inferred by function with the inputs of the patch embeddings and the baseline angle , i.e., the angle between the projection rays from the center of to the optical centers of and . So that the volumewise reconstruction becomes computationally feasible by ignoring the majority of possible view pairs.
Benefited from the direct regularization of the 3D surface, SurfaceNet does not suffer from the shortcoming of 2D regularization owing to the uneven sample of 2D projection. However, the view selection scheme becomes nontrivial and is challenging for the sparse MVS scenario where SurfaceNet still has distinct disadvantages, such as large holes and noisy surfaces around the regions with complex geometry and repeating patterns. Additionally, the volumewise prediction becomes extremely computationally heavy for large scene reconstruction. In this paper, SurfaceNet+ solves the aforementioned problems with a large margin of performance improvement and around 10X speedup compared with SurfaceNet.
3 SurfaceNet+
In this Section, We present SurfaceNet+, an endtoend learning framework, to handle the very sparse MVS problem, where the critical problem to be solved is the view selection. As the sensation sparsity increases, the number of available photoconsistent views becomes less; thus, the view selection scheme gets more critical. SurfaceNet+ utilizes a novel trainable occlusionaware view selection scheme that takes the geometric prior into account via a coarsetofine strategy. In short, the multiscale inference (subsection 3.1) outputs the geometric prior required by the occlusionaware view selection scheme (subsection 3.2). As shown in Fig. 2, starting from a bounding box, a very coarse 3D surface is predicted by considering all the view candidates. Subsequently, the coarse level geometry gets iteratively refined by gradually discarding the occluded views based on the coarser level geometric prior. In subsection 3.3, the backbone network, a fully convolutional network structure, is presented in detail.
3.1 Multiscale Inference
For a volumewise reconstruction pipeline, the noisy prediction occurs frequently around the 3D surface with repeating patterns. Moreover, it suffers from a huge computational burden to iterate through the majority of the empty space. While it may be intuitive to consider the 3D geometry prior during reconstruction, the difficulty lies in that the general MVS task does not have any shape prior of the scene. What we propose is a coarsetofine architecture to gradually refine the geometric details under the assumption that the minority volume of the space is occupied by the 3D surface of the scene.
In the first stage, SurfaceNet+ divides the entire bounding box into a set of subvolumes of the coarsest level with the side length , where is the voxel resolution of the coarsest level when the voxelization forms a tensor of size . The tensor size depends highly on the graphic memory size, for example . As the output of this stage, the estimated surface of the coarsest level is denoted as , where means an occupied voxel in the surface prediction.
The following iterative stage divides the space into different subvolume set in each scale level, i.e., , whose resolutions are a geometric sequence with the common ratio , i.e., . Usually, we set to compromise between efficiency and effectiveness. This procedure is iterated until meeting the condition , where is the desired resolution and is the finest one. The way to divide the subvolume is highly dependent on the predicted point cloud of the coarser level , when , so that each of the regular subvolume divisions contains at least one point:
(2)  
where denotes the number of subvolume divisions, and is a short representation for the union of all the subvolumes, i.e., . To eliminate the boundary effect of the convolution operation, we usually loose the above limitation and allow a slight overlapping between the neighboring subvolume. The point cloud output of SurfaceNet+ will be introduced in subsection 3.3.
3.2 Trainable Occlusionaware View Selection
As depicted in Fig. 1, even though SurfaceNet [ji2017surfacenet] does not have the artifacts caused by uneven sampling from 3D surface to 2D depth, it suffers from large holes around the complex geometry. The key reason is that the view selection becomes more critical for the sparse MVS problem. Following the annotation in subsection 2.4, we introduce how the proposed trainable occlusionaware view selection scheme can rank and select the top most valuable view pairs of each subvolume from all the possible view pairs
(3) 
based on the learned relative weights , which is inferred from both the geometric and photometric priors. Note that the multiscale scheme can provide us with the crucial geometric prior . Consequently, according to Eq. 8, the surface in each subvolume is fused by the predictions.
Geometric Prior. The geometric prior can be easily encoded from the multiscale predictions. For any camera view w.r.t. each subvolume , a convex hull is uniquely defined by a set of points
(4) 
where is the camera center of , and the set contains the 8 corners of .
The more points in the coarser level of surface prediction that appear in the region between the camera view and the subvolume , the more likely the view is occluded. These barrier points are defined in the set
(5) 
Trainable Relative Weights. As suggested in [ji2017surfacenet], the endtoend trainable relative weights not only can improve the efficiency by filtering out the majority of the less crucial view pairs for each subvolume but also can improve the effectiveness of the surface prediction by weighted fusion. Note that, for sparse MVS, the number of the valid views for each subvolume could be too few to heuristically detect occlusions. Instead, we propose a trainable occlusionaware view pair selection scheme that learns the relative weights based on both the geometric and photometric priors:
(6) 
where, the photometric priors are the same as SurfaceNet[ji2017surfacenet], i.e., the baseline angle as well as the embeddings of the cropped patches around the 2D image of on both and in Eq. 1, and the geometric prior is encoded as the probability of being not occluded, i.e., :
(7) 
where is a hyper parameter controlling the sensitivity of this occlusion probability term and the coefficient can be understood as a normalization term w.r.t. different scales. In Section we will show the effect of and how it improves the performance of the reconstruction.
Weighted Average Surface Prediction. Lastly, for the general MVS problem, we follow the fusion strategy in SurfaceNet [ji2017surfacenet], which ranks and selects only a small subset of view pairs . Subsequently, the confidence that a voxel is on the surface, , is inferred by the weighted average of the predictions :
(8) 
where denotes the set of selected view pairs with the size of , and the relative weight for each view pair is endtoend trainable and is inferred by Eq. 6. Note that a smaller can lead to more efficient and less effective results, which is discussed in section 5.
3.3 Network
Network Architecture. At each stage of reconstruction, we use a 3D convolutional neural network to predict whether each voxel in each subvolume is on the surface or not. Specifically, given and the corresponding image view pairs , we first blur each image using a Gaussian kernel to spread the local information around the large receptive field and to guarantee the image consistency in all stages. The unprojected 3D subvolume for a view pair is demonstrated in Fig. 2(b). The beauty of this representation is that it implicitly encodes the camera parameters as well as scale information to adapt to a fully convolutional neural network.
The detailed network configuration is shown in Fig. 4. The building blocks of the model are a UNetlike architecture followed by a refinement network. SurfaceNet+ takes two colored subvolumes as input, which stores two RGB color values and forms a 6channel tensor of shape , and predicts the onsurface probability for each voxel forming a tensor of size , To extract distinct geometry information in various scales, we first use a pyramid structure to process the features in different receptive fields. To better aggregate multiscale information, we use two
convolution layers followed by a onechannel convolution layer with a sigmoid activation function after concatenating the features on different scales. Inspired by
[yao2018mvsnet], we apply a prediction refinement network at the end of the previous network. After the initial output with a tensor shape of , the skip connections at each layer are used to learn the residual prediction and to generate the final output .Loss. The training loss consists of two parts to penalize both the initial prediction and the refined prediction . In the first stage, the discriminative prediction per voxel is compared with the groundtruth . Since the majority of the voxels does not contain the surface, i.e., , a classbalanced crossentropy function is utilized, i.e., for each we have
(9)  
where the hyperparameter is the occupancy ratio of the groundtruth in the scale .
In the second stage, the refined prediction is regressed to the groundtruth by the mean square error (MSE), so that the small residue can be penalized as well,
(10) 
where . Consequently, the training loss is defined as:
(11) 
3.4 Implementation Details
Our network is trained on the DTU dataset [jensen2014large]. We use the volume with voxels to train the network, with a batch size of 16, and the voxel resolution is separately set to 0.4mm, 1.0mm and 2.0mm for each set to generalize on a different scale of surface geometry. To acquire a favorable generalization on sparseMVS, the network needs to be trained from a variety of view pairs. Therefore, the 3D convolutional network is first trained on randomlysampled nonoccluded view pairs () without relative weight . Then the training process is combined together with , and the view pair number is fixed to 6. Specifically, the relative weights learning procedure is performed using a 2layer fully connected neural network . The computation is introduced in subsection 3.2 except that the surface prediction at the previous stage is replaced with the reference model. During the reconstruction stage, the volume size is and the output is upsampled to . All the training and reconstruction processes are accomplished on one GTX 1080Ti graphics card.
4 SparseMVS Benchmark
In this section, the imperative sparseMVS leaderboard on different datasets, the DTU dataset[aanaes2016large], the Tanks and Temples dataset[Knapitsch2017] (T&T), and the ETH3D lowres dataset[schops2017multi], is introduced with extensive comparisons to the recent MVS methods under various observation sparsity levels.
We benchmark SurfaceNet+ at all sparsities from 1 to 11 against several stateoftheart methods. The sparse MVS setting in our leaderboard selects a small proportion of the camera views by consecutively sampling a view from every camera index, i.e., . In reality, it is also practical to sample smallbatches of images at sparse viewpoints, i.e., grouping batches of views with certain Batchsize at the previously defined sparse viewpoints with a certain Sparsity. When and , the chosen camera indexes are 1 / 4 / 7 / 10 / . When and , the chosen camera indexes are 1,2 / 4,5 / 7,8 / 10,11 / .
Fig. 5 depicts the relationship between sparsity and the average baseline angle averaging over all the groundtruth points in the 22 models of the DTU dataset, 8 models of the Tanks and Temples dataset, and 5 models of the ETH3D lowres dataset, respectively. Note that, for simplicity, only the nearest view pairs are considered to calculate the baseline angle statistics.
(12) 
As the sparsity increases , the average baseline angle , defined by the intersected projection rays, gradually grows in a large range, e.g. reaching more than in both DTU and T&T datasets. Due to the positive correlation between and , we claim that our sparseMVS setting is reasonable by not only covering various degrees of sparsity but also containing irregular sampling locations.
4.1 DTU Dataset[aanaes2016large]
We qualify the performances on the DTU dataset [aanaes2016large] in different sparse MVS settings. The DTU dataset is a largescale MVS benchmark, which features a variety of objects and materials, and contains 80 different scenes seen from 49 camera positions under seven different lighting conditions. 22 models are selected from the DTU dataset as the evaluation set, following [ji2017surfacenet] ^{1}^{1}1Follow the same dataset split in SurfaceNet[ji2017surfacenet]. Training: 2, 6, 7, 8, 14, 16, 18, 19, 20, 22, 30, 31, 36, 39, 41, 42, 44, 45, 46, 47, 50, 51, 52, 53, 55, 57, 58, 60, 61, 63, 64, 65, 68, 69, 70, 71, 72, 74, 76, 83, 84, 85, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 107, 108, 109, 111, 112, 113, 115, 116, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128. Validation: 3, 5, 17, 21, 28, 35, 37, 38, 40, 43, 56, 59, 66, 67, 82, 86, 106, 117. Evaluation: 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, 118.
Sparsity  Method  Mean Distance(mm)  Percentage(<1mm)  Percentage(<2mm)  

Precision  Recall  Overall  Precision  Recall  fscore  Precision  Recall  fscore  
1  SurfaceNet+  0.385  0.448  0.416  88.01  73.01  78.44  92.33  78.1  83.55 
SurfaceNet[ji2017surfacenet]  0.450  1.021  0.735  84.49  64.58  71.65  89.10  68.72  76.21  
Gipuma[galliani2015massively]  0.283  0.873  0.578  94.65  59.93  70.64  96.42  63.81  74.16  
RMVSNet[yao2019recurrent]  0.383  0.452  0.417  87.63  72.48  77.09  91.74  76.39  82.01  
COLMAP[schoenberger2016mvs]  0.411  0.657  0.534  82.24  52.48  61.34  88.26  62.20  72.93  
3  SurfaceNet+  0.446  0.482  0.464  86.06  74.41  78.15  90.87  78.25  82.91 
SurfaceNet  0.461  0.997  0.729  83.02  61.09  68.87  88.31  66.39  74.41  
Gipuma  0.267  1.252  0.759  95.51  50.88  64.63  97.49  50.33  63.68  
RMVSNet  0.465  1.012  0.738  89.55  48.03  59.28  96.96  57.92  69.04  
COLMAP  0.467  1.090  0.778  78.45  49.26  59.62  91.44  55.98  65.77  
5  SurfaceNet+  0.446  0.491  0.469  88.58  71.63  77.48  92.86  76.04  82.28 
SurfaceNet  0.445  0.948  0.701  81.07  58.62  66.55  85.40  62.76  70.97  
Gipuma  0.460  1.633  1.046  92.38  38.53  52.36  95.10  48.15  61.78  
RMVSNet  0.329  2.209  1.269  89.26  20.51  31.60  93.99  32.74  46.37  
COLMAP  0.443  1.284  0.863  88.79  42.51  55.94  92.91  54.89  65.77  
7  SurfaceNet+  0.435  0.524  0.479  91.36  72.23  75.59  95.21  76.54  81.86 
SurfaceNet  0.688  1.130  0.909  66.86  36.91  50.24  69.21  46.91  61.70  
Gipuma  0.569  1.770  1.169  85.35  17.91  28.66  90.78  28.00  41.31  
RMVSNet  empty  empty  empty  empty  empty  empty  empty  empty  empty  
COLMAP  0.545  1.756  1.150  59.28  15.14  22.46  80.92  31.56  41.89  
9  SurfaceNet+  0.441  0.895  0.668  85.99  53.16  63.01  89.86  57.63  67.86 
SurfaceNet  1.112  2.176  1.644  35.84  29.53  31.47  38.36  34.01  35.49  
Gipuma  empty  empty  empty  empty  empty  empty  empty  empty  empty  
RMVSNet  empty  empty  empty  empty  empty  empty  empty  empty  empty  
COLMAP  empty  empty  empty  empty  empty  empty  empty  empty  empty  
11  SurfaceNet+  0.445  0.880  0.663  85.81  51.52  61.54  90.05  55.41  65.99 
SurfaceNet  empty  empty  empty  empty  empty  empty  empty  empty  empty  
Gipuma  empty  empty  empty  empty  empty  empty  empty  empty  empty  
RMVSNet  empty  empty  empty  empty  empty  empty  empty  empty  empty  
COLMAP  empty  empty  empty  empty  empty  empty  empty  empty  empty 
Method  Average Rank  Mean  Family  Francis  Horse  Lighthouse  M60  Panther  Playground  Train 

ACMM [xu2019multi]  14.00  57.27  69.24  51.45  46.97  63.20  55.07  57.64  60.08  54.48 
CasMVSNet [gu2019cascade]  15.75  56.84  76.37  58.45  46.26  55.81  56.11  54.06  58.18  49.51 
ACMH [xu2019multi]  22.25  54.82  69.99  49.45  45.12  59.04  52.64  52.37  58.34  51.61 
UCSNet [cheng2019deep]  22.62  54.83  76.09  53.16  43.03  54.00  55.60  51.49  57.38  47.89 
PLC [liao2019pyramid]  24.38  54.56  70.09  50.30  41.94  58.86  49.19  55.53  56.41  54.13 
SurfaceNet+  36.12  49.38  62.38  32.35  29.35  62.86  54.77  54.14  56.13  43.10 
Dense RMVSNet [yao2019recurrent]  41.00  50.55  73.01  54.46  43.42  43.88  46.80  46.69  50.87  45.25 
VisibilityAwarePointMVSNet [chen2020visibility]  43.88  48.70  61.95  43.73  34.45  50.01  52.67  49.71  52.29  44.75 
PointMVSNet [ChenPMVSNet2019ICCV]  44.38  48.27  61.79  41.15  34.20  50.79  51.97  50.85  52.38  43.06 
RMVSNet [yao2019recurrent]  46.88  48.40  69.96  46.65  32.59  42.95  51.88  48.80  52.00  42.38 
MVSNet [yao2018mvsnet]  57.50  43.48  55.99  28.55  25.07  50.79  53.96  50.86  47.90  34.69 
COLMAP [schoenberger2016mvs]  60.50  42.14  50.41  22.25  25.63  56.43  44.83  46.97  48.53  42.04 
The chart in Fig. 6 plots the performance under a large range of sparsity in terms of fscore (1mm), which unifies both recall and precision. This apparently shows that our proposed method constantly outperforms others in all the sparse settings. Especially for the case of , amazingly, SurfaceNet+ constantly performs well without obvious degradation. In the extremely sparse case, i.e., , as expected, SurfaceNet+ shows a tiny performance reduction. In contrast, other methods, especially the depthfusion methods, merely predict a few points. Readers can refer to subsection 2.3 for the discussion why the depthfusion methods cannot return a complete result. In our leaderboard, depthmapbased methods such as RMVSNet [yao2019recurrent] and Gipuma [galliani2015massively] share the same depth fusion code. For fair comparison, we tuned the hyperparameters in the depth fusion algorithm to induce better performance in terms of fscore under 1mm at each sparsity setting.
More detailed quantitative results are listed in Table I, where 3 different matrices are adopted for evaluation. The precision and recall have two metrics: the distance metric[aanaes2016large] and the percentage metric[Knapitsch2017]. The overall score for the percentage metric is measured as the fscore, and a similar measurement for the distance metric overall is given by the average of the mean precision and mean recall. Obviously, SurfaceNet+ outperforms the stateoftheart methods in both recall and precision at all sparsity settings. Unlike other methods whose recall dramatically decay when the sparsity increases, SurfaceNet+ has almost consistent recall quality with high precision.
The qualitative comparison of SurfaceNet+ and the other two methods, RMVSNet [yao2019recurrent] and Gipuma [galliani2015massively], is illustrated in Fig. 7, showing that SurfaceNet+ precisely reconstructs the scenes while maintaining high recall. In particular, SurfaceNet+ is able to generate a highrecall point cloud in complex geometry and repeating pattern regions when , which means it evenly fuses the accurate 3D model with correctedselected nonoccluded views. The detailed analysis is shown in Section 5.
To have a slightly different way of sparse sampling, three values are evaluated as depicted in Fig. 6. It can be observed that SurfaceNet+ constantly outperforms others despite that the depthfusion methods (Gipuma[galliani2015massively], RMVSNet[yao2019recurrent], COLMAP[schoenberger2016mvs]) boost the performance as the increases. Moreover, as the disparity increases, the performance drop of the existing methods is apparently larger than that of SurfaceNet+. In particular, we have retrained the RMVSNet for sparse MVS with randomlysampled nonoccluded view pairs at . As shown in Fig. 6, the gain is inapparent in terms of fscore. As the depthfusion based MVS methods (RMVSNet) rely more on the photoconsistency in 2D images, the large baseline angles of a very sparse MVS problem leads to severely skewed matching patches across views that significantly toughen the dense correspondence problem. In contrast, the learningbased volumetric MVS methods like SurfaceNet+ avoids the 2D correspondence search problem by directly inferring 3D surface from each unprojected 3D subvolumes. That may explain why the learningbased volumetric methods outperform the depthfusion based methods in the very sparse MVS settings. For the experiment settings, both RMVSNet and Gipuma shared the same depth fusion code, and we tuned the hyperparameters of it to induce better performance in terms of fscore under 1mm at each sparsity setting. More specifically, followed by Gipuma[galliani2015massively], since there is a tradeoff between accuracy and completeness, we choose the depth fusion parameter settings that achieve high accuracy at sparsity=1,2 and high completeness at sparsity>=3. The other part remain the same as the paper of RMVSNet[yao2019recurrent] and Gipuma[galliani2015massively]. In COLMAP[schoenberger2016mvs], all parameters were set as the default values.
4.2 Tanks and Temples Dataset[Knapitsch2017]
The Tanks and Temples (TT) dataset[Knapitsch2017] contains realworld large scenes under complex lighting conditions. In Fig. 8, we compare the qualitative results in the Tanks and Temples dataset[Knapitsch2017] with RMVSNet [yao2019recurrent] and COLMAP [schoenberger2016mvs]. The results indicate the effectiveness of our proposed method at different sparsities. We submitted and evaluated the SurfaceNet+ results () to the online leaderboard. As depicted in Table II, despite the dense MVS condition, the overall rank of SurfaceNet+ is still higher than that of RMVSNet[yao2019recurrent], MVSNet[yao2018mvsnet], COLMAP[schoenberger2016mvs], and PointMVSNet[ChenPMVSNet2019ICCV]. Note that we list and compare with all the top and nonanonymous methods on the leaderboard in the following table.
4.3 Generalization on the ETH3D Dataset[schops2017multi]
We also evaluate the generalization ability by adopting the ETH3D dataset[schops2017multi], i.e., we direct evaluate the proposed method that trained only on the DTU training dataset without finetuning the network. The results of the lowresolution scenes are shown in Fig. 9. It is worth noting that the baseline angle in the ETH3D dataset is tiny among all the camera views because the images were acquired by just rotating the camera with little camera translation. Fig. 5(c) further depicts the relationship between sparsity and the average baseline angle over all the models in the ETH3D lowresolution training set. The average baseline angle is far less than , indicating that the ETH3D dataset may not be suitable for the sparseMVS benchmark.
5 Ablation Study
To investigate the influences of each of the key components in the proposed method, we design an ablation study with respective to the coarsetofine fashion (Multiscale) and the trainable occlusionaware view selection (Viewselection). For all these studies below, experiments are performed and evaluated on a specific model (model 23) in the DTU dataset because it contains many challenging cases such as complex geometry, textureless regions, and repeating patterns.
In the sparse case, for example, , we quantitatively illustrate the performance gain of the multiscale fashion in Fig. 10(a), in which we compare few settings: ICCV SurfaceNet [ji2017surfacenet] ( curve), ICCV SurfaceNet with the new backbone ( curve) denoted as SurfaceNet in the rest of the paper, SurfaceNet with multiscale inference ( curve), and the proposed trainable occlusionaware view selection scheme ( curve). Clearly, from the comparison of v.s. , we can conclude that the proposed trainable occlusionaware view selection scheme that is a volumewise strategy significantly improves both completeness (Recall) and accuracy (Precision).
5.1 Multiscale Mechanism
Fig. 11(a) shows the predictions of the various scale levels. Note that the volumewise occlusion detection is turned off. Fig. 11(b) contains the result without using the coarsetofine mechanism, which is the same as SurfaceNet[ji2017surfacenet]. The reference model scanned by laser is placed in Fig. 11(c). In each group, (top) the front view of the reconstruction model and (bottom) the intersection of a horizontal plane (red line) are shown. The top view of the red line is useful to observe the surface thickness, noise level, and completeness.
Comparing (a) and (b), it is obvious that the method with the coarsetofine mechanism leads to higher precision at the texture area and complex geometry region. Although (b) accurately predicts the results at some complex regions, it suffers from thick surface prediction and floating noise around the repeating pattern regions. The floating noise occurs close to the real surface, because the volumewise method processes each subvolume locally and individually without global prior to filter out the floating noise. In contrast, the coarsetofine mechanism is helpful to gradually reject the empty space and to refine the geometric prediction.
In the sparse case, for example, , the multiscale mechanism dramatically improves precision if we compare the roundcurve and the trianglecurve in Fig. 10(a). Apparently the trianglecurve is a shifted version of the roundcurve towards the direction for better precision with constant recall.
Feature aggregation. To give the network more global context, we try to use some features coming from the coarser level of the network so that the coarse level is used to not only decide on the visibility/occlusions, but also provide additional feature contents. We study the advantages and disadvantages of this multiscale inference architectures and report the results in Fig. 10(c). It can be shown that the multiscale feature aggregation scheme () improves the completeness (Recall) of the results by providing the global context. However, when there are few numbers of view pairs, e.g., less than 6 view pairs, the multiscale aggregation worsens the accuracy (Precision) of the prediction. The reason is that the volumewise surface prediction relies on multiple pairs of the unprojected subvolumes, and in the coarsetofine procedure, the selected views may be updated based on the geometric priors under different scales. So that when the multiscale scheme aggregates the features from different view pairs, the global context may become less useful and leads to worse accuracy (Precision).
5.2 Occlusion Detection
To analyze the qualitative impact of occlusionaware view selection, the comparison experiment is set based on the result using the multiscale mechanism. For better visualization, we only probe the occlusionaware view selection in the final multiscale stage. The results with and without occlusionaware view selection are shown in Fig. 12, which contains the front views and the intersection (the red line shown in the model) of the results accompanied by different camera views.
Note that SurfaceNet+ (with Viewselection) has higher recall output, especially around the corner of the reconstructed house model (shown in the orange box of the intersection). The gap lies in different views selected by each method. Both methods use patch image (bottom right corner of the picture) to select valid views (the four views shown in the bottom of the figure). Yet the left two views are blocked by the surface, which means only the right two views can provide useful patch information for reconstruction. The occluded views reduced the output weight under the correct views; therefore, incomplete prediction occurred in complex geometry regions without occlusionaware view selection. In SurfaceNet+, the rejected occluded views (shown in red) are detected by the projection rays combined with the output point cloud in the previous stage mentioned in subsection3.2. It is worth noting that these occluded views are extremely hard to detect using only image patches. These patches are similar to each other, so it is difficult to infer the relative position relationship among them in the absence of threedimension prior.
In Fig. 10(b), to further demonstrate the benefit from an explicit “relative weight (with occlusionaware)” ( curve), we investigate the setting “relative weight (without occlusionaware)” ( curve) and the setting “without relative weight” ( curve). Enabling the “relative weight (with occlusionaware)” significantly improves Recall (Completeness) of the reconstructed model, indicating the effectiveness of the proposed trainable occlusionaware view selection scheme.
Additionally, in Fig. 10(a), we evaluate the proposed endtoend trainable occlusionaware view selection scheme (trainable, curve) versus the heuristic view selection method (heuristic, curve). Note that both of them share the same backbone network structure and the multiscale fusion strategy, while the only difference is the view selection module. As we can see, the proposed endtoend trainable occlusionaware view selection scheme significantly boosts the completeness (Recall) of the reconstruction model.
SurfaceNet[ji2017surfacenet]. For fair comparison, in Fig. 10(a) we also show the performance difference between the ICCV SurfaceNet[ji2017surfacenet] and the modified SurfaceNet with the new backbone, where the only modification on the ICCV SurfaceNet is the network structure that SurfaceNet+ is using (Fig. 4). It is worth noting the relative position changes of each curve. There is a clear shift downward after adding the proposed trainable occlusionaware view selection(Viewselection). This indicates the better recall with comparable precision. Overall, the gain achieved by SurfaceNet+ over SurfaceNet has NO relationship with the backbone network adopted; instead, it is benefited from the proposed multiscale pipeline and novel view selection strategy.
5.3 Discussion
Hyperparameters. The number of view pairs is also critical for the algorithm. Too few view pairs may lead to noisy and inaccurate reconstruction, while too many in sparseMVS lead to incomplete (low recall) prediction. The tradeoff of achieves the best overall performance as indicated in the figure.
To further analyze the effect of occlusionaware view selection, we experiment with different occlusion parameters at different sparsities with a fixed view pair number . The recall gain is counted by the recall improvement based on the method without occlusion detection.
As shown in Fig. 13, the gain increases as the sparsity grows. The reason lies in that when sparsity increases, a growing baseline angle and fewer view pairs lead to a lower percentage of nonoccluded views. Therefore, lower weight on occluded views controlled by alpha has increased benefit on larger sparsity.
NO. of subvolumes  Speed up  

SurfaceNet  SurfaceNet+  mean  max  min  
DTU  140,608  12,320  11X  23X  7X 
T&T  158,992  15,892  15X  33X  11X 
Efficiency. To evaluate the efficiency brought by the coarsetofine mechanism, we measure the speed of the algorithm using the total sampled volumes. Specifically, we count each number of cubes sampled by the algorithm for both methods in all the models on the DTU [aanaes2016large] evaluation set and Tanks and Temples [Knapitsch2017] ‘Intermediate’ set. We set the whole reconstruction scene as a cubic box with length in the DTU dataset and the final voxel resolution . Each volume forms a tensor of size and we set . We use all the cubes for reconstruction in SurfaceNet [ji2017surfacenet], a threestage coarsetofine pipeline for SurfaceNet+. The settings in Tanks and Temples are equal to DTU except that we set and . The left part of Table III shows the average subvolumes used for reconstruction, and the right part shows the speed up multiple brought by the coarsetofine mechanism. We value the average multiple as the ratio between two methods. The mean, maximum and minimum multiple show that the volume selection mechanism can achieve more than 10 times higher efficiency on both datasets.
To better understand the efficiency promotion, in Fig. 14, we visualize the speed up ratio as the scale of the relative resolution . Note how the coarsetofine mechanism leads to efficient representation compared to SurfaceNet. At low relative resolution, the ratio is near 1 due to the nearly dense sampling based on the coarse prediction. Yet with the finer reconstruction, the speedup ratio grows dramatically because the finer prediction leads to a higher percentage of empty subvolumes.
Noisy camera poses. The camera poses used in our previous experiments are given by the public datasets, which are estimated by the registration of laser scans (denoted as GT camera pose). While in practice, the camera poses may be computed through the sparse set of views, which inevitably suffers noise (denoted as noisy camera pose). To evaluate how the noisy camera pose affects the performance of SurfaceNet+, we adopt the structureofmotion SfM[schoenberger2016sfm] along with the sparse set of views to obtain the noisy camera pose. As expected, using Noisy camera poses (Fig. 15) degrades the performance of MVS methods that using GT camera poses (Fig. 6), where the fscore drops.
We examine the fscore degradation between Fig. 15 and Fig. 6, where the imagewise view selection scheme, used in Gipuma[galliani2015massively] and RMVSNet[yao2019recurrent], is more sensitive to the camera pose noise, especially under massive sparsity levels. In contrast, the pixelwise (COLMAP[schoenberger2016mvs]) and volumewise (SurfaceNet[ji2017surfacenet] and SurfaceNet+) view selection strategy is relatively more robust to camera pose noise. The reason is that the camera pose noise will introduce an inhomogeneous shift of the photoconsistent matches, so that the pixelwise and volumewise view selection can adaptively choose the relatively better views based on the photometric consistency despite the noisy camera pose. In contrast, the imagewise view selection leads to matching the correspondence only on the preselected views, which no longer be the best views for a large proportion of pixels or subvolumes if the camera pose noise is considered.
6 Conclusion
As sparser sensation is more practical and more costefficient, instead of only focusing on dense MVS setup, we propose a comprehensive analysis on sparseMVS under various observation sparsities. The proposed leaderboard calls for more attention and effort from the community to the sparseMVS problem, since the stateoftheart depthfusion methods significantly perform worse as the baseline angle get larger in the sparser setting. As another line of the solution, we propose a volumetric method, SurfaceNet+, to handle sparseMVS by introducing the novel occlusionaware view selection scheme as well as the multiscale strategy. Consequently, the experiments demonstrate the tremendous performance gap between SurfaceNet+ and the recent methods in terms of precision and recall. Under the extreme sparseMVS settings in two datasets, where existing methods can only return very few points, SurfaceNet+ still works as well as in the dense MVS setting.
Limitations. (1) Ideally, for a simple geometric region, each piece of a surface in subvolume should be effectively reconstructed only using ONE view pair with large baseline angle, i.e.,
. However, due to various of shading and lighting conditions, the colorization of the 3D model gets more challenging by using less number of views. (2) Furthermore, even though the scanned models in the MVS datasets are largescale scene, it will be challenging for SurfaceNet+ to effectively and efficiently reconstruct a citylevel 3D model. (3) Last but not least, despite of the great generalization ability of the learnt model, it still requires dozens of laserscanned 3D model for supervision. That significantly limits the application scenarios, such as astroobservation and multiview microscopic observation, where rare supervision signal can be captured.
Comments
There are no comments yet.