Log In Sign Up

M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network

The present MVS methods with deep learning have an impressive performance than traditional MVS methods. However, the learning-based networks need lots of ground-truth 3D training data, which is not always easy to be available. To relieve the expensive costs, we propose an unsupervised normal-aided multi-metric network, named M^3VSNet, for multi-view stereo reconstruction without ground-truth 3D training data. Our network puts forward: (a) Pyramid feature aggregation to extract more contextual information; (b) Normal-depth consistency to make estimated depth maps more reasonable and precise in the real 3D world; (c) The multi-metric combination of pixel-wise and feature-wise loss function to learn the inherent constraint from the perspective of perception beyond the pixel value. The abundant experiments prove our M^3VSNet state of the arts in the DTU dataset with effective improvement. Without any finetuning, M^3VSNet ranks 1st among all unsupervised MVS network on the leaderboard of Tanks Temples datasets until April 17, 2020. Our codebase is available at


page 3

page 6

page 8


NPF-MVSNet: Normal and Pyramid Feature Aided Unsupervised MVS Network

We proposed an unsupervised learning-based network, named NPF-MVSNet, fo...

Recurrent Multi-view Alignment Network for Unsupervised Surface Registration

Learning non-rigid registration in an end-to-end manner is challenging d...

Unsupervised Metric Relocalization Using Transform Consistency Loss

Training networks to perform metric relocalization traditionally require...

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

We present a learning based approach for multi-view stereopsis (MVS). Wh...

PatchMVSNet: Patch-wise Unsupervised Multi-View Stereo for Weakly-Textured Surface Reconstruction

Learning-based multi-view stereo (MVS) has gained fine reconstructions o...

Degradation-agnostic Correspondence from Resolution-asymmetric Stereo

In this paper, we study the problem of stereo matching from a pair of im...

DeepC-MVS: Deep Confidence Prediction for Multi-View Stereo Reconstruction

Deep Neural Networks (DNNs) have the potential to improve the quality of...

I Introduction

Multi-view stereo (MVS) reconstruction is still a hot topic over the past decade. MVS can be regarded as an extensive process on the basis of structure from motion (SFM) [6][9]. SFM extracts and matches the feature points from multi-view photos or continuous videos and then reconstructs the sparse point clouds [28][31]. What’s the difference is that MVS aims to reconstruct the dense point clouds [25]. Big progress has been made in the dense construction with traditional methods through the handcrafted algorithm of similarity ((e.g. NCC) and regularization [10][12][24][11][33]. Though, traditional methods may not work in some scenarios such as textureless, mirror effect, or reflection [25].

To relieve this limitation, deep learning is introduced into MVS [16][32]

. Some outstanding networks, based on convolutional neural network (CNN) and recurrent neural network (RNN), are constructed to infer the information by multi-view stereo correspondences such as MVSNet

[37] and R-MVSNet [38]. The features can be learned by the network instead of artificial selection and the inline correspondences can be considered in the forward and backward process, which is proved valid with the constraint of geometric and photometric consistency [14].

The present learning-based MVS methods are very dependent on the ground-truth 3D training data, which is a big hurdle due to the expensive cost for acquiring the training data [13][37][40]. One effective solution to that is to construct the unsupervised network without the need for the ground-truth 3D training data [4][36]. At the same time, deployment and transfer can be easily carried out [7].

The paper introduces a novel method, named VSNet, which could infer the depth maps for multi-view stereo without the ground-truth 3D training data. The key insight is derived from that the traditional photometric consistency could be guaranteed based on the correct geometric information [35][26]. Further, multi-scale information plays a vital role in similarity measurement for textureless regions or no-Lambert surfaces. Previous works such as MVSNet [37] and R-MVSnet [38] extract the feature of only a single layer to construct 3D cost volume, which has been proved to be the important representative for estimated depth [29]. Here, we aggregate multi-scale pyramid features to construct the 3D cost volume with more contextual information. For the loss function of unsupervised methods, previous works [7][36][23] pay more attention to the pixel rather than multi-scale features. In view of this, multi-scale feature loss is introduced as a significant supplement. Multi-metric loss including feature-wise loss, which derives from the pre-trained VGG16 network, and pixel-wise loss, can guarantee the understanding in perceptual aspects while the pixel-wise loss focuses on the accuracy of the pixel value. To improve the performance further, normal-depth consistency is introduced to regularize the depth maps in the 3D real space. The regularization will increase the accuracy and precision of depth maps in response to the possible deterioration by multi-scale information.

Our main contributions are summarized as below:

  • We propose a novel unsupervised network for multi-view stereo without the ground-truth 3D training data.

  • The paper puts forward three methods to deal with textureless regions or no-Lambert surfaces. Multi-metric loss including pixel-wise and feature-wise loss guarantees the understanding in perceptual aspects beyond pixel value. Multi-scale features are extracted to get more contextual information. The normal-depth consistency regularizes the depth maps to be more precise and more reasonable.

  • VSNet achieves SOTA performance and ranks among all the unsupervised MVS networks on the leaderboard of Tanks & Temples datasets until April 17, 2020.

Ii Related Work

Ii-a Traditional MVS

Many traditional methods have been proposed in this field such as voxel-based [27], feature points spread [10], and the fusion of estimated depth maps [3]. The method of voxel-based has to consume many computing resources, whose accuracy depends on the resolution of the voxel [16]. The blank area may suffer from the textureless more serious when feature points diffusion is adopted. The most used method is the fusion of inferred depth maps, which get the depth maps, and then all the depth maps are fused together to output the final point clouds [5]. Many improved insights have been proposed. Neill [5]

uses a spatial consistency constraint to remove the outliers from the depth maps. Silvano

[11] formulates the patch matches in 3D space and the progress can be massively parallelized and delivered. Johannes [24] estimates the depth and normal maps synchronously, using photometric and geometric priors to refine the image-based depth and normal fusion. Though, the accuracy and completeness can be improved when the problem of the textureless or no-Lambert surfaces can be solved perfectly.

Ii-B Depth estimation

The method of depth maps can decouple the reconstruction into depth estimation and depth fusion. Depth estimation with monocular video and binocular image pair has many similarities with the multi-view stereo here [20]. But there are exactly some differences between them. Monocular video [41] lacks the real scale for the depth actually. Binocular image pairs always rectify the parallel two images [8]. In this case, only the disparity needs to be inferred without considering the intrinsic and extrinsic of the camera. As for multi-view stereo, the input is the arbitrary number of pictures. What’s more, the transformation among these cameras should be taken into consideration as a whole [37]. Other obstacles such as multi-view occlusion and consistency [7] raise the bar for multi-view stereo depth estimation than that of monocular video and binocular image pair..

Ii-C Supervised Learing MVS

Since Yao proposes MVSNet in 2018 [37], many supervised networks based on MVSNet have been put forward. To reduce GPU memory consumption, Yao continues to introduce R-MVSNet with the help of GRU [38]. Gu uses the concept of the cascade to shrink the cost volume [13]. Yi introduced two new self-adaptive view aggregation with pyramid multi-scale images to enhance the point clouds in textureless regions [39]. Luo utilizes the plane-sweep volumes with isotropic and anisotropic 3D convolutions to get better results [22]. Yu introduces Fast-MVSNet [40], which firstly gets a sparse cost volume, and then a simple but efficient Gauss-Newton layer can refine the depth maps with great progress inefficiently. In this kind of task, cost volume and 3D regularization are memory consuming and the depth of the true value is derived from heavy labor, which is not fetched easily in other scenarios.

Ii-D Unsupervised Learning in MVS

The unsupervised network utilizes the internal and external constraint to leaning the depth by itself, which relief the complicated and fussy artificial markers for ground-truth depth maps. Many works explore unsupervised learning in monocular video and binocular images with photometric and geometric consistency. Reza

[23] presents the unsupervised learning method for depth and ego-motion from monocular video. The paper uses the image reconstruction loss, 3D point cloud alignment loss, and additional image-based loss. Similar to unsupervised learning in monocular video and binocular images [2], the losses of MVS are also the photometric and geometric consistency. Dai [7] predicts the depth maps for all views simultaneously in a symmetric way. In the stage, cross-view photometric and geometric consistency can be guaranteed. But this method consumes a lot of GPU memory. Additionally, Tejas [18] proposes an easy network and traditional loss designation but an unsatisfied result. Efforts are worthy to be paid in this direction.

Iii VSNet

In this section, VSNet will be presented in detail. As an unsupervised network, VSNet is based on MVSNet [37]. Our proposed network can work in the multi-view stereo reconstruction without the ground-truth 3D training data, which achieves the best performance among all of the unsupervised MVS networks in accuracy and completeness of point clouds. More importantly, the overall performance of VSNet can be the same as supervised MVSNet in the same setting.

Iii-a Network Architecture

Fig. 1: Our unsupervised network: VSNet. It contains four components: pyramid feature aggregation, cost volume and 3D U-Net regularization, normal-depth consistency and multi-metric loss including pixel-wise & feature-wise loss.

VSNet consists of feature extraction, construction of cost volume, 3D U-Net regularization, normal-depth refine and multi-metric loss. As figure

1 shows, the pyramid feature aggregation with only the finest level is adopted to extract features of the arbitrary number of images. The processes of cost volume, 3D U-Net regularization and initial depth estimation are based on MVSNet, which has been proved effective. Then the initial depth is transferred to the normal domain. In turn, the final depth can be refined with 3D geometric constraint from normal domain to depth domain. Besides, to construct multi-metric loss, another pre-trained network named VGG16 is used to provide the feature-wise constraint. With the traditional pixel-wise constraint, our VSNet can estimate the depth and fuse all of the depth into the final point clouds with the highest level in an unsupervised way.

Iii-A1 Pyramid Feature Aggregation

In MVSNet [37], only the 1/4 feature is adopted (1/4 represents a quarter of the size of the original reference images). The only one feature map is short of contextual information. There are many choices presented in Lin’s work [21]. Featured image pyramid predicts in every different layer and pyramidal feature hierarchy predicts in every hierarchy feature layer. Besides, the feature pyramid network makes the best of contextual information with multi-scale upsampling to predicts independently but with the cost of more memory consumption. In VSNet, the network uses the pyramid feature aggregation with only the finest level, which has been proved helpful than a single feature layer [21]. Next, the aggregation for the pyramid feature will be introduced.

Fig. 2: Pyramid Feature Aggregation

Figure 2 shows the aggregation of pyramid feature. For the input

images, the feature extraction network is constructed to extract the aggregated 1/4 feature. In the process of bottom-up, the stride of 3, 6 and 9 layers are set to 2 to get the four scale features in twelve-layer 2D CNN. Each convolutional layer is followed by the struct of BatchNorm and ReLU. In the process of up-bottom, each level of feature is derived from the concatenate by the upsampling of the higher layer and the feature in the same layer with fewer channels. Especially, the 1/2 feature needs to be downsampling to be aggregated into the final 1/4 feature. To reduce the dimension of the final 1/4 feature, the

convolution for each concatenation is adopted. At last, we get the final feature with 32 channels, which is an aggregation of contextual information as much as possible.

Iii-A2 Cost volume and 3D U-Net regularization

The construction of cost volume is based on the homography warping with the different hypotheses of depth [37]. In fact, more depth sampling or fewer depth intervals will lead to better accuracy. Here is adopted for comparison like the previous two unsupervised methods. Additionally, 3D U-Net regularization can remove the noise by the cost volume, which is the accepted approach for 2D and 3D semantic segmentation. We still use the 3D U-Net in MVSNet, which has simple but effective results. At last, the initial depth is derived from the

operation with the probability volume after regularization. Construction of cost volume and 3D U-Net regularization occupy the most of memory in the whole network. For unsupervised methods, the paper focus on the loss function.

Iii-A3 Normal-Depth Consistency

The initial depth mainly relies on the probability of feature matches. The textureless and occlusion will lead to the wrong match. How to refine the depth is a key step that can improve the estimated depth. Different from the refine network to the reference image, VSNet uses the normal-depth consistency to refine the initial depth in 3D space [36]. The consistency will make the depth more reasonable and accurate. Normal-depth consistency can be divided into two steps. Firstly, the normal should be calculated by the depth with the orthogonality. Then the refined depth can be inferred by the normal and initial depth.

Fig. 3: Normal from the depth

As figure 3 demonstrates, eight neighbors are selected to refer the normal of central point. Due to the orthogonality, the operation of cross-product can be used. For each central point , the match pairs of neighbors can be recognized as and . If the depth of and the intrinsics of camera are known, the normal can be calculated as below:

To add the credibility of final normal estimation , mean cross-product for eight neighbors can be presented as below:

Fig. 4: Depth from the normal

The final refined depth can be available when the normal and initial depth are provided. In figure 4, for each pixel , the depth of neighbor should be refined. Their corresponding 3D points are and . Assuming that the normal of is , the depth of is , the depth of is , we can get the formula , which is apparently reasonable due to the orthogonality and surface consistency in the near field. In summary, the depth of neighbors can be inferred by the depth and normal of the central point.

For the refined depth, eight neighbors are also taken into consideration. The neighbors are used to refine the depth of the central point. Considering the discontinuity of normal in some edge or irregular surface of the real object, the weight for the reference image is introduced to make depth more reasonable. The weight is defined as below:

The weight depends on the gradient between and , which means that the bigger gradient represents the less reliability of the refined depth. In view of the eight neighbors, the final refined depth is a combination of weighted sum of eight different directions.

The final refined depth is the results of regularization in 3D space. The 3D geometric constraint makes the depth more accurate and reasonable.

Iii-B Multi-metric Loss

Due to the unsupervised method used here, how to design the loss function is more important. In this paper, the multi-metric loss has played a crucial role. Not only the pixel-wise loss function is introduced, but also the feature-wise loss function is designed to face the disadvantages of textureless and to raise the completeness of point clouds.

The key points embodied in pixel-wise and feature-wise loss function are the photometric consistency crossing multi-views [3]. Given the reference image and source image , the corresponding intrinsics and , the extrinsic from to . For the pixel in , the corresponding pixel in can be calculated as:

The overlapping area, named , from reference image to source image can be sampling using the bilinear method.

For the occlusion area, the values of pixel in are set to zero. Obviously, the mask can be obtained when the is projected to the external area of . Based on the constraint, the multi-metric loss function of VSNet can be formulated as the equation.

Iii-B1 Pixel-Wise Loss

For the pixel-wise loss, the reference image is used to be the reference to satisfy the consistency crossing multi-views. There are mainly three parts of loss introduced in this section. First, the photometric loss, which compares the difference of pixel value between and , is the most used loss. To relieve the influence of lighting changes, the gradient of every pixel is integrated into .

Where is the sum number of valid points according to the mask

Second, the structure similarity (SSIM) is set to measure the similarity of and . The operation will be when is the same as . The loss function aims to make it more similar between and .

Third, the smooth of final refined depth map can make it less steep in the first-order domain and the second-order domain. Where is the sum number of points in reference image

At last, the pixel-wise loss can be illustrated as below:

Iii-B2 Feature-Wise Loss

Apart from the pixel-wise loss, the main contribution of VSNet is the use of feature-wise loss. For some textureless area, the pixel matching would be wrong, which leads to low precision. But it will be changed with the aid of feature-wise loss. Using more advanced information like high-level semantic information, depth will be well learned even in some textureless regions to some extent.

Fig. 5: Feature-wise extraction from pre-trained VGG16

Due to the strong correlation between the estimated depth and pyramid feature network mentioned in section III-A1, the high-level feature is extracted from pre-trained VGG16 instead of the pyramid feature network. By putting the reference image into the pre-trained VGG16 network, showed in figure 5, the feature-wise loss can be constructed. Here, we extract the layer 8, 15 and 22, which are one half, a quarter and one-eighth of the size of the original input images. As a matter of fact, layer 3 output the same size of the original input images, which is actually the reuse of pixel-wise loss.

For every feature from the VGG16, we can construct the loss based on the concept of crossing multi-views. Similar to section III-B1, the corresponding pixel in can be available. The matching feature from to can be presented as below:

In addition to the pixel value, the feature domain has a bigger receptive field so that the obstacle of textureless regions can be relieved to some extent so that the estimated final depth will detect the similarity of features. The loss is:

The final feature-wise loss function is a weighted sum of different scale of features. corresponds to feature of layer 8 from pre-trained VGG16.

Iv Experiments

To prove the effectiveness of our proposed VSNet, this section mainly conducts lots of experiments. First, we explore the performance of VSNet on the DTU dataset including the details of training and testing information. Then the current unsupervised networks in MVS are compared with VSNet. In section IV-C, the ablation studies are carried out to find potential improvement with the proposed contributions. At last, we test VSNet on different datasets such as Tanks and Temples to study the generalization of our model.

Iv-a Performance on DTU

The DTU dataset is a multi-view stereo set which has 124 different scenes with 49 scans using the robotic arms [15][1]. By the lighting change, each scan has seven conditions with the pose known. We use the same train-validation-test split as in MVSNet [37] and [7]. Furthermore, the scenes: 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, 118 are selected as the test lists.

Iv-A1 Implementation detail

VSNet is the unsupervised network based on Pytorch. In the process of training, the DTU’s training set without the ground-truth depth maps is used, the resolution of whose is the crop version of the original picture. That is 640

512.Due to the pyramid feature aggregation, the resolution of the final depth is 160

128. The depth ranges are sampled from 425mm to 935mm and the depth sample number is D = 192. The models are trained with the batch of size 4 in four NVidia RTX 2080Ti. By the pattern of data-parallel, each GPU with around 11G available memory could deal with the multi-batch. By using Adam optimizer for 10 epochs, the learning rates are set to 1e-3 for the first epoch and decrease by 0.5 for every two epochs. For the balance of different weights among loss, we set

, , , . Additionally, , , . During each iteration, one reference image and two source images are used. In the process of testing, the resolution of input images is 1600 1200, which needs up to 10.612G GPU memory.

Iv-A2 Results on DTU

(a) Ground Truth
(b) MVSNet
(c) Only pixel-wise
(d) VSNet
Fig. 6: Qualitative comparison in 3D reconstruction between VSNet and supervised or unsupervised MVS methods on the DTU dataset. From left to right: ground truth, MVSNet(D=256) [37], only the pixel-wise constraint, which is similar to unsup_mvs [18], and our proposed VSNet.

The official metrics [15] are used to evaluate VSNet’ performance on the DTU dataset. There are three metrics called accuracy, completion and overall. To prove the effectiveness of the model, we compare our proposed VSNet against the three classic traditional methods such as Furu [10], Tola [30] and Colmap [24]

, and three classic supervised learning methods such as SurfaceNet

[16], MVSNet [37] with different depth sample, and the two unsupervised learning methods such as Unsup_MVS [18] and [7].

As table I shows, our proposed VSNet can outperformance the two traditional methods and is so closed to Colmap [24]. As described in MVSNet [37], learning-based methods are ten times more efficient than traditional methods like Colmap. Further, due to the limitation of GPU memory, VSNet selects the sampling value as 192. Obviously, VSNet surpasses the supervised learning method with the same setting. When compared with other unsupervised learning methods, the conclusion can be made that our proposed VSNet is the SOTA network of the unsupervised networks for multi-view stereo reconstruction. For more detailed information in point clouds, figure 6 illustrates the striking contrast. The reconstruction by VSNet has more texture details. With the aid of feature-wise loss and pyramid feature aggregation, VSNet can recover more textureless regions while normal-depth consistency guarantees the accuracy of estimated depth maps in the 3D real space.

Method Mean Distance (mm)
Acc. Comp. overall
Furu [10] 0.612 0.939 0.775
Tola [30] 0.343 1.190 0.766
Colmap [24] 0.400 0.664 0.532
SurfaceNet [16] 0.450 1.043 0.746
MVSNet(D=192) 0.444 0.741 0.592
MVSNet(D=256) 0.396 0.527 0.462
Unsup_MVS [18] 0.881 1.073 0.977
[7] 0.760 0.515 0.637
VSNet(D=192) 0.636 0.531 0.583
TABLE I: Quantitative results on DTU’s evaluation set. Three classical MVS methods, two supervised learning-based MVS methods and two unsupervised methods using the distance metric (lower is better) are listed.

Iv-B Comparison with Unsupervised Methods

There are only two unsupervised methods until now. The first one is unsup_mvs [18], which is almost the first try in this direction. But it has poor performance where the overall of mean distance is 0.977. The other method published is [7]. Although can reach to 0.637 in overall of mean distance, it consumes more GPU memory than unsup_mvs due to three cost volumes and regularization needed to be constructed, which is unaffordable for single NVidia RTX 2080Ti used in VSNet. To sum up, our proposed unsupervised method achieves the best performance on the mean distance metric.

Iv-C Ablation Study

The section begins to analyze the effect of different modules proposed in VSNet. There are mainly three contrast experiments carried out. We would explore the role of pyramid feature aggregation, normal-depth consistency and multi-metric loss. All experiments focus on only one variable every time.

Pyramid Feature Aggregation. The module, which can catch more contextual information among different feature layers, is the enhanced version beyond the single feature map. Considering the expensive costs of cost volume construction and 3D U-Net regularization, we use the feature pyramid aggregation with only the 1/4 scale. As table II shows, this module will decrease the value of acc and comp in the mean distance. To summarize, pyramid feature aggregation will improve 2% in overall.

Method Mean Distance (mm)
Acc. Comp. overall
Without Pyramid Feature 0.638 0.554 0.596
With Pyramid Feature 0.636 0.531 0.583
TABLE II: Performance comparison when with and without the module of pyramid feature aggregation

Normal-depth consistency. From an initial depth map to a refined depth map, the module makes the depth map regularized in 3D space, which makes the depth more reasonable. Depth error is used to evaluate the quality of estimated depth before the reconstructed point clouds. Here we use the percentage of predicted depths within 2mm, 4mm, and 8mm of ground-truth depth maps (Higher is better). From table III, the performance with the aid of normal-depth consistency surpasses the one without the module in the threshold of 2mm, 4mm and 8mm. Further, in the later step of depth fusion, the contrastive point clouds illustrate the outliers around the object would be removed mostly with the help of normal-depth consistency.

Depth Error (mm)
Without Normal-depth 58.8 74.8 83.8
With Normal-depth 60.3 76.9 85.7
TABLE III: Performance comparison when with and without the module of normal-depth consistency
(a) (a) Only pixel-wise
(b) (b) No normal-depth
(c) (c) Normal-depth
Fig. 7: Qualitative comparison in the reconstruction of 3D point clouds with and without normal-depth consistency

Figure 7 demonstrates the comparison with the case of only pixel-wise loss, with the multi-metric loss but without normal-depth consistency, with both the multi-metric loss and normal-depth consistency. Case (b) and case (c) have better performance than case (a). But apparently, case (b) has more outliers than case (c), which proves that the module of normal-depth consistency can make depth maps more reasonable to some extent but a little deterioration in terms of completeness. The explanation is that the 3D space regularization can guarantee the refined depth maps to follow the rule in the real world. Figure 7 and table III can prove the significant benefits of this module.

Multi-metric loss. The most unsupervised networks about depth estimation, either monocular video or binocular rectified image pairs, focus on the pixel-wise loss construction, which conforms to humans’ thoughts. But the constraint pointing to feature-wise is effective in previous related work [17][34][4]. We have compared the pixel-wise loss only and the different combinations of feature-wise loss. The multi-metric loss shows a big improvement. What’s more, how to select the multi-scale features is also taken into comparison.

As the telling in table IV, the overall of only pixel-wise loss is relatively higher. The different combinations of feature-wise losses make it an impressive improvement. Further, we do some ablation studies on the different scales of features from pre-trained VGG16. The 1/4 feature is matched to the resolution of depth map by the network’s output. The results show that the combination of 1/2, 1/4, 1/8 features achieves the best result. By the way, adding the 1/8 feature improves the accuracy but deteriorate the completeness. The cause may be that too advanced semantic information is out of control under the estimated depth.

Method Mean Distance (mm)
Acc. Comp. overall
Only pixe-wise 0.832 0.924 0.878
pixel+ 1/4 feature 0.646 0.591 0.618
pixel+ 1/2,1/4,1/8 feature 0.636 0.531 0.583
pixel+ 1/2,1/4,1/8,1/16 feature 0.566 0.653 0.609
TABLE IV: Performance comparison of the different losses. Where the scale of 1/2 represents that the feature (corresponding to layer 8) extracted from the pre-trained VGG16 networks is half of the original reference image. The scales of 1/4, 1/8, 1/16 correspond to layer 15, 22, 29 in pre-trained VGG16.

Iv-D Generalization Ability on Tanks & Temples

Method Mean Family Francis Horse Lightouse M60 Panther Playground Train
VSNet 37.67 47.74 24.38 18.74 44.42 43.45 44.95 47.39 30.31
37.21 47.74 21.55 19.50 44.54 44.86 46.32 43.48 29.72
TABLE V: Qualitative comparison in 3D point clouds reconstruction on the Tanks and Temples dataset [19] among all the unsupervised methods, which is from the Leaderboard of intermediate T&T.
(a) Family
(b) Francis
(c) Horse
(d) M60
(e) Panther
(f) Playground
(g) Train
(h) Lighthouse
Fig. 8: Our unsupervised network’s performance on the Tanks and Temples dataset [19] without any finetuning.

To evaluate the generalization ability of our unsupervised network, the intermediate Tanks and Temples dataset, which has high-resolution images of outdoor scenes, is adopted. The models of VSNet trained on the DTU dataset is transferred without any finetuning. We use the intermediate scenes with the resolution of 1920

1056 and 160 depth intervals because 192 depth intervals will out of memory. What’s more, another core hyperparameter is the photometric threshold in the process of depth fusion. For the same depth maps of whole datasets, the different photometric thresholds will lead to different performances. In other words, the hyperparameter will cause the change of accuracy and completeness. For

VSNet, the photometric threshold is set to 0.6 and we get the following results. As shown in table V, the ranking is selected from the leaderboard of intermediate T&T. VSNet is better than by the mean score of 8 scenes, which is the best unsupervised MVS network until April 17, 2020. The point clouds are presented in figure 8, which are detailed and reasonable for scenes Family, Francis, Horse, M60, Panther, Playground, Train, Lighthouse. It’s worth noting that VSNet can be applied to advanced T&T but the reconstruction is so sparse due to the limitation of GPU memory. It’s a balance between GPU memory consumption and the performance of point clouds.

V Conclusion

In this paper, we proposed an unsupervised network for multi-view stereo reconstruction named VSNet, which achieves the state of arts in unsupervised MVS networks. With our proposed methods of pyramid feature aggregation, normal-depth consistency and multi-metric loss,

VSNet can capture contextual and high-level semantic information from the perspective of perception, and make sure the rationality of estimated depth maps in the real 3D world as to make it the best performance on DTU and other MVS datasets among all the unsupervised networks. In the future, more MVS datasets with high precision are desired. Besides, the domain transfer for different datasets can be improved better. Like the prosperity of other works in computer vision with deep learning, multi-task such as semantic, instance segmentation and depth completion can be combined with multi-view stereo reconstruction for the time to come.


  • [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl (2016) Large-scale data for multiple-view stereopsis. International Journal of Computer Vision 120, pp. 153–168. Cited by: §IV-A.
  • [2] I. Alhashim and P. Wonka (2018)

    High quality monocular depth estimation via transfer learning

    arXiv preprint arXiv:1812.11941. Cited by: §II-D.
  • [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), pp. 24. Cited by: §II-A, §III-B.
  • [4] W. Benzhang, F. Yiliu, F. Huini, and H. Liu (2018) Unsupervised stereo depth estimation refined by perceptual loss. In 2018 Ubiquitous Positioning, Indoor Navigation and Location-Based Services (UPINLBS), pp. 1–6. Cited by: §I, §IV-C.
  • [5] N. D. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla (2008) Using multiple hypotheses to improve depth-maps for multi-view stereo. In European Conference on Computer Vision, pp. 766–779. Cited by: §II-A.
  • [6] H. Cui, X. Gao, S. Shen, and Z. Hu (2017) HSfM: hybrid structure-from-motion. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1212–1221. Cited by: §I.
  • [7] Y. Dai, Z. Zhu, Z. Rao, and B. Li (2019) MVS2: deep unsupervised multi-view stereo with multi-view symmetry. In 2019 International Conference on 3D Vision (3DV), pp. 1–8. Cited by: §I, §I, §II-B, §II-D, §IV-A2, §IV-A, §IV-B, TABLE I.
  • [8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §II-B.
  • [9] J. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, et al. (2010) Building rome on a cloudless day. In European Conference on Computer Vision, pp. 368–381. Cited by: §I.
  • [10] Y. Furukawa and J. Ponce (2007) Accurate, dense, and robust multi-view stereopsis. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §I, §II-A, §IV-A2, TABLE I.
  • [11] S. Galliani, K. Lasinger, and K. Schindler (2015) Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881. Cited by: §I, §II-A.
  • [12] M. Goesele, B. Curless, and S. M. Seitz (2006) Multi-view stereo revisited. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 2402–2409. Cited by: §I.
  • [13] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan (2019) Cascade cost volume for high-resolution multi-view stereo and stereo matching. arXiv preprint arXiv:1912.06378. Cited by: §I, §II-C.
  • [14] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §I.
  • [15] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs (2014) Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 406–413. Cited by: §IV-A2, §IV-A.
  • [16] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017) Surfacenet: an end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2307–2315. Cited by: §I, §II-A, §IV-A2, TABLE I.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §IV-C.
  • [18] T. Khot, S. Agrawal, S. Tulsiani, C. Mertz, S. Lucey, and M. Hebert (2019) Learning unsupervised multi-view stereopsis via robust photometric consistency. arXiv preprint arXiv:1905.02706. Cited by: §II-D, Fig. 6, §IV-A2, §IV-B, TABLE I.
  • [19] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–13. Cited by: Fig. 8, TABLE V.
  • [20] H. Laga (2019) A survey on deep learning architectures for image-based depth reconstruction. arXiv preprint arXiv:1906.06113. Cited by: §II-B.
  • [21] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §III-A1.
  • [22] K. Luo, T. Guan, L. Ju, H. Huang, and Y. Luo (2019) P-mvsnet: learning patch-wise matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10452–10461. Cited by: §II-C.
  • [23] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: §I, §II-D.
  • [24] J. L. Schönberger, E. Zheng, J. Frahm, and M. Pollefeys (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pp. 501–518. Cited by: §I, §II-A, §IV-A2, §IV-A2, TABLE I.
  • [25] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), Vol. 1, pp. 519–528. Cited by: §I.
  • [26] T. Shen, L. Zhou, Z. Luo, Y. Yao, S. Li, J. Zhang, T. Fang, and L. Quan (2019) Self-supervised learning of depth and motion under photometric inconsistency. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §I.
  • [27] S. N. Sinha, P. Mordohai, and M. Pollefeys (2007) Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §II-A.
  • [28] N. Snavely, S. M. Seitz, and R. Szeliski (2006) Photo tourism: exploring photo collections in 3d. In ACM Siggraph 2006 Papers, pp. 835–846. Cited by: §I.
  • [29] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §I.
  • [30] E. Tola, C. Strecha, and P. Fua (2011) Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision and Applications 23, pp. 903–920. Cited by: §IV-A2, TABLE I.
  • [31] R. Tron, X. Zhou, and K. Daniilidis (2016) A survey on rotation optimization in structure from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 77–85. Cited by: §I.
  • [32] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047. Cited by: §I.
  • [33] H. Vu, P. Labatut, J. Pons, and R. Keriven (2011) High accuracy and visibility-consistent dense multiview stereo. IEEE transactions on pattern analysis and machine intelligence 34 (5), pp. 889–901. Cited by: §I.
  • [34] A. Wang, Z. Fang, Y. Gao, X. Jiang, and S. Ma (2018) Depth estimation of video sequences with perceptual losses. IEEE Access 6, pp. 30536–30546. Cited by: §IV-C.
  • [35] M. Weber, A. Blake, and R. Cipolla (2002) Towards a complete dense geometric and photometric reconstruction under varying pose and illumination.. In BMVC, pp. 1–10. Cited by: §I.
  • [36] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia (2018) Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §I, §I, §III-A3.
  • [37] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §I, §I, §I, §II-B, §II-C, §III-A1, §III-A2, §III, Fig. 6, §IV-A2, §IV-A2, §IV-A.
  • [38] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5534. Cited by: §I, §I, §II-C.
  • [39] H. Yi, Z. Wei, M. Ding, R. Zhang, Y. Chen, G. Wang, and Y. Tai (2019) Pyramid multi-view stereo net with self-adaptive view aggregation. arXiv preprint arXiv:1912.03001. Cited by: §II-C.
  • [40] Z. Yu and S. Gao (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. arXiv preprint arXiv:2003.13017. Cited by: §I, §II-C.
  • [41] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §II-B.