BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks
While deep learning has recently achieved great success on multi-view stereo (MVS), limited training data makes the trained model hard to be generalized to unseen scenarios. Compared with other computer vision tasks, it is rather difficult to collect a large-scale MVS dataset as it requires expensive active scanners and labor-intensive process to obtain ground truth 3D structures. In this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. To create the dataset, we apply a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, we render these mesh models to color images and depth maps. The rendered color images are further blended with the input images to generate photo-realistic blended images as the training input. Our dataset contains over 17k high-resolution images covering a variety of scenes, including cities, architectures, sculptures and small objects. Extensive experiments demonstrate that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets. The entire dataset with pretrained models will be made publicly available at https://github.com/YoYo000/BlendedMVS.READ FULL TEXT VIEW PDF
Deep learning has recently demonstrated its excellent performance for
Multi-focus image fusion, a technique to generate an all-in-focus image ...
Single-view depth prediction is a fundamental problem in computer vision...
3D face reconstruction is a fundamental task that can facilitate numerou...
Learned confidence measures gain increasing importance for outlier remov...
Computer vision is difficult, partly because the mathematical function
Deep Neural Networks (DNNs) have the potential to improve the quality of...
BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks
MVSNet: Depth Inference for Unstructured Multi-view Stereo using pytorch-lightning
Quick lookup for BlendedMVS scenes
Multi-view stereo (MVS) reconstructs the dense representation of the scene from multi-view images and corresponding camera parameters. While the problem is previously addressed by classical methods, recent studies [30, 31, 10] show that learning-based approaches are also able to produce results comparable to or even better than classical state-of-the-arts. Conceptually, learning-based approaches implicitly take into account global semantics such as specularity, reflection and lighting information during the reconstruction, which would be beneficial for reconstructions of textureless and non-Lambertian areas. It has been reported on the small object DTU dataset  that, the best overall quality has been largely improved by recent learning-based approaches [30, 31, 4, 13].
By contrast, leaderboards of Tanks and Temples  and ETH3D  benchmarks are still dominated by classical MVS methods. In fact, current learning-based methods are all trained on DTU dataset , which consists of small objects captured with a fixed camera trajectory. As a result, the trained model cannot generalize very well on other scenes. Moreover, previous MVS benchmarks [24, 26, 1, 14, 23] mainly focus on the point cloud evaluation rather than the network training. Compared with other computer vision tasks (e.g., classification and stereo), the training data for MVS reconstruction is rather limited, and it is desired to establish a new dataset to provide sufficient training ground truth for learning-based MVS.
In this paper, we introduce BlendedMVS, a large-scale synthetic dataset for multi-view stereo training. Instead of using expensive active scanners to obtain ground truth point clouds, we propose to generate training images and depth maps by rendering textured 3D models to different viewpoints. Textured meshes are first reconstructed from images of different scenes, which are then rendered into color images and depth maps. We further apply the high-pass filter on the rendered image and the low-pass filter on the input image, and then blend the two images into our training input. The resulting images inherit detailed visual cues (high frequency signals) from rendered color images, which makes them consistently align with rendered depth maps. At the same time, the blended images still largely preserve the realistic ambient lighting information (low frequency signals) from input images, which helps the trained model better generalize to real-world scenarios.
Our dataset contains 113 well selected and reconstructed 3D models. These textured models cover a variety of different scenes, including cities, architectures, sculptures and small objects. Each of the scene contains 20 to 1,000 input images and there are more than 17,000 images in total. We train recent MVSNet  and R-MVSNet  on several MVS datasets. Extensive experiments on different validation sets demonstrate that models trained on BlendedMVS achieve significantly better generalization ability compared with models trained on other MVS datasets.
Our main contributions can be summarized as:
We propose a low-cost data generation pipeline with a novel fusion approach to automatically generate training ground truth for learning-based MVS.
We establish the large-scale BlendedMVS dataset. All models in the dataset are well selected and cover a variety of diversified reconstruction scenarios.
We report on several benchmarks that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets.
Learning-based approaches for MVS reconstruction have recently shown great potentials. Learned multi-patch similarity 
first applies deep neural networks for MVS cost metrics learning. SurfaceNet and DeepMVS 
unproject images to the 3D voxel space, and use 3D CNNs to classify if a voxel belongs to the object surface. LSM and RayNet  encoded the camera projection to the network, and utilized 3D CNNs or Markov Random Field to predict surface label. To overcome the precision deficiency in volume presentation, MVSNet  applies differentiable homography to build the cost volume upon the camera frustum. The network applies 3D CNNs for the cost volume regularization and regress the per-view depth map as output. The follow-up R-MVSNet  is designed for high-resolution MVS, by replacing the memory-consuming the 3D CNNs with the recurrent regularization, and significantly reduce the peak memory size. More recently, Point-MVSNet  presents an point-based depth map refinement network, while MVS-CRF  introduces the conditional random field for the depth map refinement.
Middlebury MVS  is the earliest MVS dataset for MVS evaluation. It contains two indoor objects with low-resolution (640 480) images and calibrated cameras. Later, the EPFL benchmark  captures ground truth models of building facades and provides high-resolution images (6.2 MP) and ground truth point clouds for MVS evaluation. To evaluate algorithms under different lighting conditions, DTU dataset  captures images and point clouds for more than 100 indoor objects with a fixed camera trajectory. The point clouds are further triangulated into mesh models and rendered into different view point to generate ground truth depth maps . Current learning-based MVS networks [30, 31, 4, 13] usually apply DTU dataset as their training data. Recent Tanks and Temples benchmark  captures indoor and outdoor scenes using high-speed video cameras, however, their training set only contains 7 scenes with ground truth point clouds. ETH3D benchmark  contains one low-resolution set and one high-resolution set. But similar to Tanks and Temples, ETH3D only provides a small number of ground truth scans for the network training. The available training data in these datasets is rather limited, and a larger scale dataset is required to further exploit the potentials of learning-based MVS. In contrast, the proposed dataset will provide more than 17,000 images with ground truth depth maps, which covers a variety of diversified scenes and can greatly improve the generalization ability of the trained model.
Generating synthetic datasets for training is a common practice in many computer vision tasks, as a large amount of ground truth can be generated at very low cost. Thanks to recent advances in computer graphics, the rendering effect becomes increasingly photo-realistic, making the usage of synthetic datasets more plausible. For example, synthetic rendered images are used in stereo matching [3, 16, 32], optical flow [3, 16, 5], object detection [6, 27] and semantic segmentation [6, 19, 5, 20, 25]. Similar to these datasets, we consider incorporating the lighting effects in rendering synthetic datasets for 3D reconstruction. However, since it is difficult to generate correct material properties in different parts of the model, we resort to a blending approach with original images to recover the lighting effects.
The proposed data generation pipeline is shown in Fig. 1. We first apply a full 3D reconstruction pipeline to produce the 3D textured mesh from input images (Sec. 3.1). Next, the mesh is rendered to each camera view point to obtain the rendered image and the corresponding depth map. The final training image input is generated by blending the rendered image and input image in our proposed manner (Sec. 3.2).
The first step to build a synthetic MVS dataset is generating sufficient high-quality textured mesh models. Given input images, we use Altizure online platform  for the textured mesh reconstruction. Altizure is one of the best 3D photogrammetry software in the market, but it could also be replaced with other comparable software. The software will perform the full 3D reconstruction pipeline and return the textured mesh and camera poses as final output.
With the textured mesh model and camera positions of all input images, we then render the mesh model to each camera view point to generate the rendered images and rendered depth maps. One example is shown in Fig. 1. The rendered depth maps will be used as the ground truth depth maps during training.
Intuitively, rendered images and depth maps can be directly used for the network training. However, one potential problem is that rendered images do not contain view-dependent lightings. In fact, a desired training sample to multi-view stereo network should satisfy:
Images and depth maps should be consistently aligned. The training sample should provide reliable mappings from input images to ground truth depth maps.
Images should reflect view-dependent lightings. The realistic ambient lighting could strengthen model’s generalization ability to real-world scenarios.
To introduce lightings to rendered images, one solution is to manually assign mesh materials and set up lighting sources during the rendering process. However, this is extremely labor-intensive, which makes it rather difficult to build a large-scale dataset.
On the other hand, the original input images have already contained the natural lighting information. The lighting could be automatically overlaid to rendered images if we can directly extract such information from input images. Specifically, we notice that ambient lightings are mostly low-frequency signals in images, while visual cues for establishing multi-view dense correspondences (e.g., rich textures) are mostly high-frequency signals in images. Following the observation, we propose to extract visual cues from the rendered image using a high-pass filter , and extract the view-dependent lighting from the input image using the low-pass filters . The visual cues and lightings are fused to generate the blended image (Fig. 2):
where ‘’ denotes the convolution operation, ‘’ the element-wise multiplication. andand are approached by 2D Gaussian low-pass and high-pass filters:
The Gaussian kernel factor is empirically set to in our experiments. The blended image inherits detailed visual cues from the rendered image, while at the same time largely preserves realistic environmental lightings from the input image. Fig. 3 illustrates the differences between these three images. We will demonstrate in Sec. 5.2 that models trained with blended images have better generalization abilities to different scenes.
For the content of the proposed dataset, we manually select 113 well-reconstructed models publicly available in the Altizure.com online platform. These models cover a variety of different scenes, including architectures, street-views, sculptures and small objects. Each of the scene contains 20 to 1,000 input images, and totally there are 17,818 images in the whole dataset. It is also noteworthy that unlike DTU dataset  where all scenes are captured by a fixed robot arm, scenes in BlendedMVS contain a variety of different camera trajectories. The unstructured camera trajectories can better model different image capturing styles, and is able to make the network more generalizable to real-world reconstructions. Fig. 4 shows 7 scenes in BlendedMVS dataset with camera positions.
The dataset also provides training images and ground truth depth maps with a unified image resolution of . As input images are usually with different resolutions, we first resize all blended images and rendered depth maps to a minimum image size such that and . Then, we crop image patches of size
from the resized image centers to build training samples for BlendedMVS dataset. The corresponding camera parameters are changed accordingly. Also, the depth range is provided for each image as this information is usually required by depth map estimation algorithms.
We also augment the training data during the training process. The following photometric augmentations are considered in our training: 1) Random brightness: we change the brightness of each image by adding a random value such that , and then clip the image intensity value to the standard range of . 2) Random contrast: we change the contrast of each image with a random contrast factor such that , and the clip the image to the standard range of . 3) Random motion blur: we add the Gaussian motion blur to each input image. We consider a random motion direction and a random motion kernel size of or during the augmentation. The above mentioned augmentations will be imposed to each training image in a random order. In the ablation study section 5.2, we will demonstrate the improvement brought by the online augmentation.
MVSNet  is an end-to-end deep learning architecture for depth map estimation from multiple images. Given a reference image and several source images , MVSNet first extract deep image features for all images through a 5-layer 2D network. Next, image features are warped into the reference camera frustum to build the feature volumes
in 3D space through the differentiable homographies. The network applies a variance-based cost metric to build the cost volumefrom N feature volumes, and applies a multi-scale 3D convolutional network for the cost volume regularization. The initial depth map is regressed from the volume through the soft argmin 
operation. Also, MVSNet further fine-tune the depth map output through a small refinement network. The network is trained with the stander L1 loss function.
The recurrent MVSNet  is an extended version of MVSNet for high-resolution MVS reconstruction. Instead of regularizing the whole 3D cost volume
with 3D CNNs at once, R-MVSNet applies the recurrent neural network to sequentially regularize the 2D cost mapsthrough the depth direction. At each step the network only takes one cost map into account and thus the memory consumption is reduced from cubic to quadratic to the model resolution. Meanwhile, R-MVSNet treats depth map estimation as a classification problem, and applies the cross-entropy loss during their training.
In our experiments, we make several modifications to the original versions of MVSNet and R-MVSNet: (1) We change the 5-layer feature extraction network with a 2D U-Net to enlarge the receptive field in the image feature extraction stage. (2) MVSNet and R-MVSNet are memory consuming and can only be trained with
. Thus, we replace the batch normalization (BN) with the group normalization (GN)  with fixed a group channel size of 8 to improve the network performance. (3) We remove the refinement network in MVSNet as this part only brings limited performance gain to the network. (4) We remove the variational depth map refinement step in R-MVSNet to avoid the non-learning component affecting the training dataset evaluation.
In order to demonstrate the capacity of BlendedMVS dataset, we compare models train on 1) DTU training set, 2) ETH3D low-res training set, 3) MegaDepth dataset, and 4) BlendedMVS training set. All models are trained for 160k iterations and we evaluation the models on DTU, ETH3D and BlendedMVS validation sets. Three metrics are considered in our experiments: 1) the end point error (EPE), which is the average loss between the inferred depth map and the ground truth depth map; 2) the pixel error, which is the ratio of pixels with error larger than 1 depth-wise pixel; and 3) the pixel error. Quantitative and qualitative results are shown in Fig. 5 and Fig. 6 respectively.
Trained on DTU  As suggested by previous methods [30, 10, 31], DTU dataset is divided into training, validation and evaluation sets. We train both MVSNet and R-MVSNet with a fixed input sample size of and fixed depth range of .
It is reported that both models trained on DTU perform very well on DTU validation set, however, produce high validation errors in BlendedMVS and ETH3D datasets. In fact, the two models are overfitted in small-scale indoor scenes, showing the importance of having rich object categories in MVS training data.
Trained on ETH3D  The ETH3D training set contains 5 scenes. To separate the training and the validation, we take delivery_area, electro, forest as our training scenes, and playground, terrains as our validation scenes. The training sample size is fixed to . The per-view depth range is determined by the sparse point cloud provided by the dataset.
From Fig. 5 we found that validation errors of models trained on ETH3D are high in all validation sets including its own dataset, indicating that ETH3D training set does not provide sufficient training data to train the networks.
Trained on MegaDepth  MegaDepth dataset is originally built for single-view depth map estimation that it applies multi-view depth map estimation to generate the depth training data. The dataset provides image-depthmap training pairs and SfM output files from COLMAP . To apply MegaDepth for the MVS training, we apply the view seletion and the depth range estimation [30, 31] to generate training files in MVSNet format. Also, as reconstructed depth maps of crowdsourced images are usually incomplete, we only use those training samples with more than valid pixels in the reference depth map during our training. There are 39k MVS training samples in MegaDepth dataset after the proposed pre-processing. The training input size is fixed to by applying the resize-and-crop strategy as described in 4.1.
Although MegaDepth contains more training samples than BlendeMVS, models trained on MegaDepth are still inferior to models trained on BlendedMVS (Fig. 5). We believe there are two major problems of applying MegaDepth for the MVS training: 1) the ground truth depth map is generated through MVS reconstructions. In this case, input images and reconstructed depth maps are not consistently aligned and the network will tend to overfit to the chosen algorithm . 2) MegaDepth is built upon crowdsourced internet photos. The crowdsourced images are not well-captured and the training data quality could have significant influences on the training result.
Trained on BlendedMVS To train MVS networks with BlendedMVS, we resize all training samples to and set the depth sample number to . Our dataset is also divided into 106 training scenes and 7 validation scenes to evaluate the network training.
As shown in Fig. 5, models trained on BlendedMVS generalizes well to both DTU and ETH3D scenes. Both MVSNet and R-MVSNet achieve the best validation results on BlendedMVS and ETH3D validation sets, and achieve the second best result (very close to the best) on DTU validation set, showing the strong generalization ability brought by the proposed dataset.
We also compare point cloud reconstructions of models trained on DTU, ETH3D and BlendedMVS on Tanks and Temples  training set. As the dataset contains wide-depth-range scenes that cannot be handled by MVSNet, we only test R-MVSNet (trained for 150k iterations) in this experiment. We follow methods described in R-MVSNet paper to recover camera parameters of input images, and then perform the per-view source image selection and depth range estimation based on the sparse point cloud. For post-processing, we also follow previous works [30, 31] to apply the visibility-based depth map fusion , average depth map fusion and visibility depth map filter to generate the 3D point cloud.
The dataset reports three evaluation metrics, namelyprecision (accuracy), recall (completeness) and the overall f_score [14, 23] to quantitatively measure the reconstruction quality. As shown in Table 1, R-MVSNet trained on DTU  and MegaDepth  achieve similar f_score performances, while R-MVSNet trained on the proposed dataset significantly outperforms models trained on the other three datasets for all scenes. The average f_score is improved from to by simply replacing the training data from DTU to BlendedMVS.
|Networks||Training Images||EPE||<1 Px. Err||<3 Px. Err|
Next, we study the differences of using 1) input images, 2) rendered images and 3) blended images as our training images. For these three setting, we also study the effectiveness of the online photometric augmentation. All models are trained for 150k iterations and are validated on DTU validation set. Comparison results are shown in Table 2.
Environmental Lightings The proposed setting of blended images with photometric augmentation produces the best result, while rendered images only produces the worst result among all. Also, all images with photometric augmentation results in lower validation errors than without, showing that view-dependent lightings are indeed important for MVS network training.
Training with Input Images It is noteworthy that while input images are not completely consistent with rendered depth maps, training with input images (with or without the augmentation) also produces satisfying results. The reason might be that 3D structures have been correctly recovered for most of the scenes in BlendedMVS as all scenes are well-selected in advance. In this case, rendered depth maps can be regarded as the semi ground truth given input images, which could be jointly used for MVS network training.
Imperfect Reconstruction One concern about using the reconstructed model for the MVS training is that whether defects or imperfect reconstructions in textured models would affect the training process. In this case, if we are using input images as training inputs, images and depth maps might be not consistently aligned and could potentially cause problems for the network training. However, the blended images we use inherit detailed visual cues from rendered images, which are always consistent with rendered depth maps even if defects occur. In this case, defects in reconstructed models will not deteriorate the network training.
For the same reason, we could change the Altizure online platform to any other 3D reconstruction pipelines to recover the mesh model. What we have presented is a low-cost MVS training data generation pipeline that does not rely on any particular textured model reconstruction method.
Occlusion and Normal Information While current learning-based approaches [30, 31, 4, 13] does not take into account the pixel-wise occlusion and normal information, our dataset provides such ground truth information as well. The occlusion and normal information could be useful for future visibility-aware and patch-based MVS networks.
Privacy Another advantage of using blended images is that it could help preserve the data privacy. For example, pedestrians in input images are usually dynamic, which will not be reconstructed in the textured model and rendered images (first row in Fig. 7). Furthermore, if pedestrians appear in front of the reconstructed object, our image blending process will only extract blurred human shapes from the input image, which helps conceal user identities in the blended image (second row in Fig. 7).
We have presented the large-scale BlendedMVS dataset for learning-based MVS reconstruction. The proposed dataset provides more than 17k high-quality training samples which cover a variety of different scenes for multi-view depth estimation. To build the dataset, we have reconstructed textured meshes from input images, and have rendered these models into color images and depth maps. The rendered color image has been further blended with the input image to generate the training image input. We have trained recent MVS networks using BlendedMVS and other MVS datasets. Both quantitative and qualitative results have demonstrated that models trained on BlendedMVS achieve significant better generalization abilities than models trained on other datasets.
A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV), Cited by: §2.3.
Computer Vision and Pattern Recognition, Cited by: §2.3.
International Conference on Machine Learning (ICML), Cited by: §4.2.
We visualize two imperfect mesh reconstructions during the data generation process in Fig. 8. The two yellow regions are a reflective water pond and a thin key ring respectively. We failed to reconstruct these two regions in the mesh models, and the resulting blended images and rendered depth maps are also incomplete. However, as discussed in Sec. 5.3, such reconstruction defects won’t affect the network training, because we will not use the original images as training inputs, and the blended images we used could be consistent aligned to rendered depth maps.
In this section, we further illustrate the difference between 1) input images, 2) blended images, and 3) rendered images. The corresponding rendered depth maps are also jointly visualized in Fig. 1. The blended image has similar lightings to the input image, and also inherits detailed visual cues from the rendered image.
In this section, we list all textured models used in BlendedMVS dataset. We manually crop the boundary areas of some scenes for better visualization. Models in our dataset can be roughly categorized into 1) large-scale scenes in Fig. 2, 2) small-scale objects in Fig. 3, and 3) high-quality sculptures in Fig. 4. Also, the 7 validation scenes are shown in Fig. 5.