Multi-view stereo (MVS) reconstructs the dense representation of the scene from multi-view images and corresponding camera parameters. While the problem is previously addressed by classical methods, recent studies [30, 31, 10] show that learning-based approaches are also able to produce results comparable to or even better than classical state-of-the-arts. Conceptually, learning-based approaches implicitly take into account global semantics such as specularity, reflection and lighting information during the reconstruction, which would be beneficial for reconstructions of textureless and non-Lambertian areas. It has been reported on the small object DTU dataset  that, the best overall quality has been largely improved by recent learning-based approaches [30, 31, 4, 13].
By contrast, leaderboards of Tanks and Temples  and ETH3D  benchmarks are still dominated by classical MVS methods. In fact, current learning-based methods are all trained on DTU dataset , which consists of small objects captured with a fixed camera trajectory. As a result, the trained model cannot generalize very well on other scenes. Moreover, previous MVS benchmarks [24, 26, 1, 14, 23] mainly focus on the point cloud evaluation rather than the network training. Compared with other computer vision tasks (e.g., classification and stereo), the training data for MVS reconstruction is rather limited, and it is desired to establish a new dataset to provide sufficient training ground truth for learning-based MVS.
In this paper, we introduce BlendedMVS, a large-scale synthetic dataset for multi-view stereo training. Instead of using expensive active scanners to obtain ground truth point clouds, we propose to generate training images and depth maps by rendering textured 3D models to different viewpoints. Textured meshes are first reconstructed from images of different scenes, which are then rendered into color images and depth maps. We further apply the high-pass filter on the rendered image and the low-pass filter on the input image, and then blend the two images into our training input. The resulting images inherit detailed visual cues (high frequency signals) from rendered color images, which makes them consistently align with rendered depth maps. At the same time, the blended images still largely preserve the realistic ambient lighting information (low frequency signals) from input images, which helps the trained model better generalize to real-world scenarios.
Our dataset contains 113 well selected and reconstructed 3D models. These textured models cover a variety of different scenes, including cities, architectures, sculptures and small objects. Each of the scene contains 20 to 1,000 input images and there are more than 17,000 images in total. We train recent MVSNet  and R-MVSNet  on several MVS datasets. Extensive experiments on different validation sets demonstrate that models trained on BlendedMVS achieve significantly better generalization ability compared with models trained on other MVS datasets.
Our main contributions can be summarized as:
We propose a low-cost data generation pipeline with a novel fusion approach to automatically generate training ground truth for learning-based MVS.
We establish the large-scale BlendedMVS dataset. All models in the dataset are well selected and cover a variety of diversified reconstruction scenarios.
We report on several benchmarks that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets.
2 Related Works
2.1 Learning-based MVS
Learning-based approaches for MVS reconstruction have recently shown great potentials. Learned multi-patch similarity 
first applies deep neural networks for MVS cost metrics learning. SurfaceNet and DeepMVS 
unproject images to the 3D voxel space, and use 3D CNNs to classify if a voxel belongs to the object surface. LSM and RayNet  encoded the camera projection to the network, and utilized 3D CNNs or Markov Random Field to predict surface label. To overcome the precision deficiency in volume presentation, MVSNet  applies differentiable homography to build the cost volume upon the camera frustum. The network applies 3D CNNs for the cost volume regularization and regress the per-view depth map as output. The follow-up R-MVSNet  is designed for high-resolution MVS, by replacing the memory-consuming the 3D CNNs with the recurrent regularization, and significantly reduce the peak memory size. More recently, Point-MVSNet  presents an point-based depth map refinement network, while MVS-CRF  introduces the conditional random field for the depth map refinement.
2.2 MVS Datasets
Middlebury MVS  is the earliest MVS dataset for MVS evaluation. It contains two indoor objects with low-resolution (640 480) images and calibrated cameras. Later, the EPFL benchmark  captures ground truth models of building facades and provides high-resolution images (6.2 MP) and ground truth point clouds for MVS evaluation. To evaluate algorithms under different lighting conditions, DTU dataset  captures images and point clouds for more than 100 indoor objects with a fixed camera trajectory. The point clouds are further triangulated into mesh models and rendered into different view point to generate ground truth depth maps . Current learning-based MVS networks [30, 31, 4, 13] usually apply DTU dataset as their training data. Recent Tanks and Temples benchmark  captures indoor and outdoor scenes using high-speed video cameras, however, their training set only contains 7 scenes with ground truth point clouds. ETH3D benchmark  contains one low-resolution set and one high-resolution set. But similar to Tanks and Temples, ETH3D only provides a small number of ground truth scans for the network training. The available training data in these datasets is rather limited, and a larger scale dataset is required to further exploit the potentials of learning-based MVS. In contrast, the proposed dataset will provide more than 17,000 images with ground truth depth maps, which covers a variety of diversified scenes and can greatly improve the generalization ability of the trained model.
2.3 Synthetic Datasets
Generating synthetic datasets for training is a common practice in many computer vision tasks, as a large amount of ground truth can be generated at very low cost. Thanks to recent advances in computer graphics, the rendering effect becomes increasingly photo-realistic, making the usage of synthetic datasets more plausible. For example, synthetic rendered images are used in stereo matching [3, 16, 32], optical flow [3, 16, 5], object detection [6, 27] and semantic segmentation [6, 19, 5, 20, 25]. Similar to these datasets, we consider incorporating the lighting effects in rendering synthetic datasets for 3D reconstruction. However, since it is difficult to generate correct material properties in different parts of the model, we resort to a blending approach with original images to recover the lighting effects.
3 Dataset Generation
The proposed data generation pipeline is shown in Fig. 1. We first apply a full 3D reconstruction pipeline to produce the 3D textured mesh from input images (Sec. 3.1). Next, the mesh is rendered to each camera view point to obtain the rendered image and the corresponding depth map. The final training image input is generated by blending the rendered image and input image in our proposed manner (Sec. 3.2).
3.1 Textured Mesh Generation
The first step to build a synthetic MVS dataset is generating sufficient high-quality textured mesh models. Given input images, we use Altizure online platform  for the textured mesh reconstruction. Altizure is one of the best 3D photogrammetry software in the market, but it could also be replaced with other comparable software. The software will perform the full 3D reconstruction pipeline and return the textured mesh and camera poses as final output.
With the textured mesh model and camera positions of all input images, we then render the mesh model to each camera view point to generate the rendered images and rendered depth maps. One example is shown in Fig. 1. The rendered depth maps will be used as the ground truth depth maps during training.
3.2 Blended Image Generation
Intuitively, rendered images and depth maps can be directly used for the network training. However, one potential problem is that rendered images do not contain view-dependent lightings. In fact, a desired training sample to multi-view stereo network should satisfy:
Images and depth maps should be consistently aligned. The training sample should provide reliable mappings from input images to ground truth depth maps.
Images should reflect view-dependent lightings. The realistic ambient lighting could strengthen model’s generalization ability to real-world scenarios.
To introduce lightings to rendered images, one solution is to manually assign mesh materials and set up lighting sources during the rendering process. However, this is extremely labor-intensive, which makes it rather difficult to build a large-scale dataset.
On the other hand, the original input images have already contained the natural lighting information. The lighting could be automatically overlaid to rendered images if we can directly extract such information from input images. Specifically, we notice that ambient lightings are mostly low-frequency signals in images, while visual cues for establishing multi-view dense correspondences (e.g., rich textures) are mostly high-frequency signals in images. Following the observation, we propose to extract visual cues from the rendered image using a high-pass filter , and extract the view-dependent lighting from the input image using the low-pass filters . The visual cues and lightings are fused to generate the blended image (Fig. 2):
where ‘’ denotes the convolution operation, ‘’ the element-wise multiplication. andand are approached by 2D Gaussian low-pass and high-pass filters:
The Gaussian kernel factor is empirically set to in our experiments. The blended image inherits detailed visual cues from the rendered image, while at the same time largely preserves realistic environmental lightings from the input image. Fig. 3 illustrates the differences between these three images. We will demonstrate in Sec. 5.2 that models trained with blended images have better generalization abilities to different scenes.
4 Scenes and Networks
For the content of the proposed dataset, we manually select 113 well-reconstructed models publicly available in the Altizure.com online platform. These models cover a variety of different scenes, including architectures, street-views, sculptures and small objects. Each of the scene contains 20 to 1,000 input images, and totally there are 17,818 images in the whole dataset. It is also noteworthy that unlike DTU dataset  where all scenes are captured by a fixed robot arm, scenes in BlendedMVS contain a variety of different camera trajectories. The unstructured camera trajectories can better model different image capturing styles, and is able to make the network more generalizable to real-world reconstructions. Fig. 4 shows 7 scenes in BlendedMVS dataset with camera positions.
The dataset also provides training images and ground truth depth maps with a unified image resolution of . As input images are usually with different resolutions, we first resize all blended images and rendered depth maps to a minimum image size such that and . Then, we crop image patches of size
from the resized image centers to build training samples for BlendedMVS dataset. The corresponding camera parameters are changed accordingly. Also, the depth range is provided for each image as this information is usually required by depth map estimation algorithms.
We also augment the training data during the training process. The following photometric augmentations are considered in our training: 1) Random brightness: we change the brightness of each image by adding a random value such that , and then clip the image intensity value to the standard range of . 2) Random contrast: we change the contrast of each image with a random contrast factor such that , and the clip the image to the standard range of . 3) Random motion blur: we add the Gaussian motion blur to each input image. We consider a random motion direction and a random motion kernel size of or during the augmentation. The above mentioned augmentations will be imposed to each training image in a random order. In the ablation study section 5.2, we will demonstrate the improvement brought by the online augmentation.
MVSNet  is an end-to-end deep learning architecture for depth map estimation from multiple images. Given a reference image and several source images , MVSNet first extract deep image features for all images through a 5-layer 2D network. Next, image features are warped into the reference camera frustum to build the feature volumes
in 3D space through the differentiable homographies. The network applies a variance-based cost metric to build the cost volumefrom N feature volumes, and applies a multi-scale 3D convolutional network for the cost volume regularization. The initial depth map is regressed from the volume through the soft argmin 
operation. Also, MVSNet further fine-tune the depth map output through a small refinement network. The network is trained with the stander L1 loss function.
The recurrent MVSNet  is an extended version of MVSNet for high-resolution MVS reconstruction. Instead of regularizing the whole 3D cost volume
with 3D CNNs at once, R-MVSNet applies the recurrent neural network to sequentially regularize the 2D cost mapsthrough the depth direction. At each step the network only takes one cost map into account and thus the memory consumption is reduced from cubic to quadratic to the model resolution. Meanwhile, R-MVSNet treats depth map estimation as a classification problem, and applies the cross-entropy loss during their training.
In our experiments, we make several modifications to the original versions of MVSNet and R-MVSNet: (1) We change the 5-layer feature extraction network with a 2D U-Net to enlarge the receptive field in the image feature extraction stage. (2) MVSNet and R-MVSNet are memory consuming and can only be trained with
. Thus, we replace the batch normalization (BN) with the group normalization (GN)  with fixed a group channel size of 8 to improve the network performance. (3) We remove the refinement network in MVSNet as this part only brings limited performance gain to the network. (4) We remove the variational depth map refinement step in R-MVSNet to avoid the non-learning component affecting the training dataset evaluation.
5.1 Quantitative Evaluation
5.1.1 Depth Map Validation
In order to demonstrate the capacity of BlendedMVS dataset, we compare models train on 1) DTU training set, 2) ETH3D low-res training set, 3) MegaDepth dataset, and 4) BlendedMVS training set. All models are trained for 160k iterations and we evaluation the models on DTU, ETH3D and BlendedMVS validation sets. Three metrics are considered in our experiments: 1) the end point error (EPE), which is the average loss between the inferred depth map and the ground truth depth map; 2) the pixel error, which is the ratio of pixels with error larger than 1 depth-wise pixel; and 3) the pixel error. Quantitative and qualitative results are shown in Fig. 5 and Fig. 6 respectively.
Trained on DTU  As suggested by previous methods [30, 10, 31], DTU dataset is divided into training, validation and evaluation sets. We train both MVSNet and R-MVSNet with a fixed input sample size of and fixed depth range of .
It is reported that both models trained on DTU perform very well on DTU validation set, however, produce high validation errors in BlendedMVS and ETH3D datasets. In fact, the two models are overfitted in small-scale indoor scenes, showing the importance of having rich object categories in MVS training data.
Trained on ETH3D  The ETH3D training set contains 5 scenes. To separate the training and the validation, we take delivery_area, electro, forest as our training scenes, and playground, terrains as our validation scenes. The training sample size is fixed to . The per-view depth range is determined by the sparse point cloud provided by the dataset.
From Fig. 5 we found that validation errors of models trained on ETH3D are high in all validation sets including its own dataset, indicating that ETH3D training set does not provide sufficient training data to train the networks.
Trained on MegaDepth  MegaDepth dataset is originally built for single-view depth map estimation that it applies multi-view depth map estimation to generate the depth training data. The dataset provides image-depthmap training pairs and SfM output files from COLMAP . To apply MegaDepth for the MVS training, we apply the view seletion and the depth range estimation [30, 31] to generate training files in MVSNet format. Also, as reconstructed depth maps of crowdsourced images are usually incomplete, we only use those training samples with more than valid pixels in the reference depth map during our training. There are 39k MVS training samples in MegaDepth dataset after the proposed pre-processing. The training input size is fixed to by applying the resize-and-crop strategy as described in 4.1.
Although MegaDepth contains more training samples than BlendeMVS, models trained on MegaDepth are still inferior to models trained on BlendedMVS (Fig. 5). We believe there are two major problems of applying MegaDepth for the MVS training: 1) the ground truth depth map is generated through MVS reconstructions. In this case, input images and reconstructed depth maps are not consistently aligned and the network will tend to overfit to the chosen algorithm . 2) MegaDepth is built upon crowdsourced internet photos. The crowdsourced images are not well-captured and the training data quality could have significant influences on the training result.
Trained on BlendedMVS To train MVS networks with BlendedMVS, we resize all training samples to and set the depth sample number to . Our dataset is also divided into 106 training scenes and 7 validation scenes to evaluate the network training.
As shown in Fig. 5, models trained on BlendedMVS generalizes well to both DTU and ETH3D scenes. Both MVSNet and R-MVSNet achieve the best validation results on BlendedMVS and ETH3D validation sets, and achieve the second best result (very close to the best) on DTU validation set, showing the strong generalization ability brought by the proposed dataset.
5.1.2 Point Cloud Evaluation
We also compare point cloud reconstructions of models trained on DTU, ETH3D and BlendedMVS on Tanks and Temples  training set. As the dataset contains wide-depth-range scenes that cannot be handled by MVSNet, we only test R-MVSNet (trained for 150k iterations) in this experiment. We follow methods described in R-MVSNet paper to recover camera parameters of input images, and then perform the per-view source image selection and depth range estimation based on the sparse point cloud. For post-processing, we also follow previous works [30, 31] to apply the visibility-based depth map fusion , average depth map fusion and visibility depth map filter to generate the 3D point cloud.
The dataset reports three evaluation metrics, namelyprecision (accuracy), recall (completeness) and the overall f_score [14, 23] to quantitatively measure the reconstruction quality. As shown in Table 1, R-MVSNet trained on DTU  and MegaDepth  achieve similar f_score performances, while R-MVSNet trained on the proposed dataset significantly outperforms models trained on the other three datasets for all scenes. The average f_score is improved from to by simply replacing the training data from DTU to BlendedMVS.
|Networks||Training Images||EPE||<1 Px. Err||<3 Px. Err|
5.2 Ablation Study on Training Image
Next, we study the differences of using 1) input images, 2) rendered images and 3) blended images as our training images. For these three setting, we also study the effectiveness of the online photometric augmentation. All models are trained for 150k iterations and are validated on DTU validation set. Comparison results are shown in Table 2.
Environmental Lightings The proposed setting of blended images with photometric augmentation produces the best result, while rendered images only produces the worst result among all. Also, all images with photometric augmentation results in lower validation errors than without, showing that view-dependent lightings are indeed important for MVS network training.
Training with Input Images It is noteworthy that while input images are not completely consistent with rendered depth maps, training with input images (with or without the augmentation) also produces satisfying results. The reason might be that 3D structures have been correctly recovered for most of the scenes in BlendedMVS as all scenes are well-selected in advance. In this case, rendered depth maps can be regarded as the semi ground truth given input images, which could be jointly used for MVS network training.
Imperfect Reconstruction One concern about using the reconstructed model for the MVS training is that whether defects or imperfect reconstructions in textured models would affect the training process. In this case, if we are using input images as training inputs, images and depth maps might be not consistently aligned and could potentially cause problems for the network training. However, the blended images we use inherit detailed visual cues from rendered images, which are always consistent with rendered depth maps even if defects occur. In this case, defects in reconstructed models will not deteriorate the network training.
For the same reason, we could change the Altizure online platform to any other 3D reconstruction pipelines to recover the mesh model. What we have presented is a low-cost MVS training data generation pipeline that does not rely on any particular textured model reconstruction method.
Occlusion and Normal Information While current learning-based approaches [30, 31, 4, 13] does not take into account the pixel-wise occlusion and normal information, our dataset provides such ground truth information as well. The occlusion and normal information could be useful for future visibility-aware and patch-based MVS networks.
Privacy Another advantage of using blended images is that it could help preserve the data privacy. For example, pedestrians in input images are usually dynamic, which will not be reconstructed in the textured model and rendered images (first row in Fig. 7). Furthermore, if pedestrians appear in front of the reconstructed object, our image blending process will only extract blurred human shapes from the input image, which helps conceal user identities in the blended image (second row in Fig. 7).
We have presented the large-scale BlendedMVS dataset for learning-based MVS reconstruction. The proposed dataset provides more than 17k high-quality training samples which cover a variety of different scenes for multi-view depth estimation. To build the dataset, we have reconstructed textured meshes from input images, and have rendered these models into color images and depth maps. The rendered color image has been further blended with the input image to generate the training image input. We have trained recent MVS networks using BlendedMVS and other MVS datasets. Both quantitative and qualitative results have demonstrated that models trained on BlendedMVS achieve significant better generalization abilities than models trained on other datasets.
-  (2016) Large-scale data for multiple-view stereopsis. In International Journal of Computer Vision (IJCV), Cited by: §1, §1, §2.2, §4.1, Figure 6, §5.1.1, §5.1.2, Table 2.
-  Altizure: mapping the world in 3d.. Note: https://www.altizure.com Cited by: §3.1.
A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV), Cited by: §2.3.
-  (2019) Point-based multi-view stereo network. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §2.2, §5.3.
Virtual worlds as proxy for multi-object tracking analysis.
Computer Vision and Pattern Recognition, Cited by: §2.3.
-  (2016) Understanding real world indoor scenes with synthetic data. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
-  (2017) Learned multi-patch similarity. In International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2018) DeepMVS: learning multi-view stereopsis. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
International Conference on Machine Learning (ICML), Cited by: §4.2.
-  (2017) SurfaceNet: an end-to-end 3d neural network for multiview stereopsis. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §5.1.1.
-  (2017) Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.1.
-  (2017) End-to-end learning of geometry and context for deep stereo regression. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
-  (2019) PMVSNet: learning patch-wise matching confidence aggregation for multi-view stereo. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.2, §5.3.
-  (2017) Tanks and temples: benchmarking large-scale scene reconstruction. In ACM Transactions on Graphics (TOG), Cited by: §1, §2.2, §5.1.2, §5.1.2, Table 1.
-  (2018) Megadepth: learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 6, §5.1.1, §5.1.2.
-  (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
-  (2007) Real-time visibility-based fusion of depth maps. In International Conference on Computer Vision (ICCV), Cited by: §5.1.2.
-  (2018) Raynet: learning volumetric 3d reconstruction with ray potentials. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2016) Playing for data: ground truth from computer games. In European Conference on Computer Vision (ECCV), Cited by: §2.3.
-  (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In Computer Vision and Pattern Recognition, Cited by: §2.3.
-  (2016) Structure-from-motion revisited. In Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.1.
-  (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §5.1.1.
-  (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. Cited by: §1, §2.2, Figure 6, §5.1.1, §5.1.2.
-  (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
-  (2019) The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §2.3.
-  (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
-  (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §2.3.
-  (2018) Group normalization. In European Conference on Computer Vision (ECCV), Cited by: §4.2.
-  (2019) MVSCRF: learning multi-view stereo with conditional random fields. In International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2018) MVSNet: depth inference for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2.1, §2.2, §4.2, §4.2, §5.1.1, §5.1.1, §5.1.2, §5.3, Table 2.
-  (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §2.2, §4.2, §4.2, Figure 6, §5.1.1, §5.1.1, §5.1.2, §5.3, Table 2.
-  (2018) UnrealStereo: controlling hazardous factors to analyze stereo vision. In International Conference on 3D Vision), Cited by: §2.3.
1 Imperfect Reconstruction
We visualize two imperfect mesh reconstructions during the data generation process in Fig. 8. The two yellow regions are a reflective water pond and a thin key ring respectively. We failed to reconstruct these two regions in the mesh models, and the resulting blended images and rendered depth maps are also incomplete. However, as discussed in Sec. 5.3, such reconstruction defects won’t affect the network training, because we will not use the original images as training inputs, and the blended images we used could be consistent aligned to rendered depth maps.
2 Input v.s. Blended v.s. Rendered Images
In this section, we further illustrate the difference between 1) input images, 2) blended images, and 3) rendered images. The corresponding rendered depth maps are also jointly visualized in Fig. 1. The blended image has similar lightings to the input image, and also inherits detailed visual cues from the rendered image.
In this section, we list all textured models used in BlendedMVS dataset. We manually crop the boundary areas of some scenes for better visualization. Models in our dataset can be roughly categorized into 1) large-scale scenes in Fig. 2, 2) small-scale objects in Fig. 3, and 3) high-quality sculptures in Fig. 4. Also, the 7 validation scenes are shown in Fig. 5.