A curated list of papers & ressources linked to 3D reconstruction from images.
We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset. With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first without any fine-tuning, showing the strong generalization ability of MVSNet.READ FULL TEXT VIEW PDF
The success of existing deep-learning based multi-view stereo (MVS)
Deep learning has made significant impacts on multi-view stereo systems....
There are increasing interests of studying the structure-from-motion (Sf...
This paper introduces a neural network to solve the structure-from-motio...
Multiview stereo aims to reconstruct scene depth from images acquired by...
We present PatchmatchNet, a novel and learnable cascade formulation of
Recovering the 3D representation of an object from single-view or multi-...
A curated list of papers & ressources linked to 3D reconstruction from images.
Multi-view stereo (MVS) estimates the dense representation from overlapping images, which is a core problem of computer vision extensively studied for decades. Traditional methods use hand-crafted similarity metrics and engineered regularizations (e.g., normalized cross correlation and semi-global matching) to compute dense correspondences and recover 3D points. While these methods have shown great results under ideal Lambertian scenarios, they suffer from some common limitations. For example, low-textured, specular and reflective regions of the scene make dense matching intractable and thus lead to incomplete reconstructions. It is reported in recent MVS benchmarks [1, 18] that, although current state-of-the-art algorithms [7, 36, 8, 32] perform very well on the accuracy, the reconstruction completeness still has large room for improvement.
Recent success on convolutional neural networks (CNNs) research has also triggered the interest to improve the stereo reconstruction. Conceptually, the learning-based method can introduce global semantic information such as specular and reflective priors for more robust matching. There are some attempts on the two-view stereo matching, by replacing either hand-crafted similarity metrics[39, 10, 23, 11] or engineered regularizations [34, 19, 17] with the learned ones. They have shown promising results and gradually surpassed traditional methods in stereo benchmarks [9, 25]. In fact, the stereo matching task is perfectly suitable for applying CNN-based methods, as image pairs are rectified in advance and thus the problem becomes the horizontal pixel-wise disparity estimation without bothering with camera parameters.
However, directly extending the learned two-view stereo to multi-view scenarios is non-trivial. Although one can simply pre-rectify all selected image pairs for stereo matching, and then merge all pairwise reconstructions to a global point cloud, this approach fails to fully utilize the multi-view information and leads to less accurate result. Unlike stereo matching, input images to MVS could be of arbitrary camera geometries, which poses a tricky issue to the usage of learning methods. Only few works acknowledge this problem and try to apply CNN to the MVS reconstruction: SurfaceNet  constructs the Colored Voxel Cubes (CVC) in advance, which combines all image pixel color and camera information to a single volume as the input of the network. In contrast, the Learned Stereo Machine (LSM) 
directly leverages the differentiable projection/unprojection to enable the end-to-end training/inference. However, both the two methods exploit the volumetric representation of regular grids. As restricted by the huge memory consumption of 3D volumes, their networks can hardly be scaled up: LSM only handles synthetic objects in low volume resolution, and SurfaceNet applies a heuristic divide-and-conquer strategy and takes a long time for large-scale reconstructions. For the moment, the leading boards of modern MVS benchmarks are still occupied by traditional methods[7, 8, 32].
To this end, we propose an end-to-end deep learning architecture for depth map inference, which computes one depth map at each time, rather than the whole 3D scene at once. Similar to other depth map based MVS methods [35, 3, 8, 32], the proposed network, MVSNet, takes one reference image and several source images as input, and infers the depth map for the reference image. The key insight here is the differentiable homography warping operation, which implicitly encodes camera geometries in the network to build the 3D cost volumes from 2D image features and enables the end-to-end training. To adapt arbitrary number of source images in the input, we propose a variance-based metric that maps multiple features into one cost feature in the volume. This cost volume then undergoes multi-scale 3D convolutions and regress an initial depth map. Finally, the depth map is refined with the reference image to improve the accuracy of boundary areas. There are two major differences between our method and previous learned approaches [15, 14]. First, for the purpose of depth map inference, our 3D cost volume is built upon the camera frustum instead of the regular Euclidean space. Second, our method decouples the MVS reconstruction to smaller problems of per-view depth map estimation, which makes large-scale reconstruction possible.
We train and evaluate the proposed MVSNet on the large-scale DTU dataset . Extensive experiments show that with simple post-processing, MVSNet outperforms all competing methods in terms of completeness and overall quality. Besides, we demonstrate the generalization power of the network on the outdoor Tanks and Temples benchmark , where MVSNet ranks first (before April. 18, 2018) over all submissions including the open-source MVS methods (e.g., COLMAP  and OpenMVS ) and commercial software (Pix4D ) without any fine-tuning. It is also noteworthy that the runtime of MVSNet is several times or even several orders of magnitude faster than previous state-of-the-arts.
MVS Reconstruction. According to output representations, MVS methods can be categorized into 1) direct point cloud reconstructions [22, 7], 2) volumetric reconstructions [20, 33, 14, 15] and 3) depth map reconstructions [35, 3, 8, 32, 38]. Point cloud based methods operate directly on 3D points, usually relying on the propagation strategy to gradually densify the reconstruction [22, 7]. As the propagation of point clouds is proceeded sequentially, these methods are difficult to be fully parallelized and usually take a long time in processing. Volumetric based methods divide the 3D space into regular grids and then estimate if each voxel is adhere to the surface. The downsides for this representation are the space discretization error and the high memory consumption. In contrast, depth map is the most flexible representation among all. It decouples the complex MVS problem into relatively small problems of per-view depth map estimation, which focuses on only one reference and a few source images at a time. Also, depth maps can be easily fused to the point cloud  or the volumetric reconstructions . According to the recent MVS benchmarks [1, 18], current best MVS algorithms [8, 32] are both depth map based approaches.
Learned Stereo. Rather than using traditional handcrafted image features and matching metrics , recent studies on stereo apply the deep learning technique for better pair-wise patch matching. Han et al.  first propose a deep network to match two image patches. Zbontar et al.  and Luo et al.  use the learned features for stereo matching and semi-global matching (SGM)  for post-processing. Beyond the pair-wise matching cost, the learning technique is also applied in cost regularization. SGMNet  learns to adjust the parameters used in SGM, while CNN-CRF  integrates the conditional random field optimization in the network for the end-to-end stereo learning. The recent state-of-the-art method is GCNet , which applies 3D CNN to regularize the cost volume and regress the disparity by the soft argmin operation. It has been reported in KITTI banchmark  that, learning-based stereos, especially those end-to-end learning algorithms [24, 19, 17], significantly outperform the traditional stereo approaches.
Learned MVS. There are fewer attempts on learned MVS approaches. Hartmann et al. propose the learned multi-patch similarity  to replace the traditional cost metric for MVS reconstruction. The first learning based pipeline for MVS problem is SurfaceNet , which pre-computes the cost volume with sophisticated voxel-wise view selection, and uses 3D CNN to regularize and infer the surface voxels. The most related approach to ours is the LSM 
, where camera parameters are encoded in the network as the projection operation to form the cost volume, and 3D CNN is used to classify if a voxel belongs to the surface. However, due to the common drawback of the volumetric representation, networks of SurfaceNet and LSM are restricted to only small-scale reconstructions. They either apply the divide-and-conquer strategy or is only applicable to synthetic data with low resolution inputs . In contrast, our network focus on producing the depth map for one reference image at each time, which allows us to adaptively reconstruct a large scene directly.
This section describes the detailed architecture of the proposed network. The design of MVSNet strongly follows the rules of camera geometry and borrows the insights from previous MVS approaches. In following sections, we will compare each step of our network to the traditional MVS methods, and demonstrate the advantages of our learning-based MVS system. The full architecture of MVSNet is visualized in Fig. 1.
The first step of MVSNet is to extract the deep featuresof the input images
for dense matching. An eight-layer 2D CNN is applied, where the strides of layer 3 and 6 are set to two to divide the feature towers into three scales. Within each scale, two convolutional layers are applied to extract the higher-level image representation. Each convolutional layer is followed by a batch-normalization (BN) layer and a rectified linear unit (ReLU) except for the last layer. Also, similar to common matching tasks, parameters are shared among all feature towers for efficient learning.
The outputs of the 2D network are -channel feature maps downsized by four in each dimension compared with input images. It is noteworthy that though the image frame is downsized after feature extraction, the original neighboring information of each remaining pixel has already been encoded into the 32-channel pixel descriptor, which prevents dense matching from losing useful context information. Compared with simply performing dense matching on original images, the extracted feature maps significantly boost the reconstruction quality (see Sec. 5.3).
The next step is to build a 3D cost volume from the extracted feature maps and input cameras. While previous works [14, 15] divide the space using regular grids, for our task of depth map inference, we construct the cost volume upon the reference camera frustum. For simplicity, in the following we denote as the reference image, the source images, and the camera intrinsics, rotations and translations that correspond to the feature maps.
All feature maps are warped into different fronto-parallel planes of the reference camera to form feature volumes . The coordinate mapping from the warped feature map to at depth is determined by the planar transformation , where ‘’ denotes the projective equality and the homography between the feature map and the reference feature map at depth . Let be the principle axis of the reference camera, the homography is expressed by a matrix:
, except that the differentiable bilinear interpolation is used to sample pixels from feature mapsrather than images . As the core step to bridge the 2D feature extraction and the 3D regularization networks, the warping operation is implemented in differentiable manner, which enables end-to-end training of depth map inference.
Next, we aggregate multiple feature volumes to one cost volume . To adapt arbitrary number of input views, we propose a variance-based cost metric for N-view similarity measurement. Let be the input image width, height, depth sample number and the channel number of the feature map, and the feature volume size, our cost metric defines the mapping that:
Where is the average volume among all feature volumes, and all operations above are element-wise.
Most traditional MVS methods aggregate pairwise costs between the reference image and all source images in a heuristic way. Instead, our metric design follows the philosophy that all views should contribute equally to the matching cost and gives no preference to the reference image . We notice that recent work  applies the mean operation with multiple CNN layers to infer the multi-patch similarity. Here we choose the ‘variance’ operation instead because the ‘mean’ operation itself provides no information about the feature differences, and their network requires pre- and post- CNN layers to help infer the similarity. In contrast, our variance-based cost metric explicitly measures the multi-view feature difference. In later experiments, we will show that such explicit difference measurement improves the validation accuracy.
The raw cost volume computed from image features could be noise-contaminated (e.g., due to the existence of non-Lambertian surfaces or object occlusions) and should be incorporated with smoothness constraints to infer the depth map. Our regularization step is designed for refining the above cost volume to generate a probability volume for depth inference. Inspired by recent learning-based stereo  and MVS [14, 15] methods, we apply the multi-scale 3D CNN for cost volume regularization. The four-scale network here is similar to a 3D version UNet , which uses the encoder-decoder structure to aggregate neighboring information from a large receptive field with relatively low memory and computation cost. To further lessen the computational requirement, we reduce the 32-channel cost volume to 8-channel after the first 3D convolutional layer, and change the convolutions within each scale from 3 layers to 2 layers. The last convolutional layer outputs a 1-channel volume. We finally apply the softmax operation along the depth direction for probability normalization.
The resulting probability volume is highly desirable in depth map inference that it can not only be used for per-pixel depth estimation, but also for measuring the estimation confidence. We will show in Sec. 126.96.36.199.1.
The simplest way to retrieve depth map from the probability volume is the pixel-wise winner-take-all  (i.e., argmax). However, the argmax operation is unable to produce sub-pixel estimation, and cannot be trained with back-propagation due to its indifferentiability. Instead, we compute the expectation value along the depth direction, i.e., the probability weighted sum over all hypotheses:
Where is the probability estimation for all pixels at depth . Note that this operation is also referred to as the soft argmin operation in . It is fully differentiable and able to approximate the argmax result. While the depth hypotheses are uniformly sampled within range during cost volume construction, the expectation value here is able to produce a continuous depth estimation. The output depth map (Fig. 2 (b)) is of the same size to 2D image feature maps, which is downsized by four in each dimension compared to input images.
The probability distribution along the depth direction also reflects the depth estimation quality. Although the multi-scale 3D CNN has very strong ability to regularize the probability to the single modal distribution, we notice that for those falsely matched pixels, their probability distributions are scattered and cannot be concentrated to one peak (see Fig. 2 (c)). Based on this observation, we define the quality of a depth estimation
as the probability that the ground truth depth is within a small range near the estimation. As depth hypotheses are discretely sampled along the camera frustum, we simply take the probability sum over the four nearest depth hypotheses to measure the estimation quality. Notice that other statistical measurements, such as standard deviation or entropy can also be used here, but in our experiments we observe no significant improvement from these measurements for depth map filtering. Moreover, our probability sum formulation leads to a better control of thresholding parameter for outliers filtering.
While the depth map retrieved from the probability volume is a qualified output, the reconstruction boundaries may suffer from oversmoothing due to the large receptive field involved in the regularization, which is similar to the problems in semantic segmentation  and image matting . Notice that the reference image in natural contains boundary information, we thus use the reference image as a guidance to refine the depth map. Inspired by the recent image matting algorithm , we apply a depth residual learning network at the end of MVSNet. The initial depth map and the resized reference image are concatenated as a 4-channel input, which is then passed through three 32-channel 2D convolutional layers followed by one 1-channel convolutional layer to learn the depth residual. The initial depth map is then added back to generate the refined depth map. The last layer does not contain the BN layer and the ReLU unit as to learn the negative residual. Also, to prevent being biased at a certain depth scale, we pre-scale the initial depth magnitude to range [0, 1], and convert it back after the refinement.
Losses for both the initial depth map and the refined depth map are considered. We use the mean absolute difference between the ground truth depth map and the estimated depth map as our training loss. As ground truth depth maps are not always complete in the whole image (see Sec. 4.1), we only consider those pixels with valid ground truth labels:
Where denotes the set of valid ground truth pixels, the ground truth depth value of pixel , the initial depth estimation and the refined depth estimation. The parameter is set to in experiments.
Current MVS datasets provide ground truth data in either point cloud or mesh formats, so we need to generate the ground truth depth maps ourselves. The DTU dataset  is a large-scale MVS dataset containing more than 100 scenes with different lighting conditions. As it provides the ground truth point cloud with normal information, we use the screened Poisson surface reconstruction (SPSR)  to generate the mesh surface, and then render the mesh to each viewpoint to generate the depth maps for our training. The parameter, depth-of-tree is set to 11 in SPSR to acquire the high quality mesh result. Also, we set the mesh trimming-factor to 9.5 to alleviate mesh artifacts in surface edge areas. To fairly compare MVSNet with other learning based methods, we choose the same training, validation and evaluation sets as in SurfaceNet 111Validation set: scans 3, 5, 17, 21, 28, 35, 37, 38, 40, 43, 56, 59, 66, 67, 82, 86, 106, 117. Evaluation set: scans 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, 118. Training set: the other 79 scans.. Considering each scan contains 49 images with 7 different lighting conditions, by setting each image as the reference, DTU dataset provides 27097 training samples in total.
A reference image and two source images () are used in our training. We calculate a score for each image pair according to the sparse points, where is a common track in both view and , is ’s baseline angle and is the camera center. is a piecewise Gaussian function  that favors a certain baseline angle :
In the experiments, , and are set to 5, 1 and 10 respectively.
Notice that images will be downsized in feature extraction, plus the four-scale encoder-decoder structure in 3D regularization part, the input image size must be divisible by a factor of 32. Considering this requirement also the limited GPU memories, we downsize the image resolution from to , and then crop the image patch with and from the center as the training input. The input camera parameters are changed accordingly. The depth hypotheses are uniformly sampled from to with a resolution (
). We use TensorFlow to implement MVSNet, and the network is trained on one Tesla P100 graphics card for around iterations.
The above network estimates a depth value for every pixel. Before converting the result to dense point clouds, it is necessary to filter out outliers at those background and occluded areas. We propose two criteria, namely photometric and geometric consistencies for the robust depth map filtering.
The photometric consistency measures the matching quality. As discussed in Sec. 3.3.2, we compute the probability map to measure the depth estimation quality. In our experiments, we regard pixels with probability lower than 0.8 as outliers. The geometric constraint measures the depth consistency among multiple views. Similar to the left-right disparity check for stereo, we project a reference pixel through its depth to pixel in another view, and then reproject back to the reference image by ’s depth estimation . If the reprojected coordinate and and the reprojected depth satisfy and , we say the depth estimation of is two-view consistent. In our experiments, all depths should be at least three view consistent. This simple two-step filtering strategy shows strong robustness for filtering different kinds of outliers.
Similar to other multi-view stereo methods [8, 32], we apply a depth map fusion step to integrate depth maps from different views to a unified point cloud representation. The visibility-based fusion algorithm  is used in our reconstruction, where depth occlusions and violations across different viewpoints are minimized. To further suppress reconstruction noises, we determine the visible views for each pixel as in the filtering step, and take the average over all reprojected depths as the pixel’s final depth estimation. The fused depth maps are then directly reprojected to space to generate the 3D point cloud. The illustration of our MVS reconstruction is shown in Fig. 3.
We first evaluate our method on the 22 evaluation scans of the DTU dataset . The input view number, image width, height and depth sample number are set to , , and respectively. For quantitative evaluation, we calculate the accuracy and the completeness of both the distance metric  and the percentage metric . While the matlab code for the distance metric is given by DTU dataset, we implement the percentage evaluation ourselves. Notice that the percentage metric also measures the overall performance of accuracy and completeness as the f-score. To give a similar measurement for the distance metric, we define the overall score, and take the average of mean accuracy and mean completeness as the reconstruction quality.
Quantitative results are shown in Table 1. While Gipuma  performs best in the accuracy, our MVSNet outperforms all methods in both the completeness and the overall quality with a significant margin. As shown in Fig. 4, MVSNet produces the most complete point clouds especially in those textureless and reflected areas, which are commonly considered as the most difficult parts to recover in MVS reconstruction.
|Mean Distance (mm)||Percentage (<)||Percentage (<)|
|Acc. Comp. overall||Acc. Comp. f-score||Acc. Comp. f-score|
|OpenMVG  + OpenMVS ||3.62||41.71||58.86||32.59||26.25||43.12||44.73||46.85||45.97||35.27|
|OpenMVG  + MVE ||6.00||38.00||49.91||28.19||20.75||43.35||44.51||44.76||36.58||35.95|
|OpenMVG  + SMVS ||10.38||30.67||31.93||19.92||15.02||39.38||36.51||41.61||35.89||25.12|
|OpenMVG-G  + OpenMVS ||10.88||22.86||56.50||29.63||21.69||6.55||39.54||28.48||0.00||0.53|
|OpenMVG  + PMVS ||11.88||29.66||41.03||17.70||12.83||36.68||35.93||33.20||31.78||28.10|
The DTU scans are taken under well-controlled indoor environment with fixed camera trajectory. To further demonstrate the generalization ability of MVSNet, we test the proposed method on the more complex outdoor Tanks and Temples dataset , using the model trained on DTU without any fine-tuning. While we choose , , and for all reconstructions, the depth range and the source image set for the reference image are determined according to sparse point cloud and camera positions, which are recovered by the open source SfM software OpenMVG .
Our method ranks first before April 18, 2018 among all submissions of the intermediate set  according to the online benchmark (Table 2). Although the model is trained on the very different DTU indoor dataset, MVSNet is still able to produce the best reconstructions on these outdoor scenes, demonstrating the strong generalization ability of the proposed network. The qualitative point cloud results of the intermediate set are visualized in Fig. 5.
This section analyzes several components in MVSNet. For all following studies, we use the validation loss to measure the reconstruction quality. The 18 validation scans (see Sec. 4.1) are pre-processed as the training set that we set , , and for the validation loss computation.
We first study the influence of the input view number and demonstrate that our model can be applied to arbitrary views of input. While the model in Sec. 4.1 is trained using views, we test the model using respectively. As expected, it is shown in Fig. 6 (a) that adding input views can lower the validation loss, which is consistent with our knowledge about MVS reconstructions. It is noteworthy that testing with performs better than with , even though the model is trained with the 3 views setting. This highly desirable property makes MVSNet flexible enough to be applied the different input settings.
We demonstrate in this study that the learning based image feature could significantly boost the MVS reconstruction quality. To model the traditional patch-based image feature in MVSNet, we replace the original 2D feature extraction network with a single 32-channel convolutional layer. The filter kernel is set to a large number of and the stride is set to 4. As shown in Fig. 6 (b), network with the 2D feature extraction significantly outperforms the single layer one on validation loss.
We also compare our variance operation based cost metric with the mean operation based metric . The element-wise variance operation in Eq. 2 is replaced with the mean operation to train the new model. It can be found in Fig. 6 (b) that our cost metric results in a faster convergence with lower validation loss, which demonstrates that it is more reasonable to use the explicit difference measurement to compute the multi-view feature similarity.
Lastly, we train MVSNet with and without the depth map refinement network. The models are also tested on DTU evaluation set as in Sec. 5.1, and we use the percentage metric  to quantitatively compare the two models. While Fig. 6 (b) shows that the refinement does not affect the validation loss too much, the refinement network improves the evaluation results from 75.58 to 75.69 ( f-score) and from 79.98 to 80.25 ( f-score).
We compare the running speed of MVSNet to Gipuma , COLMAP  and SurfaceNet  using the evaluation set. The other methods are compiled from their source codes and all methods are tested in the same machine. MVSNet is much more efficient that it takes around 230 seconds to reconstruct one scan (4.7 seconds per view). The running speed is faster than Gipuma, than COLMAP and than SurfaceNet.
The GPU memory required by MVSNet is related to the input image size and the depth sample number. In order to test on the Tanks and Temples with the original image resolution and sufficient depth hypotheses, we choose the Tesla P100 graphics card (16 GB) to implement our method. It is noteworthy that the training and validation on DTU dataset could be done using one consumer level GTX 1080ti graphics card (11 GB).
As mentioned in Sec. 4.1, DTU provides ground truth point clouds with normal information so that we can convert them into mesh surfaces for depth maps rendering. However, currently Tanks and Temples dataset does not provide the normal information or mesh surfaces, so we are unable to fine-tune MVSNet on Tanks and Temples for better performance.
Although using such rendered depth maps have already achieved satisfactory results, some limitations still exist: 1) the provided ground truth meshes are not complete, so some triangles behind the foreground will be falsely rendered to the depth map as the valid pixels, which may deteriorate the training process. 2) If a pixel is occluded in all other views, it should not be used for training. However, without the complete mesh surfaces we cannot correctly identify the occluded pixels. We hope future MVS datasets could provide ground truth depth maps with complete occlusion and background information.
We have presented a deep learning architecture for MVS reconstruction. The proposed MVSNet takes unstructured images as input, and infers the depth map for the reference image in an end-to-end fashion. The core contribution of MVSNet is to encode the camera parameters as the differentiable homography to build the cost volume upon the camera frustum, which bridges the 2D feature extraction and 3D cost regularization networks. It has been demonstrated on DTU dataset that MVSNet not only significantly outperforms previous methods, but also is more efficient in speed by several times. Also, MVSNet have produced the state-of-the-art results on Tanks and Temples dataset without any fine-tuning, which demonstrates its strong generalization ability.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015),https://www.tensorflow.org/, software available from tensorflow.org
Collins, R.T.: A space-sweep approach to true multi-image matching. Computer Vision and Pattern Recognition (CVPR) (1996)