Multi-view stereo (MVS) targets at reconstructing the observed 3D scene structure from its multi-view images, whereas both the intrinsic calibration and extrinsic calibration between cameras are available. Traditional geometry-based approaches exploit multi-view photometric consistency and various kinds of regularizations/priors 18, 9, 19] and binocular depth estimation [33, 34] has been extended to MVS. Existing deep CNNs based MVS approaches [30, 31, 11, 24] tend to represent MVS as an end-to-end regression problem. By exploiting large-scale ground truth 3D training data, these methods outperform traditional geometry-based approaches and dominate the leading boards on different benchmarking datasets [30, 31]. However, the success of these supervised MVS approaches strongly depends on the availability of large-scale ground-truth 3D training data, which not only not always available but also may further hinder their generalization ability in never-seen-before open-world scenarios . Thus it is highly desired to develop unsupervised learning based MVS approaches.
In this paper, we propose the first unsupervised deep MVS network as shown in Fig. 1, which could be learned in an end-to-end manner and without using ground-truth depth maps as the supervision signals. We demonstrate that the multi-view image warping errors (photometric consistency across different views) themselves are sufficient to drive a deep network to converge to the correct state that leads to superior MVS performance. Our network structure differs from existing MVS and simple extension of unsupervised binocular stereo matching in the following aspects:
Our network is symmetric
to all the views, , it treats each view equivalently and predicts the depth map for each view simultaneously. Existing supervised learning based MVS methods[30, 31, 11, 27] apply an “asymmetric” design and infer depth map for the reference image only. Thus, multiple depth maps estimated from different viewpoints do not comply with the same 3D geometry and 3D point clouds processing is required to derive a consistent 3D geometry. We would like to argue that this kind of “centralized” and “asymmetric” design has not fully exploited the multi-view relation encoded in the multi-view images.
We propose a new cross-view consistency in depth maps building upon our multi-view symmetry network design. The underlying principle is that as the multi-view images observe the 3D scene structure from different viewpoints, the estimated depth maps from MVS network should be consistent in 3D geometry. As our experiments demonstrate, this consistency plays a key role in strengthening the image warping error and guiding the network to coverage to meaningful states.
We integrate multi-view occlusion reasoning into our network, which enables us to detect occluded regions by using the cross-view consistency in depth maps. Under our framework, multi-view depth maps prediction and occlusion reasoning are alternatively updated.
Our main contributions are summarized as follow:
We present the first deep unsupervised MVS approach, which naturally fills the gap between traditional geometry-based approaches and deep supervised MVS methods. Our proposed unsupervised method avoids the necessity of large-scale 3D training data.
We introduce the cross-view consistency in depth maps and propose a loss function to measure the consistency. We demonstrate that this kind of consistency could be utilized to guide the training of a deep neural network.
Expensive experiments conducted on the SUN3D, RGB-D, DTU and Scenes11 benchmarking datasets demonstrate the effectiveness and the excellent generalization ability of our method.
2 Related Work
MVS has been an active research topic in geometric vision. Existing methods can be roughly classified into two categories: 1) Geometry-based MVS and 2) Supervised learning based MVS. We will also discuss related work in unsupervised monocular and binocular depth estimation.
Geometry-based Multi-view Stereo: Traditional MVS methods focus on designing neighbor selection and photometric error measures for efficient and accurate reconstruction [5, 8, 4]. Furukawa  adopted geometric structures to reconstruct textured regions and applied Markov random fields to recover per-view depth maps. Langguth  used the shading-aware mechanism to improve the robustness of view selection. Wu  utilized the lighting and shadows information to enhance the performance of the ill-posed region. Michael  chose images to match (both at a per-view and per-pixel level) for addressing the dramatic changes in lighting, scale, clutter, and other effects. Schonberger  proposed the COLMAP framework, which applied photometric and geometric priors to optimize the view selection and used geometric consistency to refine the depth map.
Supervised Deep Multi-view Stereo: Different from the above geometry-based methods, learning-based approaches adopt convolution operation which has powerful feature learning capability for better pair-wise patch matching [32, 12, 14]. Ji  pre-warped the multi-view images to 3D space, then used CNNs to regularize the cost volume. Huang  proposed DeepMVS, which aggregates information through a set of unordered images. Abhishek  directly leveraged camera parameters as the projection operation to form the cost volume, and achieved an end-to-end network. Yao 
adopted a variance-based cost metric to aggregate the cost volume, then applied 3D convolutions to regularize and regress the depth map. Im
applied a plane sweeping approach to build a cost volume from deep features, then regularized the cost volume via a context-aware aggregation to improve depth regression. Very recently, Yao
introduced a scalable MVS framework based on the recurrent neural network to reduce the memory-consuming.
Unsupervised Geometric Learning: Unsupervised learning has been developed in monocular depth estimation and binocular stereo matching by exploiting the photometric consistency and regularization. Xie  proposed Deep3D to automatically convert 2D videos and images to stereoscopic 3D format. Zhou  proposed an unsupervised monocular depth prediction method by minimizing the image reconstruction error. Mahjourian  explicitly considered the inferred 3D geometry of the whole scene, where consistency of the estimated 3D point clouds and ego-motion across consecutive frames are enforced. Zhong [33, 34] used the image warping error as the loss function to derive the learning process for estimating the disparity map.
3 Our Network
In this section, we present our unsupervised learning based multi-view stereo network, MVS, which could be learned without the need of ground truth 3D data. We represent MVS as the task of predicting a depth map for each view simultaneously such that the estimated multiple depth maps comply with the underlying 3D geometry. Our network structure follows the MVSNet model proposed in  but with significant modifications to achieve unsupervised MVS with multi-view symmetry, , MVS.
3.1 Multi-view Symmetric Network Design
Under the MVS configuration, each image observes the underlying 3D scene structure from different viewpoints. Therefore, the estimated depth maps from MVS network should be consistent in 3D geometry and each depth map estimation is not independent. However, existing deep MVS networks [30, 11, 31] generally apply an “asymmetric” design and infer depth map for each image (termed as “reference image”) individually. Thus, multiple depth maps estimated from different viewpoints do not necessarily comply with the same underlying 3D geometry.
In this paper, we propose a de-centralized and multi-view symmetric network structure for MVS as illustrated in Fig. 1. Our network is symmetric to all the views, , it treats each view equivalently and predicts the depth map for each view simultaneously. Our unsupervised deep MVS network consists of five modules, namely, multi-scale feature extraction, cost volume construction, cost volume regularization, depth map refinement through spatial propagation network, and unsupervised loss evaluation. We briefly describe each module with focus on how to achieve multi-view symmetry and how to enforce multi-view consistency.
3.1.1 Cost Volume Reconstruction
Under our multi-view symmetry configuration, we need to estimate a depth map for each input view. Following the MVSNet network, a cost volume has to be constructed for each input view. Denote the feature map extracted by feature extraction module for each view as , where denote the image height, image width and feature dimension correspondingly. We adopt the classical plane sweeping based stereo pipeline and use differentiable homography matrix to warp the current image into each of the remaining images as shown in Fig. 2.
In this way, we obtain warped feature volumes for each depth value . We add the current feature volume into the group of warping feature volumes. Denote as the depth sample number, then we obtain groups of multiple feature volumes . Finally, the multiple feature volumes are aggregated to one cost volume by using the variance operation , which has been shown to be better than other operations such as mean or sum operation.
3.1.2 Cost Volume Regularization
The raw cost volume
aggregated by the variance-based cost metric could be noise-contaminated, so we utilize 3D CNN to regularize each raw cost volume to generate a probability volume. After that, we apply the ArgMin operation to regress the depth map for the current view. The cost volume regularization process is illustrated in Fig.3.
As shown in Fig. 3
, we apply multi-scale 3D CNN to regularize the cost volume. The multi-scale 3D CNN consists of four-scale, where each convolutional operation is followed by a BN layer and a ReLU layer. On this base, we pass the feature maps between the same scale to form a residual architecture for avoiding losing the critical information. The output of our regularization module is a 1-channel volumewith dimension .
Finally, we adopt the regression way to obtain an initial depth map . We first use the softmax function along the depth dimension to convert volume to a probability map . Then, we apply the ArgMin operation to regress the depth map. The whole process is expressed as:
where denote the min and max depth value.
3.1.3 Depth Map Refinement
Even though the initial depth map is already a qualified output, the reconstruction boundaries of the object may suﬀer from over-smoothing due to up-sampling. To tackle this problem and improve the performance, we apply the spatial propagation network (SPN) 
to refine the initial depth map. In this step, we obtain the guidance from the feature extraction module, and it could produce the affinity matrix which is spatially dependent on the input image. Then, we adopt the affinity matrix to guide the refinement process.
3.2 Multi-view Occlusion reasoning
Occlusion is inevitable to MVS, thus we have to decide the occlusion mask to avoid the occluded points from participating in the loss evaluation. Different from the occlusion mask detection based on forward-backward consistency check , we exploit pixel-wise cross-view depth consistency to obtain the occlusion mask. Specifically, given a pair of estimated depth maps and , we can synthesize two versions of by using the depth maps and the warping relations and . The first order synthesized depth map is generated by and . The second synthesized depth map is generated by and . The cross-view depth consistency check is illustrated in Fig. 4. Given perfect depth maps and , and should be the same up to occlusion. Therefore, we mark points which satisfy the constraint - as invalid, where we set the threshold . For the sake of robustness, we use the cross-view depth consistency rather than the brightness consistency. In Fig. 5, we present a visualization of the evolution of the occlusion mask, its corresponding source images and synthesized image. It can be observed that with the increase of iterations the occlusion mask becomes more and more accurate.
3.3 Loss Functions for Unsupervised MVS
In this paper, we target to develop an unsupervised learning framework to estimate a fine and smooth depth map for each input image. For optimizing the quality of depth map, we adopt two aspects of loss functions: view synthesis loss which includes unary term loss and smoothness loss, and cross-view consistency loss. Given multi-view images (,,…,), we first obtain the corresponding estimated depth maps during the training process. With the estimated depth map of the view and the given camera pose between and view ( and ), we can produce the synthesized view of using the pixel of view and the mapping relations between them. Similarly, we can also obtain the synthesized image and the mapping relations . According to the bilateral mapping relations, we can produce the secondary synthesized image using and similarly the using with the bilinear sampler methods.
Our overall loss function can be formulated as follows:
where denotes the total amount of selected views. Apart from this, and stand for the synthesized image loss between and and the cross-view consistency loss.
3.3.1 View Synthesis Loss
The view synthesis loss between and is defined as:
where denotes the unary term loss and denotes the depth field smoothness regularization loss.
Unary term loss. During the reconstruction process, we would like to minimize the discrepancy between the source image and the reconstructed image. Our loss consists of not only the distance between images and their gradients, but also the structure similarity SSIM. In order to further improve the robustness in brightness, we also exploit the Census transformation to measure the difference. Thus, our unary term loss is defined as follow:
where is the unoccluded mask for obtaining the valid points. denotes the structure similarity SSIM. can elevate the robustness of our loss. denotes the gradient operator and denotes the Censum transformation of image. In this paper, we set .
Smoothness regularization term loss. To encourage the smoothness in the predicted depth map, the depth smoothness term is defined as:
where . denotes the total number of the pixels.
3.3.2 Cross-view Consistency Loss
Besides the above brightness constancy loss, we also apply a new cross-view consistency loss by considering the consistency between the images and depth maps for these views. We introduce the following two losses: cross-view consistency loss and multi-view brightness consistency loss ,
The cross-view consistency loss consists of image consistency loss based on images and depth consistency loss based on depth maps. It can be formulated as:
Given two images and , we can produce a synthesized image by using , and the relative pose between them. Naturally, we can also produce the secondary synthesized image using , and their relative pose. Suppose that the predicted depth maps are accurate, then the discrepancy between and should be very small and vice versa. In order to alleviate the robustness of the consistency loss, we also introduce another term to access the difference between and . The cross-view image consistency loss is defined as:
For the sake of robustness, we exploit the constraint between the predicted depth map and the synthesized depth map . Therefore, the cross-view depth consistency loss is defined as:
Besides the above consistency loss, we also present the multi-view brightness consistency loss to enhance the relationship of other views relative to the reference view. Our multi-view brightness consistency loss is formulated as:
which evaluates the brightness constancy across views , , multi-view consistency.
|Mean. Median. Variance||Mean. Median. Variance||Acc. Comp. f-score|
4 Experimental Results
To evaluate the performance of our proposed network MVS, we conducted experiments on widely used multi-view stereo datasets, DTU , SUN3D, RGBD, MVS and Scenes11 111https://github.com/lmb-freiburg/demon. To align with other related works, we only trained our network on the training set of the DTU dataset, and directly tested on other datasets.
4.1 Implementation Details
Dataset: The DTU dataset is a large-scale multi-view stereo dataset, which consists of 128 scenes and each scene contains 49 images with 7 different lighting conditions. For a fair comparison, we follow the experimental setting in . We generate the ground truth depth maps from the point cloud with the screened Poisson surface reconstruction method . We choose scenes: 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, 118 as the testing set and the other scenes as training set. The RGBD, SUN3D, MVS and Scenes11 datasets contain more than 30000 different scenes in total, which are very different from the DTU dataset. We use these datasets to validate the powerful generalization ability of our network.
Training Details: Our MVS
network is implemented in Tensorflow with an NVIDIA v100 GPU. We train our model on the DTU’s training set, but test it on the DTU’s test set and other datasets directly. The image resolution for the DTU dataset is. The resolution of the predicted depth map is one-quarter of the original input due to down-sampling. The depth ranges are uniformly sampled from 425mm to 935mm with a resolution of 2.6mm and the depth sample number is . For other datasets, in order to align the depth range, we set the depth start from 0.5mm with the depth sample resolution of 0.25, and the number of depth sample is .
For the hyper-parameters, we set ,
throughout the experiments. The batch size is set to 1 due to memory limit. The models are trained with RMSP optimizer for 10 epochs, with the learning rate of 2e-4 for the first 2 epochs and decreased by 0.9 for every two epochs.
Error Metrics: We use the standard metrics used in a public benchmark suite for performance evaluation. These quantitative measures include absolute relative error (Abs Rel), absolute difference error (Abs diff), square relative error (Sq Rel), root mean square error and its log scale (RMSE and RMSE log) and inlier ratio (, ).
4.2 Comparison with SOTA Methods
To verify the performance of our MVS, we tested it on the widely used DTU dataset. First, we conducted extensive quantitative comparisons with the state-of-the-art (SOTA) methods published recently. Performance comparison with other SOTA MVS methods is reported in Tab. 1. From Tab. 1, we can conclude that MVS achieves higher completeness than other SOTA MVS methods while achieving comparable performance under other metrics. We applied a depth map fusion step to integrate the depth maps from different views to a unified point cloud representation. We chose the gipuma  to fuse our depth maps. The qualitative comparisons in 3D reconstruction are shown in Fig. 6, where MVS achieves 3D reconstruction comparable with state-of-the-art supervised MVS method .
|Error metric||Accuracy metric()|
|w/o||Abs Rel||Abs Diff||Sq Rel||RMSE||RMSE log||runtime|
|Abs Rel||Abs Diff||Sq Rel||RMSE||RMSE log|
|(c) BC and CC||0.0147||11.3912||1.5478||28.4428||0.0156|
4.3 Ablation Studies
To analyze the contribution of different modules of our network model, we conduct three ablation studies on the DTU validation set with . Quantitative results are reported in Tab. 2.
SPN Refinement. Under our network model, we introduce the spatial propagation network (SPN)  to refine the initial depth map. To analyze the contribution of this module, we conduct experimental comparison with and without this module and the results are reported in Tab. 2. It can be observed that when the SPN module is removed, the performance consistently drops. For example the Abs Diff increases from 11.3912 to 13.0339, and the Abs Rel increases from 0.0147 to 0.0175, which clearly demonstrates the effectiveness of the SPN refinement module.
Cost Volume. In building the cost volume, we exploit both the feature for the current view and the variance-based feature . To validate the effectiveness of our cost volume construction, we compare with a baseline implementation by using the variance-based feature only, which is used in . As illustrated in Tab. 2, when the feature for the current view is excluded from the cost volume, the performance consistently drops. For example the Abs Rel jumps from 0.0147 to 0.0204 and the Abs Diff increases from 11.3912 to 15.1751. The experimental results prove the effectiveness of our proposed cost volume reconstruction method in exploiting the feature of the current view.
Consistency Loss. In this paper, we have proposed a consistency loss to further constrain the multiple estimated depth maps, which is also a key contribution. To analyze the contribution of this consistency loss, we conducted experiments with and without this loss term and the results are reported in Tab. 2. When the cross-view consistency loss is removed from our unsupervised loss, the performance deteriorates sharply. For example the Abs Rel shoots up from 0.0147 to 0.0355 while the Abs Diff increases from 11.3912 to 24,9464 and the Sq Rel jumps from 1.5478 to 5.2399. The experimental results clearly demonstrate the significance of our proposed consistency loss.
Besides the above ablation studies in analyzing the contribution of our novel consistency loss term, as our consistency term actually consists of two terms (multi-view brightness consistency and cross-view consistency in depth maps), we also conducted two additional experiments to analyze the effectiveness of each term and the corresponding results are reported in Tab. 3. From Tab. 3, we could draw the following conclusions that: 1) Both the multi-view brightness consistency term (BC) and the cross-view consistency term (CC) are critical for achieving improved performance; 2) The cross-view consistency term (CC) plays a more important role than the multi-view brightness consistency term (BC) in depth map estimation.
|Error metric||Accuracy metric()|
|Datasets||Method||Abs Rel||Abs Diff||Sq Rel||RMSE||RMSE log|
4.4 Generalization Ability
As agreed in monocular depth estimation and binocular stereo matching, the supervised depth estimation methods strongly depend on the availability of large scale ground truth 3D data and the generalization ability could be hindered when evaluated on never-seen-before open-world scenarios. Here, we would like to verify the generalization ability of our unsupervised MVS network model. We conducted experiments on SUN3D, RGBD, MVS and Scenes11 datasets using our pre-trained model without any fine tuning. In Table 4, we compare the performance of our MVS with state-of-the-art traditional MVS methods and supervised MVS methods. We can conclude from Table 4 that: 1) Our MVS outperforms state-of-the-art traditional geometry-based multi-view method COLMAP with a wide margin, which shows the benefits in exploiting the large scale datasets; 2) Compared with supervised MVS methods trained on each dataset individually, our MVS, even only trained on the DUT training dataset, outperforms current state-of-the-art supervised MVS method DeepMVS on part of the error metrics. Qualitative comparison between our MVS and competing MVS methods (COLMAP, DeMoN, DeepMVS) on the RGBD dataset is demonstrated in Fig. 8, where our method consistently achieves compared performance with SOTA supervised methods.
We also conducted experiments on the Tanks and Temples datasets without any fine tuning to validate the generalization ability of our network model. We choose , , and for our experiments. Qualitative point cloud results are presented in Fig. 7, where our MVS could reconstruct very detailed 3D structures.
In this paper, we have proposed the first unsupervised learning based MVS network, which learns the depth map for each view simultaneously without the need of ground truth 3D data. With our proposed multi-view symmetry network design, we can enforce the cross-view consistency of depth maps during training and testing. Our learned multi-view depth maps comply with the underlying 3D geometry. Our network learns multi-view occlusion maps in an alternative way. Experimental results on multiple benchmarking datasets demonstrate the effectiveness and excellent generalization ability of our network. In the future, we plan to extend the depth consistency beyond pairwise relation, such as consistency inside a clique. Extension to dynamic scenes  could be another interesting future direction.
This research was supported in part by the Natural Science Foundation of China grants (61871325, 61420106007, 61671387). We thank all anonymous reviewers for their valuable comments.
Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and
Anders Bjorholm Dahl.
Large-scale data for multiple-view stereopsis.
International Journal of Computer Vision, 120(2):153–168, 2016.
-  Neill D. F. Campbell, George Vogiatzis, Carlos Hernández, and Roberto Cipolla. Using multiple hypotheses to improve depth-maps for multi-view stereo. In European Conference on Computer Vision, 2008.
Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.
IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  Yasutaka Furukawa, Carlos Hernández, et al. Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015.
-  Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
-  Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In IEEE International Conference on Computer Vision, 2015.
-  David Gallup, Jan Michael Frahm, and Marc Pollefeys. Piecewise planar and non-planar stereo for urban scene reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  C. Godard, O. M. Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6602–6611, July 2017.
-  Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for community photo collections. In IEEE International Conference on Computer Vision, pages 1–8, 2007.
-  P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang. Deepmvs: Learning multi-view stereopsis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2821–2830, 2018.
-  Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. Surfacenet: An end-to-end 3d neural network for multiview stereopsis. In IEEE International Conference on Computer Vision, 2017.
-  H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. Godisart, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):190–204, Jan 2019.
-  Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems, 2017.
-  Michael M. Kazhdan, Matthew Bolitho, and Hugues Hoppe. Screened poisson surface reconstruction. Acm Transactions on Graphics, 32(3):1–13, 2013.
-  Arno Knapitsch, Jaesik Park, Qian Yi Zhou, and Vladlen Koltun. Tanks and temples: benchmarking large-scale scene reconstruction. Acm Transactions on Graphics, 36(4):78, 2017.
-  Fabian Langguth, Kalyan Sunkavalli, Sunil Hadap, and Michael Goesele. Shading-aware multi-view stereo. In European Conference on Computer Vision, pages 469–485, 2016.
-  Bo Li, Yuchao Dai, and Mingyi He. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition, 83:328–339, 2018.
-  Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In IEEE Conference on Computer Vision and Pattern Recognition, June 2015.
-  Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems, pages 1520–1530, 2017.
-  R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5667–5675, June 2018.
-  Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pages 501–518, 2016.
-  Narayanan Sundaram, Thomas Brox, and Kurt Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In European Conference on Computer Vision, 2010.
-  Im Sunghoon, Jeon Hae-Gon, Lin Stephen, and Kweon In, So. Dpsnet: End-to-end deep plane sweep stereo. In International Conference of Learning Representation, 2019.
-  Engin Tola, Christoph Strecha, and Pascal Fua. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Machine Vision & Applications, 23(5):903–920, 2012.
-  B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5622–5631, 2017.
-  Kaixuan Wang and Shaojie Shen. Mvdepthnet: Real-time multiview depth estimation neural network. In International Conference on 3D Vision, pages 248–257, 2018.
-  Chenglei Wu, Bennett Wilburn, Yasuyuki Matsushita, and Christian Theobalt. High-quality shape from multi-view stereo and shading under general illumination. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
-  Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In European Conference on Computer Vision, pages 842–857, 2016.
-  Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In European Conference on Computer Vision, 2018.
-  Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, and Thomas Funkhouser.
Physically-based rendering for indoor scene understanding using convolutional neural networks.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
-  Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-super-vised learning for stereo matching with self-improving ability. In arXiv preprint, 2017.
-  Yiran Zhong, Hongdong Li, and Yuchao Dai. Open-world stereo video matching with deep rnn. In European Conference on Computer Vision, September 2018.
-  Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision, 2018.