Code of single-view depth prediction algorithm on Internet Photos described in "MegaDepth: Learning Single-View Depth Prediction from Internet Photos, Z. Li and N. Snavely, CVPR 2018".
Single-view depth prediction is a fundamental problem in computer vision. Recently, deep learning methods have led to significant progress, but such methods are limited by the available training data. Current datasets based on 3D sensors have key limitations, including indoor-only images (NYU), small numbers of training examples (Make3D), and sparse sampling (KITTI). We propose to use multi-view Internet photo collections, a virtually unlimited data source, to generate training data via modern structure-from-motion and multi-view stereo (MVS) methods, and present a large depth dataset called MegaDepth based on this idea. Data derived from MVS comes with its own challenges, including noise and unreconstructable objects. We address these challenges with new data cleaning methods, as well as automatically augmenting our data with ordinal depth relations generated using semantic segmentation. We validate the use of large amounts of Internet data by showing that models trained on MegaDepth exhibit strong generalization-not only to novel scenes, but also to other diverse datasets including Make3D, KITTI, and DIW, even when no images from those datasets are seen during training.READ FULL TEXT VIEW PDF
While deep learning has recently achieved great success on multi-view st...
Deep learning methods have typically been trained on large datasets in w...
We present a fully data-driven method to compute depth from diverse mono...
Dense and accurate 3D mapping from a monocular sequence is a key technol...
We propose a novel superpixel-based multi-view convolutional neural netw...
We describe a set of tools for analyzing, visualizing, and assessing
Many real-world sequences cannot be conveniently categorized as general ...
Code of single-view depth prediction algorithm on Internet Photos described in "MegaDepth: Learning Single-View Depth Prediction from Internet Photos, Z. Li and N. Snavely, CVPR 2018".
|(a) Internet photo of Colosseum||(b) Image from Make3D|
|(c) Our single-view depth prediction||(d) Our single-view depth prediction|
|(e) Image from KITTI|
|(f) Our single-view depth prediction|
Predicting 3D shape from a single image is an important capability of visual reasoning, with applications in robotics, graphics, and other vision tasks such as intrinsic images. While single-view depth estimation is a challenging, underconstrained problem, deep learning methods have recently driven significant progress. Such methods thrive when trained with large amounts of data. Unfortunately, fully general training data in the form of (RGB image, depth map) pairs is difficult to collect. Commodity RGB-D sensors such as Kinect have been widely used for this purpose , but are limited to indoor use. Laser scanners have enabled important datasets such as Make3D  and KITTI , but such devices are cumbersome to operate (in the case of industrial scanners), or produce sparse depth maps (in the case of LIDAR). Moreover, both Make3D and KITTI are collected in specific scenarios (a university campus, and atop a car, respectively). Training data can also be generated through crowdsourcing, but this approach has so far been limited to gathering sparse ordinal relationships or surface normals [12, 4, 5].
In this paper, we explore the use of a nearly unlimited source of data for this problem: images from the Internet from overlapping viewpoints, from which structure-from-motion (SfM) and multi-view stereo (MVS) methods can automatically produce dense depth. Such images have been widely used in research on large-scale 3D reconstruction [35, 14, 2, 8]
. We propose to use the outputs of these systems as the inputs to machine learning methods for single-view depth prediction. By using large amounts of diverse training data from photos taken around the world, we seek to learn to predict depth with high accuracy and generalizability. Based on this idea, we introduce MegaDepth (MD), a large-scale depth dataset generated from Internet photo collections, which we make fully available to the community.
To our knowledge, ours is the first use of Internet SfM+MVS data for single-view depth prediction. Our main contribution is the MD dataset itself. In addition, in creating MD, we found that care must be taken in preparing a dataset from noisy MVS data, and so we also propose new methods for processing raw MVS output, and a corresponding new loss function for training models with this data. Notably, because MVS tends to not reconstruct dynamic objects (people, cars, etc), we augment our dataset with ordinal depth relationships automatically derived from semantic segmentation, and train with a joint loss that includes an ordinal term. In our experiments, we show that by training on MD, we can learn a model that works well not only on images of new scenes, but that also generalizes remarkably well to completely different datasets, including Make3D, KITTI, and DIW—achieving much better generalization than prior datasets. Figure1 shows example results spanning different test sets from a network trained solely on our MD dataset.
Single-view depth prediction. A variety of methods have been proposed for single-view depth prediction, most recently by utilizing machine learning [15, 28]. A standard approach is to collect RGB images with ground truth depth, and then train a model (e.g., a CNN) to predict depth from RGB [7, 22, 23, 27, 3, 19]. Most such methods are trained on a few standard datasets, such as NYU [33, 34], Make3D , and KITTI , which are captured using RGB-D sensors (such as Kinect) or laser scanning. Such scanning methods have important limitations, as discussed in the introduction. Recently, Novotny et al.  trained a network on 3D models derived from SfM+MVS on videos to learn 3D shapes of single objects. However, their method is limited to images of objects, rather than scenes.
Multiple views of a scene can also be used as an implicit source of training data for single-view depth prediction, by utilizing view synthesis as a supervisory signal [38, 10, 13, 43]. However, view synthesis is only a proxy for depth, and may not always yield high-quality learned depth. Ummenhofer et al.  trained from overlapping image pairs taken with a single camera, and learned to predict image matches, camera poses, and depth. However, it requires two input images at test time.
Ordinal depth prediction. Another way to collect depth data for training is to ask people to manually annotate depth in images. While labeling absolute depth is challenging, people are good at specifying relative (ordinal) depth relationships (e.g., closer-than, further-than) . Zoran et al.  used such relative depth judgments to predict ordinal relationships between points using CNNs. Chen et al. leveraged crowdsourcing of ordinal depth labels to create a large dataset called “Depth in the Wild” . While useful for predicting depth ordering (and so we incorporate ordinal data automatically generated from our imagery), the Euclidean accuracy of depth learned solely from ordinal data is limited.
Depth estimation from Internet photos. Estimating geometry from Internet photo collections has been an active research area for a decade, with advances in both structure from motion [35, 2, 37, 30] and multi-view stereo [14, 9, 32]. These techniques generally operate on 10s to 1000s of images. Using such methods, past work has used retrieval and SfM to build a 3D model seeded from a single image , or registered a photo to an existing 3D model to transfer depth . However, this work requires either having a detailed 3D model of each location in advance, or building one at run-time. Instead, we use SfM+MVS to train a network that generalizes to novel locations and scenarios.
In this section, we describe how we construct our dataset. We first download Internet photos from Flickr for a set of well-photographed landmarks from the Landmarks10K dataset 
. We then reconstruct each landmark in 3D using state-of-the-art SfM and MVS methods. This yields an SfM model as well as a dense depth map for each reconstructed image. However, these depth maps have significant noise and outliers, and training a deep network on this raw depth data will not yield a useful predictor. Therefore, we propose a series of processing steps that prepare these depth maps for use in learning, and additionally use semantic segmentation to automatically generate ordinal depth data.
We build a 3D model from each photo collection using Colmap, a state-of-art SfM system  (for reconstructing camera poses and sparse point clouds) and MVS system  (for generating dense depth maps). We use Colmap because we found that it produces high-quality 3D models via its careful incremental SfM procedure, but other such systems could be used. Colmap produces a depth map for every reconstructed photo (where some pixels of can be empty if Colmap was unable to recover a depth), as well as other outputs, such as camera parameters and sparse SfM points plus camera visibility.
The raw depth maps from Colmap contain many outliers from a range of sources, including: (1) transient objects (people, cars, etc.) that appear in a single image but nonetheless are assigned (incorrect) depths, (2) noisy depth discontinuities, and (3) bleeding of background depths into foreground objects. Other MVS methods exhibit similar problems due to inherent ambiguities in stereo matching. Figure 2(b) shows two example depth maps produced by Colmap that illustrate these issues.
|(a) Input photo||(b) Raw depth||(c) Refined depth|
Such outliers have a highly negative effect on the depth prediction networks we seek to train. To address this problem, we propose two new depth refinement methods designed to generate high-quality training data:
First, we devise a modified MVS algorithm based on Colmap, but more conservative in its depth estimates, based on the idea that we would prefer less training data over bad training data. Colmap computes depth maps iteratively, at each stage trying to ensure geometric consistency between nearby depth maps. One adverse effect of this strategy is that background depths can tend to “eat away” at foreground objects, because one way to increase consistency between depth maps is to consistently predict the background depth (see Figure 2 (top)). To counter this effect, at each depth inference iteration in Colmap, we compare the depth values at each pixel before and after the update and keep the smaller (closer) of the two. We then apply a median filter to remove unstable depth values. We describe our modified MVS algorithm in detail in the supplemental material.
Second, we utilize semantic segmentation to enhance and filter the depth maps, and to yield large amounts of ordinal depth comparisons as additional training data. The second row of Figure 2 shows an example depth map computed with our object-aware filtering. We now describe our use of semantic segmentation in detail.
Multi-view stereo methods can have problems with a number of object types, including transient objects such as people and cars, difficult-to-reconstruct objects such as poles and traffic signals, and sky regions. However, if we can understand the semantic layout of an image, then we can attempt to mitigate these issues, or at least identify problematic pixels. We have found that deep learning methods for semantic segmentation are starting to become reliable enough for this use .
We propose three new uses of semantic segmentation in the creation of our dataset. First, we use such segmentations to remove spurious MVS depths in foreground regions. Second, we use the segmentation as a criterion to categorize each photo as providing either Euclidean depth or ordinal depth data. Finally, we combine semantic information and MVS depth to automatically annotate ordinal depth relationships, which can be used to help training in regions that cannot be reconstructed by MVS.
Semantic filtering. To process a given photo , we first run semantic segmentation using PSPNet , a recent segmentation method, trained on the MIT Scene Parsing dataset (consisting of 150 semantic categories) . We then divide the pixels into three subsets by predicted semantic category:
Foreground objects, denoted , corresponding to objects that often appear in the foreground of scenes, including static foreground objects (e.g., statues, fountains) and dynamic objects (e.g., people, cars).
Background objects, denoted , including buildings, towers, mountains, etc. (See supplemental material for full details of the foreground/background classes.)
Sky, denoted , which is treated as a special case in the depth filtering described below.
We use this semantic categorization of pixels in several ways. As illustrated in Figure 2 (bottom), transient objects such as people can result in spurious depths. To remove these from each image , we consider each connected component of the foreground mask . If of pixels in have a reconstructed depth, we discard all depths from . We use a threshold of 50%, rather than simply removing all foreground depths, because pixels on certain objects in (such as sculptures) can indeed be accurately reconstructed (and we found that PSPNet can sometimes mistake sculptures and people for one another). This simple filtering of foreground depths yields large improvements in depth map quality. Additionally, we remove reconstructed depths that fall inside the sky region , as such depths tend to be spurious.
Euclidean vs. ordinal depth. For each 3D model we have thousands of reconstructed Internet photos, and ideally we would use as much of this depth data as possible for training. However, some depth maps are more reliable than others, due to factors such as the accuracy of the estimated camera pose or the presence of large occluders. Hence, we found that it is beneficial to limit training to a subset of highly reliable depth maps. We devise a simple but effective way to compute a subset of high-quality depth maps, by thresholding by the fraction of reconstructed pixels. In particular, if of an image (ignoring the sky region ) consists of valid depth values, then we keep that image as training data for learning Euclidean depth. This criterion prefers images without large transient foreground objects (e.g., “no selfies”). At the same time, such foreground-heavy images are extremely useful for another purpose: automatically generating training data for learning ordinal depth relationships.
Automatic ordinal depth labeling. As noted above, transient or difficult to reconstruct objects, such as people, cars, and street signs are often missing from MVS reconstructions. Therefore, using Internet-derived data alone, we will lack ground truth depth for such objects, and will likely do a poor job of learning to reconstruct them. To address this issue, we propose a novel method of automatically extracting ordinal depth labels from our training images based on their estimated 3D geometry and semantic segmentation.
Let us denote as (“Ordinal”) the subset of photos that do not satisfy the “no selfies” criterion described above. For each image , we compute two regions, (based on semantic information) and (based on 3D geometry information), such that all pixels in are likely closer to the camera than all pixels in . Briefly, consists of large connected components of , and consists of large components of
that also contain valid depths in the last quartile of the full depth range for(see supplementary for full details). We found this simple approach works very well ( accuracy in pairwise ordinal relationships), likely because natural photos tend to be composed in certain common ways. Several examples of our automatic ordinal depth labels are shown in Figure 3.
We use the approach above to densely reconstruct 200 3D models from landmarks around the world, representing about 150K reconstructed images. After our proposed filtering, we are left with 130K valid images. Of these 130K photos, around 100K images are used for Euclidean depth data, and the remaining 30K images are used to derive ordinal depth data. We also include images from  in our training set. Together, this data comprises the MegaDepth (MD) dataset, available at http://www.cs.cornell.edu/projects/megadepth/.
This section presents our end-to-end deep learning algorithm for predicting depth from a single photo.
The 3D data produced by SfM+MVS is only up to an unknown scale factor, so we cannot compare predicted and ground truth depths directly. However, as noted by Eigen and Fergus , the ratios of pairs of depths are preserved under scaling (or, in the log-depth domain, the difference between pairs of log-depths). Therefore, we solve for a depth map in the log domain and train using a scale-invariant loss function, . combines three terms:
Scale-invariant data term. We adopt the loss of Eigen and Fergus , which computes the mean square error (MSE) of the difference between all pairs of log-depths in linear time. Suppose we have a predicted log-depth map , and a ground truth log depth map . and denote corresponding individual log-depth values indexed by pixel position . We denote and define:
where is the number of valid depths in the ground truth depth map.
Multi-scale scale-invariant gradient matching term. To encourage smoother gradient changes and sharper depth discontinuities in the predicted depth map, we introduce a multi-scale scale-invariant gradient matching term , defined as an penalty on differences in log-depth gradients between the predicted and ground truth depth map:
where is the value of the log-depth difference map at position and scale . Because the loss is computed at multiple scales, captures depth gradients across large image distances. In our experiments, we use four scales. We illustrate the effect of in Figure 4.
|Input photo||Output w/o||Output w/|
|Input photo||Output w/o||Output w/|
Robust ordinal depth loss. Inspired by Chen et al. , our ordinal depth loss term utilizes the automatic ordinal relations described in Section 3.3. During training, for each image in our ordinal set , we pick a single pair of pixels , with pixel and either belonging to the foreground region or the background region .
is designed to be robust to the small number of incorrectly ordered pairs.
where and is the automatically labeled ordinal depth relation between and ( if pixel is further than and otherwise). is a constant set so that is continuous. encourages the depth difference of a pair of points to be large (and ordered) if our automatic labeling method judged the pair to have a likely depth ordering. We illustrate the effect of in Figure 5. In our tests, we set based on cross-validation.
In this section, we evaluate our networks on a number of datasets, and compare to several state-of-art depth prediction algorithms, trained on a variety of training data. In our evaluation, we seek to answer several questions, including:
How well does our model trained on MD generalize to new Internet photos from never-before-seen locations?
How important is our depth map processing? What is the effect of the terms in our loss function?
How well does our model trained on MD generalize to other types of images from other datasets?
The third question is perhaps the most interesting, because the promise of training on large amounts of diverse data is good generalization. Therefore, we run a set of experiments training on one dataset and testing on another, and show that our MD dataset gives the best generalization performance.
We also show that our depth refinement strategies are essential for achieving good generalization, and show that our proposed loss function—combining scale-invariant data terms with an ordinal depth loss—improves prediction performance both quantitatively and qualitatively.
Experimental setup. Out of the 200 reconstructed models in our MD dataset, we randomly select 46 to form a test set (locations not seen during training). For the remaining 154 models, we randomly split images from each model into training and validation sets with a ratio of 96% and 4% respectively. We set and
using MD validation set. We implement our networks in PyTorch, and train using Adam 
for 20 epochs with batch size 32.
For fair comparison, we train and validate our network using MD data for all experiments. Due to variance in performance of cross-dataset testing, we train four models on MD and compute the average error (see supplemental material for the performance of each individual model).
In this subsection, we describe experiments where we train on our MD training set and test on the MD test set.
Error metrics. For numerical evaluation, we use two scale-invariant error measures (as with our loss function, we use scale-invariant measures due to the scale-free nature of SfM models). The first measure is the scale-invariant RMSE (si-RMSE) (Equation 2), which measures precise numerical depth accuracy. The second measure is based on the preservation of depth ordering. In particular, we use a measure similar to [44, 4] that we call the SfM Disagreement Rate (SDR). SDR is based on the rate of disagreement with ordinal depth relationships derived from estimated SfM points. We use sparse SfM points rather than dense MVS because we found that sparse SfM points capture some structures not reconstructed by MVS (e.g., complex objects such as lampposts). We define , the ordinal disagreement rate between the predicted (non-log) depth map and ground-truth SfM depths , as:
where is the set of pairs of pixels with available SfM depths to compare, is the total number of pairwise comparisons, and is one of three depth relations (further-than, closer-than, and same-depth-as):
We also define and as the disagreement rate with and respectively. In our experiments, we set for tolerance to uncertainty in SfM points. For efficiency, we sample SfM points from the full set to compute this error term.
|Test set||Error measure||Raw MD||Clean MD|
Effect of network and loss variants. We evaluate three popular network architectures for depth prediction on our MD test set: the VGG network used by Eigen et al. , an “hourglass”(HG) network , and ResNets . To compare our loss function to that of Eigen et al. , we also test the same network and loss function as  trained on MD.  uses a VGG network with a scale-invariant loss plus single scale gradient matching term. Quantitative results are shown in Table 1 and qualitative comparisons are shown in Figure 6. We also evaluate variants of our method trained using only some of our loss terms: (1) a version with only the scale-invariant data term (the same loss as in ), (2) a version that adds our multi-scale gradient matching loss , and (3) the full version including and the ordinal depth loss . Results are shown in Table 2.
As shown in Tables 1 and 2, the HG architecture achieves the best performance of the three architectures, and training with our full loss yields better performance compared to other loss variants, including that of  (first row of Table 1). One thing to notice that is adding could significantly improve while increasing . Figure 6 shows that our joint loss helps preserve the structure of the depth map and capture nearby objects such as people and buses.
Finally, we experiment with training our network on MD with and without our proposed depth refinement methods, testing on three datasets: KITTI, Make3D, and DIW. The results, shown in Table 3, show that networks trained on raw MVS depth do not generalize well. Our proposed refinements significantly boost prediction performance.
A powerful application of our 3D-reconstruction-derived training data is to generalize to outdoor images beyond landmark photos. To evaluate this capability, we train our model on MD and test on three standard benchmarks: Make3D , KITTI , and DIW —without seeing training data from these datasets. Since our depth prediction is defined up to a scale factor, for each dataset, we align each prediction from all non-target dataset trained models with the ground truth by a scalar computed from least sqaure solution to the ratio between ground truth and predicted depth.
|Training set||Method||RMS||Abs Rel||log10|
|Make3D||Karsch et al. ||9.20||0.355||0.127|
|Liu et al. ||9.49||0.335||0.137|
|Liu et al. ||8.60||0.314||0.119|
|Li et al. ||7.19||0.278||0.092|
|Laina et al. ||4.45||0.176||0.072|
|Xu et al. ||4.38||0.184||0.065|
|NYU||Eigen et al. ||6.89||0.505||0.198|
|Liu et al. ||7.20||0.669||0.212|
|Laina et al. ||7.31||0.669||0.216|
|KITTI||Zhou et al. ||8.39||0.651||0.231|
|Godard et al. ||9.88||0.525||0.319|
|DIW||Chen et al. ||7.25||0.550||0.200|
|(a) Image||(b) GT||(c) DIW ||(d) NYU ||(e) KITTI ||(f) MD|
|Training set||Method||RMS||RMS(log)||Abs Rel||Sq Rel|
|KITTI||Liu et al. ||6.52||0.275||0.202||1.614|
|Eigen et al. ||6.31||0.282||0.203||1.548|
|Zhou et al. ||6.86||0.283||0.208||1.768|
|Godard et al. ||5.93||0.247||0.148||1.334|
|Make3D||Laina et al. ||8.68||0.422||0.339||3.136|
|Liu et al. ||8.70||0.447||0.362||3.465|
|NYU||Eigen et al. ||10.37||0.510||0.521||5.016|
|Liu et al. ||10.10||0.526||0.540||5.059|
|Laina et al. ||10.07||0.527||0.515||5.049|
|CS||Zhou et al. ||7.58||0.334||0.267||2.686|
|DIW||Chen et al. ||7.12||0.474||0.393||3.260|
|(a) Image||(b) GT||(c) DIW ||(d) Best NYU ||(e) Best Make3D ||(f) MD|
|DIW||Chen et al. ||22.14|
|KITTI||Zhou et al. ||31.24|
|Godard et al. ||30.52|
|NYU||Eigen et al. ||25.70|
|Laina et al. ||45.30|
|Liu et al. ||28.27|
|Make3D||Laina et al. ||31.65|
|Liu et al. ||29.58|
|(a) Image||(b) NYU ||(c) KITTI ||(d) Make3D ||(e) Ours|
Make3D. To test on Make3D, we follow the protocol of [23, 19], ,resizing all images to 345460, and removing ground truth depths larger than 70m (since Make3D data is unreliable at large distances). We train our network only on MD using our full loss. Table 4 shows numerical results, including comparisons to several methods trained on both Make3D and non-Make3D data, and Figure 7 visualizes depth predictions from our model and several other non-Make3D-trained models. Our network trained on MD have the best performance among all non-Make3D-trained models. Finally, the last row of Table 4 shows that our model fine-tuned on Make3D achieves better performance than the state-of-the-art.
KITTI. Next, we evaluate our model on the KITTI test set based on the split of . As with our Make3D experiments, we do not use images from KITTI during training. The KITTI dataset is very different from ours, consisting of driving sequences that include objects, such as sidewalks, cars, and people, that are difficult to reconstruct with SfM/MVS. Nevertheless, as shown in Table 5, our MD-trained network still outperform approaches trained on non-KITTI datasets. Finally, the last row of Table 5 shows that we can achieve state-of-the-art performance by fine-tuning our network on KITTI training data. Figure 8 shows visual comparisons between our results and models trained on other non-KITTI datasets. One can see that we achieve much better visual quality compared to other non-KITTI datasets, and our predictions can reasonably capture nearby objects such as traffic signs, cars, and trees, due to our ordinal depth loss.
DIW. Finally, we test our network on the DIW dataset . DIW consists of Internet photos with general scene structures. Each image in DIW has a single pair of points with a human-labeled ordinal depth relationship. As with Make3D and KITTI, we do not use DIW data during training. For DIW, quality is computed via the Weighted Human Disagreement Rate (WHDR), which measures the frequency of disagreement between predicted depth maps and human annotations on a test set. Numerical results are shown in Table 6. Our MD-trained network again has the best performance among all non-DIW trained models. Figure 9 visualizes our predictions and those of other non-DIW-trained networks on DIW test images. Our predictions achieve visually better depth relationships. Our method even works reasonably well for challenging scenes such as offices and close-ups.
We presented a new use for Internet-derived SfM+MVS data: generating large amounts of training data for single-view depth prediction. We demonstrated that this data can be used to predict state-of-the-art depth maps for locations never observed during training, and generalizes very well to other datasets. However, our method also has a number of limitations. MVS methods still do not perfectly reconstruct even static scenes, particularly when there are oblique surfaces (e.g., ground), thin or complex objects (e.g., lampposts), and difficult materials (e.g., shiny glass). Our method does not predict metric depth; future work in SfM could use learning or semantic information to correctly scale scenes. Our dataset is currently biased towards outdoor landmarks, though by scaling to much larger input photo collections we will find more diverse scenes. Despite these limitations, our work points towards the Internet as an intriguing, useful source of data for geometric learning problems.
Acknowledgments. We thank the anonymous reviewers for their valuable comments. This work was funded by the National Science Foundation under grant IIS-1149393.
Proc. Computer Vision and Pattern Recognition (CVPR), pages 1434–1441, 2010.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs.In Proc. Computer Vision and Pattern Recognition (CVPR), pages 1119–1127, 2015.
Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks.In Proc. European Conf. on Computer Vision (ECCV), 2016.