Code to extract stereo frame pairs from 3D videos, as used in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, arXiv:1907.01341"
The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a training objective that is invariant to changes in depth range and scale. Armed with this objective, we explore an abundant source of training data: 3D films. We demonstrate that despite pervasive inaccuracies, 3D films constitute a useful source of data that is complementary to existing training sets. We evaluate the presented approach on diverse datasets, focusing on zero-shot cross-dataset transfer: testing the generality of the learned model by evaluating it on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources yields improved depth estimates, particularly on previously unseen datasets. Some results are shown in the supplementary video: https://youtu.be/ITI0YS6IrUQREAD FULL TEXT VIEW PDF
Code to extract stereo frame pairs from 3D videos, as used in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, arXiv:1907.01341"
Depth is among the most useful intermediate representations for action in physical environments . Yet despite its utility, monocular depth estimation remains a challenging open problem. When only one image is given as input, depth is heavily underconstrained and its estimation calls for the use of multiple monocular cues along with comprehensive prior knowledge. This must rely on learning-based techniques [20, 37].
To learn models that are effective across a variety of scenarios, we need training data that provides relevant supervision and captures the diversity of the visual world. The key challenge is acquiring such data at sufficient scale. Sensors that provide dense ground-truth depth in dynamic scenes, such as structured light or time-of-flight, have limited range and operating conditions [22, 19, 10]. Laser scanners are expensive and most designs can only provide sparse depth measurements when the scene is in motion. Stereo cameras are a promising source of data [13, 15], but collecting stereo images in diverse environments at scale remains a challenge. Structure-from-motion (SfM) reconstruction has been used to construct training data for monocular depth estimation across a variety of scenes , but the result does not include independently moving objects and is incomplete due to the limitations of multi-view matching. On the whole, none of the existing datasets are sufficiently rich and unbiased to support the training of a general model that works robustly on real data from wildly diverse scenes. At present, we are faced with multiple datasets that may usefully complement each other, but are individually biased and incomplete.
In this paper, we propose an approach to mixing diverse datasets for training monocular depth estimation models. We develop a novel loss function that is invariant to the major sources of incompatibility between datasets, including unknown and inconsistent scale and baselines. Our loss enables training on data that was acquired with diverse sensing modalities such as stereo cameras (with potentially unknown calibration), laser scanners, and structured light sensors. We explore strategies for mixing datasets during training and show that a principled approach based on multi-objective optimization can lead to improved generalization performance.
Equipped with the ability to combine diverse datasets, we tap into a new source of data for monocular depth estimation: 3D film. We construct a new 3D Movies dataset and show that it provides a powerful training resource that improves generalization to new and dynamic environments.
Our experiments show that a model trained on a rich and diverse set of images from different sources, with the appropriate training procedure, delivers state-of-the-art results across a variety of environments. Our primary experimental procedure is zero-shot cross-dataset transfer. That is, we train a model on certain datasets and then test its performance on other datasets that were not seen during training. The basic intuition is that zero-shot cross-dataset performance is a more faithful proxy for performance in the “real world” than training and testing on subsets from a single biased dataset.
Our evaluation on eight different datasets suggests that our model outperforms prior art both quantitatively and qualitatively. Example results are shown in Figure LABEL:fig:teaser.
Early work on monocular depth estimation used MRF-based formulations , simple geometric assumptions , and non-parametric methods . More recently, significant advances have been made by leveraging the expressive power of convolutional networks. Eigen et al.  trained a multi-scale deep network to perform depth regression. Various architectural innovations have been proposed to enhance prediction accuracy [25, 36, 29, 12, 26]. These methods need ground-truth depth for training, which is commonly acquired using RGB-D cameras or LiDAR sensors. The performance of these methods in unconstrained scenes is limited by the lack of diverse ground-truth data at scale.
Garg et al.  proposed to use calibrated stereo cameras for self-supervision. While this significantly simplifies the acquisition of training data, diverse, large-scale stereo datasets are still not available.
Various approaches that leverage self-supervision have been proposed, but they either require stereo images [15, 47] or are based on apparent motion [49, 31, 2], and are thus challenging to apply to highly dynamic scenes. Other approaches leverage existing stereo matching networks to obtain supervision [17, 30].
We argue that the deployment of high-capacity deep models for monocular depth estimation in unconstrained environments is limited by the lack of large-scale, dense ground truth that spans a variety of scenes. Indeed, commonly used datasets feature homogeneous scene compositions such as street scenes in a limited geographic area [14, 32, 37] or indoor environments , and have only a limited number of dynamic objects. Models that are trained on data with such strong biases are prone to fail in unconstrained environments.
Efforts have been made to create datasets that overcome these limitations. Chen et al.  used crowdsourcing to sparsely annotate ordinal relations in images that were collected from the web. Xian et al.  collected a stereo dataset from the web and used off-the-shelf tools to generate dense ground-truth disparity; while this dataset is fairly diverse and provides dense ground truth, it only contains 3,600 images. Li and Snavely  used SfM and MVS to reconstruct many (predominantly static) scenes to obtain depth supervision. Each of these efforts contributes datasets with different characteristics and limitations. On the whole, there is no single large-scale dataset with dense ground truth for diverse dynamic scenes.
|KITTI LiDAR [14, 32]||✓||✓||✓||✓||Medium||Low||Laser||93K|
|KITTI Stereo [14, 32]||✓||✓||✓||✓||✓||Medium||Low||Stereo||93K|
|DIW ||✓||✓||✓||Low||High||User clicks||496K|
|Tanks and Temples ||✓||✓||✓||✓||High||Low||Laser||3290|
|3D Movies (Ours)||✓||✓||✓||✓||✓||Medium||High||Stereo||50K/movie|
To the best of our knowledge, the controlled mixing of multiple data sources has not been explored before in this context. Ummenhofer et al.  presented a model for two-view structure and motion estimation and trained it on the union of several datasets that depict static scenes. However, this work did not propose strategies for optimal mixing or study the influence of combining multiple datasets. The model was further constrained to work with the intrinsic parameters of a single camera model.
Concurrent work. Several concurrent projects aim to extend the generalization capabilities of monocular depth estimation by collecting larger and more diverse datasets. Li et al.  use SfM and MVS to construct a dataset from a collection of videos of people imitating mannequins (i.e. the people are frozen in action while the camera moves through the scene). Chen et al.  propose an approach to automatically assess the quality of sparse SfM reconstructions to enable the construction of a large dataset. Wang et al.  build a large and diverse dataset from stereo videos sourced from the Web, while Cho et al.  collect a large dataset of outdoor scenes using handheld stereo cameras. Gordon et al.  estimate intrinsic parameters of YouTube videos in order to leverage them for training. Our approach can be used to directly integrate these datasets in a single training procedure to train even more general and accurate models.
Table 1 summarizes datasets that provide relevant supervision for monocular depth estimation. Most datasets feature a limited range of environments, such as indoor [40, 41, 7] or road scenes [14, 32]. Some are restricted to static scenes [37, 6, 24, 38, 28]. Only DIW  and ReDWeb  include diverse dynamic scenes in a variety of settings. However, the diversity of the data comes with drawbacks. DIW is large and diverse, but each image provides only an ordinal depth relation for a single pair of points. And the ReDWeb dataset contains only 3,600 images.
We propose to extract depth data from 3D movies. This source of data is complementary to existing datasets and comes with its own strengths. 3D movies feature diverse dynamic environments that range from human-centric imagery in story- and dialogue-driven Hollywood films to nature scenes with landscapes and animals in documentary features. While the data does not provide metric depth, we can use stereo matching to obtain relative depth. Using relative depth for supervision comes with challenges and we will present techniques for overcoming them.
Our driving inspiration is the scale and diversity of the data. 3D movies provide the largest known source of stereo pairs, presenting the possibility of tapping into millions of images from an ever-growing library of content. We note that 3D movies have been used in related tasks in isolation [18, 46]. We will show that this data reveals its full potential in combination with other, complementary data sources.
Challenges. Movie data comes with its own challenges and imperfections. The primary objective when producing stereoscopic film is providing a visually pleasing viewing experience while avoiding discomfort for the viewer . This means that the disparity range for any given scene (also known as the depth budget) is limited and depends on both artistic and psychophysical considerations. For example, disparity ranges are often increased in the beginning and the end of a movie, in order to induce a very noticeable stereoscopic effect for a short time. Depth budgets in the middle may be lower to allow for more comfortable viewing. Stereographers thus adjust their depth budget depending on the content, transitions, and even the rhythm of scenes.
In consequence, focal lengths, baseline, and convergence angle between the cameras of the stereo rig are unknown and vary between scenes even within a single film. Furthermore, in contrast to image pairs obtained directly from a standard stereo camera, stereo pairs in movies usually contain both positive and negative disparities to allow objects to be perceived either in front of or behind the screen. Additionally, the depth that corresponds to the screen is scene-dependent and is often modified in post-production by shifting the image pairs. We describe data extraction and training procedures that address these challenges.
Movie selection and preprocessing. We selected a diverse set of 23 movies. The selection was based on the following considerations. 1) We only selected movies that were shot using a physical stereo camera. (Some 3D films are shot with a monocular camera and the stereoscopic effect is added in post-production by artists.) 2) We tried to balance realism and diversity. 3) We only selected movies that are available in Blu-ray format and thus allow extraction of high-resolution images. The complete list of movies can be found in supplementary material.
We extract stereo image pairs at 1920x1080 resolution and 24 frames per second (fps). Movies have varying aspect ratios, resulting in black bars on the top and bottom of the frame, and some movies have thin black bars along frame boundaries due to realignment of stereo images in post-production. We thus center-crop all frames to 1880x800 pixels. We use the chapter information (Blu-ray meta-data) to split each movie into individual chapters. We drop the first and last chapters since they usually include the introduction and credits.
We use the scene detection tool of FFmpeg  with a threshold of 0.1 to extract individual clips. We discard clips that are shorter than one second to filter out chaotic action scenes and highly correlated clips that rapidly switch between protagonists during dialogues. To balance scene diversity, we sample the first 24 frames of each clip and additionally sample 24 frames every four seconds for longer clips. Example frames from the resulting dataset are shown in Figure 1.
Disparity extraction. The extracted image pairs can be used to estimate disparity maps using stereo matching. Unfortunately, state-of-the-art stereo matchers perform poorly when applied to movie data, since the matchers were designed and trained to match only over positive disparity ranges. This assumption is appropriate for the rectified output of a standard stereo camera, but not to image pairs extracted from stereoscopic film. Moreover, disparity ranges encountered in 3D movies are usually smaller than ranges that are common in standard stereo setups due to the limited depth budget. (The average disparity range in our dataset is 32 pixels.)
To alleviate these problems, we apply a modern optical flow algorithm  to the stereo pairs. We retain the horizontal component of the flow as a proxy for disparity. Optical flow algorithms naturally handle both positive and negative disparities and usually perform well for displacements of moderate size. For each stereo pair we use the left camera as the reference and extract the optical flow from the left to the right image and vice versa. We perform a left-right consistency check and mark pixels with a disparity difference of more than one pixel as invalid. In a final step, we detect pixels that belong to sky regions using a pre-trained semantic segmentation model  and set their disparity to the minimum disparity in the image.
|Number of clips||38,000|
|Avg./Max. disparity range||32/223|
Dataset statistics. We use frames from 19 movies for training and set aside two movies for validation and two movies for testing, respectively. An overview of the statistics of the resulting training set is shown in Table 2. The complete dataset contains 38,000 clips of one second each and close to one million frames. Since multiple frames are part of the same clip, the complete dataset is highly correlated. We thus subsample the dataset at 1 fps or 4 fps, respectively.
Training models for monocular depth estimation on diverse datasets presents a challenge because ground-truth data may take different forms. Ground truth may be present in the form of absolute depth (from laser-based measurements or stereo cameras with known calibration), depth up to an unknown scale (from SfM), or disparity maps (from stereo cameras with unknown calibration). The main requirement for a sensible training scheme is to carry out computations in an appropriate output space that is compatible with all ground-truth representations and is numerically well-behaved. We further need to design a loss function that is flexible enough to handle diverse sources of data while making optimal use of all available information.
We identify three major challenges. 1) Inherently different representations of depth: direct depth versus inverse depth representations (such as disparity). 2) Scale ambiguity: for some data sources, depth is only given up to an unknown scale. 3) Shift ambiguity: the ground truth in the 3D Movies dataset is only given up to an unknown global shift that is a function of the unknown baseline and a possible shift of the disparity range in post-production.
Scale- and shift-invariant loss. We propose to perform prediction in inverse depth space together with a scale- and shift-invariant dense loss to handle the aforementioned ambiguities. Let denote the number of pixels in an image with valid ground truth and let be the parameters of the prediction model. Let be an inverse depth prediction and let be the corresponding ground-truth inverse depth. We index individual pixels by subscripts. We define a scale- and shift-invariant loss for a single sample as
where accounts for the unknown scale and accounts for an unknown shift between the inverse depth maps. The loss effectively aligns the scale and shift of the estimate to the ground truth based on a least-squares criterion before measuring the mean squared error. The factors and can be efficiently determined in closed form, as follows. Let and . We can rewrite (1) as
which has the closed-form solution
Substituting into (2) yields
It is straightforward to show that this loss is indeed invariant to scale and shift of the prediction. We provide a proof sketch in supplementary material.
Relation to existing loss functions. The importance of accounting for unknown or varying scale in the training of monocular depth estimation models has been recognized early. Eigen et al.  proposed a scale-invariant loss in log-depth space. Their loss can then be written as
where and are depths up to unknown scale. By comparing this loss to (1) it becomes clear that both losses account for the unknown scale of the predictions, but only (1) accounts for an unknown global shift of the inverse depth range. Moreover, the two losses are evaluated on different representations of depth. Our loss (1) is defined in inverse depth space, which is numerically stable, compatible with common representations of relative depth, and allows to model the error distribution as Gaussian, which is represented well by a quadratic loss .
Chen et al.  proposed a generally applicable loss for relative depth estimation based on ordinal relations:
where encodes the ground-truth ordinal relation. This loss encourages pushing points infinitely far apart when and pulling them to the same depth when . Note that measures the ordinal relation between a pair of points, which is inefficient if applied exhaustively to dense ground truth. Xian et al.  suggest to sparsely evaluate this loss by randomly sampling point pairs, even when dense ground truth is available. In contrast, our proposed loss takes all available data into account. While the ordinal loss can be applied to arbitrary depth representations and is thus suited for mixing diverse datasets, we will show that our scale- and shift-invariant loss leads to consistently better performance.
Regularization terms. We adapt the multi-scale scale-invariant gradient matching term  to the inverse depth space. This term biases discontinuities to be sharp and to coincide with discontinuities in the ground truth. Let
We define the gradient matching term as
where denotes the difference of inverse depth maps at scale . As proposed by Li and Snavely , we use scale levels, halving the image resolution at each level. Note that in inverse depth space the estimated scale needs to be applied to the prediction before measuring gradients. This is in contrast to the term proposed in  where predictions are given in log-depth space and no pre-scaling is required.
Mixing strategies. Our final loss for a training set is
where is the number of samples in the training set and is set to 0.5.
While our choice of prediction space together with our loss enables mixing datasets, it is not immediately clear in what proportions different datasets should be integrated during training with a stochastic optimization algorithm. We explore two different strategies in our experiments.
The first, naive strategy is to mix datasets in equal parts in each minibatch. For a minibatch of size , we sample training samples from each dataset, where denotes the number of distinct datasets. This strategy ensures that all datasets are represented equally in the effective training set, irrespective of their size.
Our second strategy explores a more principled approach, where we adapt a recent procedure for Pareto-optimal multi-task learning to our setting . We define learning on each dataset as a separate task and are thus seeking an approximate Pareto-optimum over datasets (i.e. the loss cannot be decreased on any of the training sets without increasing the loss on at least one of the other training sets). Formally, we use the algorithm presented in  to minimize the multi-objective optimization criterion
where we share the parameters of the deep network across all datasets.
Experimental setup. We start from the experimental setup proposed by Xian et al. 
and use their ResNet-based multi-scale architecture for single-image depth prediction. We initialize all ResNet-50 blocks with pretrained ImageNet weights and initialize other layers randomly. We use Adam with a learning rate of for randomly initialized layers and for layers that were initialized with pretrained weights. We set the exponential decay rate parameters of the moving averages for Adam to and
in all experiments. The batch size is set to 8. Images are flipped horizontally with a 50% chance, and randomly cropped to augment the data and maintain the aspect ratio across different input images. We pretrain the network for 300 epochs on 3,240 images from the ReDWeb dataset to produce a baseline model comparable to.
For all experiments that follow, we start from this pretrained model and fine-tune on the respective collections of different datasets. When fine-tuning we fix the learning rate to for all layers. We use the same dataset augmentation and a batch size of , i.e. when mixing three datasets the batch size is . When comparing datasets of different sizes, the term epoch is not well-defined anymore; we thus denote an epoch as processing images, roughly the size of the MegaDepth training dataset, and train for epochs. For all datasets, we shift and scale the ground-truth inverse depth to the range .
Training datasets. We use three complementary datasets for training. ReDWeb  (RW) is small but features diverse, dynamic scenes with ground truth that was acquired with a relatively large stereo baseline. The 3D Movies dataset (MV) features highly dynamic scenes and is considerably larger but is biased towards dominant foreground objects due to the depth budget. MegaDepth  (MD) is large and shows predominantly static scenes; the ground truth in this dataset is usually more accurate in background regions due to the large-baseline multi-view stereo reconstruction. We hold out test and validation sets for all datasets (details in the supplement).
Test datasets. To benchmark the generalization performance of different models, we use a variety of datasets for testing. We split datasets in two distinct groups. The first group are datasets where the model was trained on similar data, namely our held-out validation sets of ReDWeb, MegaDepth, and Movies. The second group are entirely different datasets that were never seen during training. We chose five datasets based on diversity and accuracy of their ground truth. DIW  is highly diverse but provides ground truth only in the form of sparse ordinal relations. ETH3D  features highly accurate laser-scanned ground truth on static scenes. Sintel  features perfect ground truth for synthetic scenes. KITTI  and NYU  are commonly used datasets with characteristic biases. Note that we never fine-tune models on any of the datasets in this second group. We refer to this experimental procedure as zero-shot cross-dataset transfer.
Metrics. For each dataset, we use a single metric that fits the ground-truth data in that dataset. For DIW we use the Weighed Human Disagreement Rate . For datasets that are based on relative depth, we measure the root mean squared error in disparity space (Movies, ReDWeb, MegaDepth). For datasets that provide accurate absolute depth we measure the mean absolute value of the relative error in depth space (ETH3D, Sintel). Finally, we use the percentage of pixels with to evaluate models on the KITTI and NYU datasets (). We align predictions and dense ground truth in scale and shift before measuring errors (details in the supplement). To summarize the performance of different models across test datasets, we rank methods by their performance on each dataset and compute the average rank. This is our primary performance measure for zero-shot cross-dataset transfer.
|Godard ||CS K||28.87||0.224||0.430||13.25||35.52||5.4|
|Chen ||NYU DIW||14.81||0.236||0.444||41.43||28.08||7.0|
Comparison of loss functions. We show the effect of different loss functions on validation performance in Table 4. We used the ReDWeb dataset to train networks with different losses: mean squared error in disparity space (MSE); the scale-invariant loss in log-depth space (5) as defined by Eigen et al. ; the ordinal loss (6), where we sample 5,000 point pairs randomly (ORD); a scale-invariant loss in disparity space, where we assume a fixed offset of in (1) and only estimate the scale (SIMSE); and the full scale- and shift-invariant loss (1) (SSIMSE). Note that the model trained with the ordinal loss (ORD) corresponds to our reimplementation of Xian et al. . Table 4 shows that our proposed loss yields the lowest validation error on all datasets. We thus conduct all experiments that follow using the scale- and shift-invariant loss.
|Input [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c.jpg||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399.jpg||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017.jpg||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000.jpg||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67.jpg|
|Chen  [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_eigen||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_eigen||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_eigen||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_eigen||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_eigen|
|Godard  [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_monodepth||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_monodepth||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_monodepth||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_monodepth||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_monodepth|
|Li  [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_md||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_md||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_md||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_md||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_md|
|Xian  [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_redweb_ord||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_redweb_ord||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_redweb_ord||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_redweb_ord||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_redweb_ord|
|Ours [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_ours||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_ours||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_ours||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_ours||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_ours|
|GT [width=0.19trim=0 0 0 0.020pt,clip]figures/results/diw/14_0cf56d4b247ee420e17327c5c28705bac674289c_gt_hazed||[width=0.19trim=0 0 0 0.020pt,clip]figures/results/eth3d/facadeDSC_0399_gt||[width=0.19trim=0.1730pt 0 0.1730pt 0,clip]figures/results/sintel/market_6frame_0017_gt||[width=0.19trim=0.2660pt 0 0.270pt 0,clip]figures/results/kitti/0000000000_gt_interp||[width=0.19trim=0 0.020pt 0 0.110pt,clip]figures/results/nyu/img_67_gt|
Qualitative comparison of our approach to various baselines. Ground truth on KITTI was interpolated from sparse LiDAR measurements for visualization. On DIW the yellow and purple dots represent sparse human annotations for close and far, respectively.
Training on diverse datasets. We show the performance of models that were trained on different combinations of training sets in Table 4. We trained models with the MV dataset sampled at 1 fps as well as 4 fps. We indicate if models were trained using Pareto-optimal mixing (MGDA) in a separate column. We observe that adding more training sets consistently improves performance across validation sets. The strongest models (RW+MD+MV) were trained on all three datasets and outperform the best single-dataset model (RW) on all validation sets. Oversampling clips in the movie dataset (4 fps) typically leads to an increase in performance, likely because it provides a natural form of data augmentation due to small shifts in perspective and scene composition over consecutive frames. We can additionally see that performing principled Pareto-optimal dataset mixing (MGDA) leads to a noticeable improvement over the naive mixing strategy on most datasets. Using all three datasets (RW+MD+MV) with MGDA yields our best-performing model.
Effect of 3D Movies dataset. Our 3D Movies dataset features smaller baselines than ReDWeb and thus provides less accurate disparities in the background. However, an analysis of Table 4 shows that it consistently improves performance when used in combination with other datasets, especially for diverse test sets. Fine-tuning only on MegaDepth or 3D Movies decreases performance compared to training on ReDWeb only, while fine-tuning on MegaDepth and 3D Movies (MD+MV) leads to consistently better results. We further observe that mixing 3D Movies and ReDWeb always leads to an improvement on the ReDWeb validation set. Similar findings hold on DIW, arguably the most diverse test set. We achieve the best results when mixing all three datasets. Without Pareto-optimal mixing, RW+MD+MV (4 fps) improves performance on five out of seven datasets when compared to RW+MD, while RW+MD+MV (1 fps) improves performance on four out of seven datasets.
Comparison to the state of the art. We compare our best-performing model to various state-of-the-art baselines in Table 5. The top part of the table compares to baselines that were not fine-tuned on any of the evaluated datasets (i.e. zero-shot transfer, akin to our model). The bottom part shows baselines that were fine-tuned on a subset of the datasets for reference. In the training set column, CS refers to Cityscapes , K to KITTI, and A B indicates that a model was pretrained on A and fine-tuned on B.
Our model outperforms the baselines by a comfortable margin on most datasets. The only exception is KITTI, where a model that was trained on the visually similar CityScapes dataset achieves the lowest error . It can also be seen that fine-tuning on the KITTI dataset improves accuracy on this specific dataset but often leads to worse performance on other datasets. A qualitative comparison is shown in Figure 2. The visual results correspond well to our quantitative evaluation. Only our model and the model of Xian et al.  adequately handle diverse scenes. The model of Xian et al. tends to miss details or place parts of the scene at the wrong depth. Interestingly, the bias of the model that was trained on street scenes  can be clearly observed in the sample from ETH3D, where the slope of the staircase is too flat and the thin horizontal shadow on the left side of the building is mistaken for a free-standing pole. Note, however, that this model better reconstructs the thin structures in the foreground, likely because such structures are frequently encountered in street scenes (e.g. street signs and poles). Additional results are shown in the supplement.
The success of deep networks has been driven by massive datasets. We believe that learning truly general models for monocular depth estimation will require not only innovations in network architectures, but also creative ways to boost the amount and diversity of training data. Motivated by the difficulty of capturing diverse depth datasets at scale, we have introduced tools for combining complementary sources of data. We have proposed a flexible loss function and a principled dataset mixing strategy. We have further introduced a dataset based on 3D movies that provides dense ground truth for diverse dynamic scenes. To evaluate the robustness and generality of trained models, we used zero-shot cross-dataset transfer: systematically testing models on datasets that were never seen during training. The results indicate that the presented ideas substantially advance monocular depth estimation in diverse environments. Our code and and pretrained models will be made available online.
The Cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks.In ECCV, 2016.
Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.In CVPR, 2018.
Does computer vision matter for action?Science Robotics, 4(30), 2019.
Table 6 shows the complete list of movies that were used for creating the 3D Movies dataset. We additionally state the number of extracted clips of one second (24 frames each). Note that discrepancies in the number of extracted clips per movie occur due to varying runtimes.
|Movie title||# Clips|
|Battle of the Year (2013)||2053|
|Billy Lynn’s Long Halftime Walk (2016)||1645|
|Drive Angry (2011)||1722|
|Exodus: Gods and Kings (2014)||2847|
|Final Destination 5 (2011)||1502|
|A very Harold & Kumar 3D Christmas (2011)||1601|
|The Hobbit: An Unexpected Journey (2012)||2742|
|The Three Musketeers (2011)||1958|
|Nurse 3D (2013)||1397|
|Dawn of the Planet of the Apes (2014)||2087|
|The Amazing Spider-Man (2012)||2240|
|Step Up 3D (2010)||1841|
|Step Up: All In (2014)||1876|
|Transformers: Age of Extinction (2014)||2903|
|Le Dernier Loup / Wolf Totem (2015)||1874|
|X-Men: Days of Future Past (2014)||2687|
|The Great Gatsby (2013)||2606|
|Step Up: Miami Heat / Revolution (2012)||1829|
|Doctor Who - The Day of the Doctor (2013)||1447|
|StreetDance 2 (2012)||1514|
To see that the proposed loss is invariant to scale and shift of the prediction, let , and be a diagonal matrix with strictly positive entries. We have
The optimality condition is:
Substitute into (11) to see that cancels out.
Our evaluation in Table 4 of the main paper has been performed on the validation sets of MegaDepth  (2963 images) and the 3D movies dataset (4435 images). For DIW  we created a validation set of 10000 images from the DIW training set. For ReDWeb  we left out 360 images of the training set for validation. For KITTI  we used the Eigen test split of 697 images . For NYU  we used the official test split of 654 images. For ETH3D  and the MPI Sintel depth dataset  we used all images from the respective datasets with publicly available ground truth (454 and 1064 images, respectively). For comparisons to the state of the art (Table 5), we used the test set of DIW  with 74441 images.
Alignment. We align the scale and shift of all predictions (our models as well as baselines) to the ground truth before conducting evaluations. We perform the alignment in inverse depth space based on a least-squares criterion.
Depth cap. Following , we cap predictions at an appropriate maximum value for datasets that are evaluated in depth space (ETH3D, Sintel, KITTI, NYU). For ETH3D, KITTI, and NYU, the depth cap was set to the maximum ground truth depth value (72, 80, and 10 meters, respectively). For Sintel we evaluate on areas with ground truth depth smaller than 72 meters and accordingly use a depth cap of 72 meters.
Input resolution for evaluation. For the 3D Movies dataset, we use the center crop of size pixels after downscaling the original image by a factor of four. For all other datasets, we downscale input images to a size that is as close as possible to the training size (), while maintaining aspect ratio and ensuring that each dimension is divisible by 32 (a constraint imposed by the network architecture).
Resolution of evaluation. On the 3D Movies dataset, we evaluate at the resolution of the prediction (). On the MegaDepth dataset, we follow the original evaluation protocol and evaluate at a resolution of for landscape images and for portrait images. On the remaining datasets (DIW, ReDWeb, ETH3D, Sintel, KITTI, and NYU), the prediction is upscaled to the original resolution of the input images and evaluated at the full resolution. For ETH3D, we rendered ground-truth depth maps from the 3D point clouds at a resolution of pixels.
We show additional results of our best-performing model that was trained on all three datasets (RW+MD+MV) with the multi-task learning strategy (MGDA).
Supplementary video. In the supplementary video, we show qualitative results on the DAVIS video dataset . Note that every frame was processed individually, i.e. no temporal information was used in any way. For each clip, the inverse depth maps were jointly scaled and shifted for visualization. The dataset consists of a diverse set of videos and includes humans, animals, and cars in action. This dataset was filmed with monocular cameras, hence, no ground truth depth information is available.
Additional qualitative results. We show qualitative results from the two movies in the test set (Doctor Who - The Day of the Doctor and Streetdance 2) in Figures 3 and 4. The ground truth (obtained from stereo matching ) is shown for reference. Invalid pixels have been masked out in the ground truth. Note that our method is able to handle various camera angles, complex scenes depicting multiple people or animals, atypical objects such as robots, and various static objects.
To further showcase the generalization ability of our model, Figure 5 provides qualitative results on the DIW test set . We again show results on a diverse set of input images depicting various objects and scenes, including humans, mammals, birds, cars, helicopters in flight, and other man-made and natural objects. The images feature indoor, street and nature scenes, various lighting conditions, and various camera angles. Additionally, subject areas vary from close-up to long-range shots.
Failure cases. We identify common failure cases and biases of our model. As observed by 
, images have a natural bias where the lower parts of the image are closer to the camera than the higher image regions. When randomly sampling two points and classifying the lower point as closer to the camera, achieved an agreement rate of 85.8% with human annotators. To some extent, this bias has also been learned by our network and can be observed in some extreme cases that are shown in Figure 6. In the example on the top, the model fails to recover the ground plane; likely because the input image was rotated by 90 degrees. In the bottom image, pellets at approximately the same distance to the camera are reconstructed closer to the camera in the lower part of the image. Such cases could be prevented by augmenting training data with rotated images. However, it is not clear if invariance to image rotations is a necessary or desired property for this task.
Some particularly interesting failure cases are shown in Figure 7. Paintings, photos, and mirrors are often not recognized as such, especially if they are very prominent in the image. The network estimates depth based on the content that is depicted on the reflector rather than predicting the depth of the reflector itself.
A selection of additional failure cases is shown in Figure 8. Strong edges in RGB space can lead to hallucinated depth discontinuities, resulting, for example, in heads being detached from the lower body. Thin structures can be missed by the network and relative depth arrangement between disconnected objects might fail in some situations (e.g. relative placement of people). The results tend to get blurred in background areas, which might be explained by the limited resolution of the input images and imperfect ground truth in the far range.