Learning to Recover 3D Scene Shape from a Single Image

by   Wei Yin, et al.

Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail, and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at: https://git.io/Depth


page 6

page 7

page 11

page 12

page 13

page 14

page 15

page 16


Boosting Monocular Depth Estimation with Sparse Guided Points

Existing monocular depth estimation shows excellent robustness in the wi...

Geo-Supervised Visual Depth Prediction

We propose using global orientation from inertial measurements, and the ...

DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

We present a method for depth estimation with monocular images, which ca...

Enforcing geometric constraints of virtual normal for depth prediction

Monocular depth prediction plays a crucial role in understanding 3D scen...

Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Single-view depth estimation plays a crucial role in scene understanding...

Boundary-induced and scene-aggregated network for monocular depth prediction

Monocular depth prediction is an important task in scene understanding. ...

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

The success of monocular depth estimation relies on large and diverse tr...

1 Introduction

3D scene reconstruction is a fundamental task in computer vision. The established approach to address this task is SLAM or SfM 

[hartley2003multiple], which reconstructs 3D scenes based on feature-point correspondence with consecutive frames or multiple views. In contrast, this work aims to achieve dense 3D scene shape reconstruction from a single in-the-wild image. Without multiple views available, we rely on monocular depth estimation. However, as shown in Fig: first page fig., existing monocular depth estimation methods [eigen2014depth, wang2020sdc, Yin2019enforcing] alone are unable to faithfully recover an accurate 3D point cloud.

Unlike multi-view reconstruction methods, monocular depth estimation requires leveraging high level scene priors, so data-driven approaches have become the de facto solution to this problem [li2018megadepth, Ranftl2020, wang2019web, yin2020diversedepth]

. Recent works have shown promising results by training deep neural networks on diverse in-the-wild data, web stereo images and stereo videos 

[chen2016single, chen2020oasis, Ranftl2020, wang2019web, xian2018monocular, xian2020structure, yin2020diversedepth]. However, the diversity of the training data also poses challenges for the model training, as training data captured by different cameras can exhibit significantly different image priors for depth estimation [facil2019cam]. Moreover, web stereo images and videos can only provide depth supervision up to a scale and shift due to the unknown camera baselines and stereoscopic post processing [lang2010nonlinear]. As a result, state-of-the-art in-the-wild monocular depth models use various types of losses invariant to scale and shift in training. While an unknown scale in depth will not cause any shape distortion, as it scales the 3D scene uniformly, an unknown depth shift will (see Sec. 3.1 and Fig. LABEL:Fig:_first_page_fig.). In addition, the camera focal length of a given image may not be accessible at test time, leading to more distortion of the 3D scene shape. This scene shape distortion is a critical problem for downstream tasks such as 3D view synthesis and 3D photography.

To address these challenges, we propose a novel monocular scene shape estimation framework that consists of a depth prediction module and a point cloud reconstruction module. The depth prediction module is a convolutional neural network trained on a mixture of existing datasets that predicts depth maps up to a scale and shift. The point cloud reconstruction module leverages point cloud encoder networks that predict shift and focal length adjustment factors from an initial guess of the scene point cloud reconstruction. A key observation that we make here is that,

when operating on point clouds derived from depth maps, and not on images themselves, we can train models to learn 3D scene shape priors using synthetic 3D data or data acquired by 3D laser scanning devices. The domain gap is significantly less of an issue for point clouds than that for images, although these data sources are significantly less diverse than internet images.

We empirically show that these point cloud encoders generalize well to unseen datasets.

Furthermore, to train a robust monocular depth prediction model on mixed data from multiple sources, we propose a simple but effective image-level normalized regression loss, and a pair-wise surface normal regression loss. The former loss transforms the depth data to a canonical scale-shift-invariant space for more robust training, while the latter improves the geometry of our predicted depth maps. To summarize, our main contributions are:

  • [noitemsep]

  • A novel framework for in-the-wild monocular 3D scene shape estimation. To the best of our knowledge, this is the first fully data-driven method for this task, and the first method to leverage 3D point cloud neural networks for improving the structure of point clouds derived from depth maps.

  • An image-level normalized regression loss and a pair-wise surface normal regression loss for improving monocular depth estimation models trained on mixed multi-source datasets.

Experiments show that our point cloud reconstruction module can recover accurate 3D shape from a single image, and that our depth prediction module achieves state-of-the-art results on zero-shot dataset transfer to unseen datasets.

2 Related Work

Monocular depth estimation in the wild.

This task has recently seen impressive progress [chen2016single, chen2019learning, chen2020oasis, li2018megadepth, wang2019web, wang2020foresee, xian2018monocular, xian2020structure, yin2020diversedepth]. The key properties of such approaches are what data can be used for training, and what objective function makes sense for that data. When metric depth supervision is available, networks can be trained to directly regress these depths [eigen2014depth, liu2015learning, Yin2019enforcing]. However, obtaining metric ground truth depth for diverse datasets is challenging. As an alternative, Chen  [chen2016single] collect diverse relative depth annotations for internet images, while other approaches propose to scrape stereo images or videos from the internet [Ranftl2020, wang2019web, xian2018monocular, xian2020structure, yin2020diversedepth]. Such diverse data is important for generalizability, but as the metric depth is not available, direct depth regression losses cannot be used. Instead, these methods rely either on ranking losses which evaluate relative depth [chen2016single, xian2018monocular, xian2020structure] or scale and shift invariant losses [Ranftl2020, wang2019web] for supervision. The later methods produce especially robust depth predictions, but as the camera model is unknown and an unknown shift resides in the depth, the 3D shape cannot be reconstructed from the predicted depth maps. In this paper, we aim to reconstruct the 3D shape from a single image in the wild.

3D reconstruction from a single image.

A number of works have addressed reconstructing different types of objects from a single image [barron2014shape, wang2018pixel2mesh, wu2018learning], such as humans [saito2019pifu, saito2020pifuhd], cars, planes, tables, etc. The main challenge is how to best recover objects details, and how to represent them with limited memory. Pixel2Mesh [wang2018pixel2mesh] proposes to reconstruct the 3D shape from a single image and express it in a triangular mesh. PIFu [saito2019pifu, saito2020pifuhd] proposes an memory-efficient implicit function to recover high-resolution surfaces, including unseen/occluded regions, of humans. However, all these methods rely on learning priors specific to a certain object class or instance, typically from 3D supervision, and can therefore not work for full scene reconstruction.

On the other hand, several works have proposed reconstructing 3D scene structure from a single image. Saxena  [saxena2008make3d] assume that the whole scene can be segmented into several pieces, of which each one can be regarded as a small plane. They predict the orientation and the location of the planes and stitch them together to represent the scene. Other works propose to use image cues, such as shading [prados2005shape] and contour edges [karpenko2006smoothsketch] for scene reconstruction. However, these approaches use hand-designed priors and restrictive assumptions about the scene geometry. Our method is fully data driven, and can be applied to a wide range of scenes.

Figure 1: Method Pipeline. During training, the depth prediction model (top left) and point cloud module (top right) are trained separately on different sources of data. During inference (bottom), the two networks are combined together to predict depth and from that, the depth shift and focal length that together allow for an accurate scene shape reconstruction. Note that we employ point cloud networks to predict shift and focal length scaling factor separately. Please see the text for more details.

Camera intrinsic parameter estimation.

Recovering a camera’s focal length is an important part of 3D scene understanding. Traditional methods utilize reference objects such as a planar calibration grids 

[zhang2000flexible] or vanishing points [deutscher2002automatic], which can then be used to estimate a focal length. Other methods [hold2018perceptual, workman2015deepfocal] propose a data driven approach where a CNN recovers the focal length on in-the-wild data directly from an image. In contrast, our point cloud module estimates the focal length directly in 3D, which we argue is an easier task than operating on natural images directly.

3 Method

Our two-stage single image 3D shape estimation pipeline is illustrated in Fig. 1. It is composed of a depth prediction module (DPM) and a point cloud module (PCM). The two modules are trained separately on different data sources, and are then combined together at inference time. The DPM takes an RGB image and outputs a depth map [yin2020diversedepth] with unknown scale and shift in relation to the true metric depth map. The PCM takes as input a distorted 3D point cloud, computed using a predicted depth map and an initial estimation of the focal length , and outputs shift adjustments to the depth map and focal length to improve the geometry of the reconstructed 3D scene shape.

3.1 Point Cloud Module

Figure 2: Illustration of the distorted 3D shape caused by incorrect shift and focal length. A ground truth depth map is projected in 3D and visualized. When the focal length is incorrectly estimated (), we observe significant structural distortion, e.g., see the angle between two walls A and B. Second row (front view): a shift () also causes the shape distortion, see the roof.

We assume a pinhole camera model for the 3D point cloud reconstruction, which means that the unprojection from 2D coordinates and depth to 3D points is:


where are the camera optical center, is the focal length, and is the depth. The focal length affects the point cloud shape as it scales and coordinates, but not . Similarly, a shift of will affect the , , and coordinates non-uniformly, which will result in shape distortions.

For a human observer, these distortions are immediately recognizable when viewing the point cloud at an oblique angle (Fig. 2), although they cannot be observed looking at a depth map alone. As a result, we propose to directly analyze the point cloud to determine the unknown shift and focal length parameters. We tried a number of network architectures that take unstructured 3D point clouds as input, and found that the recent PVCNN [liu2019pvcnn] performed well for this task, so we use it in all experiments here.

During training, a perturbed input point cloud with incorrect shift and focal length is synthesized by perturbing the known ground truth depth shift and focal length. The ground truth depth is transformed by a shift drawn from , and the ground truth focal length is transformed by a scale drawn from to keep the focal length positive and non-zero.

When recovering the depth shift, the perturbed 3D point cloud is is given as input to the shift point cloud network , trained with the objective:


where are network weights and is the true focal length.

Similarly, when recovering the focal length, the point cloud is fed to the focal length point cloud network , trained with the objective:


During inference, the ground truth depth is replaced with the predicted affine-invariant depth , which is normalized to prior to the 3D reconstruction. We use an initial guess of focal length , giving us the reconstructed point cloud , which is fed to and to predict the shift and focal length scaling factor respectively. In our experiments we simply use an initial focal length with a field of view (FOV) of . We have also tried to employ a single network to predict both the shift and the scaling factor, but have empirically found that two separate networks can achieve a better performance.

3.2 Monocular Depth Prediction Module

We train our depth prediction on multiple data sources including high-quality LiDAR sensor data [zamir2018taskonomy], and low-quality web stereo data [Ranftl2020, wang2019web, xian2020structure] (see Sec. 4). As these datasets have varied depth ranges and web stereo datasets contain unknown depth scale and shift, we propose an image-level normalized regression (ILNR) loss to address this issue. Moreover, we propose a pair-wise normal regression (PWN) loss to improve local geometric features.

Image-level normalized regression loss.

Depth maps of different data sources can have varied depth ranges. Therefore, they need to be normalized to make the model training easier. Simple Min-Max normalization [garcia2015data, singh2019investigating]

is sensitive to depth value outliers. For example, a large value at a single pixel will affect the rest of the depth map after the Min-Max normalization. We investigate more robust normalization methods and propose a simple but effective image-level normalized regression loss for mixed-data training.

Our image-level normalized regression loss transforms each ground truth depth map to a similar numerical range based on its individual statistics. To reduce the effect of outliers and long-tail residuals, we combine tanh normalization [singh2019investigating]

with a trimmed Z-score, after which we can simply apply a pixel-wise mean average error (MAE) between the prediction and the normalized ground truth depth maps. The ILNR loss is formally defined as follows.

where and and

are the mean and the standard deviation of a trimmed depth map which has the nearest and farthest

of pixels removed, is the predicted depth, and is the ground truth depth map. We have tested a number of other normalization methods such as Min-Max normalization [singh2019investigating], Z-score normalization [fukunaga2013introduction], and median absolute deviation normalization (MAD) [singh2019investigating]. In our experiments, we found that our proposed ILNR loss achieves the best performance.

Pair-wise normal loss.

Normals are an important geometric property, which have been shown to be a complementary modality to depth [silberman2012indoor]. Many methods have been proposed to use normal constraints to improve the depth quality, such as the virtual normal loss [Yin2019enforcing]. However, as the virtual normal only leverages global structure, it cannot help improve the local geometric quality, such as depth edges and planes. Recently, Xian  [xian2020structure] proposed a structure-guided ranking loss, which can improve edge sharpness. Inspired by these methods, we follow their sampling method but enforce the supervision in surface normal space. Moreover, our samples include not only edges but also planes. Our proposed pair-wise normal (PWN) loss can better constrain both the global and local geometric relations.

The surface normal is obtained from the reconstructed 3D point cloud by local least squares fitting [Yin2019enforcing]. Before calculating the predicted surface normal, we align the predicted depth and the ground truth depth with a scale and shift factor, which are retrieved by least squares fitting [Ranftl2020]. From the surface normal map, the planar regions where normals are almost the same and edges where normals change significantly can be easily located. Then, we follow [xian2020structure] and sample paired points on both sides of these edges. If planar regions can be found, paired points will also be sampled on the same plane. In doing so, we sample K paired points per training sample on average. In addition, to improve the global geometric quality, we also randomly sample paired points globally. The sampled points are , while their corresponding normals are . The PWN loss is:


where denotes ground truth surface normals. As this loss accounts for both local and global geometry, we find that it improves the overall reconstructed shape.

Finally, we also use a multi-scale gradient loss [li2018megadepth]:


The overall loss function is formally defined as follows.


where and in all experiments.

4 Experiments

Dataset # Img Scene Evaluation Supervision
Type Metric Type
NYU Indoor AbsRel & Kinect
ScanNet Indoor AbsRel & Kinect
2D-3D-S Indoor LSIV LiDAR
iBims-1 Indoor
AbsRel &
KITTI Outdoor AbsRel & LiDAR
Sintel Outdoor AbsRel & Synthetic
ETH3D Outdoor AbsRel & LiDAR
YouTube3D In the Wild WHDR SfM, Ordinal pairs
OASIS In the Wild
User clicks,
Small patches with GT
& Outdoor
AbsRel &
Table 1: Overview of the test sets in our experiments.

Datasets and implementation details.

To train the PCM, we sampled K Kinect-captured depth maps from ScanNet, K LiDAR-captured depth maps from Taskonomy, and K synthetic depth maps from the 3D Ken Burns paper [Niklaus_TOG_2019]. We train the network using SGD with a batch size of , an initial learning rate of , and a learning rate decay of . For parameters specific to PVCNN, such as the voxel size, we follow the original work [liu2019pvcnn].

To train the DPM, we sampled K RGBD pairs from LiDAR-captured Taskonomy [zamir2018taskonomy], K synthetic RGBD pairs from the 3D Ken Burns paper [Niklaus_TOG_2019], K RGBD pairs from calibrated stereo DIML [kim2018deep],

K RGBD pairs from web-stereo Holopix50K 

[hua2020holopix50k], and K web-stereo HRWSI [xian2020structure] RGBD pairs. Note that when doing the ablation study about the effectiveness of PWN and ILNR, we sampled a smaller dataset which is composed of K images from Taskonomy, K images from DIML, and K images from HRWSI. During training, images are withheld from all datasets as a validation set. We use the depth prediction architecture proposed in Xian . [xian2020structure]

, which consists of a standard backbone for feature extraction (e.g., ResNet50 

[he2016deep] or ResNeXt101 [xie2017aggregated]), followed by a decoder, and train it using SGD with a batch size of , an initial learning rate for all layer, and a learning rate decay of . Images are resized to × , and flipped horizontally with a chance. Following [yin2020diversedepth], we load data from different datasets evenly for each batch.

Evaluation details.

The focal length prediction accuracy is evaluated on 2D-3D-S [armeni2017joint] following [hold2018perceptual]. Furthermore, to evaluate the accuracy of the reconstructed 3D shape, we use the Locally Scale Invariant RMSE (LSIV) [chen2020oasis] metric on both OASIS [chen2020oasis] and 2D-3D-S [armeni2017joint]. It is consistent with the previous work [chen2020oasis]. The OASIS [chen2020oasis] dataset only has the ground truth depth on some small regions, while 2D-3D-S has the ground truth for the whole scene.

Recovered Shift 15.9 15.1 17.5 40.3 36.9
Table 2: Effectiveness of recovering the shift from 3D point clouds with the PCM. Compared with the baseline, the AbsRel is much lower after recovering the depth shift over all test sets.
Figure 3: Comparison of recovered focal length on the 2D-3D-S dataset. Left, our method outperforms Hold-Geoffroy  [hold2018perceptual]. Right, we conduct an experiment on the effect of the initialization of field of view (FOV). Our method remains robust across different initial FOVs, with a slight degradation in quality past and .
Figure 4: Qualitative comparison. We compare the reconstructed 3D shape of our method with several baselines. As MiDaS [Ranftl2020] does not estimate the focal length, we use the focal length recovered from [hold2018perceptual] to convert the predicted depth to a point cloud. “Ours-Baseline” does not recover the depth shift or focal length and uses an orthographic camera, while “Ours” recovers the shift and focal length. We can see that our method better reconstructs the 3D shape, especially at edges and planar regions (see arrows).

To evaluate the generalizability of our proposed depth prediction method, we take datasets which are unseen during training, including YouTube3D [chen2019learning], OASIS [chen2020oasis], NYU [silberman2012indoor], KITTI [geiger2012we], ScanNet [dai2017scannet], DIODE [vasiljevic2019diode], ETH3D [schops2017multi], Sintel [Butler:ECCV:2012], and iBims-1 [Koch18:ECS]. On OASIS and YouTube3D, we use the Weighted Human Disagreement Rate (WHDR) [xian2018monocular] for evaluation. On other datasets, except for iBims-1, we evaluate the absolute mean relative error (AbsRel) and the percentage of pixels with . We follow Ranftl  [Ranftl2020] and align the scale and shift before evaluation. To evaluate the geometric quality of the depth, i.e. the quality of edges and planes, we follow [Niklaus_TOG_2019, xian2020structure] and evaluate the depth boundary error [Koch18:ECS] () as well as the planarity error [Koch18:ECS] () on iBims-1. and evaluate the flatness and orientation of reconstructed 3D planes compared to the ground truth 3D planes respectively, while and demonstrate the localization accuracy and the sharpness of edges respectively. More details as well as a comparison of these test datasets are summarized in Tab. 1

4.1 3D Shape Reconstruction

Shift recovery.

To evaluate the effectiveness of our depth shift recovery, we perform zero-shot evaluation on datasets unseen during training. We recover a 3D point cloud by unprojecting the predicted depth map, and then compute the depth shift using our PCM. We then align the unknown scale [bian2019unsupervised, monodepth2] of the original depth and our shifted depth to the ground truth, and evaluate both using the AbsRel error. The results are shown in Tab. 2, where we see that, on all test sets, the AbsRel error is lower after recovering the shift. We also trained a standard 2D convolutional neural network to predict the shift given an image composed of the unprojected point coordinates, but this approach did not generalize well to samples from unseen datasets.

Focal length recovery.

To evaluate the accuracy of our recovered focal length, we follow Hold-Geoffroy  [hold2018perceptual] and compare on the 2D-3D-S dataset, which is unseen during training for both methods. The model of [hold2018perceptual] is trained on the in-the-wild SUN360 [xiao2012recognizing] dataset. Results are illustrated in Fig. 3, where we can see that our method demonstrates better generalization performance. Note that PVCNN is very lightweight and only has parameters, but shows promising generalizability, which could indicate that there is a smaller domain gap between datasets in the 3D point cloud space than in the image space where appearance variation can be large.

Furthermore, we analyze the effect of different initial focal lengths during inference. We set the initial field of view (FOV) from to and evaluate the accuracy of the recovered focal length, Fig. 3 (right). The experimental results demonstrate that our method is not particularly sensitive to different initial focal lengths.

Method OASIS 2D-3D-S
Orthographic Camera Model
MegaDepth [li2018megadepth]
MiDaS [Ranftl2020]
Pinhole Camera Model
MegaDepth [li2018megadepth] + Hold-Geoffroy [hold2018perceptual]
MiDaS [Ranftl2020] + Hold-Geoffroy [hold2018perceptual]
MiDaS [Ranftl2020] + Ours-PCM
Ours 0.52 0.80
Table 3: Quantitative evaluation of the reconstructed 3D shape quality on OASIS and 2D-3D-S. Our method can achieve better performance than previous methods. Compared with the orthographic projection, our method using the pinhole camera model can obtain better performance. DPM and PCM refers to our depth prediction module and point cloud module respectively.
Figure 5: Qualitative comparisons with state-of-the-art methods, including MegaDepth [li2018megadepth], Xian [xian2020structure], and MiDaS [Ranftl2020]. It shows that our method can predict more accurate depths at far locations and regions with complex details. In addition, we see that our method generalizes better on in-the-wild scenes.

Evaluation of 3D shape quality.

Following OASIS [chen2020oasis], we use LSIV for the quantitative comparison of recovered 3D shapes on the OASIS [chen2020oasis] dataset and the 2D-3D-S [armeni2017joint] dataset. OASIS only provides the ground truth point cloud on small regions, while 2D-3D-S covers the whole 3D scene. Following OASIS [chen2020oasis], we evaluate the reconstructed 3D shape with two different camera models, i.e. the orthographic projection camera model [chen2020oasis] (infinite focal length) and the (more realistic) pinhole camera model. As MiDaS [Ranftl2020] and MegaDepth [li2018megadepth] do not estimate the focal length, we use the focal length recovered from Hold-Geoffroy [hold2018perceptual] to convert the predicted depth to a point cloud. We also evaluate a baseline using MiDaS instead of our DPM with the focal length predicted by our PCM (“MiDaS + Ours-PCM”). From Tab. 3 we can see that with an orthographic projection, our method (“Ours-DPM”) performs roughly as well as existing state-of-the-art methods. However, for the pinhole camera model our combined method significantly outperforms existing approaches. Furthermore, comparing “MiDaS + Ours-PCM” and “MiDaS + Hold-Geoffroy”, we note that our PCM is able to generalize to different depth prediction methods.

A qualitative comparison of the reconstructed 3D shape on in-the-wild scenes is shown in Fig. 4. It demonstrates that our model can recover more accurate 3D scene shapes. For example, planar structures such as walls, floors, and roads are much flatter in our reconstructed scenes, and the angles between surfaces (walls) are also more realistic. Also, the shape of the car has less distortions.

Method iBims-1
Xian [xian2020structure]
MegaDepth [li2018megadepth]
MiDaS [Ranftl2020]
3D Ken Burns [Niklaus_TOG_2019] 5.44
Ours   w/o PWN
Ours Full 1.90 2.0 7.41 0.079
Table 4: Quantitative comparison of the quality of depth boundaries (DBE) and planes (PE) on the iBims-1 dataset. We use to indicate when a method was trained on the small training subset.

4.2 Depth prediction

In this section, we conduct several experiments to demonstrate the effectiveness of our depth prediction method, including a comparison with state-of-the-art methods, a comparison of our proposed image-level normalized regression loss with other methods, and an analysis of the effectiveness of our pair-wise normal regression loss.

Method Backbone OASIS YT3D NYU KITTI DIODE ScanNet ETH3D Sintel Rank
WHDR AbsRel AbsRel AbsRel AbsRel AbsRel AbsRel
OASIS [chen2020oasis] ResNet50
MegaDepth [li2018megadepth] Hourglass
Xian [xian2020structure] ResNet50
WSVD [wang2019web] ResNet50
Chen [chen2019learning] ResNet50
DiverseDepth [yin2020diversedepth] ResNeXt50
MiDaS [Ranftl2020] ResNeXt101
Ours ResNet50 14.3 80.0
Ours ResNeXt101 28.3 19.2 9.0 91.6 27.1 76.6 9.5 91.2 17.1 77.7 31.9 65.9 1.1
Table 5: Quantitative comparison of our depth prediction with state-of-the-art methods on eight zero-shot (unseen during training) datasets. Our method achieves better performance than existing state-of-the-art methods across all test datasets.

Comparison with state-of-the-art methods.

In this comparison, we test on datasets unseen during training. We compare with methods that have been shown to best generalize to in-the-wild scenes. Their results are obtained by running the publicly released code. Each method is trained on its own proposed datasets. When comparing the AbsRel error, we follow Ranftl [Ranftl2020] to align the scale and shift before the evaluation. The results are shown in the Tab. 5. Our method outperforms prior works, and using a larger ResNeXt101 backbone further improves the results. Some qualitative comparisons can be found in Fig. 5

Pair-wise normal loss.

To evaluate the quality of our full method and dataset on edges and planes, we compare our depth model with previous state-of-the-art methods on the iBims-1 dataset. In addition, we evaluate the effect of our proposed pair-wise normal (PWN) loss through an ablation study. As training on our full dataset is computationally demanding, we perform this ablation on the small training subset. The results are shown in Tab. 4. We can see that our full method outperforms prior work for this task. In addition, under the same settings, both edges and planes are improved by adding the PWN loss. We further show a qualitative comparison in Fig. 6.

Figure 6: Qualitative comparison of reconstructed point clouds. Using the pair-wise normal loss (PWN), we can see that edges and planes are better reconstructed (see highlighted regions).

Image-level normalized regression loss.

To show the effectiveness of our proposed image-level normalized regression (ILNR) loss, we compare it with the scale-shift invariant loss (SSMAE) [Ranftl2020] and the scale-invariant multi-scale gradient loss [wang2019web]. Each of these methods is trained on the small training subset to limit the computational overhead, and comparisons are made to datasets that are unseen during training. All models have been trained for epochs, and we have verified that all models fully converged by then. The quantitative comparison is shown in Tab. 6, where we can see an improvement of ILNR over other scale and shift invariant losses. Furthermore, we also analyze different options for normalization, including image-level Min-Max (ILNR-MinMax) normalization and image-level median absolute deviation (ILNR-MAD) normalization, and found that our proposed loss performs a bit better.

5 Discussion


We observed a few limitations of our method. For example, our PCM cannot recover accurate focal length or depth shift when the scene does not have enough geometric cues, when the whole image is mostly a wall or a sky region. The accuracy of our method will also decrease with images taken from uncommon view angles (e.g., top-down) or extreme focal lengths. More diverse 3D training data may address these failure cases. In addition, our method does not model the effect of radial distortion from the camera and thus the reconstructed scene shape can be distorted in cases with severe radial distortion. Studying how to recover the radial distortion parameters using our PCM can be an interesting future direction.

Method RedWeb NYU KITTI ScanNet DIODE
SMSG [wang2019web]
SSMAE [Ranftl2020]
ILNR 18.7 13.9 16.1 12.3 34.2
Table 6: Quantitative comparison of different losses on zero shot generalization to datasets unseen during training.

6 Conclusion

In summary, we presented, to our knowledge, the first fully data driven method that reconstructs 3D scene shape from a monocular image. To recover the shift and focal length for 3D reconstruction, we proposed to use point cloud networks trained on datasets with known global depth shifts and focal lengths. This approach showed strong generalization capabilities and we are under the impression that it may be helpful for related depth-based tasks. Extensive experiments demonstrated the effectiveness of our scene shape reconstruction method and the superior ability to generalize to unseen data.


Appendix A Datasets

a.1 Datasets for Training

To train a robust model, we use a variety of data sources, each with its own unique properties:

  • Taskonomy [zamir2018taskonomy] contains high-quality RGBD data captured by a LiDAR scanner. We sampled around K RGBD pairs for training.

  • DIML [kim2018deep] contains calibrated stereo images. We use the GA-Net [Zhang2019GANet] method to compute the disparity for supervision. We sampled around K RGBD pairs for training.

  • 3D Ken Burns [Niklaus_TOG_2019] contains synthetic data with ground truth depth. We sampled around K RGBD pairs for training.

  • Holopix50K [hua2020holopix50k] contains diverse uncalibrated web stereo images. Following [xian2018monocular], we use FlowNet [IMKDB17] to compute the relative depth (inverse depth) data for training.

  • HRWSI [xian2020structure] contains diverse uncalibrated web stereo images. We use the entire dataset, consisting of K RGBD images.

a.2 Datasets Used in Testing

To evaluate the generalizability of our method, we test our depth model on a range of datasets:

  • NYU [silberman2012indoor] consists of mostly indoor RGBD images where the depth is captured by a Kinect sensor. We test our method on the official test set, which contains images.

  • KITTI [geiger2012we] consists of street scenes, with sparse metric depth captured by a LiDAR sensor. We use the standard test set ( images) of the Eigen split.

  • ScanNet [dai2017scannet] contains similar data to NYU, indoor scenes captured by a Kinect. We randomly sampled images from the official validation set for testing.

  • DIODE [vasiljevic2019diode] contains high-quality LiDAR-generated depth maps of both indoor and outdoor scenes. We use the whole validation set ( images) for testing.

  • ETH3D [schops2017multi] consists of outdoor scenes whose depth is captured by a LiDAR sensor. We sampled images from it for testing.

  • Sintel [Butler:ECCV:2012] is a synthetic dataset, mostly with outdoor scenes. We collected images from it for testing.

  • OASIS [chen2020oasis] is a diverse dataset consisting of images in the wild, with ground truth depth annotations by humans. It contains both sparse relative depth labels (similar to DIW [chen2016single]), and some planar regions. We test on the entire validation set, containing K images.

  • YouTube3D [chen2019learning] consists of in-the-wild videos that are reconstructed using structure from motion, with the sparse reconstructed points as supervision. We randomly sampled K images from the whole dataset for testing.

  • RedWeb [xian2018monocular] consists of in-the-wild stereo images, with disparity labels derived from an optical flow matching algorithm. We use K images to evaluate the WHDR error, and we randomly sampled K points pairs on each image.

  • iBims-1 [Koch18:ECS] is an indoor-scene dataset, which consists of high-quality images captured by a LiDAR sensor. We use the whole dataset for evaluating edge and plane quality.

We will release a list of all images used for testing to facilitate reproducibility.

Appendix B Details for Depth Prediction Model and Training.

We use the depth prediction model proposed by Xian  [xian2020structure]. We follow [yin2020diversedepth] and combine the multi-source training data by evenly sampling from all sources per batch. As HRWSI and Holopix50K are both web stereo data, we merge them together. Therefore, there are four different data sources, i.e. high-quality Taskonomy, synthetic 3D Ken Burn, middle-quality DIML, and low-quality Holopix50K and HRWSI. For example, if the batch size is , we sample images from each of the four sources. Furthermore, as the ground truth depth quality varies between data sources, we enforce different losses for them.

For the web-stereo data, such as Holopix50K [hua2020holopix50k] and HRWSI [xian2020structure], as their inverse depths have unknown scale and shift, these inverse depths cannot be used to compute the affine-invariant depth (up to an unknown scale and shift to the metric depth). The pixel-wise regression loss and geometry loss cannot be applied for such data. Therefore, during training, we only enforce the ranking loss [xian2018monocular] on them.

For the middle-quality calibrated stereo data, such as DIML [kim2018deep], we enforce the proposed image-level normalized regression loss, multi-scale gradient loss and ranking loss. As the recovered disparities contain much noise in local regions, enforcing the pair-wise normal regression loss on noisy edges will cause many artifacts. Therefore, we enforce the pair-wise normal regression loss only on planar regions for this data.

For the high-quality data, such as Taskonomy [zamir2018taskonomy] and synthetic 3D Ken Burns [Niklaus_TOG_2019], accurate edges and planes can be located. Therefore, we apply the pair-wise normal regression loss, ranking loss, and multi-scale gradient loss for this data.

Furthermore, we follow [liu2019training] and add a light-weight auxiliary path on the decoder. The auxiliary outputs the inverse depth and the main branch (decoder) outputs the depth. For the auxiliary path, we enforce the ranking loss, image-level normalized regression loss in the inverse depth space on all data. The network is illustrated in Fig. 7.

Figure 7: The network architecture for the DPM. The network has two output branches. The decoder outputs the depth map, while the auxiliary path outputs the inverse depth. Different losses are enforced on these two branches.

Appendix C Sampling Strategy for Pairwise Normal Loss

We enforce the pairwise normal regression loss on Taskonomy and DIML data. As DIML is more noisy than Taskonomy, we only enforce the normal regression loss on the planar regions, such as pavements and roads, whereas for Taskonomy, we sample points on edges and on planar regions. We use the local least squares fitting method [Yin2019enforcing] to compute the surface normal from the depth map.

For edges, we follow the method of Xian  [xian2020structure], which we describe here. The first step is to locate image edges. At each edge point, we then sample pairs of points on both sides of the edge, i.e. . The ground truth normals for these points are , while the predicted normals are . To locate the object boundaries and planes folders, where the normals changes significantly, we set the angle difference of two normals greater than . To balance the samples, we also get some negative samples, where the angle difference is smaller than and they are also detected as edges. The sampling method is illustrated as follow.


For planes, on DIML, we use [deeplabv3plus2018] to segment the roads, which we assume to be planar regions. On Taskonmy, we locate planes by finding regions with the same normal. On each detected plane, we sample paired points. Finally, we combine both sets of paired points and enforce the normal regression loss on them, see E.q. in our main paper.

Appendix D Illustration of the Reconstructed Point Cloud

We illustrate some examples of the reconstructed 3D point cloud from our proposed approach in Fig. 8. All these data are unseen during training. This shows that our method demonstrates good generalizability on in-the-wild scenes and can recover realistic shape of a wide range of scenes.

Figure 8: Point Cloud Illustration. The first column shows the input images. The remaining columns show the point cloud recovered from our proposed approach from the left, right, and top respectively.

Appendix E Illustration of Depth Prediction In the Wild

We illustrate examples of our single image depth prediction results in Fig. 9. The images are randomly sampled from DIW and OASIS, which are unseen during training. On these diverse scenes, our method predicts reasonably accurate depth maps, in terms of global structure and local details.

Figure 9: Examples of depths on in-the-wild scenes. Purple indicates closer regions whereas red indicates farther regions.


This work was in part supported by ARC DP Project “Deep learning that scales”.