1 Introduction
3D scene reconstruction is a fundamental task in computer vision. The established approach to address this task is SLAM or SfM
[hartley2003multiple], which reconstructs 3D scenes based on featurepoint correspondence with consecutive frames or multiple views. In contrast, this work aims to achieve dense 3D scene shape reconstruction from a single inthewild image. Without multiple views available, we rely on monocular depth estimation. However, as shown in Fig: first page fig., existing monocular depth estimation methods [eigen2014depth, wang2020sdc, Yin2019enforcing] alone are unable to faithfully recover an accurate 3D point cloud.Unlike multiview reconstruction methods, monocular depth estimation requires leveraging high level scene priors, so datadriven approaches have become the de facto solution to this problem [li2018megadepth, Ranftl2020, wang2019web, yin2020diversedepth]
. Recent works have shown promising results by training deep neural networks on diverse inthewild data, web stereo images and stereo videos
[chen2016single, chen2020oasis, Ranftl2020, wang2019web, xian2018monocular, xian2020structure, yin2020diversedepth]. However, the diversity of the training data also poses challenges for the model training, as training data captured by different cameras can exhibit significantly different image priors for depth estimation [facil2019cam]. Moreover, web stereo images and videos can only provide depth supervision up to a scale and shift due to the unknown camera baselines and stereoscopic post processing [lang2010nonlinear]. As a result, stateoftheart inthewild monocular depth models use various types of losses invariant to scale and shift in training. While an unknown scale in depth will not cause any shape distortion, as it scales the 3D scene uniformly, an unknown depth shift will (see Sec. 3.1 and Fig. LABEL:Fig:_first_page_fig.). In addition, the camera focal length of a given image may not be accessible at test time, leading to more distortion of the 3D scene shape. This scene shape distortion is a critical problem for downstream tasks such as 3D view synthesis and 3D photography.To address these challenges, we propose a novel monocular scene shape estimation framework that consists of a depth prediction module and a point cloud reconstruction module. The depth prediction module is a convolutional neural network trained on a mixture of existing datasets that predicts depth maps up to a scale and shift. The point cloud reconstruction module leverages point cloud encoder networks that predict shift and focal length adjustment factors from an initial guess of the scene point cloud reconstruction. A key observation that we make here is that,
when operating on point clouds derived from depth maps, and not on images themselves, we can train models to learn 3D scene shape priors using synthetic 3D data or data acquired by 3D laser scanning devices. The domain gap is significantly less of an issue for point clouds than that for images, although these data sources are significantly less diverse than internet images.We empirically show that these point cloud encoders generalize well to unseen datasets.
Furthermore, to train a robust monocular depth prediction model on mixed data from multiple sources, we propose a simple but effective imagelevel normalized regression loss, and a pairwise surface normal regression loss. The former loss transforms the depth data to a canonical scaleshiftinvariant space for more robust training, while the latter improves the geometry of our predicted depth maps. To summarize, our main contributions are:

[noitemsep]

A novel framework for inthewild monocular 3D scene shape estimation. To the best of our knowledge, this is the first fully datadriven method for this task, and the first method to leverage 3D point cloud neural networks for improving the structure of point clouds derived from depth maps.

An imagelevel normalized regression loss and a pairwise surface normal regression loss for improving monocular depth estimation models trained on mixed multisource datasets.
Experiments show that our point cloud reconstruction module can recover accurate 3D shape from a single image, and that our depth prediction module achieves stateoftheart results on zeroshot dataset transfer to unseen datasets.
2 Related Work
Monocular depth estimation in the wild.
This task has recently seen impressive progress [chen2016single, chen2019learning, chen2020oasis, li2018megadepth, wang2019web, wang2020foresee, xian2018monocular, xian2020structure, yin2020diversedepth]. The key properties of such approaches are what data can be used for training, and what objective function makes sense for that data. When metric depth supervision is available, networks can be trained to directly regress these depths [eigen2014depth, liu2015learning, Yin2019enforcing]. However, obtaining metric ground truth depth for diverse datasets is challenging. As an alternative, Chen [chen2016single] collect diverse relative depth annotations for internet images, while other approaches propose to scrape stereo images or videos from the internet [Ranftl2020, wang2019web, xian2018monocular, xian2020structure, yin2020diversedepth]. Such diverse data is important for generalizability, but as the metric depth is not available, direct depth regression losses cannot be used. Instead, these methods rely either on ranking losses which evaluate relative depth [chen2016single, xian2018monocular, xian2020structure] or scale and shift invariant losses [Ranftl2020, wang2019web] for supervision. The later methods produce especially robust depth predictions, but as the camera model is unknown and an unknown shift resides in the depth, the 3D shape cannot be reconstructed from the predicted depth maps. In this paper, we aim to reconstruct the 3D shape from a single image in the wild.
3D reconstruction from a single image.
A number of works have addressed reconstructing different types of objects from a single image [barron2014shape, wang2018pixel2mesh, wu2018learning], such as humans [saito2019pifu, saito2020pifuhd], cars, planes, tables, etc. The main challenge is how to best recover objects details, and how to represent them with limited memory. Pixel2Mesh [wang2018pixel2mesh] proposes to reconstruct the 3D shape from a single image and express it in a triangular mesh. PIFu [saito2019pifu, saito2020pifuhd] proposes an memoryefficient implicit function to recover highresolution surfaces, including unseen/occluded regions, of humans. However, all these methods rely on learning priors specific to a certain object class or instance, typically from 3D supervision, and can therefore not work for full scene reconstruction.
On the other hand, several works have proposed reconstructing 3D scene structure from a single image. Saxena [saxena2008make3d] assume that the whole scene can be segmented into several pieces, of which each one can be regarded as a small plane. They predict the orientation and the location of the planes and stitch them together to represent the scene. Other works propose to use image cues, such as shading [prados2005shape] and contour edges [karpenko2006smoothsketch] for scene reconstruction. However, these approaches use handdesigned priors and restrictive assumptions about the scene geometry. Our method is fully data driven, and can be applied to a wide range of scenes.
Camera intrinsic parameter estimation.
Recovering a camera’s focal length is an important part of 3D scene understanding. Traditional methods utilize reference objects such as a planar calibration grids
[zhang2000flexible] or vanishing points [deutscher2002automatic], which can then be used to estimate a focal length. Other methods [hold2018perceptual, workman2015deepfocal] propose a data driven approach where a CNN recovers the focal length on inthewild data directly from an image. In contrast, our point cloud module estimates the focal length directly in 3D, which we argue is an easier task than operating on natural images directly.3 Method
Our twostage single image 3D shape estimation pipeline is illustrated in Fig. 1. It is composed of a depth prediction module (DPM) and a point cloud module (PCM). The two modules are trained separately on different data sources, and are then combined together at inference time. The DPM takes an RGB image and outputs a depth map [yin2020diversedepth] with unknown scale and shift in relation to the true metric depth map. The PCM takes as input a distorted 3D point cloud, computed using a predicted depth map and an initial estimation of the focal length , and outputs shift adjustments to the depth map and focal length to improve the geometry of the reconstructed 3D scene shape.
3.1 Point Cloud Module
We assume a pinhole camera model for the 3D point cloud reconstruction, which means that the unprojection from 2D coordinates and depth to 3D points is:
(1) 
where are the camera optical center, is the focal length, and is the depth. The focal length affects the point cloud shape as it scales and coordinates, but not . Similarly, a shift of will affect the , , and coordinates nonuniformly, which will result in shape distortions.
For a human observer, these distortions are immediately recognizable when viewing the point cloud at an oblique angle (Fig. 2), although they cannot be observed looking at a depth map alone. As a result, we propose to directly analyze the point cloud to determine the unknown shift and focal length parameters. We tried a number of network architectures that take unstructured 3D point clouds as input, and found that the recent PVCNN [liu2019pvcnn] performed well for this task, so we use it in all experiments here.
During training, a perturbed input point cloud with incorrect shift and focal length is synthesized by perturbing the known ground truth depth shift and focal length. The ground truth depth is transformed by a shift drawn from , and the ground truth focal length is transformed by a scale drawn from to keep the focal length positive and nonzero.
When recovering the depth shift, the perturbed 3D point cloud is is given as input to the shift point cloud network , trained with the objective:
(2) 
where are network weights and is the true focal length.
Similarly, when recovering the focal length, the point cloud is fed to the focal length point cloud network , trained with the objective:
(3) 
During inference, the ground truth depth is replaced with the predicted affineinvariant depth , which is normalized to prior to the 3D reconstruction. We use an initial guess of focal length , giving us the reconstructed point cloud , which is fed to and to predict the shift and focal length scaling factor respectively. In our experiments we simply use an initial focal length with a field of view (FOV) of . We have also tried to employ a single network to predict both the shift and the scaling factor, but have empirically found that two separate networks can achieve a better performance.
3.2 Monocular Depth Prediction Module
We train our depth prediction on multiple data sources including highquality LiDAR sensor data [zamir2018taskonomy], and lowquality web stereo data [Ranftl2020, wang2019web, xian2020structure] (see Sec. 4). As these datasets have varied depth ranges and web stereo datasets contain unknown depth scale and shift, we propose an imagelevel normalized regression (ILNR) loss to address this issue. Moreover, we propose a pairwise normal regression (PWN) loss to improve local geometric features.
Imagelevel normalized regression loss.
Depth maps of different data sources can have varied depth ranges. Therefore, they need to be normalized to make the model training easier. Simple MinMax normalization [garcia2015data, singh2019investigating]
is sensitive to depth value outliers. For example, a large value at a single pixel will affect the rest of the depth map after the MinMax normalization. We investigate more robust normalization methods and propose a simple but effective imagelevel normalized regression loss for mixeddata training.
Our imagelevel normalized regression loss transforms each ground truth depth map to a similar numerical range based on its individual statistics. To reduce the effect of outliers and longtail residuals, we combine tanh normalization [singh2019investigating]
with a trimmed Zscore, after which we can simply apply a pixelwise mean average error (MAE) between the prediction and the normalized ground truth depth maps. The ILNR loss is formally defined as follows.
where and and
are the mean and the standard deviation of a trimmed depth map which has the nearest and farthest
of pixels removed, is the predicted depth, and is the ground truth depth map. We have tested a number of other normalization methods such as MinMax normalization [singh2019investigating], Zscore normalization [fukunaga2013introduction], and median absolute deviation normalization (MAD) [singh2019investigating]. In our experiments, we found that our proposed ILNR loss achieves the best performance.Pairwise normal loss.
Normals are an important geometric property, which have been shown to be a complementary modality to depth [silberman2012indoor]. Many methods have been proposed to use normal constraints to improve the depth quality, such as the virtual normal loss [Yin2019enforcing]. However, as the virtual normal only leverages global structure, it cannot help improve the local geometric quality, such as depth edges and planes. Recently, Xian [xian2020structure] proposed a structureguided ranking loss, which can improve edge sharpness. Inspired by these methods, we follow their sampling method but enforce the supervision in surface normal space. Moreover, our samples include not only edges but also planes. Our proposed pairwise normal (PWN) loss can better constrain both the global and local geometric relations.
The surface normal is obtained from the reconstructed 3D point cloud by local least squares fitting [Yin2019enforcing]. Before calculating the predicted surface normal, we align the predicted depth and the ground truth depth with a scale and shift factor, which are retrieved by least squares fitting [Ranftl2020]. From the surface normal map, the planar regions where normals are almost the same and edges where normals change significantly can be easily located. Then, we follow [xian2020structure] and sample paired points on both sides of these edges. If planar regions can be found, paired points will also be sampled on the same plane. In doing so, we sample K paired points per training sample on average. In addition, to improve the global geometric quality, we also randomly sample paired points globally. The sampled points are , while their corresponding normals are . The PWN loss is:
(4) 
where denotes ground truth surface normals. As this loss accounts for both local and global geometry, we find that it improves the overall reconstructed shape.
Finally, we also use a multiscale gradient loss [li2018megadepth]:
(5) 
The overall loss function is formally defined as follows.
(6) 
where and in all experiments.
4 Experiments
Dataset  # Img  Scene  Evaluation  Supervision  

Type  Metric  Type  
NYU  Indoor  AbsRel &  Kinect  
ScanNet  Indoor  AbsRel &  Kinect  
2D3DS  Indoor  LSIV  LiDAR  
iBims1  Indoor 

LiDAR  
KITTI  Outdoor  AbsRel &  LiDAR  
Sintel  Outdoor  AbsRel &  Synthetic  
ETH3D  Outdoor  AbsRel &  LiDAR  
YouTube3D  In the Wild  WHDR  SfM, Ordinal pairs  
OASIS  In the Wild 



DIODE 


LiDAR 
Datasets and implementation details.
To train the PCM, we sampled K Kinectcaptured depth maps from ScanNet, K LiDARcaptured depth maps from Taskonomy, and K synthetic depth maps from the 3D Ken Burns paper [Niklaus_TOG_2019]. We train the network using SGD with a batch size of , an initial learning rate of , and a learning rate decay of . For parameters specific to PVCNN, such as the voxel size, we follow the original work [liu2019pvcnn].
To train the DPM, we sampled K RGBD pairs from LiDARcaptured Taskonomy [zamir2018taskonomy], K synthetic RGBD pairs from the 3D Ken Burns paper [Niklaus_TOG_2019], K RGBD pairs from calibrated stereo DIML [kim2018deep],
K RGBD pairs from webstereo Holopix50K
[hua2020holopix50k], and K webstereo HRWSI [xian2020structure] RGBD pairs. Note that when doing the ablation study about the effectiveness of PWN and ILNR, we sampled a smaller dataset which is composed of K images from Taskonomy, K images from DIML, and K images from HRWSI. During training, images are withheld from all datasets as a validation set. We use the depth prediction architecture proposed in Xian . [xian2020structure], which consists of a standard backbone for feature extraction (e.g., ResNet50
[he2016deep] or ResNeXt101 [xie2017aggregated]), followed by a decoder, and train it using SGD with a batch size of , an initial learning rate for all layer, and a learning rate decay of . Images are resized to × , and flipped horizontally with a chance. Following [yin2020diversedepth], we load data from different datasets evenly for each batch.Evaluation details.
The focal length prediction accuracy is evaluated on 2D3DS [armeni2017joint] following [hold2018perceptual]. Furthermore, to evaluate the accuracy of the reconstructed 3D shape, we use the Locally Scale Invariant RMSE (LSIV) [chen2020oasis] metric on both OASIS [chen2020oasis] and 2D3DS [armeni2017joint]. It is consistent with the previous work [chen2020oasis]. The OASIS [chen2020oasis] dataset only has the ground truth depth on some small regions, while 2D3DS has the ground truth for the whole scene.
Method  ETH3D  NYU  KITTI  Sintel  DIODE 

AbsRel  
Baseline  
Recovered Shift  15.9  15.1  17.5  40.3  36.9 
To evaluate the generalizability of our proposed depth prediction method, we take datasets which are unseen during training, including YouTube3D [chen2019learning], OASIS [chen2020oasis], NYU [silberman2012indoor], KITTI [geiger2012we], ScanNet [dai2017scannet], DIODE [vasiljevic2019diode], ETH3D [schops2017multi], Sintel [Butler:ECCV:2012], and iBims1 [Koch18:ECS]. On OASIS and YouTube3D, we use the Weighted Human Disagreement Rate (WHDR) [xian2018monocular] for evaluation. On other datasets, except for iBims1, we evaluate the absolute mean relative error (AbsRel) and the percentage of pixels with . We follow Ranftl [Ranftl2020] and align the scale and shift before evaluation. To evaluate the geometric quality of the depth, i.e. the quality of edges and planes, we follow [Niklaus_TOG_2019, xian2020structure] and evaluate the depth boundary error [Koch18:ECS] () as well as the planarity error [Koch18:ECS] () on iBims1. and evaluate the flatness and orientation of reconstructed 3D planes compared to the ground truth 3D planes respectively, while and demonstrate the localization accuracy and the sharpness of edges respectively. More details as well as a comparison of these test datasets are summarized in Tab. 1
4.1 3D Shape Reconstruction
Shift recovery.
To evaluate the effectiveness of our depth shift recovery, we perform zeroshot evaluation on datasets unseen during training. We recover a 3D point cloud by unprojecting the predicted depth map, and then compute the depth shift using our PCM. We then align the unknown scale [bian2019unsupervised, monodepth2] of the original depth and our shifted depth to the ground truth, and evaluate both using the AbsRel error. The results are shown in Tab. 2, where we see that, on all test sets, the AbsRel error is lower after recovering the shift. We also trained a standard 2D convolutional neural network to predict the shift given an image composed of the unprojected point coordinates, but this approach did not generalize well to samples from unseen datasets.
Focal length recovery.
To evaluate the accuracy of our recovered focal length, we follow HoldGeoffroy [hold2018perceptual] and compare on the 2D3DS dataset, which is unseen during training for both methods. The model of [hold2018perceptual] is trained on the inthewild SUN360 [xiao2012recognizing] dataset. Results are illustrated in Fig. 3, where we can see that our method demonstrates better generalization performance. Note that PVCNN is very lightweight and only has parameters, but shows promising generalizability, which could indicate that there is a smaller domain gap between datasets in the 3D point cloud space than in the image space where appearance variation can be large.
Furthermore, we analyze the effect of different initial focal lengths during inference. We set the initial field of view (FOV) from to and evaluate the accuracy of the recovered focal length, Fig. 3 (right). The experimental results demonstrate that our method is not particularly sensitive to different initial focal lengths.
Method  OASIS  2D3DS 

LSIV  LSIV  
Orthographic Camera Model  
MegaDepth [li2018megadepth]  
MiDaS [Ranftl2020]  
OursDPM  
Pinhole Camera Model  
MegaDepth [li2018megadepth] + HoldGeoffroy [hold2018perceptual]  
MiDaS [Ranftl2020] + HoldGeoffroy [hold2018perceptual]  
MiDaS [Ranftl2020] + OursPCM  
Ours  0.52  0.80 
Evaluation of 3D shape quality.
Following OASIS [chen2020oasis], we use LSIV for the quantitative comparison of recovered 3D shapes on the OASIS [chen2020oasis] dataset and the 2D3DS [armeni2017joint] dataset. OASIS only provides the ground truth point cloud on small regions, while 2D3DS covers the whole 3D scene. Following OASIS [chen2020oasis], we evaluate the reconstructed 3D shape with two different camera models, i.e. the orthographic projection camera model [chen2020oasis] (infinite focal length) and the (more realistic) pinhole camera model. As MiDaS [Ranftl2020] and MegaDepth [li2018megadepth] do not estimate the focal length, we use the focal length recovered from HoldGeoffroy [hold2018perceptual] to convert the predicted depth to a point cloud. We also evaluate a baseline using MiDaS instead of our DPM with the focal length predicted by our PCM (“MiDaS + OursPCM”). From Tab. 3 we can see that with an orthographic projection, our method (“OursDPM”) performs roughly as well as existing stateoftheart methods. However, for the pinhole camera model our combined method significantly outperforms existing approaches. Furthermore, comparing “MiDaS + OursPCM” and “MiDaS + HoldGeoffroy”, we note that our PCM is able to generalize to different depth prediction methods.
A qualitative comparison of the reconstructed 3D shape on inthewild scenes is shown in Fig. 4. It demonstrates that our model can recover more accurate 3D scene shapes. For example, planar structures such as walls, floors, and roads are much flatter in our reconstructed scenes, and the angles between surfaces (walls) are also more realistic. Also, the shape of the car has less distortions.
Method  iBims1  
AbsRel  
Xian [xian2020structure]  
MegaDepth [li2018megadepth]  
MiDaS [Ranftl2020]  
3D Ken Burns [Niklaus_TOG_2019]  5.44  
Ours^{†} w/o PWN  
Ours^{†}  
Ours Full  1.90  2.0  7.41  0.079 
4.2 Depth prediction
In this section, we conduct several experiments to demonstrate the effectiveness of our depth prediction method, including a comparison with stateoftheart methods, a comparison of our proposed imagelevel normalized regression loss with other methods, and an analysis of the effectiveness of our pairwise normal regression loss.
Method  Backbone  OASIS  YT3D  NYU  KITTI  DIODE  ScanNet  ETH3D  Sintel  Rank  
WHDR  AbsRel  AbsRel  AbsRel  AbsRel  AbsRel  AbsRel  
OASIS [chen2020oasis]  ResNet50  
MegaDepth [li2018megadepth]  Hourglass  
Xian [xian2020structure]  ResNet50  
WSVD [wang2019web]  ResNet50  
Chen [chen2019learning]  ResNet50  
DiverseDepth [yin2020diversedepth]  ResNeXt50  
MiDaS [Ranftl2020]  ResNeXt101  
Ours  ResNet50  14.3  80.0  
Ours  ResNeXt101  28.3  19.2  9.0  91.6  27.1  76.6  9.5  91.2  17.1  77.7  31.9  65.9  1.1 
Comparison with stateoftheart methods.
In this comparison, we test on datasets unseen during training. We compare with methods that have been shown to best generalize to inthewild scenes. Their results are obtained by running the publicly released code. Each method is trained on its own proposed datasets. When comparing the AbsRel error, we follow Ranftl [Ranftl2020] to align the scale and shift before the evaluation. The results are shown in the Tab. 5. Our method outperforms prior works, and using a larger ResNeXt101 backbone further improves the results. Some qualitative comparisons can be found in Fig. 5
Pairwise normal loss.
To evaluate the quality of our full method and dataset on edges and planes, we compare our depth model with previous stateoftheart methods on the iBims1 dataset. In addition, we evaluate the effect of our proposed pairwise normal (PWN) loss through an ablation study. As training on our full dataset is computationally demanding, we perform this ablation on the small training subset. The results are shown in Tab. 4. We can see that our full method outperforms prior work for this task. In addition, under the same settings, both edges and planes are improved by adding the PWN loss. We further show a qualitative comparison in Fig. 6.
Imagelevel normalized regression loss.
To show the effectiveness of our proposed imagelevel normalized regression (ILNR) loss, we compare it with the scaleshift invariant loss (SSMAE) [Ranftl2020] and the scaleinvariant multiscale gradient loss [wang2019web]. Each of these methods is trained on the small training subset to limit the computational overhead, and comparisons are made to datasets that are unseen during training. All models have been trained for epochs, and we have verified that all models fully converged by then. The quantitative comparison is shown in Tab. 6, where we can see an improvement of ILNR over other scale and shift invariant losses. Furthermore, we also analyze different options for normalization, including imagelevel MinMax (ILNRMinMax) normalization and imagelevel median absolute deviation (ILNRMAD) normalization, and found that our proposed loss performs a bit better.
5 Discussion
Limitations.
We observed a few limitations of our method. For example, our PCM cannot recover accurate focal length or depth shift when the scene does not have enough geometric cues, when the whole image is mostly a wall or a sky region. The accuracy of our method will also decrease with images taken from uncommon view angles (e.g., topdown) or extreme focal lengths. More diverse 3D training data may address these failure cases. In addition, our method does not model the effect of radial distortion from the camera and thus the reconstructed scene shape can be distorted in cases with severe radial distortion. Studying how to recover the radial distortion parameters using our PCM can be an interesting future direction.
Method  RedWeb  NYU  KITTI  ScanNet  DIODE 

WHDR  AbsRel  
SMSG [wang2019web]  
SSMAE [Ranftl2020]  
ILNRMinMax  
ILNRMAD  
ILNR  18.7  13.9  16.1  12.3  34.2 
6 Conclusion
In summary, we presented, to our knowledge, the first fully data driven method that reconstructs 3D scene shape from a monocular image. To recover the shift and focal length for 3D reconstruction, we proposed to use point cloud networks trained on datasets with known global depth shifts and focal lengths. This approach showed strong generalization capabilities and we are under the impression that it may be helpful for related depthbased tasks. Extensive experiments demonstrated the effectiveness of our scene shape reconstruction method and the superior ability to generalize to unseen data.
Appendix
Appendix A Datasets
a.1 Datasets for Training
To train a robust model, we use a variety of data sources, each with its own unique properties:

Taskonomy [zamir2018taskonomy] contains highquality RGBD data captured by a LiDAR scanner. We sampled around K RGBD pairs for training.

DIML [kim2018deep] contains calibrated stereo images. We use the GANet [Zhang2019GANet] method to compute the disparity for supervision. We sampled around K RGBD pairs for training.

3D Ken Burns [Niklaus_TOG_2019] contains synthetic data with ground truth depth. We sampled around K RGBD pairs for training.

Holopix50K [hua2020holopix50k] contains diverse uncalibrated web stereo images. Following [xian2018monocular], we use FlowNet [IMKDB17] to compute the relative depth (inverse depth) data for training.

HRWSI [xian2020structure] contains diverse uncalibrated web stereo images. We use the entire dataset, consisting of K RGBD images.
a.2 Datasets Used in Testing
To evaluate the generalizability of our method, we test our depth model on a range of datasets:

NYU [silberman2012indoor] consists of mostly indoor RGBD images where the depth is captured by a Kinect sensor. We test our method on the official test set, which contains images.

KITTI [geiger2012we] consists of street scenes, with sparse metric depth captured by a LiDAR sensor. We use the standard test set ( images) of the Eigen split.

ScanNet [dai2017scannet] contains similar data to NYU, indoor scenes captured by a Kinect. We randomly sampled images from the official validation set for testing.

DIODE [vasiljevic2019diode] contains highquality LiDARgenerated depth maps of both indoor and outdoor scenes. We use the whole validation set ( images) for testing.

ETH3D [schops2017multi] consists of outdoor scenes whose depth is captured by a LiDAR sensor. We sampled images from it for testing.

Sintel [Butler:ECCV:2012] is a synthetic dataset, mostly with outdoor scenes. We collected images from it for testing.

OASIS [chen2020oasis] is a diverse dataset consisting of images in the wild, with ground truth depth annotations by humans. It contains both sparse relative depth labels (similar to DIW [chen2016single]), and some planar regions. We test on the entire validation set, containing K images.

YouTube3D [chen2019learning] consists of inthewild videos that are reconstructed using structure from motion, with the sparse reconstructed points as supervision. We randomly sampled K images from the whole dataset for testing.

RedWeb [xian2018monocular] consists of inthewild stereo images, with disparity labels derived from an optical flow matching algorithm. We use K images to evaluate the WHDR error, and we randomly sampled K points pairs on each image.

iBims1 [Koch18:ECS] is an indoorscene dataset, which consists of highquality images captured by a LiDAR sensor. We use the whole dataset for evaluating edge and plane quality.
We will release a list of all images used for testing to facilitate reproducibility.
Appendix B Details for Depth Prediction Model and Training.
We use the depth prediction model proposed by Xian [xian2020structure]. We follow [yin2020diversedepth] and combine the multisource training data by evenly sampling from all sources per batch. As HRWSI and Holopix50K are both web stereo data, we merge them together. Therefore, there are four different data sources, i.e. highquality Taskonomy, synthetic 3D Ken Burn, middlequality DIML, and lowquality Holopix50K and HRWSI. For example, if the batch size is , we sample images from each of the four sources. Furthermore, as the ground truth depth quality varies between data sources, we enforce different losses for them.
For the webstereo data, such as Holopix50K [hua2020holopix50k] and HRWSI [xian2020structure], as their inverse depths have unknown scale and shift, these inverse depths cannot be used to compute the affineinvariant depth (up to an unknown scale and shift to the metric depth). The pixelwise regression loss and geometry loss cannot be applied for such data. Therefore, during training, we only enforce the ranking loss [xian2018monocular] on them.
For the middlequality calibrated stereo data, such as DIML [kim2018deep], we enforce the proposed imagelevel normalized regression loss, multiscale gradient loss and ranking loss. As the recovered disparities contain much noise in local regions, enforcing the pairwise normal regression loss on noisy edges will cause many artifacts. Therefore, we enforce the pairwise normal regression loss only on planar regions for this data.
For the highquality data, such as Taskonomy [zamir2018taskonomy] and synthetic 3D Ken Burns [Niklaus_TOG_2019], accurate edges and planes can be located. Therefore, we apply the pairwise normal regression loss, ranking loss, and multiscale gradient loss for this data.
Furthermore, we follow [liu2019training] and add a lightweight auxiliary path on the decoder. The auxiliary outputs the inverse depth and the main branch (decoder) outputs the depth. For the auxiliary path, we enforce the ranking loss, imagelevel normalized regression loss in the inverse depth space on all data. The network is illustrated in Fig. 7.
Appendix C Sampling Strategy for Pairwise Normal Loss
We enforce the pairwise normal regression loss on Taskonomy and DIML data. As DIML is more noisy than Taskonomy, we only enforce the normal regression loss on the planar regions, such as pavements and roads, whereas for Taskonomy, we sample points on edges and on planar regions. We use the local least squares fitting method [Yin2019enforcing] to compute the surface normal from the depth map.
For edges, we follow the method of Xian [xian2020structure], which we describe here. The first step is to locate image edges. At each edge point, we then sample pairs of points on both sides of the edge, i.e. . The ground truth normals for these points are , while the predicted normals are . To locate the object boundaries and planes folders, where the normals changes significantly, we set the angle difference of two normals greater than . To balance the samples, we also get some negative samples, where the angle difference is smaller than and they are also detected as edges. The sampling method is illustrated as follow.
(7) 
For planes, on DIML, we use [deeplabv3plus2018] to segment the roads, which we assume to be planar regions. On Taskonmy, we locate planes by finding regions with the same normal. On each detected plane, we sample paired points. Finally, we combine both sets of paired points and enforce the normal regression loss on them, see E.q. in our main paper.
Appendix D Illustration of the Reconstructed Point Cloud
We illustrate some examples of the reconstructed 3D point cloud from our proposed approach in Fig. 8. All these data are unseen during training. This shows that our method demonstrates good generalizability on inthewild scenes and can recover realistic shape of a wide range of scenes.
Appendix E Illustration of Depth Prediction In the Wild
We illustrate examples of our single image depth prediction results in Fig. 9. The images are randomly sampled from DIW and OASIS, which are unseen during training. On these diverse scenes, our method predicts reasonably accurate depth maps, in terms of global structure and local details.
Acknowledgment
This work was in part supported by ARC DP Project “Deep learning that scales”.