Benchmarks are the drivers of progress as they facilitate measurable technical increments, and can also provide explainable insights for diverging technical approaches. They must be unbiased, especially given the emergence of data-driven methods that can easily exploit any hidden bias in the data. Besides, the expressiveness of deep models necessitates the enrichment of benchmarks with varying data distributions to allow for the assessment of their generalization and their capacity to exploit different sources of data.
The recent availability of depth datasets out of stitched raw sensor data [armeni2017joint, chang2018matterport3d], 3D reconstruction renderings [zioulis2018omnidepth, zioulis2019spherical], and photorealistic synthetic scenes [jin2020geometric, Structured3D] has stimulated research in monocular depth estimation [tateno2018distortion, eder2019mapped, wang2020bifuse, jiang2021unifuse, zeng2020joint, sun2020hohonet]. Still, the progress in monocular depth estimation has been mainly driven by research for traditional cameras, and assessed on perspective datasets, starting with the pioneering work of [eigen2014depth]. Even though other approaches exist (e.g. ordinal regression [fu2018deep, bhat2020adabins]), depth estimation is most typically addressed as a dense regression objective. Various estimator choices are available for the direct objective like L1, L2, or robust versions like the reverse Huber (berHu) loss [laina2016deeper]. Complementary errors have also been introduced like the virtual normal loss [yin2019enforcing] which captures longer range depth relations. Additional smoothness ensuring losses can be used to enforce a reasonable and established prior of depth maps, which is their piece-wise smoothly spatially varying nature [huang2000statistics].
Depth maps also exhibit sharp edges at object boundaries [huang2000statistics], whose preservation is important for various downstream applications. Recent works which focus explicitly on improving the estimated boundaries introduced new metrics to measure boundary preservation performance [hu2019revisiting, ramamonjisoa2020predicting]
. Since convolutional data-driven methods spatially downscale the encoded representations, predicting neighboring values relies on neighborhood information, leading to interpolation blurriness. Counteracting approaches, like encoder-decoder skip connections or guided filters[wu2018fast], can lead to texture transfer artifacts, hurting the predictions’ smoothness. The latter (smoothness) is also an important trait for some tasks like scene-scale 3D reconstruction which usually relies on surface orientation information [kazhdan2013screened] to preserve structural planarities, while the former (boundaries) are necessary for applications like view synthesis [attal2020matryodshka] or object retrieval [karsch2013boundary]. Still, smoothness related metrics are usually presented on surface [wang2020vplnet, karakottas2019360] or plane [eder2019pano] estimation works. Further, the balance between them needs to be tuned as they are conflicting objectives.
In this work we set to deliver an unbiased and holistic benchmark for monocular depth estimation that provides performance analysis across all traits, i) depth estimation, ii) boundary preservation, iii) smoothness. We also consider an orthogonal evaluation strategy that seeks to assess the models’ generalization as well, across its different facets, i) varying depth distributions, ii) adaptation to the scenes’ contexts, and iii) different camera domains. To support the benchmark, we design a set of solid baselines that respect best practises as reported in the literature and rely on standard architectures. Our results, data, code, configurations and trained models are publicly available at vcl3d.github.io/Pano3D/.
We show that recently made available datasets contain significant biases or artifacts that prevent them from being suitable as solid benchmarks.
We provide depth estimation performance results for all different traits, across different domains, contexts, distributions and resolutions, while also taking depth refinement advances into account.
We demonstrate the effectiveness of skip connections, a rare architectural choice for () depth estimation.
2 Related Work
Monocular Omnidirectional Depth Estimation. The first works addressing the monocular data-driven omnidirectional depth estimation task were [tateno2018distortion] and [zioulis2018omnidepth]. The former applied traditional CNNs trained on perspective images in a distortion-aware manner to spherical images, while the latter introduced a rendered spherical dataset of paired color and depth images, in addition to a simplistic rectangular filtering preprocessing block. Pano Popups [eder2019pano] simultaneously predict depth and surface orientation to construct planar 3D models, showing the insuffiency of depth estimates along to approximate planar regions.
The generalized Mapped Convolutions [eder2019mapped] were applied to omnidirectional depth estimation, showing how accounting for the distortion when using equirectangular projection increases performance in the image regions closer to the equator. Although these spatially imbalanced predictions are an important issue to address for depth estimation methods, the usual evaluation methodologies do not address this apart from [zioulis2019spherical]. The omnidirectional extension networks [cheng2020omnidirectional] employ a near field-of-view (NFoV) perspective depth camera to accompany the spherical one, offering a necessary, albeit not full FoV, constraint to enhance the preservation of details in the inferred depth map.
Recent omnidirectional depth estimation works diverged in two paths. One route is to exploit the nature of the spherical images within network architectures, with BiFuse [wang2020bifuse] fusing features from a cubemap and an equirectangular representation, while UniFuse [jiang2021unifuse] shows that the fusion of cubemap features to the equirectangular ones is more effective. HoHoNet [sun2020hohonet] adapts classical CNNs to operate on images by flattening the meridians to DCT coefficients, allowing for efficient dense feature reconstruction, and applying it to monocular depth estimation from spherical panoramas. Other recent works [jin2020geometric, zeng2020joint] explore the connection between the layout and depth estimation tasks, while [feng2020deep] relies on the joint optimization between depth and surface orientation estimates using a UNet model [ronneberger2015u].
Monocular Perspective Depth Estimation. The pioneering work for data-driven monocular dense depth estimation [eigen2014depth] employed a scale-invariant loss and established the set of metrics used to evaluate follow up works. Naturally, the progress in monocular depth estimation for perspective images is larger, as traditional images find more widespread use. While impressive gains have been presented using ordinal regression [fu2018deep] or adaptive binning [bhat2020adabins], they have not been applied to depth estimation, which exhibits more complex depth distributions than perspective depth maps due to its holistic FoV.
Results like the berHu loss presented in [laina2016deeper] have found traction in omnidirectional models as they are more easily transferable. On the contrary the more recently presented virtual normal loss [yin2019enforcing] has not been applied to depth, albeit its longer range depth relation modelling is highly aligned with the global reasoning required for the spherical task. Recently, the balance between the multitude of losses required to balance smoothness, boundary preservation and depth accuracy were investigated in [lee2020multi] to help models initially focus on easier to optimize losses (i.e. depth accuracy), and then on harder ones (i.e. smoothness, boundary).
Regarding depth discontinuity preservation performance, [hu2019revisiting] showed that a combination of three different loss terms, a depth, a surface and a spatial derivative one, help increase performance at object boundaries. Similarly, a boundary consistency was introduced in [huang2019indoor] to overcome blurriness and bleeding artifacts. Another approach, based on learnable guided filtering [wu2018fast], exploits the color image as guidance. Recently, displacement fields [ramamonjisoa2020predicting] showed that predicting resampling offsets instead of residuals is more suitable to increase performance at sharp depth discontinuities, while preserving depth estimation accuracy.
Our goal is two-fold, first to deliver a new benchmark for depth estimation, and second, to methodically analyze the task in light of recent developments, to identify a set of solid baselines, which future works will use as the starting points for assessing performance gains. Section 3.1 introduces the benchmark data, which set the ground for the subsequent analysis. Section 3.2 describes the benchmark’s holistic approach in terms of evaluation, while Section 3.3 presents the experiment design rationale.
Up to now depth datasets either rendered purely synthetic scenes like Structured3D [Structured3D] or 360D’s SunCG and SceneNet parts [zioulis2018omnidepth], or relied on 3D scanned datasets like Matterport3D [chang2018matterport3d] and Stanford2D3D [armeni2017joint]. The latter offer both panoramas and the 3D textured meshes, with some works using the original Matterport camera derived data and others the rendered panoramas from the 3D scanned meshes. Both approaches come with certain drawbacks, the original data contain invalid (i.e. true black) regions towards the sphere’s poles, while the 3D rendered data contain invalid regions where the 3D scans failed to reconstruct the surface. At the same time, the original data present with stitching artifacts (mostly blurring), while the rendered data sometimes suffer from 3D reconstruction errors which manifest in color discontinuities. We opt for the generation via rendering approach [zioulis2018omnidepth] as it produces true spherical panoramas and higher quality depth maps at lower resolution compared to nearest neighbor sampling. However, we fix a critical issue of the 360D [zioulis2018omnidepth] and 3D60 [zioulis2019spherical] datasets, namely, the introduction of a light source that alters the scene’s photorealism. Instead, we only sample the raw diffuse texture, preserving the original scene lighting, a crucial factor for unbiased learning and performance evaluation.
Zero-shot Cross-Dataset Transfer. However, there is a need to move beyond traditional train/test split performance analysis to support model deployment in real-world conditions. Thus, assessing generalization performance is very important. Towards that end, apart from re-rendering the Matterport3D scans for training, we introduce a new color-depth pair generated dataset from the GibsonV2 (GV2) [xia2018gibson] 3D scans. Compared to Matterport3D’s (M3D) buildings, it is a vastly larger dataset, whose scenes offer higher variety as well. These renders can be used for assessing generalization performance across its different splits: tiny, medium, full111The larger GV2 full split is kept for future training purposes and fullplus
. After removing outlier scans and filtering samples (keeping those withinvalid pixels), we are left with / train/test M3D samples, and , , , GV2 tiny, medium, fullplus, and full split samples respectively.
|Model||Depth Error||Depth Accuracy|
Since the introduction of the first set of metrics for data-driven depth estimation [eigen2014depth], namely root mean squared error (RMSE), room mean squared logarithmic error (RMSLE), absolute relative error (AbsRel), squared relative error (SqRel), and the relative threshold () based accuracies (), these metrics are the standard approach for evaluation depth estimation performance. More recent works have identified some shortcomings of these metrics. Specifically in [koch2018evaluation] an expanded analysis of depth estimation quality measures was conducted, and focused on two important traits, planarity and discontinuities. The latter is very important for some downstream applications like view synthesis, and apart from the completeness (comp) and accuracy (acc) of the depth boundary errors (dbe) proposed in [koch2018evaluation], another set of accuracy metrics were proposed in [hu2019revisiting]
. The precision, recall and their harmonic mean (F1-score) were used after extracting different boundary layers via Sobel edge thresholding. Planarity is also very important for various downstream applications, especially for indoor 3D reconstruction. Finally, to overcome resolution[cadena2016measuring] and focal length variations [chen2020oasis], recent perspective depth estimation works resort to nearest-neighbor 3D metrics.
Direct Depth Metrics. We build upon these developments and design our benchmark to provide a holistic evaluation of depth estimation models. Given the progress of recent data-driven models we expand the accuracies with two lower thresholds, a strict at , and a precise, , similar to [huang2019indoor]. However, these metrics, when applied directly on equirectangular images, are biased by its distortion towards the poles. To remove this bias we take the spherically weighted mean (denoting with a prefix), which is standard practise for image/video quality assessment [xu2020state] and was also used in [zioulis2019spherical]. For the accuracies though, we turn to uniform sampling on the sphere using the projected vertices of a subdivided icosahedron, denoted as , being the icosahedron’s order.
Depth Discontinuity Metrics. We complement the direct depth performance metrics with a set of secondary metrics measuring performance at preserving the depth discontinuities, usually manifesting at object boundaries. While [koch2018evaluation] used manual annotation and structured edge detection [dollar2014fast], we follow the approach of [ramamonjisoa2020predicting] that relies on automatic Canny edge detection [canny1986computational]. In addition, we complement the depth boundary errors (dbeacc and dbecomp) with the accuracy metrics of [hu2019revisiting] (prect, rect) using the same thresholds, , for both set of metrics.
Depth Smoothness Metrics. While the planarity metric of [koch2018evaluation] required the manual annotation of samples, its goal is to measure the smoothness of the inferred depth with respect to dominant structures. A straightforward adaptation that alleviates annotations is the use of surface orientation metrics, which is a property directly derived from the depth measurements. Using spherical-to-Cartesian coordinates conversion the depth/radius measurements are lifted to 3D points, with the surface orientation extracted by exploiting the structured nature of images. Similar to how surface estimation methods measure performance, we use the angular RMSE (RMSEo), and a set of accuracies with pre-defined angle thresholds , using those from [wang2020vplnet] ().
Geometric Metrics. Depth estimations are typically used in downstream applications for metric-scale 3D perception. Therefore, 3D performance metrics are reasonable to assess suitability for downstream tasks. We use two different metrics that aggregate the performance of the aforementioned different depth traits, i.e. accuracy and precision, boundary preservation and smoothness. The first geometric metric is computed on the point cloud level (c2c), using a point-to-plane distance between each point and its closest correspondence with the ground truth point cloud. The point-to-plane distance jointly encodes depth correctness and smoothness, while the closest point query will penalize boundary errors. The second geometric metric is computed on the mesh level, having each point cloud (predicted and ground truth) 3D reconstructed using the Screened Poisson Surface Reconstruction [kazhdan2013screened]. We then calculate the Hausdorff distance [cignoni1998metro] between the two meshes. Similarly, Poisson reconstruction leverages both position and surface information when generating the scene’s mesh. Through this metric we assess the capacity to represent the entire scene’s geometry with the estimated depth, an important trait for some downstream applications.
3.3 Experimental Setup
We design our experiments and search for a solid baseline taking recent developments into account.
Supervision. As shown in [carvalho2018regression] the L1 () loss exhibits the best convergence for monocular depth estimation irrespective of the model size and architecture complexity, indicating that models behaving like median estimators are more appropriate. Most recent works for depth estimation [wang2020bifuse, jiang2021unifuse, eder2019mapped] use the berHu loss [laina2016deeper], with the exception to this rule being [sun2020hohonet] that uses the L1 loss.
We additionally observe that these works rely solely on a single direct depth loss, while recent works on perspective depth estimation also include additional losses. MiDaS [ranftl2020towards], MegaDepth [li2018megadepth] as well as [xian2020structure] and [hu2019revisiting] use a multi-scale ( scales) gradient matching term () that enforces consistent depth discontinuities. While their terms are scale-invariant and operate in the log-space, depth does not suffer from disparity/baseline or focal length variations, and since we do not use the L1 loss in log-space (as its performance is inferior to pure L1 [carvalho2018regression]), we use a non-scale invariant version of this loss. Apart from boundary preservation, the piece-wise smooth nature of depth, necessitates the use of a suitable prior for the predictions. This was acknowledged in [hu2019revisiting], where a surface orientation consistency loss was used (). Prior works employed smoothness priors on the predictions, and, to overcome cross boundary smoothing, relied on image gradient weighting [godard2017unsupervised]. Yet image gradients do not necessarily align with depth discontinuities, making the normal loss a better candidate.
Finally, the newly introduced virtual normal loss [yin2019enforcing] () is a long-range relationship oriented objective, which given the global context of spherical panoramas is well aligned with the task. In our experiments we follow a progressive loss ablation starting with a objective, examining the effect of and on the baseline, as well as their combined effect , and finally further extend the combined objective with , with the latter experiment including all losses.
Model Architecture. The importance of high-capacity encoders, pre-training and multi-scale predictions is acknowledged in the literature [ranftl2020towards]. Building on the first, we preserve a consistent convolution decoder and use a DenseNet-161 [huang2017densely] (M parameters) and ResNet-152 [he2016identity] (M parameters) encoder as baselines. Inspired by recent work [lee2020multi] we also include their Pnas model (M parameters) whose encoder is a product of neural architecture search [liu2018progressive]. In addition, taking into account the boundary preservation performance of skip connections, we also use the – largely unpopular for depth estimation – UNet model [ronneberger2015u] (M parameters). Since it is a purely convolutional model, we additionally modify the ResNet-152 model with skip connections starting from the first residual block (M parameters), in contrast to UNet’s very early layer encoder-to-decoder skip. Since pre-training weights are not available for UNet, we experiment with cold-started models, and also simplify training using a single-scale predictions as the multi-scale effect should be horizontal across all models with the same convolution decoder structure.
Periodic Displacement Fields Refinement. We additionally consider the refinement of the predicted depth, using a shallow hourglass module [newell2016stacked]. It is adapted for the task at hand, with two branches, one for the input color image and the other for the predicted depth map. Across each stage, we account for the varying nature of each branch’s feature statistics using Adaptive Instance Normalization [huang2017arbitrary]. We follow the recent approach of [ramamonjisoa2020predicting] that shows how predicting displacement fields instead of residuals produces higher quality depth refinement. However, the spherical domain is continuous, and thus, we need to account for the horizontal discontinuity of the equirectangular projection. To achieve this in a locally differentiable manner, we resort to a periodic reconstruction of the sampling coordinates. Considering the final sampling coordinates after adding the displacement field, we wrap them around to , with .
Training and Evaluation. We train all models solely on the official train split of M3D, and evaluate them on its official test split as well. Evaluation is conducted across all the aforementioned axes of depth performance. Apart from this holistic performance analysis, we additionally take an orthogonal direction and assess the models’ generalization performance on zero-shot cross-dataset transfer using the GV2 tiny, medium and fullplus splits. Given that both GV2 and M3D scenes were scanned with same type of camera (i.e. Matterport), we render another version of tiny which is tone mapped to a film-like dynamic range, dubbed tiny-filmic, changing the camera-related data domain. Our experiments are conducted on two different resolutions and (we render all datasets to both) to assess cross-resolution performance.
|Model||Depth Error||Depth Accuracy|
|UNet @ 3D60||0.3140||0.0455||0.0741||0.0316||49.99%||75.16%||95.49%||99.11%||99.60%|
|ResNet skip @ 3D60||0.3758||0.6100||0.0883||0.0481||46.03%||70.29%||93.12%||98.41%||99.34%|
|UNet @ S3D||0.1815||0.0546||0.0919||0.0398||50.61%||75.98%||92.23%||96.56%||97.53%|
|ResNetskip @ S3D||0.2450||0.1335||0.1349||0.1249||40.48%||67.29%||88.67%||95.01%||96.68%|
Direct depth performance using spherical metrics. A UNet model with spherical padding is also presented (light pink), as well as the two better performing models trained and tested on the 3D60 (light orange) and Structured3D (light red) datasets.
|Model||Depth Discontinuity||Depth Smoothness|
|ResNetskip & ref||1.4291||4.3115||60.78%||58.09%||51.49%||37.79%||32.55%||27.23%||15.05||65.16%||78.26%||82.77%|
|GV2||Model||Direct Depth||Depth Discontinuity||Depth Smoothness|
|UNetskip & aug||0.4580||0.0840||0.1701||39.73%||81.19%||1.4480||4.2681||62.69%||66.19%||62.27%||16.30||82.16%|
|UNetskip & aug||0.4321||0.0823||0.1673||39.70%||81.90%||1.5045||4.2659||63.94%||67.27%||61.69%||15.43||83.70%|
|UNetskip & aug||0.6014||0.1033||0.1758||42.70%||76.97%||1.7040||5.0063||56.24%||58.18%||53.33%||20.87||75.09%|
|UNetskip & aug||0.4750||0.0871||0.1743||39.51%||80.39%||1.5326||4.3923||60.69%||63.32%||59.43%||16.66||81.62%|
|UNetskip & aug||0.6237||0.1084||0.1829||41.57%||75.56%||1.7688||5.1482||54.64%||55.56%||50.02%||21.23||74.58%|
|GV2||Model||Direct Depth||Depth Discontinuity||Depth Smoothness|
Implementation Details. We implement all experiments with moai [moai], using the same seed across all experiments. For data generation we use Blender and the Cycles path tracer using samples. Our ResNets are built with pre-activated bottleneck blocks [he2016identity] and all our models’ weights are initialized with [he2015delving]. We optimize all models for epochs on a NVidia 2080 Ti, using Adam [kingma2014adam] with a learning rate of and default momentum parameters, and a consistent batch size of . All losses are unity (i.e. equally) weighted across all experiments. We use CloudCompare to calculate the c2c distance [girardeaumontaut:pastel-00001745], and MeshLab to calculate the m2m distance [cignoni1998metro]. During evaluation, we consider the raw values predicted by the models and clip the valid depth range to m.
Which loss combination offers better performance? Contrary to their focused nature both and increase depth estimation performance across all models when complementing the direct objective, as evident in Table 1. In addition, they provide the expected boost in smoothness/discontinuity preservation across all models as presented in our supplementary material which is appended after the references. When viewed purely from a depth estimation perspective, it is observed that their combination, , benefits performance. But, when examining the specific depth traits that they seek to enforce, their conflicting nature is also apparent. Overall, we observe almost all models achieve highest overall performance when both losses are present, with, our without the virtuan normal loss (VNL) which is added in the case. The latter greatly boosts the UNet model, which is reasonable as the localised nature of skip connections is aided by the global depth constraints that VNL introduces.
Which architecture is better performing? We compare architectures after selecting the best performing models for each, which for UNet is , and for the rest the . The rationale for choosing for ResNetskip is that while behaves better on direct depth metrics (except closer distances, as indicated by the RMSLE), there is a large performance gap in the discontinuity and smoothness metrics22footnotemark: 2, compared to the performance discrepancy on depth estimation. Table 2 presents the results using the spherical metrics that account for the distortion. These are unbiased metrics, which is evident given the deteriorated performance across all metrics compared to those estimated on the image level on each equirectangular panorama. A more straightforward comparison is available in our supplementary material appended after the references. Interestingly, we observe that models employing encoder-decoder skip connections exhibit better performance both in direct depth metrics (Table 2). Curiously, contrary to the expectation as set by the literature [ranftl2020towards] that high-capacity encoders are required, the UNet architecture showcases the best performance. Regarding domain oriented techniques, we train the better performing model with circular padding [sun2019horizonnet, zioulis2021single] that connects features across the horizontal equirectangular boundary, denoted as circ. Evidently, this simple scheme increases the performance across all metrics, allowing the model to exploit its spherical nature.
Is this performance consistent when considering secondary traits? Regarding the discontinuity and smoothness traits as presented in Table 3 it is evident that skip connections result in higher performance, but especially for the dominating UNet, at the expense of the smoothness trait. This is reasonable as early layer skip connections result into texture transfer, and further evidenced by the improved performance of ResNetskip, which lacks early layer skips, on both discontinuity and smoothness metrics. Overall, UNet achieves the best performance on depth and discontinuity metrics at the expense of the smoothness trait and closer range performance as indicated by its inferior RMSLE. On the other side, the different metrics indicate that the PNAS model produces oversmoothed results that are more metrically accurate and precise at closer distances. Nonetheless, ResNetskip achieves a better balance without significant sacrifices across the secondary traits.
How helpful is depth refinement? We also examine the effect of a shallow depth refinement module on these models, with the results after training for epochs presented in Table 3. All models, apart from UNet, improve their performance at boundary preservation while also preserving depth estimation performance, but at the expense of smoothness, with the exception in this case being ResNetskip. For UNet specifically, texture transfer leads to noise, which prevents an interpolation-based warping technique to improve results, as it was designed to improve smooth depth predictions. However, ResNetskip closes the performance gap and even improves smoothness performance, further solidifying its well-balanced nature.
Why this benchmark? Table 2 shows the performance of the two higher performing models when trained and tested on other recently introduced depth datasets, namely 3D60 [zioulis2019spherical] which is an extension of [zioulis2018omnidepth] and Structured3D [Structured3D] with and resolutions respectively. All metrics are significantly higher which evidences their insuitability to be used for benchmarking progress. This is largely because of their inherent bias which is the result of lighting for 3D60, which includes an extra light source at the center, as also explained in [jiang2021unifuse], an unfortunate bias that models learn to exploit as farther depths are darker; and the omission of the noisy camera-based image formation process and lack of real-world scene complexity exhibited by the purely synthetic Structure3D dataset,
What is their generalization capacity? We test these models in a zero-shot cross-dataset transfer setting using the GV2 splits using a subset of all metrics with the results presented in Table 4. We observe reduced performance for all models across all splits which is the result applying these models in different contexts/scenes and to out-of-distribution depths (tiny/medium). Yet, the ranking between models is not severely disrupted, indicating that architecture changes do not significantly affect generalization. The fullplus split is noticeably harder than the others, as all metrics are considerably worse, showcasing that pure context shifts (similar depth distribution) are detrimental to performance. However, camera domain shifts are another generalization barrier that is significant, as shown by the models’ results on the filmic splits, where a different color transfer function was applied during rendering. The latter also received the bigger gains when training with photometric augmentation (UNetaug), specifically random gamma, contrast, brightness and saturation shifts, which also boosted performance horizontally across all splits. Still, augmentation alone did not raise performance to levels similar to the M3D test set, indicating that other techniques are required.
|Pnas||0.2502 (7.02%)||0.1439 (0.1881)|
|UNet||0.2397 (6.52%)||0.1305 (0.1663)|
|DenseNet||0.2475 (6.98%)||0.1425 (0.1852)|
|ResNet||0.2573 (7.01%)||0.1405 (0.1907)|
|ResNetskip||0.2424 (6.83%)||0.1300 (0.1770)|
metric we also report the error standard deviation.
How does performance vary with resolution? Given their FoV, spherical panoramas require higher resolutions to be able to more robustly estimate detailed depth. Table 5 presents the results of the two better performing models, trained on M3D’s resolution data, and tested on the GV2 splits with the same resolution. We observe a change in performance between the UNet and the ResNet with skip connections. The latter’s expanded receptive field and higher capacity encoder offer significantly higher performance in the direct depth and smoothness metrics, albeit the UNet still localizes boundaries better.
How about downstream application suitability? We also assess each model’s performance using the 3D metrics that aggregate performance across all axes. Table 6 presents the results using the cloud and mesh distances are presented in Section 3.2. Overall the performance ranking is preserved, with UNet’s noisy predictions being moderated by the reconstruction process in the mesh distance metric, while the point cloud distance’s nearest-neighbor nature is more sensitive to it. Thus, downstream applications like view synthesis should investigate model results using c2c metrics, while applications relying on 3D reconstruction should resort to the m2m metric. Again, as shown by these metrics, the skip connections based ResNet is a reasonably balanced choice, that follows UNet’s top performance.
Spherical depth estimation is a task that comes with certain advantages (holistic view) and disadvantages (resolution requirements) compared to traditional – perspective – depth estimation. Preserving boundaries is challenging because of the distortion frequently squeezing objects towards the equator, and thus, smaller spatial areas; and due to the discontinuities that the different projections introduce. Imposing a smoothness prior is also not straightforward as for perspective depth. The presented Pano3D benchmark can stimulate future progress in depth estimation that will take all these aspects into account. From our extensive analysis – which nonetheless does not cover all cases – we identify the effectiveness of skip connections in terms of boundary preservation, as a means to overcome the weakness of spatial downscaling, which in turn, is necessary to exploit the panoramas’ global context. While the UNet architecture achieves top performance in lower resolutions, a ResNet with skip connections is a more balanced architectural choice that scales better across resolutions.
Finally, Pano3D relies on zero-shot cross-dataset transfer to move beyond a simple train/test split performance comparison. By decomposing generalization into three distinct performance reducing barriers, our goal is better facilitate the assessment towards real-world applicability of data-driven models for geometric inference.
We provide extra quantitative and qualitative comparisons in the supplementary material following the references. Supplementing experiments also reproduce prior work used as a basis for designing our methodology. Finally, a live web demo of our baseline models can be found at share.streamlit.io/tzole1155/ThreeDit.
This work was supported by the EC funded H2020 project ATLANTIS [GA 951900].
Appendix A Supplementary Material
This supplementary material complements our original manuscript with additional quantitative results, offering extra ablation experiments, providing qualitative results on real data and comparisons between the different architectures.
a.1 Quantitative Results
Table 7 complements Table 1 of the main document, presenting the performance of all remaining metrics, namely the spherical direct depth metrics, the boundary preservation metrics, and the smoothness metrics. In addition, Figure 6 presents the different models’ performance in terms of three indicators, one for each trait. These indicators combine an error and an accuracy metric:
with and the depth, boundary and smoothness performance indicators. Evidently, UNet performs significantly better than the other models, especially in the boundary consistency metrics, while all models benefit of the addition of extra losses. The addition, of skip connections in a common ResNet architecture offers better performance. While offers better depth performance for ResNetskip, the variant trained with offers higher performance across the two secondary traits.
In addition, we complement the main’s paper spherical metrics Table 8 by collating the traditional ones for a straightforward comparison.
|Model||Direct Depth||Depth Discontinuity||Depth Smoothness|
|Model||Depth Error||Depth Accuracy|
Performance indicators (higher is better) of different loss functions per model in three different axis. From left to right: depth indicator, boundary indicator and smoothness indicator .
|Depth Error||Depth Accuracy|
Finally, Table 9 reproduces the grounds upon our methodology was designed, namely the efficacy of pre-trained models [ranftl2020towards] and the L1 loss [carvalho2018regression]
. We use the DenseNet and Pnas models with the encoders initialized using weights pre-trained on ImageNet. Both claims stand, with all pre-trained models achieving better performance than the model trained from scratch. In addition, the L1 loss outperforms both berHu[laina2016deeper] and log loss. Interestingly, the performance drops significantly in DenseNet when trained with other losses, while for Pnas the performance gap is smaller. Therefore, when benchmarking different models, this needs to be taken into account as well. Only through consistent experimentation across different aspects measurable and explainable progress will be possible.
a.2 Qualitative Results
Finally we present additional qualitative results for different models. Apart from the collation of the predicted depth maps between the different models, we provide an advantage visualisation technique similar to that presented in HoHoNet [sun2020hohonet]. The visualisation is the MAE difference between two comparable models.
To that end, Figure 32 demonstrates the comparison of ResNet and ResNetskip architectures, Figure 58 that of the UNet and Pnas architectures, and, finally Figure 84 presents the differences between the UNet and ResNetskip architectures.
Additionally, Figure 85 presents comparative results regarding the boundary preservation performance across models. Once again, UNet is able to capture finer-grained details while the Pnas model produces smoother results. Similarly, the differences between ResNet and ResNetskip, attributed to the addition of the skip connections are apparent across all samples.
Nonetheless, Pnas better captures the global context as seen in Figure 110, where the scene’s dominant planar surfaces are better preserved by it than UNet.
Figures 111, 112, 113 demonstrate qualitative results in GV2 tiny split for the UNet, Pnas, and ResNetskip architectures respectively. Apart from the predicted point cloud we visualise the c2c error on the ground truth point cloud, with a blue-green-red colormap denoting the error’s magnitude.
Finally, Figures 114 and 115 offer qualitative results of our best performing method in real world, in-the-wild, data captures. We also qualitatively compare our predictions with a state-of-the-art depth estimation model (i.e. BiFuse [wang2020bifuse]). It is worth highlighting that even the two of the three images are captured by a panorama camera, the last two images are captured by a smartphone camera, and as such there are artifacts. Yet, it seems that this does not greatly affect the performance of models. The UNet model produces higher quality depth estimates than BiFuse, albeit trained only on the train split of M3D, while the publicly available BiFuse model, as reported in UniFuse [jiang2021unifuse], has been trained on the entire M3D dataset.