Omnidirectional imaging is currently experiencing a surge in popularity, thanks to the advent of interactive panorama photo sharing on social media platforms, the rise of small, affordable cameras like the Ricoh Theta and Samsung Gear360, and the host of potential applications that arise from capturing wide field of view (FoV) in a single frame. At the same time, deep learning has never been a more useful tool for solving computer vision tasks from object recognition to 3D reconstruction. In order to fully utilize this rising form of media, we must extend existing deep learning methods to the omnidirectional domain. Unfortunately, this is not necessarily a trivial task.
Due to the radically different camera models, deep networks trained on perspective images do not transfer well to omnidirectional images. Omnidirectional images replace the concept of the image plane with that of the image sphere. Yet because we require a 2D planar representation of the image, omnidirectional cameras typically provide outputs as FoV equirectangular projections. This representation of the spherical image, while compact, suffers from significant horizontal distortion, especially near the poles.
While there have been a number of efforts to handle the difficulties of equirectangular projections [1, 2, 3, 5, 25, 26], we are interested in exploring their possible uses. There is excitement over the range of applications of omnidirectional imaging from head-mounted displays to medical scopes to autonomous vehicles. In this paper, we target indoor scene modeling.
Perspective image methods are impeded by a small FoV that is more likely to be limited by featureless, homogeneous regions in an indoor scene. With the larger FoV in images, these homogeneous regions can be reasoned about in the larger context of the scene. Our goal is to predict the dense depth and surface normals for a piecewise-planar reconstruction of the scene. This objective differs from much of the existing work that uses omnidirectional images for indoor 3D modeling. Those, such as RoomNet  and LayoutNet , aim to generate a simple model of the scene by leveraging a Manhattan World constraint to estimate the dominant planes. That type of model is useful for determining the shapes of rooms and floor-plans of buildings, but not for modeling the objects that comprise the captured scene. While we, too, are essentially estimating planes in the scene, we aim for a more fine-grained model in order to better capture these important details. To this end, we relax the Manhattan constraint to a simple planar one. That is, we assume only that our scene is piecewise-planar.
We use a convolutional neural network (CNN) to predict depth and surface normal estimates per pixel as well as a map of the plane boundaries in the image. We enforce the planar assumption by using a plane-aware loss function that modifies each pixel’s contribution to the learning based on its principal curvature. Using our network outputs, we then generate high quality 3D planar models of the scene as seen in Figure 1.
We summarize our contributions in this paper as follows:
We propose a plane-aware cost function to estimate depth, surface normals, and plane boundaries from a single image.
We demonstrate that the inclusion of geodesic coordinate maps as extra inputs to the network improves surface normal prediction from omnidirectional images.
We qualitatively show that our network can be used to generate a 3D planar model from a single image.
2 Related Work
2.1 Single-view estimation
There is a significant body of existing research on the task of monocular depth estimation from perspective images. One of the first papers to report success in this task was from Saxena , who use a Markov Random Field to infer depth from a blend of local and global image features. With the advent of practical deep learning, more recent methods have focused on applying CNNs to estimate depth. Eigen  present a CNN for depth estimation that uses multi-scale predictions to provide coarse and fine supervision for the depth predictions. Eigen  built on that work to simultaneously generate surface normal predictions and semantic labels as well. Dharmasiri  follow a similar network design but replace semantic label prediction with principal curvature prediction. Our network architecture has some commonalities with the aforementioned, primarily in our use of multi-scale predictions and similar prediction modalities. However, our goal is more aligned with that of Qi  who propose a method for enforcing geometric consistency in the network outputs. In that work, the authors use the depth predictions to refine normal predictions and vice versa. In our case, we use a plane-aware loss to make our network predictions geometrically consistent. Our objective is also somewhat similar to that of Liu  who predict a planar segmentation of the scene. However, they rely on a separate plane classification branch in their network and are limited to a fixed number of planes. We use a parametric definition of a plane derived from the principal curvature map and are thus unlimited in the number of planes we can predict.
There have been other recent works in monocular depth estimation that, while interesting and useful, are not currently feasible for our task. Godard  use stereo image pairs to train a model for monocular depth estimation using an image reconstruction loss. In our case, we only have access to monocular images. Li and Snavely  train a network on a dataset built from large-scale, unordered image collections. Alas, there is not yet such a repository for omnidirectional images.
2.2 Omnidirectional images
The primary distinction between our work and those presented above is the mode of our input data. Most research in monocular depth estimation has relied thus far on perspective image projections. We instead operate on equirectangular image projections, which image a spherical capture oo a plane. This representation carries high levels of distortion. There is an active branch of research in developing solutions to account for these factors. Su and Grauman 
propose a transfer learning approach to train networks to operate on equirectangular projections. Using an existing perspective-projection-trained network as the target, they train an equirectangular network with a learnable adaptive convolutional kernel to match the outputs. Tateno present a distortion-aware convolutional kernel that convolves over the sampling grid transformed by a distortion function. In this way, the network can be trained on perspective images and still perform effectively on spherical projections. Coors  independently derive the same operation and show that it can be highly effective for object detection on images. Both methods train on perspective images and evaluate on spherical projections. Another promising method is the spherical convolution derived by Cohen  . Spherical convolutions address the nuances of spherical projections by filtering rotations of the feature maps rather than translations. Most recently, Eder and Frahm  demonstrate that resampling spherical images to a subdivided icosahedron substantially improves the performance of CNNs trained on spherical data. In our work we do not directly address the problem of specialized convolutions. Rather, we explore the application of omnidirectional image inference for the task of indoor 3D modeling. Our work is most similar to that of Zioulis  who estimate depth directly from omnidirectional images.
There is also a growing body of work using panorama images to generate indoor scene layouts. Xu 
fuse object detection and 3D geometry estimation use Bayesian inference to generate 3D room layouts from a singleimage. Rather than dividing the problem into sub-tasks, Lee  use an end-to-end CNN to generate a 3D room layout from a single perspective image. Zhou  improve this technique by incorporating vanishing point alignment and prediction additional layout elements to their model. All of the aforementioned layout generation models assume a Manhattan World in their predictions. While this may be useful for common room shapes, it is too simple a prior for general indoor scene modeling. Our work focuses on a more complete indoor 3D model, so we relax this Manhattan constraint to a planar one.
3 Plane-Aware Estimation
We present a CNN that estimates dense depth and surface normal predictions as well as a planar boundary map from a single image. To learn depth and normal prediction, we supervise training with ground truth values. Observing that a non-zero principal curvature indicates the presence of a planar boundary, we supervise training for the planar boundary map using the norm of the principal curvature.
3.1 Network architectures
We analyze our plane-aware loss function using a network based on the RectNet architecture used by Zioulis . Our network uses the same encoder-decoder structure with rectangular filter banks on the input layers, but with two decoder branches: one for depth predictions and one for joint surface normal and plane boundary map prediction. We also include skip connections from encoder to decoder layers as in U-Net from Ronneberger , as we observe it improves performance. Our network takes a five-channel input: an RGB equirectangular projection and the associated geodesic map containing latitude and longitude coordinates for each pixel. This design is based on the observation that distortion in equirectangular projections is location dependent. Given that these images are indexed by their geodesic coordinates, given in latitude and longitude, we provide the network with location information in the form of a geodesic coordinate map of the image. We find that this provides a significant boost in performance for surface normal prediction in particular and discuss it in more detail in Section 4.4. Figure 2 provides a detailed overview of our network.
Recall our premise that each scene is piecewise-planar. This assumption provides a few constraints. First, each scene should be segmented by some web of edges that define the boundaries between each plane. Second, each planar region should have a constant depth gradient and all pixels within should have the same surface normal. Furthermore, the principal curvature, which is effectively the second derivative of depth, should be zero. Lastly, the depth and normal predictions within a planar region should satisfy the plane equation , where is the normal, is the 3D point, and is the plane’s distance from the origin.
We enforce these constraints through a multi-scale, multi-task loss function. We compute individual losses over the depth, surface normals, and plane boundary map predictions as well as a loss over the plane distance prediction for each pixel, denoted as , , , and , respectively. This last term is computed as a function of both the depth and normal predictions, which encourages planar consistency. Each of the losses is also weighted using a plane-aware function . For the depth, curvature, and plane distance losses, we use the reverse Huber, or BerHu, loss proposed by Laina . This loss is given as
where we adjust on a per-batch basis to be 20% of the max per-batch error as in . Our plane-aware function weights the impact of each pixel to the loss by the norm of its ground truth principal curvature, :
As curvature is zero on a planar surface, this term gives full weight to all pixels that lie on planes. However, pixels that fall along sharp plane boundaries and thus have higher curvatures will have their contribution to the loss down-weighted. This is similar to the texture-edge-aware loss weighting used by Godard , except that we use the curvature values instead of intensity gradients. Our formulation makes more sense for our task, given that we are interested in planar boundaries rather than texture ones.
Each component of the loss is given below. The subscript denotes the -th pixel in the image; is depth, is normal, and is curvature.
where is the relevant output map and the asterisks denote ground truth values. In Equation (6), where is the directional unit vector from the camera center to pixel on the sphere, i.e. is the back-projected 3D point.
It is worth noting that other single-view depth estimation papers typically include an penalty on the gradient of the depth or disparity prediction to account for homogeneous regions where depth may be ambiguous [9, 29]. However, this term is known to lead to over-smoothing, especially for surfaces that are not fronto-planar to the camera. In the case of images, where depth is defined as the distance from a 3D point to the camera center (rather than to the image plane), this gradient penalty would encourage the prediction of a circular scene wherein each point is locally fronto-planar to the camera. Thus, we do not penalize the depth gradient at all. In the planar boundary map prediction, however, we do include an penalty to encourage sparsity in the edge predictions.
Our total loss is thus the sum of all of these terms at two scales weighted by some hyper-parameters , , , and :
We empirically set the hyper-parameters to balance the contribution of each component loss. In our reported results, , , , , , , , and . The penalty coefficient in Equation (5) is always . Nonetheless, we observed that small changes to these hyper-parameters have negligible effects on the network training. Note that we do not use any loss for planar boundary map prediction for the down-scaled prediction () as we observed that it made no impact in the final plane boundary map. We train the network for epochs with a batch size of and use the Adam optimizer  with an initial learning rate of decayed by half every epochs.
|L2 + smoothing |
|L2 instead of BerHu|
|No curvature penalty|
|No plane loss|
In this section we evaluate our proposed plane-aware depth and normal estimation. First, we demonstrate the benefit of our plane-aware loss through comparison to a baseline, the loss used by Zioulis , as well as in a series of ablation experiments. Second, we demonstrate the importance of predicting surface normals rather than relying on derived normals from predicted depth. We then examine the effect of including coordinate priors as inputs to the network. Finally, we qualitatively show how we can leverage the predicted plane boundary map to create 3D reconstructions in Section 5.
We train and evaluate our method using the Scene Understanding and Modeling (SUMO) dataset, a collection of 58,631 computer generated omnidirectional images of indoor scenes derived from SunCG . As released, the SUMO dataset contains RGB-D cube map images with a cube face dimension of pixels. To prepare this data for our experiments, we resample the cube maps to
pixel equirectangular images using bilinear interpolation for color information and nearest-neighbor interpolation for depth. For the purposes of surface normal and principal curvature prediction, we augment the dataset with normal and curvature maps for each image as well. We derive the ground truth normal maps from the provided images by first resampling them to the vertices of icosahedral triangular mesh as in, scaling each vertex by the ground truth depth, computing the surface normal for each face, and rendering the normal maps back into an equirectangular projection. For the ground truth planar boundary maps, we use the norm of the principal curvature. The curvature maps are derived as in 
using the eigenvalues of thematrix given by:
where and are vectors that, with the surface normal , form an orthonormal basis at a given point . , , and are defined by the derivatives of the the surface normal at that point:
|Plane-aware + Lat./Lon.|
|Derived from depth|
|No curvature penalty|
|No plane loss|
4.2 Depth estimation
We evaluate the depth estimation task using the standard set of metrics defined in Eigen , shown in Table 1. Because depth estimates are subject to the arbitrary scale of the training distribution, we use the median scaling technique given by  to normalize the depth distributions during evaluation. The numbers we report are based on pixels whose ground truth depth falls within the range . We set
to be 4.375 standard deviations above the mean of the training set, deriving this value from an analysis of the evaluation threshold used by Zioulis. To evaluation our proposed loss, we compare to network training under the loss used by Zioulis  as a baseline. This loss is simply an minimization with a gradient penalty at two scales, as given by Equation (8):
The results in Table 1 show that our loss formulation outperforms the baseline. We note that the training on synthetic images leads to a high performance for the baseline as well, so we also look to a qualitative analysis to reinforce the effect of our plane-aware formulation. Figure 4 shows a selection of network outputs comparing our loss to the baseline. Observe the finer-grained depth estimate of lounge chair in the center of row (1) and the shelving and counters in rows (2) and (3). We find that training with our proposed plane-aware loss results in sharper details in the resulting depth maps. We posit that this effect is due to extra supervision provided by the ground truth curvature penalty, which limits smoothing on geometric edges.
We perform an ablation study on elements of our loss function, also listed in Table 1. Among other things, these results demonstrate that our improvement is not simply due to the use of the BerHu loss. We see a moderate impact from both the planar-consistency regularizer as well as the curvature penalty. Interestingly, we found that removing the associated curvature prediction task altogether neither affected the depth or normal prediction accuracy. However, we keep it in the network as it plays a key role in generating the 3D reconstructions, discussed in Section 5.
4.3 Surface normal estimation
For surface normal estimates, we examine pixels that fall within the same valid ground truth depth range. We evaluate the average angular error per pixel as well as the percentage of pixels whose angular error falls within a threshold of the ground truth. Table 2 shows that our loss formulation is useful for improving surface normal prediction. As a baseline we use the surface normals derived from the depth predictions. These results indicate that derived normals are no replacement for an independent surface normal prediction. Our predicted normals are much less susceptible to noisy depth values than their derived counterparts. Figure 5 shows a qualitative comparison of our predicted results compared to the derived normals. When the depth estimation is fairly accurate, the derived normals are only slightly noisier than the prediction, as in row (1). However, in cases where the depth predictions are not as high quality, the predicted normals are often still very good, while the derived normals degrade significantly, as in rows (2) and (3). This effect is why we rely on the indepdendent surface normal prediction branch when generating a 3D reconstruction.
4.4 Geodesic map inputs
We also delve deeper into the impact of the latitude and longitude map priors in the network. Fixing all other aspects of the network, we evaluate the performance of our network on the SUMO dataset with and without the geodesic map channels. Consistent with our expectations, the results in the bottom block of Table 2 suggest that the geodesic map inputs have a positive impact in surface normal estimation. We surmise that the geodesic map helps the network disambiguate the orientation of the surface normal. It is notable that without the geodesic map, we see errors occur at the peak point of barreling on planes in the equirectangular projection as in the top-left image in Figure 6. Interestingly, longitude provides the most important information, which aligns with what we observe in Figure 6: predictions changing abruptly along the rows.
Because the equirectangular grid is indexed by spherical coordinates rather than a Cartesian grid, the distance between adjacent pixels is row-dependent as well. Adjacent pixels nearer to the top and bottom of the image actually lie closer together on the sphere than adjacent pixels near the middle of the image do. This sampling scheme is problematic for CNNs because the convolution operation’s translation equivariance inherently assumes an even sampling. Somehow the network needs to learn to map the geodesic sampling to a Cartesian one. Our experiments suggest that including the geodesic maps as extra input channels is a useful way to pass this information to the network. These findings line up with the results of Liu  who show that incorporating pixel location information can help a network learn some degree of translation dependence, which is what we also need to achieve.
5 3D Planar Model Reconstruction
An important consequence of our planarity assumption is that the network provides all of the information necessary to detect and segment planes in the input images. By defining these planes, we can generate “pop-up” models from a single image, as proposed by Hoiem . Indoor omnidirectional images are uniquely suited to produce these types of reconstructions as they are capable of capturing entire rooms in a single image.
To generate these reconstructions, we first isolate the sharpest edges in the planar boundary map using Otsu thresholding  and then identify each connected component in the resulting segmentation. An example of the result of this plane segmentation is shown in Figure 7. Thanks to the quality of our plane boundary predictions, this segmentation process requires no threshold tuning. To turn this segmentation into a 3D planar model, we first compute the median normal within each segmented plane. Then, we estimate the distance parameter of the plane equation in each segment using a 1-parameter RANSAC  with a final least-squares refinement over the inliers. Lastly, we project each pixel onto its associated plane. The model is finally “popped-up” in 3D by back-projecting the point cloud according to these new depths. We mesh the points by resampling to the vertices of a icosahedral triangular grid and scaling the vertices according to the adjusted depths, resulting in the models shown in Figure 8.
Reiterating the importance of surface normal prediction, we found incorporating normal information to be vital to our RANSAC routine. Estimating planes solely from the depth estimates gives a much noisier reconstruction. Furthermore, we observe that having plane information allows us to produce higher quality 3D models than those generated from depth estimates alone. Figure 9 compares our method, which leverages depth, normals, and boundary information, to the baseline network, which only estimates depth. Where the latter model suffers from smoothed edges, ours is able to produce sharp plane boundaries.
The significant drawback of monocular depth estimation is that the lack of any regularization over the estimates leads to fairly noisy predictions. This stands in contrast to stereo methods (and even pseudo-stereo methods like Godard ) in which a second image can be used to ensure consistency in the depth map. However, with our planar assumption, we can resolve some of the depth ambiguity while staying purely monocular. Moreover, the planar constraint removes the dependence on texture to recover depth. Although making assumptions about the scene may be impractical for specific tasks like autonomous vehicle depth estimation , Figure 9 demonstrates that a simple planarity assumption can be leveraged with great effect for indoor 3D modeling.
We have presented a CNN capable of predicting depth, surface normals, and planar boundaries from a single indoor image. Using a novel plane-aware loss function, we have achieved state-of-the-art results for these tasks. We have also demonstrated that the inclusion of a geodesic map can improve surface normal estimates for omnidrectional images. Lastly, we have shown that our network provides all the information necessary to produce a 3D planar model of the scene. Looking ahead, we see an emerging opportunity to utilize this type of all-in-one prediction from omnidirectional images to bootstrap indoor 3D reconstruction.
-  T. Cohen, M. Geiger, J. Köhler, and M. Welling. Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893, 2017.
-  T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
-  B. Coors, A. P. Condurache, and A. Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 518–533, 2018.
-  T. Dharmasiri, A. Spek, and T. Drummond. Joint prediction of depths, normals and surface curvature from rgb images using cnns. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 1505–1512. IEEE, 2017.
-  M. Eder and J.-M. Frahm. Convolutions on spherical images. arXiv preprint arXiv:1905.08409, 2019.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.
-  M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
-  D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. In ACM transactions on graphics (TOG), volume 24, pages 577–584. ACM, 2005.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
-  C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: End-to-end room layout estimation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4875–4884. IEEE, 2017.
Z. Li and N. Snavely.
Megadepth: Learning single-view depth prediction from internet
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
-  C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa. Planenet: Piece-wise planar reconstruction from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2579–2588, 2018.
-  R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. arXiv preprint arXiv:1807.03247, 2018.
-  N. Otsu. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1):62–66, 1979.
-  X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
-  N. Smolyanskiy, A. Kamenev, and S. Birchfield. On the importance of stereo for accurate depth estimation: An efficient semi-supervised deep neural network approach. arXiv preprint arXiv:1803.09719, 2018.
-  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  A. Spek, W. H. Li, and T. Drummond. A fast method for computing principal curvatures from range images. arXiv preprint arXiv:1707.00385, 2017.
-  Y.-C. Su and K. Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539, 2017.
-  K. Tateno, N. Navab, and F. Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 707–722, 2018.
-  L. Tchapmi and D. Huber. The sumo challenge.
-  J. Xu, B. Stenger, T. Kerola, and T. Tung. Pano2cad: Room layout from a single panorama image. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 354–362. IEEE, 2017.
-  N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras. Omnidepth: Dense depth estimation for indoors spherical panoramas. arXiv preprint arXiv:1807.09620, 2018.
-  C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2059, 2018.
7 Extended Results
In this section, we provide a further qualitative review of our work. Figure 10 shows more examples of our network’s depth estimates compared to our baseline. Figure 11 provides more cases to justify the prediction of normals independently from depth. Finally, Figure 12 shows more comparisons of popup reconstructions along with examples of the plane boundary predictions and segmentations.