Pano Popups: Indoor 3D Reconstruction with a Plane-Aware Network

07/01/2019 ∙ by Marc Eder, et al. ∙ zillow University of North Carolina at Chapel Hill Wormpex Technology 3

In this work we present a method to train a plane-aware convolutional neural network for dense depth and surface normal estimation as well as plane boundaries from a single indoor image. Using our proposed loss function, our network outperforms existing methods for single-view, indoor, omnidirectional depth estimation and provides an initial benchmark for surface normal prediction from images. Our improvements are due to the use of a novel plane-aware loss that leverages principal curvature as an indicator of planar boundaries. We also show that including geodesic coordinate maps as network priors provides a significant boost in surface normal prediction accuracy. Finally, we demonstrate how we can combine our network's outputs to generate high quality 3D "pop-up" models of indoor scenes.



There are no comments yet.


page 1

page 4

page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Omnidirectional imaging is currently experiencing a surge in popularity, thanks to the advent of interactive panorama photo sharing on social media platforms, the rise of small, affordable cameras like the Ricoh Theta and Samsung Gear360, and the host of potential applications that arise from capturing wide field of view (FoV) in a single frame. At the same time, deep learning has never been a more useful tool for solving computer vision tasks from object recognition to 3D reconstruction. In order to fully utilize this rising form of media, we must extend existing deep learning methods to the omnidirectional domain. Unfortunately, this is not necessarily a trivial task.

Figure 1: Visualization of our paper’s contributions. Given an RGB omnidirectional image (top), we predict depth, surface normals, and plane boundary maps (middle) with state-of-the-art accuracy. Then we show we can use this information to achieve a planar segmentation and 3D reconstruction of the input image (bottom).

Due to the radically different camera models, deep networks trained on perspective images do not transfer well to omnidirectional images. Omnidirectional images replace the concept of the image plane with that of the image sphere. Yet because we require a 2D planar representation of the image, omnidirectional cameras typically provide outputs as FoV equirectangular projections. This representation of the spherical image, while compact, suffers from significant horizontal distortion, especially near the poles.

While there have been a number of efforts to handle the difficulties of equirectangular projections [1, 2, 3, 5, 25, 26], we are interested in exploring their possible uses. There is excitement over the range of applications of omnidirectional imaging from head-mounted displays to medical scopes to autonomous vehicles. In this paper, we target indoor scene modeling.

Perspective image methods are impeded by a small FoV that is more likely to be limited by featureless, homogeneous regions in an indoor scene. With the larger FoV in images, these homogeneous regions can be reasoned about in the larger context of the scene. Our goal is to predict the dense depth and surface normals for a piecewise-planar reconstruction of the scene. This objective differs from much of the existing work that uses omnidirectional images for indoor 3D modeling. Those, such as RoomNet [14] and LayoutNet [30], aim to generate a simple model of the scene by leveraging a Manhattan World constraint to estimate the dominant planes. That type of model is useful for determining the shapes of rooms and floor-plans of buildings, but not for modeling the objects that comprise the captured scene. While we, too, are essentially estimating planes in the scene, we aim for a more fine-grained model in order to better capture these important details. To this end, we relax the Manhattan constraint to a simple planar one. That is, we assume only that our scene is piecewise-planar.

We use a convolutional neural network (CNN) to predict depth and surface normal estimates per pixel as well as a map of the plane boundaries in the image. We enforce the planar assumption by using a plane-aware loss function that modifies each pixel’s contribution to the learning based on its principal curvature. Using our network outputs, we then generate high quality 3D planar models of the scene as seen in Figure 1.

We summarize our contributions in this paper as follows:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • We propose a plane-aware cost function to estimate depth, surface normals, and plane boundaries from a single image.

  • We demonstrate that the inclusion of geodesic coordinate maps as extra inputs to the network improves surface normal prediction from omnidirectional images.

  • We qualitatively show that our network can be used to generate a 3D planar model from a single image.

Figure 2:

Overview of our network architecture. The vectors next to each layer are

for each. Above each layer is the kernel size, ‘T’ indicates transposed convolution, ‘s#’ indicates stride, and ‘d#’ indicates dilation. The network follows a similar encoder-decoder model to Zioulis

[29]. There are two decoder branches: one for depth, and the other for normals and curvature. The downsampled predictions are upsampled and used for scale prediction in both branches.

2 Related Work

2.1 Single-view estimation

There is a significant body of existing research on the task of monocular depth estimation from perspective images. One of the first papers to report success in this task was from Saxena [21], who use a Markov Random Field to infer depth from a blend of local and global image features. With the advent of practical deep learning, more recent methods have focused on applying CNNs to estimate depth. Eigen [7] present a CNN for depth estimation that uses multi-scale predictions to provide coarse and fine supervision for the depth predictions. Eigen [6] built on that work to simultaneously generate surface normal predictions and semantic labels as well. Dharmasiri [4] follow a similar network design but replace semantic label prediction with principal curvature prediction. Our network architecture has some commonalities with the aforementioned, primarily in our use of multi-scale predictions and similar prediction modalities. However, our goal is more aligned with that of Qi [19] who propose a method for enforcing geometric consistency in the network outputs. In that work, the authors use the depth predictions to refine normal predictions and vice versa. In our case, we use a plane-aware loss to make our network predictions geometrically consistent. Our objective is also somewhat similar to that of Liu [16] who predict a planar segmentation of the scene. However, they rely on a separate plane classification branch in their network and are limited to a fixed number of planes. We use a parametric definition of a plane derived from the principal curvature map and are thus unlimited in the number of planes we can predict.

There have been other recent works in monocular depth estimation that, while interesting and useful, are not currently feasible for our task. Godard [10] use stereo image pairs to train a model for monocular depth estimation using an image reconstruction loss. In our case, we only have access to monocular images. Li and Snavely [15] train a network on a dataset built from large-scale, unordered image collections. Alas, there is not yet such a repository for omnidirectional images.

2.2 Omnidirectional images

The primary distinction between our work and those presented above is the mode of our input data. Most research in monocular depth estimation has relied thus far on perspective image projections. We instead operate on equirectangular image projections, which image a spherical capture oo a plane. This representation carries high levels of distortion. There is an active branch of research in developing solutions to account for these factors. Su and Grauman [25]

propose a transfer learning approach to train networks to operate on equirectangular projections. Using an existing perspective-projection-trained network as the target, they train an equirectangular network with a learnable adaptive convolutional kernel to match the outputs. Tateno

[26] present a distortion-aware convolutional kernel that convolves over the sampling grid transformed by a distortion function. In this way, the network can be trained on perspective images and still perform effectively on spherical projections. Coors [3] independently derive the same operation and show that it can be highly effective for object detection on images. Both methods train on perspective images and evaluate on spherical projections. Another promising method is the spherical convolution derived by Cohen [1] [2]. Spherical convolutions address the nuances of spherical projections by filtering rotations of the feature maps rather than translations. Most recently, Eder and Frahm [5] demonstrate that resampling spherical images to a subdivided icosahedron substantially improves the performance of CNNs trained on spherical data. In our work we do not directly address the problem of specialized convolutions. Rather, we explore the application of omnidirectional image inference for the task of indoor 3D modeling. Our work is most similar to that of Zioulis [29] who estimate depth directly from omnidirectional images.

There is also a growing body of work using panorama images to generate indoor scene layouts. Xu [28]

fuse object detection and 3D geometry estimation use Bayesian inference to generate 3D room layouts from a single

image. Rather than dividing the problem into sub-tasks, Lee [14] use an end-to-end CNN to generate a 3D room layout from a single perspective image. Zhou [30] improve this technique by incorporating vanishing point alignment and prediction additional layout elements to their model. All of the aforementioned layout generation models assume a Manhattan World in their predictions. While this may be useful for common room shapes, it is too simple a prior for general indoor scene modeling. Our work focuses on a more complete indoor 3D model, so we relax this Manhattan constraint to a planar one.

3 Plane-Aware Estimation

We present a CNN that estimates dense depth and surface normal predictions as well as a planar boundary map from a single image. To learn depth and normal prediction, we supervise training with ground truth values. Observing that a non-zero principal curvature indicates the presence of a planar boundary, we supervise training for the planar boundary map using the norm of the principal curvature.

3.1 Network architectures

We analyze our plane-aware loss function using a network based on the RectNet architecture used by Zioulis [30]. Our network uses the same encoder-decoder structure with rectangular filter banks on the input layers, but with two decoder branches: one for depth predictions and one for joint surface normal and plane boundary map prediction. We also include skip connections from encoder to decoder layers as in U-Net from Ronneberger [20], as we observe it improves performance. Our network takes a five-channel input: an RGB equirectangular projection and the associated geodesic map containing latitude and longitude coordinates for each pixel. This design is based on the observation that distortion in equirectangular projections is location dependent. Given that these images are indexed by their geodesic coordinates, given in latitude and longitude, we provide the network with location information in the form of a geodesic coordinate map of the image. We find that this provides a significant boost in performance for surface normal prediction in particular and discuss it in more detail in Section 4.4. Figure 2 provides a detailed overview of our network.

(a) RGB Input

(b) Pred. Depth

(c) GT Depth

(d) Pred. Normals

(e) GT Normals

(f) Pred. Boundaries
Figure 3: Examples of our network predictions on the SUMO dataset [27]. Observe that the plane boundary maps only include the geometric edges in the scene. For example, they do not include the highly textured floor and ceiling in the top row input.

3.2 Training

Recall our premise that each scene is piecewise-planar. This assumption provides a few constraints. First, each scene should be segmented by some web of edges that define the boundaries between each plane. Second, each planar region should have a constant depth gradient and all pixels within should have the same surface normal. Furthermore, the principal curvature, which is effectively the second derivative of depth, should be zero. Lastly, the depth and normal predictions within a planar region should satisfy the plane equation , where is the normal, is the 3D point, and is the plane’s distance from the origin.

We enforce these constraints through a multi-scale, multi-task loss function. We compute individual losses over the depth, surface normals, and plane boundary map predictions as well as a loss over the plane distance prediction for each pixel, denoted as , , , and , respectively. This last term is computed as a function of both the depth and normal predictions, which encourages planar consistency. Each of the losses is also weighted using a plane-aware function . For the depth, curvature, and plane distance losses, we use the reverse Huber, or BerHu, loss proposed by Laina [13]. This loss is given as


where we adjust on a per-batch basis to be 20% of the max per-batch error as in [13]. Our plane-aware function weights the impact of each pixel to the loss by the norm of its ground truth principal curvature, :


As curvature is zero on a planar surface, this term gives full weight to all pixels that lie on planes. However, pixels that fall along sharp plane boundaries and thus have higher curvatures will have their contribution to the loss down-weighted. This is similar to the texture-edge-aware loss weighting used by Godard [10], except that we use the curvature values instead of intensity gradients. Our formulation makes more sense for our task, given that we are interested in planar boundaries rather than texture ones.

Each component of the loss is given below. The subscript denotes the -th pixel in the image; is depth, is normal, and is curvature.


where is the relevant output map and the asterisks denote ground truth values. In Equation (6), where is the directional unit vector from the camera center to pixel on the sphere, i.e. is the back-projected 3D point.

It is worth noting that other single-view depth estimation papers typically include an penalty on the gradient of the depth or disparity prediction to account for homogeneous regions where depth may be ambiguous [9, 29]. However, this term is known to lead to over-smoothing, especially for surfaces that are not fronto-planar to the camera. In the case of images, where depth is defined as the distance from a 3D point to the camera center (rather than to the image plane), this gradient penalty would encourage the prediction of a circular scene wherein each point is locally fronto-planar to the camera. Thus, we do not penalize the depth gradient at all. In the planar boundary map prediction, however, we do include an penalty to encourage sparsity in the edge predictions.

Our total loss is thus the sum of all of these terms at two scales weighted by some hyper-parameters , , , and :


We empirically set the hyper-parameters to balance the contribution of each component loss. In our reported results, , , , , , , , and . The penalty coefficient in Equation (5) is always . Nonetheless, we observed that small changes to these hyper-parameters have negligible effects on the network training. Note that we do not use any loss for planar boundary map prediction for the down-scaled prediction () as we observed that it made no impact in the final plane boundary map. We train the network for epochs with a batch size of and use the Adam optimizer [12] with an initial learning rate of decayed by half every epochs.

Loss AbsRel SqRel RMSLin RMSLog
L2 + smoothing [29]
Plane-aware (ours)
L2 instead of BerHu
No curvature penalty
No plane loss
Table 1: Depth estimation results comparing out loss to alternatives and ablated forms. Our baseline, L2 + smoothing, is the approach taken by Zioulis [29]. We also evaluate using an L2 loss in place of BerHu, training the network without the curvature-aware penalty, Eq. (2), removing the planar-consistency regularizer, Eq. (6), and omitting the joint plane boundary map prediction, Eq. (5).

4 Evaluation

In this section we evaluate our proposed plane-aware depth and normal estimation. First, we demonstrate the benefit of our plane-aware loss through comparison to a baseline, the loss used by Zioulis [29], as well as in a series of ablation experiments. Second, we demonstrate the importance of predicting surface normals rather than relying on derived normals from predicted depth. We then examine the effect of including coordinate priors as inputs to the network. Finally, we qualitatively show how we can leverage the predicted plane boundary map to create 3D reconstructions in Section 5.

4.1 Dataset

We train and evaluate our method using the Scene Understanding and Modeling (SUMO) dataset

[27], a collection of 58,631 computer generated omnidirectional images of indoor scenes derived from SunCG [23]. As released, the SUMO dataset contains RGB-D cube map images with a cube face dimension of pixels. To prepare this data for our experiments, we resample the cube maps to

pixel equirectangular images using bilinear interpolation for color information and nearest-neighbor interpolation for depth. For the purposes of surface normal and principal curvature prediction, we augment the dataset with normal and curvature maps for each image as well. We derive the ground truth normal maps from the provided images by first resampling them to the vertices of icosahedral triangular mesh as in

[5], scaling each vertex by the ground truth depth, computing the surface normal for each face, and rendering the normal maps back into an equirectangular projection. For the ground truth planar boundary maps, we use the norm of the principal curvature. The curvature maps are derived as in [24]

using the eigenvalues of the

matrix given by:

where and are vectors that, with the surface normal , form an orthonormal basis at a given point . , , and are defined by the derivatives of the the surface normal at that point:

Method Avg. Ang.
Plane-aware + Lat./Lon.
Derived from depth
No curvature penalty
No plane loss
No coordinates
Lat. only
Lon. only
Table 2: Surface normal prediction results. Due to a dearth of existing methods for surface normal prediction on omnidirectional images, we evaluate against surface normals derived from predicted depth and perform ablation studies.

4.2 Depth estimation

We evaluate the depth estimation task using the standard set of metrics defined in Eigen [7], shown in Table 1. Because depth estimates are subject to the arbitrary scale of the training distribution, we use the median scaling technique given by [29] to normalize the depth distributions during evaluation. The numbers we report are based on pixels whose ground truth depth falls within the range . We set

to be 4.375 standard deviations above the mean of the training set, deriving this value from an analysis of the evaluation threshold used by Zioulis

[29]. To evaluation our proposed loss, we compare to network training under the loss used by Zioulis [29] as a baseline. This loss is simply an minimization with a gradient penalty at two scales, as given by Equation (8):


The results in Table 1 show that our loss formulation outperforms the baseline. We note that the training on synthetic images leads to a high performance for the baseline as well, so we also look to a qualitative analysis to reinforce the effect of our plane-aware formulation. Figure 4 shows a selection of network outputs comparing our loss to the baseline. Observe the finer-grained depth estimate of lounge chair in the center of row (1) and the shelving and counters in rows (2) and (3). We find that training with our proposed plane-aware loss results in sharper details in the resulting depth maps. We posit that this effect is due to extra supervision provided by the ground truth curvature penalty, which limits smoothing on geometric edges.

(b) Baseline

(c) Ours

(d) Ground Truth
Figure 4: A qualitative comparison of depth predictions using our plane-aware loss compared to the baseline method based on Zioulis [29]. Notice that our depth estimates are able to capture finer details of the scene.

(a) RGB Input

We perform an ablation study on elements of our loss function, also listed in Table 1. Among other things, these results demonstrate that our improvement is not simply due to the use of the BerHu loss. We see a moderate impact from both the planar-consistency regularizer as well as the curvature penalty. Interestingly, we found that removing the associated curvature prediction task altogether neither affected the depth or normal prediction accuracy. However, we keep it in the network as it plays a key role in generating the 3D reconstructions, discussed in Section 5.

4.3 Surface normal estimation

For surface normal estimates, we examine pixels that fall within the same valid ground truth depth range. We evaluate the average angular error per pixel as well as the percentage of pixels whose angular error falls within a threshold of the ground truth. Table 2 shows that our loss formulation is useful for improving surface normal prediction. As a baseline we use the surface normals derived from the depth predictions. These results indicate that derived normals are no replacement for an independent surface normal prediction. Our predicted normals are much less susceptible to noisy depth values than their derived counterparts. Figure 5 shows a qualitative comparison of our predicted results compared to the derived normals. When the depth estimation is fairly accurate, the derived normals are only slightly noisier than the prediction, as in row (1). However, in cases where the depth predictions are not as high quality, the predicted normals are often still very good, while the derived normals degrade significantly, as in rows (2) and (3). This effect is why we rely on the indepdendent surface normal prediction branch when generating a 3D reconstruction.

(b) Derived

(c) Predicted

(d) Ground Truth
Figure 5: Comparison between our surface normal predictions and those derived the from depth predictions alone. Normal predictions are more reliable than normals derived from depth, as there is no direct dependence between the two predictions. This is important for generating a realistic 3D reconstruction.

(a) RGB Input

4.4 Geodesic map inputs

We also delve deeper into the impact of the latitude and longitude map priors in the network. Fixing all other aspects of the network, we evaluate the performance of our network on the SUMO dataset with and without the geodesic map channels. Consistent with our expectations, the results in the bottom block of Table 2 suggest that the geodesic map inputs have a positive impact in surface normal estimation. We surmise that the geodesic map helps the network disambiguate the orientation of the surface normal. It is notable that without the geodesic map, we see errors occur at the peak point of barreling on planes in the equirectangular projection as in the top-left image in Figure 6. Interestingly, longitude provides the most important information, which aligns with what we observe in Figure 6: predictions changing abruptly along the rows.

Because the equirectangular grid is indexed by spherical coordinates rather than a Cartesian grid, the distance between adjacent pixels is row-dependent as well. Adjacent pixels nearer to the top and bottom of the image actually lie closer together on the sphere than adjacent pixels near the middle of the image do. This sampling scheme is problematic for CNNs because the convolution operation’s translation equivariance inherently assumes an even sampling. Somehow the network needs to learn to map the geodesic sampling to a Cartesian one. Our experiments suggest that including the geodesic maps as extra input channels is a useful way to pass this information to the network. These findings line up with the results of Liu [17] who show that incorporating pixel location information can help a network learn some degree of translation dependence, which is what we also need to achieve.

Figure 6: Demonstrating the impact of geodesic map inputs on surface normal prediction. TL: output without the geodesic maps, TR: output with geodesic maps, B: ground truth. Notice the error in the large wall in the center of the image.
Figure 7: Plane segmentation using the plane prediction output from our network. TL: RGB input, TR: plane boundary prediction, B: plane boundary segmentation color-coded by label.

5 3D Planar Model Reconstruction

An important consequence of our planarity assumption is that the network provides all of the information necessary to detect and segment planes in the input images. By defining these planes, we can generate “pop-up” models from a single image, as proposed by Hoiem [11]. Indoor omnidirectional images are uniquely suited to produce these types of reconstructions as they are capable of capturing entire rooms in a single image.

To generate these reconstructions, we first isolate the sharpest edges in the planar boundary map using Otsu thresholding [18] and then identify each connected component in the resulting segmentation. An example of the result of this plane segmentation is shown in Figure 7. Thanks to the quality of our plane boundary predictions, this segmentation process requires no threshold tuning. To turn this segmentation into a 3D planar model, we first compute the median normal within each segmented plane. Then, we estimate the distance parameter of the plane equation in each segment using a 1-parameter RANSAC [8] with a final least-squares refinement over the inliers. Lastly, we project each pixel onto its associated plane. The model is finally “popped-up” in 3D by back-projecting the point cloud according to these new depths. We mesh the points by resampling to the vertices of a icosahedral triangular grid and scaling the vertices according to the adjusted depths, resulting in the models shown in Figure 8.

Reiterating the importance of surface normal prediction, we found incorporating normal information to be vital to our RANSAC routine. Estimating planes solely from the depth estimates gives a much noisier reconstruction. Furthermore, we observe that having plane information allows us to produce higher quality 3D models than those generated from depth estimates alone. Figure 9 compares our method, which leverages depth, normals, and boundary information, to the baseline network, which only estimates depth. Where the latter model suffers from smoothed edges, ours is able to produce sharp plane boundaries.

Figure 8: View of the 3D “pop-up” model created from our network outputs. Left: our planar reconstruction textured with RGB image. Right: same model textured with plane segmentation.
Figure 9: Left: Snapshot of an untextured, meshed 3D model produced from the baseline depth predictions using the image from Figure 8. Right: Equivalent popup model generated using the our proposed method.

The significant drawback of monocular depth estimation is that the lack of any regularization over the estimates leads to fairly noisy predictions. This stands in contrast to stereo methods (and even pseudo-stereo methods like Godard [10]) in which a second image can be used to ensure consistency in the depth map. However, with our planar assumption, we can resolve some of the depth ambiguity while staying purely monocular. Moreover, the planar constraint removes the dependence on texture to recover depth. Although making assumptions about the scene may be impractical for specific tasks like autonomous vehicle depth estimation [22], Figure 9 demonstrates that a simple planarity assumption can be leveraged with great effect for indoor 3D modeling.

6 Conclusion

We have presented a CNN capable of predicting depth, surface normals, and planar boundaries from a single indoor image. Using a novel plane-aware loss function, we have achieved state-of-the-art results for these tasks. We have also demonstrated that the inclusion of a geodesic map can improve surface normal estimates for omnidrectional images. Lastly, we have shown that our network provides all the information necessary to produce a 3D planar model of the scene. Looking ahead, we see an emerging opportunity to utilize this type of all-in-one prediction from omnidirectional images to bootstrap indoor 3D reconstruction.


7 Extended Results

In this section, we provide a further qualitative review of our work. Figure 10 shows more examples of our network’s depth estimates compared to our baseline. Figure 11 provides more cases to justify the prediction of normals independently from depth. Finally, Figure 12 shows more comparisons of popup reconstructions along with examples of the plane boundary predictions and segmentations.

(b) Baseline

(c) Ours

(d) Ground Truth

(a) RGB Input
Figure 10: More qualitative comparisons of depth predictions using our plane-aware loss compared to the baseline method based on Zioulis [29]. Our results are noticeably better at capturing the depth of planar objects in the scenes, such as the shelves in row (2) or the table in row (5). Row (3) shows a case where our method in unable to capture a large planar section, but it is worth noting that the baseline method was unsuccessful as well.

(b) Derived

(c) Predicted

(d) Ground Truth
Figure 11: Extended comparison between our surface normal predictions and those derived the from depth predictions alone. Rows (1) and (2) gives more examples where the normals derived from depth perform well, but rows (3)-(6) show that, generally, we are better off predicting normals independently from depth.

(a) RGB Input

Figure 12: More comparisons of plane segmentations and 3D popups. For each example, the top row shows RGB input, plane boundary prediction, and plane segmentation, respectively, from left to right. Beneath those are a comparison of our popup reconstruction (left) and a mesh constructed from the baseline depth estimation (right). We show the untextured mesh to better highlight the differences in geometry. The rough regions in our reconstruction fall on the boundaries of the plane segmentation, highlighting that our method, while generally useful, falls prey to ‘fat edges’ on the plane boundaries.