Corners for Layout: End-to-End Layout Recovery from 360 Images

03/19/2019 ∙ by Clara Fernandez-Labrador, et al. ∙ University of Zaragoza Université de Bourgogne 2

The problem of 3D layout recovery in indoor scenes has been a core research topic for over a decade. However, there are still several major challenges that remain unsolved. Among the most relevant ones, a major part of the state-of-the-art methods make implicit or explicit assumptions on the scenes -- e.g. box-shaped or Manhattan layouts. Also, current methods are computationally expensive and not suitable for real-time applications like robot navigation and AR/VR. In this work we present CFL (Corners for Layout), the first end-to-end model for 3D layout recovery on 360 images. Our experimental results show that we outperform the state of the art relaxing assumptions about the scene and at a lower cost. We also show that our model generalizes better to camera position variations than conventional approaches by using EquiConvs, a type of convolution applied directly on the sphere projection and hence invariant to the equirectangular distortions.



There are no comments yet.


page 3

page 4

page 5

page 7

page 8

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recovering the 3D layout of an indoor scene from a single view has attracted the attention of computer vision and graphics researchers in the last decade. The idea is going beyond pure geometrical reconstructions and provide higher-level contextual information about the scene, even in the presence of clutter. Layout estimation is a key technology in several emerging application markets, such as augmented and virtual reality and robot navigation. But also for more traditional ones, like real estate


Layout estimation, however, is not a trivial task and there are several major problems that still remain unsolved. For example, most existing methods are based on strong assumptions on the geometry (Manhattan scenes) or the over-simplification of the room types (box-shaped layouts), often underfitting the richness of real indoor spaces. The limited field of view of conventional cameras leads to ambiguities, which could be solved by considering a wider context. For this reason it is advantageous to use wide fields of view, like 360 panoramas. In these cases, however, the methods for conventional cameras are not suitable due to the image distortions and new ones have to be developed.

In the last years, the main improvements in layout recovery from panoramas have come from the application of deep learning. The high-level features learned by deep networks have proven to be as useful for this problem as for many others. Nevertheless, these techniques entail other problems such as the lack of data or overfitting. State-of-the-art methods require additional pre- and/or post-processing. As a consequence they are very slow, and this is a major drawback considering the aforementioned applications for real-time layout recovery.

Figure 1: Corners for Layout: The first end-to-end model from the sphere to the 3D layout.

In this work, we present Corners for Layout (CFL) the first end-to-end neural network that recovers the 3D layout from a single

image (Figure 1). CFL predicts a map of the corners of the room that is directly used to obtain the layout without further processing. This makes CFL more than 100 times faster than the state of the art, while still outperforming the accuracy of current approaches. Furthermore, our proposal is not limited by typical scene assumptions, meaning that it can predict complex geometries, such as rooms with more than four walls or non-Manhattan structures. Additionally, we propose a novel implementation of the convolution for images [30, 6] in the equirectangular projection. We deform [7] the kernel to compensate the distortion and make CFL more robust to camera rotation and pose variations. Hence, it is equivalent to applying directly a convolution operation to the spherical image, which is geometrically more coherent than applying a standard convolution on the equirectangular panorama. We have extensively evaluated our network in two public datasets with several training configurations, including data augmentation techniques to address occlusions by enforcing the network to learn from the context. Our code and labeled dataset can be found here: CFL webpage.

2 Related Work

The layout of a room provides a strong prior for other visual tasks like depth recovery [11], realistic insertions of virtual objects into indoor images [18], indoor object recognition [3, 29] or human pose estimation [15]. A large variety of methods have been developed for this purpose using multiple input images [31, 14] or depth sensors [35], which deliver high-quality reconstruction results. For the common case when a single RGB image is available, the problem becomes considerably more challenging and researchers need very often to rely on strong assumptions.

The seminal approaches to layout prediction from a single view were [9, 21], followed by [17, 27]. They basically model the layout of the room with a vanishing-point-aligned 3D box, being hence constrained to this particular room geometry and unable to generalize to others appearing frequently in real applications. Most recent approaches exploit CNNs and their excellent performance in a wide range of applications such as image classification, segmentation and detection. [23, 24, 36, 38], for example, focus on predicting the informative edges separating the geometric classes (walls, floor and ceiling). Alternatively, Dasgupta [8] proposed a FCN to predict labels for each of the surfaces of the room. All these methods require extra computation added to the forward propagation of the network to retrieve the actual layout. In [20], for example, an end-to-end network predicts the layout corners in a perspective image, but after that it has to infer the room type within a limited set of manually chosen configurations.

While layout recovery from conventional images has progressed rapidly with both geometry and deep learning, the works that address these challenges using omnidirectional images are still very few. Panoramic cameras have the potential to improve the performance of the task: their 360 field of view captures the entire viewing sphere surrounding its optical center, allowing to acquire the whole room at once and hence predicting layouts with more visual information. PanoContext [37] was the first work that extended the frameworks designed for perspective images to panoramas. It recovers both the layout, which is also assumed as a simple 3D box, and bounding boxes for the most salient objects inside the room. Pano2CAD [33] extends the method to non-cuboid rooms, but it is limited by its dependence on the output of object detectors. Motivated by the need of addressing complex room geometries, [13] generates layout hypotheses by geometric reasoning from a small set of structural corners obtained from the combination of geometry and deep learning. The most recent works along this line are LayoutNet [39], that trains a FCN from panoramas and vanishing lines, generating the layout models from edge and corner maps, and DuLa-Net [34], that predicts Manhattan-world layouts leveraging a perspective ceiling-view of the room. All of these approaches require pre- or post-processing steps like line and vanishing point extraction or room model fitting, that increase their cost.

In addition to all the challenges mentioned above, we also notice that there is an incrongruence between panoramic images and conventional CNNs. The space-varying distortions caused by the equirectangular representation makes the translational weight sharing ineffective. Very recently, Cohen [6] did a relevant theoretical contribution by studying convolutions on the sphere using spectral analysis. However, it is not clearly demonstrated whether Spherical CNNs can reach the same accuracy and efficiency on equirectangular images. Our EquiConvs have more in common with the idea of [30], that proposes distortion-aware convolutional filters to train their model using conventional perspective images and then use it to regress depth from panoramic images. We propose a novel parameterization and implementation of the deformable convolutions [7] by following the idea of adapting the receptive field of the convolutional kernels by deforming their shape according to the distortion of the equirectangular projection.

3 Corners for Layout

Here we describe our end-to-end approach for recovering the layout, i.e. the main structure of the room, from single images. After introducing some details about the target data, we describe the proposed network architecture and how we directly transform the output into the 3D layout. The network architecture is adapted for Standard Convolutions and for our proposed Equirectangular Convolutions implementation, the latest being explained in Section 4.

Figure 2: CFL architecture. Our network is built upon ResNet-50, adding a single decoder that jointly predicts edge and corner maps. Here we propose two network variations: on top, the network applies StdConvs on the equirectangular panorama, whereas the one on the bottom applies EquiConvs directly on the sphere.

3.1 Ground truth

The ground truth (GT) for every panorama consists of two maps, , one represents the room edges (), intersections between walls, ceiling and floor, and the other encodes the corner locations (). Both maps are defined as , with pixel values . has a value of if it belongs to an edge or a corner, and otherwise. We do line thickening and Gaussian blur for easier convergence during training since it makes the loss progression continuous instead of binary. The loss is gradually reduced as the prediction approaches the target.

Notice here that our target is considerably simpler than others that usually divide the ground truth into different classes. This contributes to the small computational footprint of our proposal. For example, [23, 38] use independent feature maps for background, wall-floor, wall-wall and wall-ceiling edges. A full image segmentation into left, front and right wall, ceiling and floor categories is performed in [8]. In [20]

, they represent a total of 48 different corner types by a 2D Gaussian heatmap centered at the true keypoint location. Here, instead, we only use two probability maps, one for edges and another one for corners – see

outputs in the Figure 2.

3.2 Network architecture

The proposed FCN follows the encoder-decoder structure and builds upon ResNet-50 [16]. We replace the final fully-connected layer with a decoder that jointly predicts layout edges and corners locations already refined. We illustrate the proposed architecture in Figure 2.

Encoder. Most of deep-learning approaches facing layout recovery problem have made use of the VGG16 [28] as encoder [23, 8, 20]. Instead, [38] builds their model over ResNet-101 [16] outperforming the state of the art. Here, we use ResNet-50 [16]

, pre-trained on the ImageNet dataset

[26], which leads to a faster convergence due to the general low-level features learned from ImageNet. Residual networks allow us to increase the depth without increasing the number of parameters with respect to their plain counterparts. This leads, in ResNet-50, to capture a receptive field of pixels, enough for our input resolution of pixels.

Decoder. Most of the recent work [23, 39, 24] builds two output branches for multi-task learning, which increases the computation time and the network parameters. We instead propose a unique branch with two output channels, corners and edge maps, which helps to reinforce the quality of both map types. In the decoder, we combine two different ideas. First, skip-connections [25] from the encoder to the decoder. Specifically, we concatenate “up-convolved” features with their corresponding features from the contracting part. Second, we do preliminary predictions at lower resolutions which are also concatenated and fed back to the network following the spirit of [10]

, ensuring early stages of internal features aim for the task. We use ReLU as non-linear function except for the prediction layers, where we use Sigmoid.

We propose two variations of the network architecture for two different convolution operations (Figure 2). The first one, CFL StdConvs, convolves the feature maps with Standard Convolutions and use up-convolutions to decode the output. The second one, CFL EquiConvs, uses Equirectangular Convolutions both in the encoder and the decoder, using unpooling to upsample the output. Equirectangular Convolutions are deformable convolutions that adapt their size and shape depending on the position in the equirectangular image, for which we propose a new implementation in Section 4.

Figure 3: Layout from corner predictions. From the corner probability map, the coordinates with maximum values are directly selected to generate the layout.

3.3 Loss functions

Edge and corner maps are learned through a pixel-wise sigmoid cross-entropy loss function. Since we know a priori that the natural distribution of pixels in these maps is extremely unbalanced (

have a value of ), we introduce weighting factors to make the training stable. Defining as and the positive and negative labels, the weighting factors are defined as , being the total number of pixels and the amount of pixels of class per sample. The per-pixel per-map loss is as follows:


where is the GT for pixel in the map and is the network output for pixel and map . We minimize this loss at different resolutions , specifically in the network output () and 3 intermediate layers (). The total loss is then the sum over all pixels, the resolutions and both the edge and corner maps


3.4 3D Layout

Aiming to a fast end-to-end model, CFL avoids post-processing and strong scene assumptions and just follow a natural transformation from corners coordinates to 2D and 3D layout. The 2D corners coordinates are the maximum activations in the probability map. Assuming that the corner set is consistent, they are directly joined, from left to right, in the unit sphere space and re-projected to the equirectangular image plane. From this 2D layout, we infer the 3D layout by only assuming ceiling-floor parallelism, leaving the wall structure unconstrained –i.e., we do not force the usual Manhattan perpendicularity between walls. Corners are projected to floor and ceiling planes given a unitary camera height (trivial as results are up to scale). See Figure 3.

Figure 4: Spherical parametrization of EquiConvs. The spherical kernel, defined by its angular size () and resolution (), is convolved around the sphere with angles and .

Limitations of CFL: We directly join corners from left to right, meaning that our end-to-end model would not work if any wall is occluded because of the convexity of the scene. In those particular cases, the joining process should follow a different order. [13] proposes a geometry-based post-processing that could alleviate this problem, but its cost is high and it needs the Manhattan World assumption. The addition of this post-processing into our work, in any case, could be done similarly to [12].

4 Equirectangular Convolutions

Spherical images are receiving an increasing attention due to the growing number of omnidirectional sensors in drones, robots and autonomous cars. A naïve application of convolutional networks to a equirectangular projection, is not, in principle, a good choice due to the space-varying distortions introduced by such projection.

In this section we present a convolution that we name EquiConv, which is defined in the spherical domain instead of the image domain and it is implicitly invariant to equirectangular representation distortions. The kernel in EquiConvs is defined as a spherical surface patch –see Figure 4. We parametrize its receptive field by the angles and . Thus, we directly define a convolution over the field of view. The kernel is rotated and applied along the sphere and its position is defined by the spherical coordinates ( and in the figure) of its center. Unlike standard kernels, that are parameterized by their size , with EquiConvs we define the angular size () and resolution (). In practice, we keep the aspect ratio, , and we use square kernels, so we will refer the field of view as () and the resolution as () respectively from now on.

Figure 5: Effect of changing field of view (rad) and resolution in EquiConvs. 1 column shows a narrow field of view . 2 column shows a wider kernel keeping its resolution (atrous-like), . 3 column shows an even larger field of view for the kernel, . Notice how the kernel adapts to the equirectangular distortion. Rows are resolutions and .

As we increase the resolution of the kernel, the angular distance between the elements decreases, with the intuitive upper limit of not giving more resolution to the kernel than the image itself. In other words, the kernel is defined in a sphere, being its radius less or equal to the image sphere radius. EquiConvs can also be seen as a general model for spherical Atrous Convolutions [4, 5] where the kernel size is what we call resolution, and the rate is the field of view of the kernel divided by the resolution. An example of the differences of EquiConvs by modifiying and can be seen in Figure 5.

4.1 EquiConvs Details

In [7], they introduce deformable convolutions by learning additional offsets from the preceding feature maps. Offsets are added to the regular kernel locations in the Standard Convolution enabling free form deformation of the kernel.

Figure 6: Effect of offsets on a kernel. Left: Regular kernel in Standard Convolution. Center: Deformable kernel in [7]. Right: Spherical surface patch in EquiConvs.

Inspired by this work, we deform the shape of the kernels according to the geometrical priors of the equirectangular image projection. To do that, we generate offsets that are not learned but fixed given the spherical distortion model and constant over the same horizontal locations. Here, we describe how to obtain the distorted pixel locations from the original ones.

Let us define as the pixel location on the equirectangular image where we apply the convolution operation (i.e. the image coordinate where the center of the kernel is located). First, we define the coordinates for every element in the kernel and afterwards we rotate them to the point of the sphere where the kernel is being applied. We define each point of the kernel as


where and are integers in the range and is the distance from the center of the sphere to the kernel grid. In order to cover the field of view ,


We project each point into the sphere surface by normalizing the vectors, and rotate them to align the kernel center to the point where the kernel is applied.


where stands for a rotation matrix of an angle around the axis. and are the spherical angles of the center of the kernel –see Figure 4, and are defined as


where and are, respectively, the width and height of the equirectangular image in pixels. Finally, the rest of elements are back-projected to the equirectangular image domain. First, we convert the unit sphere coordinates to latitude and longitude angles:


And then, to the original 2D equirectangular image domain:


In Figure 6 we show how these offsets are applied to a regular kernel; and in Figure 7 three kernel samples on the spherical and on the equirectangular images.

Figure 7: EquiConvs on spherical images. We show three kernel positions to highlight the differences between the offsets. As we approach to the poles (larger

angles) the deformation of the kernel on the equirectangular image is bigger, in order to reproduce a regular kernel on the sphere surface. Additionally, with EquiConvs, we do not use padding when the kernel is on the border of the image since offsets take the points to their correct position on the other side of the


5 Experiments

Edges Corners
Conv. Type IntPred Edges
StdConvs - - - - - - -
StdConvs -
StdConvs 0.588 0.933 0.782 0.691 0.733 0.465 0.974 0.872 0.498 0.632
EquiConvs - - - - - - -
EquiConvs - 0.686 0.491
EquiConvs 0.575 0.931 0.789 0.722 0.460 0.974 0.887 0.627
bigger is better bigger is better
Table 1: Ablation study on SUN360 dataset. We show results for both Standard Convolutions (StdConvs) and our proposed Equirectangular Convolutions (EquiConvs) with some modifications: Using or not intermediate predictions (IntPred) in the decoder and edge map predictions (Edges).

We present a set of experiments to evaluate CFL using both Standard Convolutions (StdConvs) and the proposed Equirectangular Convolutions (EquiConvs). We do not only analyze how well it predicts edge and corner maps, but also the impact of each algorithmic component through ablation studies. We report the performance of our proposal in two different datasets, and show qualitative 2D and 3D models of different indoor scenes.

5.1 Datasets

We use two public datasets that comprise several indoor scenes, SUN360 [32] and Stanford (2D-3D-S) [2] in equirectangular projection (360). The former is used for ablation studies, and both are used for comparison against several state-of-the-art baselines.

SUN360 [32]: We use 500 bedroom and livingroom panoramas from this dataset labeled by Zhang et al. [37]. We use these labels but, since all panoramas were labeled as box-type rooms, we hand-label and substitute 35 panoramas representing more faithfully the actual shapes of the rooms. We split the raw dataset in 85 training scenes and 15 test scenes randomly by making sure that there were rooms of more than 4 walls in both partitions.

Stanford 2D-3D-S [2]: This dataset contains more challenging scenarios like cluttered laboratories or corridors. In [39], they use areas 1, 2, 4, 6 for training, and area 5 for testing. For our experiments we use same partitions and the ground truth provided by them.

5.2 Implementation details

The input to the network is a single panoramic RGB image of resolution . The outputs are, on the one hand, the room layout edge map and on the other hand, the corner map, both of them at resolution . A widely used strategy to improve generalization of neural networks is data augmentation. We apply random erasing, horizontal mirroring as well as horizontal rotation from to of input images during training. The weights are all initialized using ResNet-50 [16] trained on ImageNet [26]. For CFL EquiConvs we use the same kernel resolutions and field of views as in ResNet-50. This means that for a standard 33 kernel applied to a WH feature map, and , where for panoramas. We minimize the cross-entropy loss using Adam [19], regularized by penalizing the loss with the sum of the L2 of all weights. The initial learning rate is and is exponentially decayed by a rate of

every epoch. We apply a dropout rate of


The network is implemented using TensorFlow

[1] and trained and tested in a NVIDIA Titan X. The training time for StdConvs is around hour and the test time is seconds per image. For EquiConvs, training takes hours and test around seconds per image.

5.3 FCN evaluation

We measure the quality of our predicted probability maps using five standard metrics: intersection over union of predicted corner/edge pixels IoU, precision P, recall R, F1 Score and accuracy Acc. Table 1 summarizes our results and allows us to answer the following questions:

What are the effects of different convolutions? As one would expect, EquiConvs, aware of the distortion model, learn in a non-distorted generic feature space achieving accurate predictions, like StdConvs on conventional images [20]. However and counterintuitively, StdConvs, ignoring the distortion model, rely on image patterns that this generates obtaining similar performance – see Table 1. Distortion understanding, nonetheless, gives the network other advantages. While StdConvs learn strong bias correlation between features and distortion patterns (e.g. ceiling line on the top of the image or clutter in the mid-bottom), EquiConvs are invariant to that. For this reason, the performance of EquiConvs does not degrade when varying the camera 6DOF pose – see Section 5.4. Additionally, EquiConvs allow a more direct use of networks pre-trained on conventional images. Specifically, this translates into a faster convergence, which is desirable as, to date, datasets contain far less images than datasets with conventional images. Moreover Tateno et al. demonstrate in their recent work [30] that other tasks like depth prediction, panoramic monocular SLAM, panoramic semantic segmentation and panoramic style transfer can also benefit from this type of convolutions.

How can we refine predictions? There are some techniques that we can use in order to obtain more accurate and refined predictions. Here, we make pyramid preliminary predictions in the decoder and iteratively refine them, by feeding them back to the network, until the final prediction. Also, although we only use the corner map to recover the layout of the room, we train the network to additionally predict edge maps as an auxiliary task. This is another representation of the same task that ensures that the network learns to exploit the relationship between both outputs, i.e., the network learns how edges intersect between them generating the corners. The improvement is shown in the Table 1.

Edges Corners
Translation (-0.3h:+0.3h) StdConvs
Rotation (:) StdConvs
Table 2: Robustness analysis. Values represent the mean value (bigger is better) standard deviation (smaller is better). We apply two types of transformations to the panoramas: translations in dependant on the room height, , and rotations in . We do not use these images for training but just for testing in order to show the generalization capabilities of EquiConvs.
Figure 8: Augmenting the data with virtual occlusions. Left: Image with erased pixels. Right: Input panorama and predictions without and with pixel erasing. Notice the improvement by random erasing.

How can we deal with occlusions? We do Random Erasing Data Augmentation. This operation randomly selects rectangles in the training images and removes its content, generating various levels of virtual occlusion. In this manner we simulate real situations where objects in the scene occlude the corners of the room layout, and force the network to learn context-aware features to overcome this challenging situation. Figure 8 illustrates this strategy with an example.

Figure 9: Relaxation of assumptions. The figure shows two CFL predictions of non-Manhattan/not box-like rooms.

Is it possible to relax the scene assumptions while keeping a good performance? Our end-to-end approach overcomes the Manhattan assumption as well as the box-type simplification (four-walls rooms). On the one hand, although we label some panoramas more accurately to their actual shape, we still have a largely unbalanced dataset. We address this problem by choosing a batch size of and forcing it to always include one non-box sample. This favors the learning of more complex rooms despite having few examples. On the other hand, while recent works [39, 13, 37] use pre-computed vanishing points and posterior optimizations, here we directly obtain the corner coordinates from the FCN output without applying geometric constraints. In Figure 9 we show two examples where CFL predicts more than walls. Notice also the non-Manhattan ceiling in the left image.

5.4 Robustness analysis

With the motivation of exploiting the potential of EquiConvs, we test our model with previously unseen images where the camera viewpoint is different from that in the training set. The distortion in equirectangular projection is location dependent, specifically, it depends on the polar angle . Since EquiConvs are invariant to this distortion, it is interesting to see how modifications in the camera extrinsic parameters (translation and rotation) affect the model performance using EquiConvs against StdConvs. When we generate translations over the vertical axis and rotations, the shape of the layout is modified by the distortion, losing its characteristic pattern (which StdConvs use in its favor).

Since standard datasets have a strong bias when referring to camera pose and rotation, we synthetically render these transformations along our test set. The rotation is trivial as we work on the spherical domain. As the complete 3D dense model of the rooms is not available, the translation simulation is performed by using the existing information, ignoring occlusions produced by viewpoint changes. Nevertheless, as we do not work with wide translations the effect is minimal and images are realistic enough to prove the point we want to highlight (see Figure 10).

Figure 10: Synthetic images for robustness analysis. Here we show two examples of panoramas generated with upward translation in and rotation in respectively.

For both experiments, we uniformly sample from a minimum to a maximum transformation and calculate the mean and standard deviation for all the metrics. What we see in Table 2 is that we obtain higher mean values while smaller standard deviation by using EquiConvs. This means that this EquiConvs make the model more robust and generalizable to real life situations, not covered in the datasets, panoramas taken by hand, drones or small robots. This effect is highlighted especially in the evaluation of the edges since it is their appearance that is highly modified by these changes of the camera.

5.5 3D Layout comparison

We evaluate our layout predictions using three standard metrics, 3D intersection over union , corner error and pixel error , and compare ourselves against four approaches from the state of the art [37, 39, 13, 34]. Pano2CAD [33] has no source code available nor evaluation of layouts, making direct comparison difficult. The pixel error metric given by [39] only distinguishes between ceiling, floor and walls, . Instead our proposed segmented mask distinguish between ceiling, floor and each wall separately, , which is more informative since it also has into account errors in wall-wall boundaries. For all experiments, only SUN360 dataset is used for training. Table 3 shows the performance of our proposal testing on both datasets, SUN360 and Stanford 2D-3D. Results are averaged across all images. It can be seen that our approach outperforms the state of the art clearly, in all the metrics.

It is worth mentioning that our approach, not only obtains better accuracy but also it recovers shapes more faithful to the real ones, since it can handle non box-type room designs with few training examples. In Table 4 we show that, apart from achieving better localization of layout boundaries and corners, our end-to-end approach is much faster. Our full method with EquiConvs takes seconds to process one room and with StdConvs just seconds, which is a major advantage considering the aforementioned applications of layout recovery need to be real-time (robot navigation, AR/VR).

Test Method
SUN360 PanoContext [37]
Fernandez [13] - - -
LayoutNet [39] -
DuLa-Net [34] - - -
CFL StdConvs 78.79 2.49 3.33
CFL EquiConvs 0.78
Std.2D3D Fernandez [13] - - -
CFL StdConvs 1.44 4.75 6.05
CFL EquiConvs 65.23
smaller is better
Table 3: Layout results on both datasets, training on SUN360 data. SS: Simple Segmentation (3 categories): ceiling, floor and walls [39]. CS: Complete Segmentation: ceiling, floor, wall,…, wall [13]. Observe how our method outperforms all the baselines in all the metrics.
Method Computation Time (s)
PanoContext [37]
LayoutNet [39]
DuLa-Net [34]
CFL EquiConvs
CFL StdConvs 0.46
Table 4: Average computing time per image. Every approach is evaluated using NVIDIA Titan X and Intel Xeon 3.5 GHz (6 cores) except DuLa-Net, evaluated using NVIDIA 1080ti GPU. Our end-to-end method is more than 100 times faster than other methods.
Figure 11: Layout predictions (light magenta) and ground truth (dark magenta) for complex room geometries.

6 Conclusions

In this work we present CFL, the first end-to-end algorithm for layout recovery in images. Our experimental results demonstrate that our predicted layouts are clearly more accurate than the state of the art. Additionally, the removal of extra pre- and post-processing stages makes our method much faster than other works. Finally, being entirely data-driven removes the geometric assumptions that are commonly used in the state of the art and limits their usability in complex geometries. We present two different variants of CFL. The first one, implemented using Standard Convolutions, reduces the computation in 100 times and it is very suitable for images taken with a tripod. The second one uses our proposed implementation of Equirectangular Convolutions that adapt their shape to the equirectangular projection of the spherical image. This proves to be more robust to translations and rotations of the camera making it ideal for panoramas taken by a hand-held camera.

Acknowledgement: This project was in part funded by the Spanish government (DPI2015-65962-R, DPI2015-67275), the Regional Council of Bourgogne-Franche-Comté (2017-9201AAO048S01342) and the Aragon government (DGA-T45_17R/FSE). We also thank Nvidia for their Titan X and Xp donation. Also, we would like to acknowledge Jesus Bermudez-Cameo for his valuable discussions and alpha testing.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    Tensorflow: A system for large-scale machine learning.

    In OSDI, volume 16, pages 265–283, 2016.
  • [2] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese.

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding.

    ArXiv, Feb. 2017.
  • [3] S. Y. Bao, M. Sun, and S. Savarese. Toward coherent object detection and scene layout understanding. Image and Vision Computing, 29(9):569–579, 2011.
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [5] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [6] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
  • [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1(2):3, 2017.
  • [8] S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 616–624, 2016.
  • [9] E. Delage, H. Lee, and A. Y. Ng.

    A dynamic bayesian network model for autonomous 3D reconstruction from a single indoor image.

    In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 2418–2428, 2006.
  • [10] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
  • [11] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
  • [12] C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. Demonceaux, and J. J. Guerrero. Panoroom: From the sphere to the 3d layout. arXiv preprint arXiv:1808.09879, 2018.
  • [13] C. Fernandez-Labrador, A. Perez-Yus, G. Lopez-Nicolas, and J. J. Guerrero. Layouts from panoramic images with geometry and deep learning. IEEE Robotics and Automation Letters, 3(4):3153–3160, 2018.
  • [14] A. Flint, D. Murray, and I. Reid. Manhattan scene understanding using monocular, stereo, and 3d features. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2228–2235. IEEE, 2011.
  • [15] D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic. People watching: Human actions as a cue for single view geometry. International journal of computer vision, 110(3):259–274, 2014.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [17] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In IEEE International Conference on Computer Vision, pages 1849–1856, 2009.
  • [18] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics (TOG), 30(6):157, 2011.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] C. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. RoomNet: End-to-end room layout estimation. In IEEE International Conference on Computer Vision, 2017.
  • [21] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2136–2143, 2009.
  • [22] C. Liu, A. G. Schwing, K. Kundu, R. Urtasun, and S. Fidler. Rent3d: Floor-plan priors for monocular layout estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [23] A. Mallya and S. Lazebnik. Learning informative edge maps for indoor scene layout prediction. In IEEE International Conference on Computer Vision, pages 936–944, 2015.
  • [24] Y. Ren, S. Li, C. Chen, and C.-C. J. Kuo. A coarse-to-fine indoor layout estimation (cfile) method. In Asian Conference on Computer Vision, pages 36–51, 2016.
  • [25] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
  • [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [27] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box in the box: Joint 3D layout and object reasoning from single images. In IEEE International Conference on Computer Vision, pages 353–360, 2013.
  • [28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [29] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
  • [30] K. Tateno, N. Navab, and F. Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 707–722, 2018.
  • [31] G. Tsai, C. Xu, J. Liu, and B. Kuipers. Real-time indoor scene understanding using bayesian filtering with motion cues. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 121–128. IEEE, 2011.
  • [32] J. Xiao, K. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, 2012.
  • [33] J. Xu, B. Stenger, T. Kerola, and T. Tung. Pano2CAD: Room layout from a single panorama image. In IEEE Winter Conference on Applications of Computer Vision, pages 354–362, 2017.
  • [34] S.-T. Yang, F.-E. Wang, C.-H. Peng, P. Wonka, M. Sun, and H.-K. Chu. Dula-net: A dual-projection network for estimating room layouts from a single rgb panorama. arXiv preprint arXiv:1811.11977, 2018.
  • [35] J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun. Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In 2013 IEEE International Conference on Computer Vision, pages 1273–1280. IEEE, 2013.
  • [36] W. Zhang, W. Zhang, K. Liu, and J. Gu. Learning to predict high-quality edge maps for room layout estimation. Transactions on Multimedia, 19(5):935–943, 2017.
  • [37] Y. Zhang, S. Song, P. Tan, and J. Xiao. PanoContext: A whole-room 3D context model for panoramic scene understanding. In European Conference on Computer Vision, pages 668–686. Springer, 2014.
  • [38] H. Zhao, M. Lu, A. Yao, Y. Guo, Y. Chen, and L. Zhang. Physics inspired optimization on semantic transfer features: An alternative method for room layout estimation. arXiv preprint arXiv:1707.00383, 2017.
  • [39] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2059, 2018.