CFL-End-to-End-Layout-Recovery-from-360-Images
None
view repo
The problem of 3D layout recovery in indoor scenes has been a core research topic for over a decade. However, there are still several major challenges that remain unsolved. Among the most relevant ones, a major part of the state-of-the-art methods make implicit or explicit assumptions on the scenes -- e.g. box-shaped or Manhattan layouts. Also, current methods are computationally expensive and not suitable for real-time applications like robot navigation and AR/VR. In this work we present CFL (Corners for Layout), the first end-to-end model for 3D layout recovery on 360 images. Our experimental results show that we outperform the state of the art relaxing assumptions about the scene and at a lower cost. We also show that our model generalizes better to camera position variations than conventional approaches by using EquiConvs, a type of convolution applied directly on the sphere projection and hence invariant to the equirectangular distortions.
READ FULL TEXT VIEW PDF
We propose a novel FCN able to work with omnidirectional images that out...
read it
In this paper, we propose a novel procedure for 3D layout recovery of in...
read it
In this paper, we address the novel, highly challenging problem of estim...
read it
Unlike standard object classification, where the image to be classified
...
read it
We present a novel approach to reconstruct large or featureless scenes. ...
read it
In this paper, we propose an alternative method to estimate room layouts...
read it
Using different methods for laying out a graph can lead to very differen...
read it
None
Recovering the 3D layout of an indoor scene from a single view has attracted the attention of computer vision and graphics researchers in the last decade. The idea is going beyond pure geometrical reconstructions and provide higher-level contextual information about the scene, even in the presence of clutter. Layout estimation is a key technology in several emerging application markets, such as augmented and virtual reality and robot navigation. But also for more traditional ones, like real estate
[22].Layout estimation, however, is not a trivial task and there are several major problems that still remain unsolved. For example, most existing methods are based on strong assumptions on the geometry (Manhattan scenes) or the over-simplification of the room types (box-shaped layouts), often underfitting the richness of real indoor spaces. The limited field of view of conventional cameras leads to ambiguities, which could be solved by considering a wider context. For this reason it is advantageous to use wide fields of view, like 360 panoramas. In these cases, however, the methods for conventional cameras are not suitable due to the image distortions and new ones have to be developed.
In the last years, the main improvements in layout recovery from panoramas have come from the application of deep learning. The high-level features learned by deep networks have proven to be as useful for this problem as for many others. Nevertheless, these techniques entail other problems such as the lack of data or overfitting. State-of-the-art methods require additional pre- and/or post-processing. As a consequence they are very slow, and this is a major drawback considering the aforementioned applications for real-time layout recovery.
In this work, we present Corners for Layout (CFL) the first end-to-end neural network that recovers the 3D layout from a single
image (Figure 1). CFL predicts a map of the corners of the room that is directly used to obtain the layout without further processing. This makes CFL more than 100 times faster than the state of the art, while still outperforming the accuracy of current approaches. Furthermore, our proposal is not limited by typical scene assumptions, meaning that it can predict complex geometries, such as rooms with more than four walls or non-Manhattan structures. Additionally, we propose a novel implementation of the convolution for images [30, 6] in the equirectangular projection. We deform [7] the kernel to compensate the distortion and make CFL more robust to camera rotation and pose variations. Hence, it is equivalent to applying directly a convolution operation to the spherical image, which is geometrically more coherent than applying a standard convolution on the equirectangular panorama. We have extensively evaluated our network in two public datasets with several training configurations, including data augmentation techniques to address occlusions by enforcing the network to learn from the context. Our code and labeled dataset can be found here: CFL webpage.The layout of a room provides a strong prior for other visual tasks like depth recovery [11], realistic insertions of virtual objects into indoor images [18], indoor object recognition [3, 29] or human pose estimation [15]. A large variety of methods have been developed for this purpose using multiple input images [31, 14] or depth sensors [35], which deliver high-quality reconstruction results. For the common case when a single RGB image is available, the problem becomes considerably more challenging and researchers need very often to rely on strong assumptions.
The seminal approaches to layout prediction from a single view were [9, 21], followed by [17, 27]. They basically model the layout of the room with a vanishing-point-aligned 3D box, being hence constrained to this particular room geometry and unable to generalize to others appearing frequently in real applications. Most recent approaches exploit CNNs and their excellent performance in a wide range of applications such as image classification, segmentation and detection. [23, 24, 36, 38], for example, focus on predicting the informative edges separating the geometric classes (walls, floor and ceiling). Alternatively, Dasgupta [8] proposed a FCN to predict labels for each of the surfaces of the room. All these methods require extra computation added to the forward propagation of the network to retrieve the actual layout. In [20], for example, an end-to-end network predicts the layout corners in a perspective image, but after that it has to infer the room type within a limited set of manually chosen configurations.
While layout recovery from conventional images has progressed rapidly with both geometry and deep learning, the works that address these challenges using omnidirectional images are still very few. Panoramic cameras have the potential to improve the performance of the task: their 360 field of view captures the entire viewing sphere surrounding its optical center, allowing to acquire the whole room at once and hence predicting layouts with more visual information. PanoContext [37] was the first work that extended the frameworks designed for perspective images to panoramas. It recovers both the layout, which is also assumed as a simple 3D box, and bounding boxes for the most salient objects inside the room. Pano2CAD [33] extends the method to non-cuboid rooms, but it is limited by its dependence on the output of object detectors. Motivated by the need of addressing complex room geometries, [13] generates layout hypotheses by geometric reasoning from a small set of structural corners obtained from the combination of geometry and deep learning. The most recent works along this line are LayoutNet [39], that trains a FCN from panoramas and vanishing lines, generating the layout models from edge and corner maps, and DuLa-Net [34], that predicts Manhattan-world layouts leveraging a perspective ceiling-view of the room. All of these approaches require pre- or post-processing steps like line and vanishing point extraction or room model fitting, that increase their cost.
In addition to all the challenges mentioned above, we also notice that there is an incrongruence between panoramic images and conventional CNNs. The space-varying distortions caused by the equirectangular representation makes the translational weight sharing ineffective. Very recently, Cohen [6] did a relevant theoretical contribution by studying convolutions on the sphere using spectral analysis. However, it is not clearly demonstrated whether Spherical CNNs can reach the same accuracy and efficiency on equirectangular images. Our EquiConvs have more in common with the idea of [30], that proposes distortion-aware convolutional filters to train their model using conventional perspective images and then use it to regress depth from panoramic images. We propose a novel parameterization and implementation of the deformable convolutions [7] by following the idea of adapting the receptive field of the convolutional kernels by deforming their shape according to the distortion of the equirectangular projection.
Here we describe our end-to-end approach for recovering the layout, i.e. the main structure of the room, from single images. After introducing some details about the target data, we describe the proposed network architecture and how we directly transform the output into the 3D layout. The network architecture is adapted for Standard Convolutions and for our proposed Equirectangular Convolutions implementation, the latest being explained in Section 4.
The ground truth (GT) for every panorama consists of two maps, , one represents the room edges (), intersections between walls, ceiling and floor, and the other encodes the corner locations (). Both maps are defined as , with pixel values . has a value of if it belongs to an edge or a corner, and otherwise. We do line thickening and Gaussian blur for easier convergence during training since it makes the loss progression continuous instead of binary. The loss is gradually reduced as the prediction approaches the target.
Notice here that our target is considerably simpler than others that usually divide the ground truth into different classes. This contributes to the small computational footprint of our proposal. For example, [23, 38] use independent feature maps for background, wall-floor, wall-wall and wall-ceiling edges. A full image segmentation into left, front and right wall, ceiling and floor categories is performed in [8]. In [20]
, they represent a total of 48 different corner types by a 2D Gaussian heatmap centered at the true keypoint location. Here, instead, we only use two probability maps, one for edges and another one for corners – see
outputs in the Figure 2.The proposed FCN follows the encoder-decoder structure and builds upon ResNet-50 [16]. We replace the final fully-connected layer with a decoder that jointly predicts layout edges and corners locations already refined. We illustrate the proposed architecture in Figure 2.
Encoder. Most of deep-learning approaches facing layout recovery problem have made use of the VGG16 [28] as encoder [23, 8, 20]. Instead, [38] builds their model over ResNet-101 [16] outperforming the state of the art. Here, we use ResNet-50 [16]
, pre-trained on the ImageNet dataset
[26], which leads to a faster convergence due to the general low-level features learned from ImageNet. Residual networks allow us to increase the depth without increasing the number of parameters with respect to their plain counterparts. This leads, in ResNet-50, to capture a receptive field of pixels, enough for our input resolution of pixels.Decoder. Most of the recent work [23, 39, 24] builds two output branches for multi-task learning, which increases the computation time and the network parameters. We instead propose a unique branch with two output channels, corners and edge maps, which helps to reinforce the quality of both map types. In the decoder, we combine two different ideas. First, skip-connections [25] from the encoder to the decoder. Specifically, we concatenate “up-convolved” features with their corresponding features from the contracting part. Second, we do preliminary predictions at lower resolutions which are also concatenated and fed back to the network following the spirit of [10]
, ensuring early stages of internal features aim for the task. We use ReLU as non-linear function except for the prediction layers, where we use Sigmoid.
We propose two variations of the network architecture for two different convolution operations (Figure 2). The first one, CFL StdConvs, convolves the feature maps with Standard Convolutions and use up-convolutions to decode the output. The second one, CFL EquiConvs, uses Equirectangular Convolutions both in the encoder and the decoder, using unpooling to upsample the output. Equirectangular Convolutions are deformable convolutions that adapt their size and shape depending on the position in the equirectangular image, for which we propose a new implementation in Section 4.
Edge and corner maps are learned through a pixel-wise sigmoid cross-entropy loss function. Since we know a priori that the natural distribution of pixels in these maps is extremely unbalanced (
have a value of ), we introduce weighting factors to make the training stable. Defining as and the positive and negative labels, the weighting factors are defined as , being the total number of pixels and the amount of pixels of class per sample. The per-pixel per-map loss is as follows:(1) | |||||
where is the GT for pixel in the map and is the network output for pixel and map . We minimize this loss at different resolutions , specifically in the network output () and 3 intermediate layers (). The total loss is then the sum over all pixels, the resolutions and both the edge and corner maps
(2) |
Aiming to a fast end-to-end model, CFL avoids post-processing and strong scene assumptions and just follow a natural transformation from corners coordinates to 2D and 3D layout. The 2D corners coordinates are the maximum activations in the probability map. Assuming that the corner set is consistent, they are directly joined, from left to right, in the unit sphere space and re-projected to the equirectangular image plane. From this 2D layout, we infer the 3D layout by only assuming ceiling-floor parallelism, leaving the wall structure unconstrained –i.e., we do not force the usual Manhattan perpendicularity between walls. Corners are projected to floor and ceiling planes given a unitary camera height (trivial as results are up to scale). See Figure 3.
Limitations of CFL: We directly join corners from left to right, meaning that our end-to-end model would not work if any wall is occluded because of the convexity of the scene. In those particular cases, the joining process should follow a different order. [13] proposes a geometry-based post-processing that could alleviate this problem, but its cost is high and it needs the Manhattan World assumption. The addition of this post-processing into our work, in any case, could be done similarly to [12].
Spherical images are receiving an increasing attention due to the growing number of omnidirectional sensors in drones, robots and autonomous cars. A naïve application of convolutional networks to a equirectangular projection, is not, in principle, a good choice due to the space-varying distortions introduced by such projection.
In this section we present a convolution that we name EquiConv, which is defined in the spherical domain instead of the image domain and it is implicitly invariant to equirectangular representation distortions. The kernel in EquiConvs is defined as a spherical surface patch –see Figure 4. We parametrize its receptive field by the angles and . Thus, we directly define a convolution over the field of view. The kernel is rotated and applied along the sphere and its position is defined by the spherical coordinates ( and in the figure) of its center. Unlike standard kernels, that are parameterized by their size , with EquiConvs we define the angular size () and resolution (). In practice, we keep the aspect ratio, , and we use square kernels, so we will refer the field of view as () and the resolution as () respectively from now on.
As we increase the resolution of the kernel, the angular distance between the elements decreases, with the intuitive upper limit of not giving more resolution to the kernel than the image itself. In other words, the kernel is defined in a sphere, being its radius less or equal to the image sphere radius. EquiConvs can also be seen as a general model for spherical Atrous Convolutions [4, 5] where the kernel size is what we call resolution, and the rate is the field of view of the kernel divided by the resolution. An example of the differences of EquiConvs by modifiying and can be seen in Figure 5.
In [7], they introduce deformable convolutions by learning additional offsets from the preceding feature maps. Offsets are added to the regular kernel locations in the Standard Convolution enabling free form deformation of the kernel.
Inspired by this work, we deform the shape of the kernels according to the geometrical priors of the equirectangular image projection. To do that, we generate offsets that are not learned but fixed given the spherical distortion model and constant over the same horizontal locations. Here, we describe how to obtain the distorted pixel locations from the original ones.
Let us define as the pixel location on the equirectangular image where we apply the convolution operation (i.e. the image coordinate where the center of the kernel is located). First, we define the coordinates for every element in the kernel and afterwards we rotate them to the point of the sphere where the kernel is being applied. We define each point of the kernel as
(3) |
where and are integers in the range and is the distance from the center of the sphere to the kernel grid. In order to cover the field of view ,
(4) |
We project each point into the sphere surface by normalizing the vectors, and rotate them to align the kernel center to the point where the kernel is applied.
(5) |
where stands for a rotation matrix of an angle around the axis. and are the spherical angles of the center of the kernel –see Figure 4, and are defined as
(6) |
where and are, respectively, the width and height of the equirectangular image in pixels. Finally, the rest of elements are back-projected to the equirectangular image domain. First, we convert the unit sphere coordinates to latitude and longitude angles:
(7) |
And then, to the original 2D equirectangular image domain:
(8) |
In Figure 6 we show how these offsets are applied to a regular kernel; and in Figure 7 three kernel samples on the spherical and on the equirectangular images.
Edges | Corners | |||||||||||
Conv. Type | IntPred | Edges | ||||||||||
StdConvs | - | - | - | - | - | - | - | |||||
StdConvs | - | ✓ | ||||||||||
StdConvs | ✓ | ✓ | 0.588 | 0.933 | 0.782 | 0.691 | 0.733 | 0.465 | 0.974 | 0.872 | 0.498 | 0.632 |
EquiConvs | - | - | - | - | - | - | - | |||||
EquiConvs | - | ✓ | 0.686 | 0.491 | ||||||||
EquiConvs | ✓ | ✓ | 0.575 | 0.931 | 0.789 | 0.722 | 0.460 | 0.974 | 0.887 | 0.627 | ||
bigger is better | bigger is better |
We present a set of experiments to evaluate CFL using both Standard Convolutions (StdConvs) and the proposed Equirectangular Convolutions (EquiConvs). We do not only analyze how well it predicts edge and corner maps, but also the impact of each algorithmic component through ablation studies. We report the performance of our proposal in two different datasets, and show qualitative 2D and 3D models of different indoor scenes.
We use two public datasets that comprise several indoor scenes, SUN360 [32] and Stanford (2D-3D-S) [2] in equirectangular projection (360). The former is used for ablation studies, and both are used for comparison against several state-of-the-art baselines.
SUN360 [32]: We use 500 bedroom and livingroom panoramas from this dataset labeled by Zhang et al. [37]. We use these labels but, since all panoramas were labeled as box-type rooms, we hand-label and substitute 35 panoramas representing more faithfully the actual shapes of the rooms. We split the raw dataset in 85 training scenes and 15 test scenes randomly by making sure that there were rooms of more than 4 walls in both partitions.
The input to the network is a single panoramic RGB image of resolution . The outputs are, on the one hand, the room layout edge map and on the other hand, the corner map, both of them at resolution . A widely used strategy to improve generalization of neural networks is data augmentation. We apply random erasing, horizontal mirroring as well as horizontal rotation from to of input images during training. The weights are all initialized using ResNet-50 [16] trained on ImageNet [26]. For CFL EquiConvs we use the same kernel resolutions and field of views as in ResNet-50. This means that for a standard 33 kernel applied to a WH feature map, and , where for panoramas. We minimize the cross-entropy loss using Adam [19], regularized by penalizing the loss with the sum of the L2 of all weights. The initial learning rate is and is exponentially decayed by a rate of
every epoch. We apply a dropout rate of
.The network is implemented using TensorFlow
[1] and trained and tested in a NVIDIA Titan X. The training time for StdConvs is around hour and the test time is seconds per image. For EquiConvs, training takes hours and test around seconds per image.We measure the quality of our predicted probability maps using five standard metrics: intersection over union of predicted corner/edge pixels IoU, precision P, recall R, F1 Score and accuracy Acc. Table 1 summarizes our results and allows us to answer the following questions:
What are the effects of different convolutions? As one would expect, EquiConvs, aware of the distortion model, learn in a non-distorted generic feature space achieving accurate predictions, like StdConvs on conventional images [20]. However and counterintuitively, StdConvs, ignoring the distortion model, rely on image patterns that this generates obtaining similar performance – see Table 1. Distortion understanding, nonetheless, gives the network other advantages. While StdConvs learn strong bias correlation between features and distortion patterns (e.g. ceiling line on the top of the image or clutter in the mid-bottom), EquiConvs are invariant to that. For this reason, the performance of EquiConvs does not degrade when varying the camera 6DOF pose – see Section 5.4. Additionally, EquiConvs allow a more direct use of networks pre-trained on conventional images. Specifically, this translates into a faster convergence, which is desirable as, to date, datasets contain far less images than datasets with conventional images. Moreover Tateno et al. demonstrate in their recent work [30] that other tasks like depth prediction, panoramic monocular SLAM, panoramic semantic segmentation and panoramic style transfer can also benefit from this type of convolutions.
How can we refine predictions? There are some techniques that we can use in order to obtain more accurate and refined predictions. Here, we make pyramid preliminary predictions in the decoder and iteratively refine them, by feeding them back to the network, until the final prediction. Also, although we only use the corner map to recover the layout of the room, we train the network to additionally predict edge maps as an auxiliary task. This is another representation of the same task that ensures that the network learns to exploit the relationship between both outputs, i.e., the network learns how edges intersect between them generating the corners. The improvement is shown in the Table 1.
Edges | Corners | ||||||
Translation (-0.3h:+0.3h) | StdConvs | ||||||
EquiConvs | |||||||
Rotation (:) | StdConvs | ||||||
EquiConvs |
How can we deal with occlusions? We do Random Erasing Data Augmentation. This operation randomly selects rectangles in the training images and removes its content, generating various levels of virtual occlusion. In this manner we simulate real situations where objects in the scene occlude the corners of the room layout, and force the network to learn context-aware features to overcome this challenging situation. Figure 8 illustrates this strategy with an example.
Is it possible to relax the scene assumptions while keeping a good performance? Our end-to-end approach overcomes the Manhattan assumption as well as the box-type simplification (four-walls rooms). On the one hand, although we label some panoramas more accurately to their actual shape, we still have a largely unbalanced dataset. We address this problem by choosing a batch size of and forcing it to always include one non-box sample. This favors the learning of more complex rooms despite having few examples. On the other hand, while recent works [39, 13, 37] use pre-computed vanishing points and posterior optimizations, here we directly obtain the corner coordinates from the FCN output without applying geometric constraints. In Figure 9 we show two examples where CFL predicts more than walls. Notice also the non-Manhattan ceiling in the left image.
With the motivation of exploiting the potential of EquiConvs, we test our model with previously unseen images where the camera viewpoint is different from that in the training set. The distortion in equirectangular projection is location dependent, specifically, it depends on the polar angle . Since EquiConvs are invariant to this distortion, it is interesting to see how modifications in the camera extrinsic parameters (translation and rotation) affect the model performance using EquiConvs against StdConvs. When we generate translations over the vertical axis and rotations, the shape of the layout is modified by the distortion, losing its characteristic pattern (which StdConvs use in its favor).
Since standard datasets have a strong bias when referring to camera pose and rotation, we synthetically render these transformations along our test set. The rotation is trivial as we work on the spherical domain. As the complete 3D dense model of the rooms is not available, the translation simulation is performed by using the existing information, ignoring occlusions produced by viewpoint changes. Nevertheless, as we do not work with wide translations the effect is minimal and images are realistic enough to prove the point we want to highlight (see Figure 10).
For both experiments, we uniformly sample from a minimum to a maximum transformation and calculate the mean and standard deviation for all the metrics. What we see in Table 2 is that we obtain higher mean values while smaller standard deviation by using EquiConvs. This means that this EquiConvs make the model more robust and generalizable to real life situations, not covered in the datasets, panoramas taken by hand, drones or small robots. This effect is highlighted especially in the evaluation of the edges since it is their appearance that is highly modified by these changes of the camera.
We evaluate our layout predictions using three standard metrics, 3D intersection over union , corner error and pixel error , and compare ourselves against four approaches from the state of the art [37, 39, 13, 34]. Pano2CAD [33] has no source code available nor evaluation of layouts, making direct comparison difficult. The pixel error metric given by [39] only distinguishes between ceiling, floor and walls, . Instead our proposed segmented mask distinguish between ceiling, floor and each wall separately, , which is more informative since it also has into account errors in wall-wall boundaries. For all experiments, only SUN360 dataset is used for training. Table 3 shows the performance of our proposal testing on both datasets, SUN360 and Stanford 2D-3D. Results are averaged across all images. It can be seen that our approach outperforms the state of the art clearly, in all the metrics.
It is worth mentioning that our approach, not only obtains better accuracy but also it recovers shapes more faithful to the real ones, since it can handle non box-type room designs with few training examples. In Table 4 we show that, apart from achieving better localization of layout boundaries and corners, our end-to-end approach is much faster. Our full method with EquiConvs takes seconds to process one room and with StdConvs just seconds, which is a major advantage considering the aforementioned applications of layout recovery need to be real-time (robot navigation, AR/VR).
Test | Method | ||||
SUN360 | PanoContext [37] | ||||
Fernandez [13] | - | - | - | ||
LayoutNet [39] | - | ||||
DuLa-Net [34] | - | - | - | ||
CFL StdConvs | 78.79 | 2.49 | 3.33 | ||
CFL EquiConvs | 0.78 | ||||
Std.2D3D | Fernandez [13] | - | - | - | |
CFL StdConvs | 1.44 | 4.75 | 6.05 | ||
CFL EquiConvs | 65.23 | ||||
smaller is better |
Method | Computation Time (s) |
---|---|
PanoContext [37] | |
LayoutNet [39] | |
DuLa-Net [34] | |
CFL EquiConvs | |
CFL StdConvs | 0.46 |
In this work we present CFL, the first end-to-end algorithm for layout recovery in images. Our experimental results demonstrate that our predicted layouts are clearly more accurate than the state of the art. Additionally, the removal of extra pre- and post-processing stages makes our method much faster than other works. Finally, being entirely data-driven removes the geometric assumptions that are commonly used in the state of the art and limits their usability in complex geometries. We present two different variants of CFL. The first one, implemented using Standard Convolutions, reduces the computation in 100 times and it is very suitable for images taken with a tripod. The second one uses our proposed implementation of Equirectangular Convolutions that adapt their shape to the equirectangular projection of the spherical image. This proves to be more robust to translations and rotations of the camera making it ideal for panoramas taken by a hand-held camera.
Acknowledgement: This project was in part funded by the Spanish government (DPI2015-65962-R, DPI2015-67275), the Regional Council of Bourgogne-Franche-Comté (2017-9201AAO048S01342) and the Aragon government (DGA-T45_17R/FSE). We also thank Nvidia for their Titan X and Xp donation. Also, we would like to acknowledge Jesus Bermudez-Cameo for his valuable discussions and alpha testing.
Tensorflow: A system for large-scale machine learning.
In OSDI, volume 16, pages 265–283, 2016.Joint 2D-3D-Semantic Data for Indoor Scene Understanding.
ArXiv, Feb. 2017.IEEE Conference on Computer Vision and Pattern Recognition
, pages 616–624, 2016.A dynamic bayesian network model for autonomous 3D reconstruction from a single indoor image.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 2418–2428, 2006.
Comments
There are no comments yet.