Log In Sign Up

Transferable End-to-end Room Layout Estimation via Implicit Encoding

by   Hao Zhao, et al.

We study the problem of estimating room layouts from a single panorama image. Most former works have two stages: feature extraction and parametric model fitting. Here we propose an end-to-end method that directly predicts parametric layouts from an input panorama image. It exploits an implicit encoding procedure that embeds parametric layouts into a latent space. Then learning a mapping from images to this latent space makes end-to-end room layout estimation possible. However end-to-end methods have several notorious drawbacks despite many intriguing properties. A widely raised criticism is that they are troubled with dataset bias and do not transfer to unfamiliar domains. Our study echos this common belief. To this end, we propose to use semantic boundary prediction maps as an intermediate domain. It brings significant performance boost on four benchmarks (Structured3D, PanoContext, S3DIS, and Matterport3D), notably in the zero-shot transfer setting. Code, data, and models will be released.


page 3

page 5

page 11

page 12

page 13


RoomNet: End-to-End Room Layout Estimation

This paper focuses on the task of room layout estimation from a monocula...

Single-Shot Cuboids: Geodesics-based End-to-end Manhattan Aligned Layout Estimation from Spherical Panoramas

It has been shown that global scene understanding tasks like layout esti...

Structured Graph Variational Autoencoders for Indoor Furniture layout Generation

We present a structured graph variational autoencoder for generating the...

End-to-end Generative Floor-plan and Layout with Attributes and Relation Graph

In this paper, we propose an end-end model for producing furniture layou...

Physics Inspired Optimization on Semantic Transfer Features: An Alternative Method for Room Layout Estimation

In this paper, we propose an alternative method to estimate room layouts...

PSMNet: Position-aware Stereo Merging Network for Room Layout Estimation

In this paper, we propose a new deep learning-based method for estimatin...

Corners for Layout: End-to-End Layout Recovery from 360 Images

The problem of 3D layout recovery in indoor scenes has been a core resea...

1 Introduction

Room layout estimation is the task of recovering parametric room structure elements (e.g., walls, floors and ceilings) from images. If efficient and effective room layout estimation is finally achieved in the future, many robotics and graphics applications could benefit from it. We notice that existing state-of-the-art methods all share a common paradigm of fitting on features

, which consists of two stages. In the first stage, a fully convolutional neural network extracts semantic cues from inputs. Semantic cues may come in various forms like keypoints, boundaries or facets. In the second stage, parametric representations are fitted on these cues, using tailored cost functions.

Different from these methods, we pursue end-to-end room layout estimation. There are works [chen2020bsp][li2019supervised]

that predict compact parametric shape representations for objects from clean inputs. Yet whether it is possible for cluttered scenes remains unclear. We believe giving a positive answer to this question is methodologically meaningful. The other motivation is building a solution that naturally enjoys the ongoing progress of deep learning. Now that the problem is addressed by an end-to-end neural network, new generic techniques can be incorporated seamlessly, including but not limited to layers, losses or training schemes.

Specifically, we resort to the idea of implicit encoding [park2019deepsdf][mescheder2019occupancy][chen2020bsp]

. A shape is represented as a latent vector, on which an implicit function is conditioned. The latent vector lives in a space that is a surrogate of the structured output space of room layouts, on which we can build discriminative models. This space bridges the gap between sensory inputs and parametric representations, making end-to-end room layout estimation possible.

However, end-to-end methods often struggle with generalization performance. When training such a naive end-to-end model on the largest known synthetic dataset, Structured3D [zheng2020structured3d], we achieves top-view IoU. Unfortunately, evaluating the trained model on other three small-scale benchmarks, PanoContext [zhang2014panocontext], S3DIS [armeni2016s3dis] and Matterport3D [zou2021manhattan], leads to , and IoUs, respectively. This echoes the common belief that end-to-end methods are troubled with dataset bias [agrawal2018assume][zhao2019lds]. Even worse, fine-tuning on these smaller datasets often yields even lower performance. We delve into this problem and identify two sources of domain drift.

The first is related to a biased shape embedding regressor. Datasets vary in both low-level and high-level properties. Photo-realistic rendering in Structured3D cannot reproduce all subtle visual effects caused by real-world material and lighting, and consequently differs in low-level details to real images. S3DIS pre-dominantly features office rooms while other datasets do not. Office rooms do not cover typical home furniture, making it different from other datasets in term of high-level scene composition. As such, the first domain drift results from regressing layout shape embedding from RGB inputs of different statistics. The second is related to a biased shape embedding space. Similar to the widely studied domain-adaptive road scene parsing problem [richter2016playing], the second source can be considered comparable to the issue of label set mismatch.

We develop two techniques to address domain drift:

The first is to pre-process panoramic images into semantic boundary prediction maps (Fig. 3). It is inspired by semantic transfer [zhao2017physics], but differs from it in terms of motivation. [zhao2017physics] intends for proper initialization, while our aim is to address zero-shot transferability. This technique works surprisingly well, despite the fact that this map is spatially sparse (i.e., most pixels have nearly zero values) and suffers from radial distortion. Specifically, it improves the performance on PanoContext, S3DIS and Matterport3D to , and , respectively. Notably, this performance boost is achieved without fine-tuning.

The second is to improve the implicit encoding step with an enormous amount of synthetic data. Using synthetic data for 3D scene understanding is an actively explored topic, yet has seen limited success thus far

[su2015render][song2017ssc][zhang2017physically]. Our domain, top-view layout occupancy image, is very simple and hardly influenced by sim-to-real gap. Translating a wall in the occupancy image faithfully generates in-domain new samples. Specifically, we use the training set of Structured3D [zheng2020structured3d] as anchors and generate one million synthetic samples via a conditional uniform augmentation strategy. It leads to nearly perfect implicit self-encoding performance on PanoContext () and S3DIS (). Again, this is achieved without fine-tuning.

2 Related Work

Room Layout Estimation. The task of estimating room layouts from perspective images was first introduced by [hedau2009recovering]. Line segments are clustered according to three Manhattan directions [coughlan1999manhattan]. Sampling lines originating from three orthogonal vanishing points yields room layout proposals, which can be later ranked by a discriminative model [tsochantaridis2005large]. Handcrafted statistical features like geometric context [hoiem2005popup] or orientation maps [lee2009geometric] show strong discriminative power for this task. An interesting feature of this scheme is that 3D object boxes can be sampled and ranked in a similar way, naturally leading to the joint parsing of objects and layouts [hedau2010thinking][lee2010estimating]. A further extension is to build a Bayesian model that is aware of the prior distribution of object-layout relationship. Although inference is usually expensive, better 3D scene parsing results can be achieved [zhao2013scene][choi2015gp].

After the advent of deep learning, robust features generated by fully convolutional architectures improved room layout estimation performance by large margins. Pixel-wise semantic features come in various forms like boundaries [mallya2015learning][zhao2017physics], facets [dasgupta2016delay][ren2016cfile] or keypoints [lee2017roomnet][huang2018hopr]. However, these methods still rely on post-processing algorithms to generate parametric layout results. These algorithms can be as simple as keypoint linking [lee2017roomnet] or as complex as MCMC sampling [huang2018hopr]. We call this paradigm fitting on features.

Panoramic room layout estimation was first proposed by [zhang2014panocontext] which adapts perspective techniques to this problem. Panoramic images have a field of view, thus are suitable for inferring the layout of a whole room. LayoutNet [zou2021manhattan] learns keypoint/boundary cues from panoramic RGB and remapped segment images, using deep networks. HorizonNet [sun2019horizonnet] introduces the tailored column-wise representation, making use of calibrated panoramic images whose rows are exactly parallel to horizons. Dula-Net [yang2019dula] learns deep occupancy cues on the remapped top view of panoramic images. CFL [fernandez2020cfl] explores the usage of equirectangular convolution for better representation learning on distorted panoramic images. However, all these methods are designed based on the fitting on features scheme. Different from them, we propose the first end-to-end room layout estimation method that directly predicts parametric layouts without fitting in a post-processing stage.

Parametric Reconstruction. Morphable models are principled representations for parametric reconstruction. It has been successfully used for faces [blanz1999morphable], body [loper2015smpl] and chairs [wu2018interpreter]. High-dimensional geometric data is projected onto a set of base shapes, and the coefficient vector is treated as a compact parametric representation. As such, regressing these coefficients naturally allows end-to-end parametric reconstruction from sensory inputs, e.g., using deep neural networks. However, it is not clear if these techniques can be applied to the shape space of room layouts. Deforming a rectangular room into a non-convex room with 14 walls is different from deforming a neutral face to a smiling one. To this end, we borrow the recently proposed idea of deep binary space partitioning [chen2020bsp]

. We choose a set of hyperplanes as the parameterization, which can be converted to an implicit representation for differentiable self supervision.

Synthetic Data. Using synthetic data for deep learning is an exciting topic as it virtually provides an unlimited amount of data. But the photo-realism of current rendering techniques is still far from satisfactory. Render4CNN [su2015render]

shows improved pose estimation performance using cropped object images. But this success does not transfer to pixel-wise detailed understanding as evidenced by the limited accuracy boost reported in

[zhang2017physically]. We argue that the right way to use synthetic data is to choose a proper domain that is less influenced by rendering artifacts. For example, semantic scene completion [song2017ssc][zhang2018sgc] benefits from an enormous corpus of synthetic depth images because rendering only geometry is obviously easier than rendering images.

3 Method

Figure 1: An illustration of implicit encoding. (h) can be supervised by accessing input (a) with coordinates in (d). (i) is generated by thresholding (h) with 0.5. (j) is the same as (h) but visualized with color temperature. Other details are described in text.

3.1 Implicit Encoding

The objective of implicit encoding is to turn a room layout into a latent code, so that we can predict the code using a neural network. Getting a code during test time means getting its corresponding layout, which is exactly what we pursue: end-to-end room layout estimation. We choose the (rasterized) top-down occupancy image (Fig. 1-a) as the representation of the room layout. Training an auto-encoder on this image with a pixel-wise reconstruction loss is a natural way to get the code. However, in this way, the code can only be used to recover a rasterized occupancy image, which is of limited use in many scenarios.

Instead of pixel-wise auto-encoding, our implicit encoding module (detailed in Fig. 1) firstly maps a layout into a code, which can be later mapped to a set of hyper-planes. Intriguingly, the parameters of hyper-planes are trained in a self-supervised manner, i.e., we don’t provide ground truth parameters for training. This is achieved through an implicit occupancy function that links the hyper-plane parameters with the input image. Specifically speaking, occupancy values can be calculated with the hyper-plane parameters and supervised by accessing corresponding positions in the input occupancy image, as formally stated below.

3.1.1 Self-encoding and Hyper-plane Generation

The input occupancy image (Fig. 1-a) is denoted as , which is generated by projecting the layout to the top-down viewpoint. The representation of is binary, with in-room pixels having value and out-of-room pixels having value . The resolution is a hyper-parameter. Figure. 1

-A shows the architecture of a self-encoding convolutional network, whose final activation function is Sigmoid. This Sigmoid function keeps the latent code (Fig. 

1-b) magnitude-constrained, which is critical to the success of regressing the code in the subsequent step. The self-encoding network is formally denoted as :


The -dimensional latent code is generated by . The latent code is is a compact representation of the room layout yet there is no way to understand the meaning of its each component. We map into a set of oriented hyper-planes, each of which partitions the space into two parts. One side of the hyper-plane is occupied by the room while the other is not. This mapping is done with a hyper-plane generation network denoted by :


is implemented by a multi-layer perceptron which outputs another array so that the final layer reshapes it into the size of

. The set of hyper-planes is generated by . Here is a hyper-parameter representing the number of hyper-planes. is set to a fixed value much larger than the number of walls in practice. Each hyper-plane has three parameters corresponding to the coefficients in . A visualization of is given in Fig. 1-c. The coordinate system of is set such that the origin is at the center of the input image and two axes align with the pixel coordinates.

As a reminder, our learning system is self-supervised so we do not have the ground truth hyper-plane equations for . In fact, it is not even possible to annotate hyper-plane equations for each layout. In order to enforce to naturally have the meaning we want, we render back to input image , in a differentiable manner.

3.1.2 Differentiable Rendering

This differentiable rendering is achieved through an implicit occupancy function, with the help of a set of continuous coordinates in the coordinate system of . These coordinates only serve as a surrogate in our training, so that they neither receive gradients nor undergo optimization.

These coordinates are illustrated as Fig. 1-d. In practice, we use homogeneous representation so that coordinates are denoted as . As shown in Fig. 1-C, we multiply with and apply a function, getting the initial occupancy values . corresponds to Fig. 1-e. The reason why is initial is that we still cannot impose supervision on it. We can get ground truth occupancy values by bilinearly accessing . However, it is not yet possible to get ground truth occupancy values to supervise .

The first step to convert values to values is grouping. We resort to a set of learnalbe grouping weights . The intuition is that this step would group hyper-planes to shape primitives. We first multiply with , then apply a function to align with the training data definition and finally impose a clamp function (as illustrated in Fig. 1-D). As such we can get the intermediate occcpancy value , shown as Fig. 1-f.

The final step is to combine shape primitives into a final layout shape. This is achieved by another set of learnable combining weights . As shown in Fig. 1-E, we multiply with and clamp it. The final occupancy value is generated by , which corresponds to Fig. 1-g.

Now that all components in Fig. 1 has been described, we can write them together in a single equation:


The original input is the rasterized occupancy image and the homogeneous representation of continuous coordinates . The output is the occupancy values at coordinates . The ground truth occupancy values can be obtained by accessing coordinates bilinearly in . There are four sets of network parameters in this equation: , , and . These parameters can be trained by forcing the final occupancy towards . Different from conventional self-encoding, the input is a rasterized representation while the output is an implicit representation, although they correspond to the same underlying signal.

3.1.3 Loss Functions

Formally, the first loss is occupancy value reconstruction:


We need regularization terms to ensure the behaviors of and . Following [chen2020bsp], we enforce each element in to be bounded in so that it behaves like a soft grouping operator. The loss term is implemented as such:


Finally, we drag the sum of towards 1 so that it functions as a combining operator:


These three loss functions altogether supervise the implicit encoding network. Note that they correspond to occupancy values, grouping weights and combining weights, all at the scale of

. So we add them using the same balancing coefficient.


This implicit encoding can effectively reconstruct complicated rooms like the one shown in Fig. 1.

3.1.4 Data Augmentation

Figure 2: Conditional uniform data augmentation.

Finally, we describe a conditional uniform data augmentation strategy. As a reminder, this implicit encoding network is trained between layout occupancy images and functions. This is a domain in which we can easily generate synthetic samples without rendering artifacts.

Figure 3: Left: Semantic boundary map ground truth and prediction. Right: Full and visible ground truth semantic boundary maps.

Since training generative models for structured data is still an open problem, we resort to a conditional formulation. As demonstrated in Fig. 2, for every sample in the Structured3D training set, we randomly generate synthetic rooms from it. The first step is to traverse all walls and identify the shortest one with

length. The second step is to randomly select a wall to augment. The third step is to sample a value from the uniform distribution

and translate the selected wall along its normal according to the value. Augmentation on existing rooms and restricting the translation range guarantee that the synthetic rooms have a reasonable topology. Using this strategy, we generate a total of one million synthetic rooms (referred to as RoomAug-1M) to train the implicit encoding network.

3.2 Shape Code Regression via Image Encoding

Now the ground truth layout occupancy image is transformed into a latent code by . Then we learn an image encoding network to bridge the gap between the input panoramic image and the latent code:


We use a convolutional network to implement . We impose on all training samples in datasets to get (pseudo) ground truth latent codes. Then we optimize to minimize the or distances between the output of and ground truth latent codes. Formally, we optimize:


As such, during test time, we can obtain a set of hyper-planes from a single panoramic image by:


As will be shown later in the experiments, shows poor transferability and we successfully address the issue using a data pre-processing step.

3.3 Data Pre-processing

We propose a data pre-processing step to map input RGB panoramas into an intermediate domain that is hardly influenced by low-level visual differences and high-level scene composition drift. We choose semantic boundary maps as the intermediate domain (Fig. 3).

The original input is a panoramic image . The predicted semantic boundary map is denoted by . The ground truth semantic boundary map is depicted by . Three channels of and correspond to wall-floor boundary, wall-ceiling boundary and wall-wall boundary respectively. As such the mapping from to is denoted by :


is a fully convolutional network. As shown in the right panel of Fig. 3, using visible111Strictly speaking, here visible means being free of layout self-occlusion. It does not mean the map is free of furniture occlusion. semantic boundary ground truth map is critical. We demonstrate pairs in the upper rows of Fig. 3’s left panel. We optimize a pixel-wise loss function between and . This data pre-processing step is formally stated as:


Then we apply on both train and test sets. Semantic boundary prediction maps on unseen test sets are illustrated in the lower two rows of Fig. 3’s left panel. They are good but not perfect. Complex non-cuboid layout structures in the Structured3D and Matterport3D samples are successfully captured. Yet in the simple PanoContext and S3DIS samples, blurred boundaries caused by occlusion and ceiling decoration still exist. We use inferred imperfect as a surrogate for , which significantly improves zero-shot transfer performance.

4 Experiments

Exp Encoding Size Manhattan SBM Loss RoomAug-1M IoU-IE (%) IoU-LE (%)
A FC 128/16 98.32 83.02
B GAP 128/16 96.89 81.93
C GAP 256/32 97.24 81.98
D GAP 128/16 97.02 82.12
E GAP 128/16 96.89 89.99
F GAP 128/16 98.38 90.16
G GAP 128/16 98.38 90.34
Table 1: These ablation results are evaluated on Structured3D. Encoding means how representations are mapped to the latent code in . Size means the numbers of hyper-planes and shape primitives, i.e., and mentioned above. Manhattan means whether hyper-planes are enforced to two orthogonal directions. SBM means whether the data pre-processing step is used or not. Loss means the loss function used for training . RoomAug-1M means whether the augmented synthetic dataset is used.

4.1 Dataset and Evaluation Protocol

PanoContext S3DIS Matterport3D
Ft IE Ft SR SBM IoU-IE (%) IoU-LE (%) IoU-IE (%) IoU-LE (%) IoU-IE (%) IoU-LE (%)
95.94 67.04 97.35 51.43 90.18 24.68
95.94 84.43 97.35 82.69 90.18 73.73
95.94 51.90 97.35 44.69 90.18 22.75
95.94 81.22 97.35 80.76 90.18 73.26
97.05 51.08 98.72 42.93 87.59 25.09
97.05 81.95 98.72 80.25 87.59 71.06
Table 2: Transferability evaluation. Ft IE/SR means whether to fine-tune the implicit encoding and shape code regressor, respectively.

We use four datasets for evaluation: Structured3D [zheng2020structured3d], PanoContext [zhang2014panocontext], S3DIS [armeni2016s3dis] and Matterport3D [zou2021manhattan]. Structured3D is a synthetic dataset which can be used to generate diverse annotations. Here we use the version published in the ECCV2020 Holistic 3D workshop, which has 21727 samples. In the alphabetical order of file names, we take one for evaluation every 70 samples. We have 21329 panoramas for training and 308 panoramas for testing. To be consistent with the literature [zou2021manhattan], we combine the training sets of PanoContext and S3DIS, forming a set of 896 panoramas. For testing, we use the original split, which contains 53 samples for PanoContext and 113 samples for S3DIS. Matterport is also a dataset that allows various usages. We use the layout version annotated by [zou2021manhattan], which has 1835 panoramas for training and 458 panoramas for testing.

We use 2D intersection over union (IoU) in the top-down viewpoint as our metric. As shown in Fig. 1, we generate a discretized output (i) and compare it with (a). Pixel-wise intersection and union between (a) and (i) are calculated and we divide intersection by union to get an IoU value. This is used to evaluate the implicit encoding network and referred to as IoU-IE. Meanwhile, we can evaluate this IoU value for end-to-end room layout estimation, which is named as IoU-LE. Besides, we also give a systematic evaluation on the quality of semantic boundary networks. Following former edge detection papers [arbelaez2010contour], we use the F1-score under optimal dataset scale (ODS) and optimal image scale (OIS) to evaluate the accuracy of semantic boundary prediction. Since the OIS measure allows us to select an optimal scale value for each image, it is always higher than ODS. Three boundary maps are evaluated separately.

PanoContext S3DIS Matterport3D
Ft IE Ft SR Aug Loss IoU-IE (%) IoU-LE (%) IoU-IE (%) IoU-LE (%) IoU-IE (%) IoU-LE (%)
95.94 84.43 97.35 82.69 90.18 73.73
98.91 83.60 98.88 82.89 94.79 73.89
98.91 81.48 98.88 80.75 94.79 72.84
98.91 83.30 98.88 81.23 94.79 73.86
Table 3: Ft IE/SR means whether to fine-tune the implicit encoding and shape code regression network, respectively. Aug means whether to train the implicit encoding network with RoomAug-1M. Loss means how to train the shape code regressor.

4.2 Ablation Studies

The first set of experiments are presented on Structured3D as it is the largest one among four datasets we inspected. We show the impact of basic building blocks in the newly proposed learning system, in Table. 1. Experiment indexes are shown on the leftmost column. To make comparisons clearer, we use a v.s. operator between indexes.

Encoding (A v.s. B): We study two alternatives for the final encoding step in the mapping . The first is to directly use a fully connected layer to map (flattened) convolutional features to the latent shape code. This preserves spatial information which is naturally favorable, since the shape code should have the capability of reconstructing the layout. The second is to apply a global average pooling layer before the fully connected layer. This is an established choice in image recognition [zhou2016learning], yet whether it is reasonable for implicit encoding still needs justification. Not surprisingly, A outperforms B by 1.43% for IoU-IE and 1.09% for IoU-LE. However, this comes at a high cost of parameter usage. As for the implicit encoding network, model A has a size of 33MB while model B has a size of 21MB. As such, we choose the GAP version for all later experiments.

Size (B v.s. C): We answer the question of how many hyper-planes () and shape primitives () are enough for implicitly encoding generic room layouts. We tried three settings: 64/8, 128/16 and 256/32. The first setting does not converge, implying that it is not enough to approximate the shape space. C outperforms B only by 0.35% for IoU-IE and 0.05% for IoU-LE, from which we reach the conclusion that there is no need to further pursue bigger models.

Manhattan (B v.s. D): The Manhattan constraint is widely used in room layout estimation, thus we explore the possibility of enforcing hyper-plane equations to align with two orthogonal Manhattan directions. Interestingly, D outperforms B by 0.13% for IoU-IE and 0.19% for IoU-LE, implying the advantages of incorporating this inductive bias. However, to keep the formulation generic, we do not use the constraint in later experiments.

SBM (B v.s. E): Then we show the impact of obtaining semantic boundary prediction maps as a data pre-processing step. Note that IoU-IE remains the same and E outperforms B by 8.06% for IoU-LE. This significant margin illustrates the importance of using an intermediate domain in our newly proposed learning system. The mapping alleviates the negative influence of appearance change and provides consistent patterns to recognize. We keep this setting in later experiments due to its effectiveness.

RoomAug-1M (E v.s. F): We then inspect the setting of training with one million augmented rooms. F outperforms E by 1.49% for IoU-IE and 0.17% for IoU-LE. As expected, using numerous augmented data better model the shape code space, leading to a clear margin for IoU-IE. As a reminder, RoomAug-1M is generated on Structured3D, thus this margin will get larger on other datasets. This improvement does not translate to IoU-LE, suggesting that other factors like the capability of are the bottleneck.

Loss (F v.s. G): Finally, we use for training . Since naturally leads to sparse non-zero entries, if the latent shape code is well-disentangled, using may lead to better performance. Empirically, G outperforms F by 0.18% for IoU-LE, which is quite marginal.

4.3 On Transferablility

In Table. 2, we show quantitative results on model transferability for PanoContext, S3DIS and Matterport3D. There are three combinations: (1) Directly evaluating on other datasets using models trained on Structured3D, without fine-tuning the implicit encoding nor the shape code regression network on them. This is amounts to a zero-shot transfer setting. (2) Only fine-tuning the shape code regression network on other datasets. This setting re-uses shape embeddings trained on Structured3D. (3) Fine-tuning the whole system on other datasets.

Poor Transferability of Direct Image Encoding: The first noticeable fact is that the whole system trained on Structured3D transfers poorly to other datasets, without the data pre-processing step. This is supported by the first row of Table. 2. The zero-shot transfer results for IoU-IE are 95.94%, 97.25% and 90.18%, respectively. This is understandable as occupancy images/functions are rarely influenced by domain drift and Structured3D is the largest dataset that covers a wide range of room shapes. However, the zero-shot transfer results for IoU-LE are only 67.04%, 51.43%, and 24.68%, respectively. This shows that direct image encoding suffers from severe domain drift.

S3D Floor Ceiling Wall
Train 63.53/60.70 63.97/54.27 68.79/63.82
Test 63.68/61.12 64.34/54.42 69.16/63.60
All 63.53/60.71 63.95/54.28 68.80/63.82
P&S Floor Ceiling Wall
Train 50.13/47.41 52.49/49.53 52.80/48.99
Test-P 46.02/44.09 42.92/39.14 42.78/37.79
Test-S 44.96/42.71 56.07/53.84 45.80/40.94
All 49.27/46.67 52.39/49.35 51.56/47.53
MP3D Floor Ceiling Wall
Train 45.85/43.44 44.75/41.11 46.35/41.46
Test 44.75/42.18 38.97/35.27 44.18/38.99
All 45.63/43.17 43.60/39.85 45.92/40.95
Table 4: Semantic boundary prediction accuracy evaluation. Numbers before and after the slash are F1-score under OIS and ODS respectively, which are measured in %.

Semantic Boundary Maps Help: As demonstrated in the second row of Table. 2, introducing the intermediate domain significantly improves the zero-shot transfer performance, which the most important finding of this paper. Note that IoU-IE is not influenced and IoU-LE sees a performance boost of 17.39%, 31.26% and 49.05%, respectively. This is achieved without fine-tuning the latent space or the shape code regressor on other datasets. As a conclusion, the usage of intermediate domain largely improves the generalization ability of our method, because the domain drift of biased shape code regression is effectively addressed by bypassing the negative influences of low-level appearance change and high-level scene composition change.

Does Fine-tuning the Shape Code Regressor Help? Very interestingly, the answer is no, as evidenced by the third and fourth row of Table. 2. Without SBM, fine-tuning the shape code regressor leads to 15.14%, 6.74% and 1.93% performance drops, respectively. With SBM, it leads to 3.21%, 1.93% and 0.47% performance drops, respectively. These results conclusively show that fine-tuning the shape code regressor on small datasets results in over-fitting and severely hurts model performance. This fact further confirms the significance of SBM because the most straightforward solution of fine-tuning on target datasets cannot address the domain drift of biased shape code regression.

Does Fine-tuning the Implicit Encoding Network Help? The answer is still no, as evidenced by the last two rows of Table. 2. Without SBM, fine-tuning the whole system leads to -0.82%, -1.76% and +2.34% performance changes, respectively. With SBM, it leads to +0.73%, -0.51% and -2.20% performance changes, respectively. The margins do not point to clear conclusions but it is demonstrated that even fine-tuning the whole system on target datasets cannot fully address the domain drift problems.

4.4 Impact of RoomAug1M

Table. 3 summarizes the quantitative results on training the implicit encoding network with RoomAug1M. As demonstrated by the second row, introducing RoomAug1M significantly promotes IoU-IE on three small datasets in the zero-shot transfer setting. We achieve 2.97%, 1.53% and 4.61% boosts on PanoContext, S3DIS and Matterport3D, respectively. Notably, IoU-IE values on PanoContext and S3DIS are as high as 98.91% and 98.88%, which are nearly perfect. This validates our assumption that using synthetic data in a simple domain free of rendering artifacts is a good practice. Meanwhile, the IoU-LE values only see little performance drop or increase, implying that other factors are becoming bottlenecks that prevent us from fully unleashing the power of a better shape code space.

Fine-tuning the Shape Code Regressor Still Hurts Performance: With the RoomAug1M dataset used, fine-tuning the implicit encoding network on small datasets become meaningless. But we still investigate the option of fine-tuning the shape code regressor. As shown in the third row of Table. 3, this stills brings 2.12%, 2.14% and 1.05% performance drops, respectively. This indicates that there is two sources of domain drift. Even when the domain drift of a biased shape code space is alleviated by enormous data, fine-tuning the shape code regressor on small datasets still amplifies the domain drift of a biased shape code regressor.

Regression Empirically Helps: As shown in the last row of Table. 3, switching to a loss function for shape code regression brings 1.82%, 0.48%, and 1.02% improvements, respectively, although the results are still lower than the zero-shot transfer setting. This suggests the shape code space learned on RoomAug1M may have some disentanglement characteristics.

4.5 Semantic Boundary Accuracy

Lastly, we provide quantitative evaluations for semantic boundary map quality, in Table. 4. S3D means Structured3D, P&S means PanoContext and S3DIS, and MP3D means Matterport3D. As a reminder, the training sets of PanoContext and S3DIS are combined. Since is used as a data pre-processing step, we impose it on both training and testing images. The semantic boundary map prediction accuracy on Structure3D is the highest, which directly translates to its high IoU-LE (90.34%).

5 Conclusion

We propose an end-to-end room layout estimation method taking panoramic images as inputs. It is based upon the implicit encoding principle. We first learn shape codes that serve as a surrogate for a structured implicit representation. The representation is self-supervised by a combination of several differentiable rendering layers. The we learn a mapping from input images to the latent shape code, which makes end-to-end room layout estimation possible. Like other end-to-end formulations, the proposed one is troubled by generalization ability. On one hand, we show a data pre-processing step significantly improves the zero-shot transfer performance. One the other hand, we propose a conditional uniform data augmentation strategy that alleviates the domain drift of a biased shape code space. Extensive evaluations are conducted on four widely used public datasets, with code, data and models released.

Limitation: The current representation exploits a fixed number of hyper-planes, which is not compatible with several widely used metrics on those four benchmarks.