Predictive and Semantic Layout Estimation for Robotic Applications in Manhattan Worlds

11/19/2018 ∙ by Armon Shariati, et al. ∙ University of Pennsylvania 0

This paper describes an approach to automatically extracting floor plans from the kinds of incomplete measurements that could be acquired by an autonomous mobile robot. The approach proceeds by reasoning about extended structural layout surfaces which are automatically extracted from the available data. The scheme can be run in an online manner to build water tight representations of the environment. The system effectively speculates about room boundaries and free space regions which provides useful guidance to subsequent motion planning systems. Experimental results are presented on multiple data sets.



There are no comments yet.


page 2

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Quadrotor w/ Sensor Suite
(b) 3D Point Cloud Reconstruction
(c) Semantic Floor Plan
Figure 1: The goal of this work is to be able to automatically construct semantic layouts of indoor spaces based on the kinds of data that could be acquired from an autonomous robot like the one shown in (a). This system is equipped with a pair of stereo cameras, an IMU and a PMD depth camera. (b) Shows a small portion of the 3D point cloud that we can acquire by integrating information from the robots sensors (c) Shows the abstracted floor plan distilled from the 3D measurements that are acquired as the sensor suite is moved through the scene.

In this paper we present an algorithm which is designed to generate semantic floor plans for an autonomous agent, such as the one shown in Figure 1, that must navigate through previously unknown indoor environments. In many such applications it is neither necessary nor desirable for the robot to view every corner or crevasse of the environment. Humans moving through such a scene are able to quickly apprehend the overall structure and make reasonable inferences about the structure of the scene even with incomplete information gleaned from a few scans and we would like to endow our robots with similar capabilities.

One of our goals is to produce floor plans that are useful in solving motion planning problems. As such our method is designed to be computationally efficient, to provide robust speculation as to the presence of free space and boundaries in regions of the environment which have not yet been observed, and to provide a semantic context for higher level planning tasks.

Many current layout estimation pipelines focus on scene understanding from a single image, or focus on generating models only after the entire space has been observed. Our approach fuses information from multiple vantage points and can be run in an online manner to provide predictions as the robot moves.

2 Related Work

2.1 Single-View Layout Estimation

A number of methods have been proposed that use geometric methods to infer scene layout from a single RGB image including [7, 14, 6, 5]. These methods typically focus on detecting vanishing points within a Manhattan frame of reference in order to estimate layout through scene occlusion. Cowley et al. [17] present a related single view approach that uses depth images rather than RGB images.

More recently, several learning based approaches to the single view variant of the problem have been developed. Lee et al. [9] use CNN features, instead of the more traditional handcrafted ones, in order to detect vanishing points and, in turn, scene layout. Methods such as [19] and [16] can produce quite flexible factored representations of the completed scene, however, neither produce the water-tight floor plan models which are desirable in robotic applications. Finally, while [21] and [3] are able to produce such floor plans, both rely on panoramic images, which may be difficult to obtain on a robotic platform.

Ultimately, despite the success and apparent efficacy of these systems, most fail to present a way in which additional information from multiple views can be incorporated into the information processing pipeline to model extended environments.

2.2 Multi-View Layout Estimation

The class of multi-view layout estimation techniques is comprised of solutions arising predominantly from three different problem domains.

The first domain is robotics, where the goal is often to distill a semantically meaningful representation of an environment from sequential image data. Like this paper, these works often target real-time planning and navigation applications. Among the first of these approaches are [4] and [18] which use a Bayesian filtering framework to update a model of the scene based on observations provided by some state-estimation subsystem. Although we may share many of the same goals and high-level approaches as some of these authors, our work expands upon their insights with a focus on prediction and model utility. A learning approach designed for a similar end can be found in [10], where the authors use RGBD information to produce semantic floor plans, yet they do not emphasize the need for speculation in their system as we do.

The second subset of techniques is automated floor plan generation. Okorn et al. [13] focuses on generating accurate 2D floor plans by extracting and reasoning about complete wall segments from high-fidelity point clouds. In [12] and [11], the authors utilize a 3D piecewise-planar representation and focus on volumetric reasoning of space. These techniques are appealing as they can produce more flexible models of indoor space than those permitted by making a Manhattan world assumption, however, they also require that all of the data be of a certain quality and available at once.

The third category is that of large-scale indoor modeling, where the work most closely related to ours may be found. The approach of Xiao et al. [20] is similar to ours in that the authors leverage aspects of the Manhattan world assumption and use rectangular primitives to model free space. Our approach differs in that we build our candidate regions using infinitely extended structural planes that are automatically detected as opposed to finte wall extents. Furthermore, our online approach is specifically designed to suggest unexplored regions that could be freespace which can be contrasted with the batch oriented approach in [20] which is discouraged from including unexplored regions in its volumetric model. In Armeni et al. [2], the authors present their work focused on large-scale semantic parsing, which demonstrates how larger point clouds, such as those of a building, can be parsed into semantically meaningful components such as rooms, hallways, etc. However, despite providing a useful labeling of the points, their model of the environment is still, ultimately, a point cloud instead of a floor plan.

3 Technical Approach

(a) SL-LMS Input
(b) Free Space Speculation
(c) Semantic Floor Plan
Figure 2: A birds-eye perspective of the 3D reconstruction provided by [15] is shown in (a). Red and green dotted lines indicate the position of different layout planes perpendicular to the and axis, respectively. Each red and green point cloud illustrates the portion of its corresponding layout plane which is observed. A generated floor plan outlined in blue overlaid on top of the occupancy grid is given in (b). Known free cells are colored white while unobserved cells speculated to be free based on the floor plan are colored gray. Occupied cells and unobserved cells outside of the domain of the floor plan are colored black. The final semantically colored floor plan with labeled region is shown in (c). Cyan regions correspond to rooms, while magenta regions regions correspond to corridors. Open doorways on the borders of each region are indicated.

The method we present in this section builds upon the work of Shariati et al. [15]. In their paper, the authors present a SLAM approach that models partial observations of layout structures, including walls, floors, and ceilings, as segments residing on structural supporting layout planes, which are modeled as axis-aligned surfaces with infinite extent. The output of their system is an optimized trajectory and the minimum set of layout planes necessary to explain all observed layout segments. Each layout plane is parameterized by its position and orientation – north, east, south, or west. This representation of indoor spaces leads to a simplified feature set for more robust localization and loop closure, and provides a strong prior for scene understanding. As is illustrated in Figure 1(a), the primary shortcoming of such a representation, which we seek to address, is that although a human may be able to discern independent regions within this reconstruction, it is unclear how this may be done automatically since the true extent of individual walls and the meeting of perpendicular walls at corners is typically unobserved and ill defined. Furthermore, it is unclear how a robot could speculate about the unobserved free space in the environment given such a representation.

Although the approach described in the sequel can readily be extended to the multi-floor 3D case, for the sake of simplicity we describe the single-floor formulation of the problem, which will output a floor plan of the building. Note however that the system still analyzes a complete 3D point cloud and 3D trajectory to produce the distilled floor plan.

3.1 Door Detection

We first observe that in the context of buildings, doorways play a special role in scene understanding in that they signal a clear transition between two functionally disjoint spaces such as distinct rooms and hallways. This motivates us to develop a scheme to automatically detect these openings in walls.

Given a layout plane we begin by accumulating all of the points associated with that plane as shown in Figure 3. Notice how the discontinuity of observations makes this problem of finding doors more challenging than simply looking for negative space.

Figure 3: Result of merging the individual cloud segments associated with a particular layout plane (top). Histogram of projected points corresponding to the point cloud in Figure 3 cropped at 2 meters (bottom). The distance between ticks along the axis is 10 meters. Histogram bin counts range from 0 - 500.

Each point set is cropped at 2 meters which is close to the typical door height and the remaining points are aggregated in a histogram as shown in Figure 3. We compute a smoothed gradient of the resulting signal and then convolve that result with a matched filter that is designed to detect 1 meter wide apertures. The existing system readily detects openings between 0.8 and 1.2 meters wide and can easily be extended to accommodate varying dimensions. Note that the detected door openings are effectively children of the underlying layout planes. Once the layout has been determined these doorways help to define the functional transitions between different spaces.

3.2 Layout Estimation

We begin by recognizing that regions of indoor free space are typically enclosed by pairs of inward facing structures i.e. north-facing wall to south-facing wall and east-facing wall to west-facing wall. Therefore, given a set of axis-aligned layout planes as well as their observed orientation, we enumerate all possible north-south and east-west pairs,


where and are defined as the set of east-facing plane positions and west-facing plane positions respectively. Similarly, and are defined in the same way for north and south-facing planes. Using these intermediary sets, we then enumerate all the possible rectangles that could be used to explain the free space


This set however contains several different types of invalid rectangles including: those that have opposing faces which are outward facing; those whose length or width are too narrow (less than 1 meter for most indoor spaces); those which include portions of observed layout segments (projected to the ground plane after thresholding all points at a height greater than 2 meters) within their bounds; and those which include detected doorways within their bounds. Therefore, we prune the set of all such offending elements. It is interesting to note that this operation typically reduces the size of the original space of candidates by about - , which greatly improves the speed of our algorithm.

This approach to defining rectangular regions is similar to the scheme employed by Xiao et al. [20] but here we leverage the fact that we are pairing structural planes with infinite extent rather than incomplete wall segments. More specifically if we consider the example environment shown in Figure 1(a) the system considers pairs of the infinite dotted lines shown rather than just the solid segments where direct evidence is available. This approach allows the scheme to effectively speculate in regions that have not been observed yet.

In addition to the planar reconstruction, [15] also provides a voxel map reconstruction of the environment, from which we can sample at a particular height in order to determine which cells each rectangle in spans. Each voxel is meters on side.

At this point, we observe that the problem we are presented with can be phrased as a set cover problem. We are given a universe of free space cells in the 2D occupancy grid – generated by sampling from the 3D voxel map at a height – each covered by at least one , and a list of rectangle subsets of , each with its own weight defined as the total number of free, occupied, and unobserved cells in the grid it covers. What we would like is to select the a collection of rectangles, , of minimum total weight, whose union is equal to all of . Minimizing this objective should, in principle, select those candidates which explain as much of the free space as they can, while also yielding the simplest explanation of the space. The set cover problem is, of course, NP-Complete, however effective greedy solutions have been developed and we exploit one of these.

We construct the cover of rectangles, by iteratively making the following greedy choice: select the rectangle that minimizes


until no free space voxels remain uncovered, where denotes the sum of the number of free, occupied, and unobserved voxels within the span of , and is the set of remaining uncovered free voxels. If there happen to be two or more rectangles with the same ratio, we choose the one with the largest . This algorithm has the interesting property that the cover selected has weight within a factor of the optimal, where [8].

Given this cover, a floor plan can be generated by computing the union of all rectangles in . An example of such a floor plan can be seen in Figure 1(b). Notice that our segment and doorway collision constraint on results in the generation of functionally disjoint regions. These regions may also be given unique identifiers as illustrated in Figure 1(c). It is important to mention that these regions are subject to one filtering criteria, which is that no region my have a ratio of total cells spanned to free cells spanned greater than 1000. This threshold may be tuned in order to limit the desired degree of risk in the speculation.

Based on which layout planes form the faces of each region, we can also reason as to which doorways act as transitions between pairs of adjacent regions given each region’s well defined boundary. These doorways are highlighted in the semantically annotated version of the floor plan shown in Figure 1(c).

Note that this optimization procedure seeks to find the simplest set of boxes that explains the available data which encourages the system to expand corridors and rooms since this allows it to explain larger regions with fewer primitives. In contrast the optimization in [20] was designed for situations where the space was completely scanned so the optimization penalizes primitives that include unexplored voxels.

3.3 Semantic Labeling

For any functional interpretation of space, it is important to understand what each region represents. In our classification scheme, we distinguish between two types of spaces: rooms and corridors. While this label space may not be comprehensive, we argue that it does capture the general purpose of most types of space – either the space itself acts as a transition, or the space is itself a terminal point where some particular event or action takes place. Observe however, that these categories, rooms in particular, can each be readily extended to include subcategories such as office, kitchen, etc.

The layout estimation algorithm described in the previous section produces a floor plan comprised of disjoint regions parameterized by sets of vertices. For each of the regions we can compute several features to describe the particular space, including the area, perimeter and aspect ratio. Recognizing that the outer boundary of most rooms are typically close to square, the feature which yields the largest information gain between the two classes is the turning distance [1] between the region’s outer boundary, with its perimeter normalized to 1, and the unit square. This quantity turns out to be quite useful as it implicitly captures the magnitude of various other attributes at once such as the number of sides and the aspect ratio.

Using these features, it is possible to use the following classifier to discriminate between the two types of regions:


An example of a semantically labeled floor plan can be seen in Figure 1(c). While this hand crafted classifier is quite simple, it does represent a useful baseline against which other, more sophisticated, schemes could be compared.

We feel it is important to emphasize that in this approach we are not performing semantic segmentation, but rather a semantic labeling of high-level regions in a water-tight model. While it may help provide a description of individual observations, semantic segmentation of a point cloud or an occupancy grid does not produce any abstraction or model that may be useful for higher-level reasoning.

4 Experimental Results

(a) Floor Plan A
(b) Floor Plan B
ID Floor Plan A Floor Plan B
1 3.62 3.75 0.13 2.72 2.66 0.06
2 73.15 73.91 0.76 3.23 3.25 0.02
3 3.34 3.32 0.02 27.13 27.04 0.09
4 9.03 8.31 0.72 2.17 2.10 0.07
5 8.46 8.31 0.15 9.15 9.14 0.01
6 12.60 12.73 0.13 4.61 4.57 0.04
7 4.72 4.72 0 3.59 3.58 0.01
8 29.99 30.34 0.35 45.04 45.72 0.68
9 8.41 8.38 0.04 2.02 2.02 0
10 8.48 8.31 0.17 2.10 2.10 0
11 6.33 6.37 0.04 30.01 30.06 0.05
12 9.02 9.06 0.04 2.00 2.08 0.08
13 4.80 4.84 0.04 3.30 3.31 0.01
14 8.44 8.31 0.13 1.68 1.6 0.08
15 4.56 4.56 0
16 8.57 8.31 0.26
17 1.47 1.58 0.11
Mean Err. 0.18 0.09
Mean % Err. 1.98 1.31
Figure 4: Floor plans A and B (left) annotated with locations of ground truth measurements. Differences between ground truth measurements and floor plan estimates are provided in table (right). All values are given in meters.

In order to evaluate the efficacy of the proposed scheme we applied it to a number of extended indoor environments. Our data sets consisted of range images, stereo images and inertial measurements acquired with the sensor system shown on the robot in Figure 0(a). Figure 2 and Figure 7 show the results of applying the interpretation in a batch mode on various data sets. In each of these cases the system was able to correctly infer the large scale building layout and partition the space into rooms and corridors. It was also able to correctly detect doorways which are indicated on the figures. These figures also compare the inferred free space area with the free space area that is actually observed to provide an indication of the systems ability to speculate about unexplored regions.

The results show the system’s ability to infer the presence of structure which is not directly observed. For instance, in Area 3, the algorithm is able to use the easternmost plane of Rooms 3, 4, and 5, in order to infer the presence of a back wall in Rooms 6 and 7, and cover the free space observed in them. Also notice that the use of these rectangular primitives allows the system to approximate more complicated structure such as that seen in Room 4 of Area 4, which would have otherwise been lost in direct geometric model fitting schemes that would seek to approximate the entire space with a single cuboid.

In order to provide a more quantitative evaluation of the scheme we took measurements of the dimensions of a subset of the rooms and corridors used in our experiments using a hand held laser range finder and compared these measurements with the dimensions predicted by our automated layout scheme. These results are presented in Figure 4. The results indicate that over all the dimensions that were considered the measurements and the predictions agreed on average within 2%.

In a second set of experiments we run the scene interpretation scheme in an online manner at regular intervals ( - seconds) in the data set to provide an understanding of how a robots’ concept of the space would evolve as it moved through the environment.

The procedure was carried out in two extended environments and the results are shown in Figures 4(b) and 5(b). Although both spaces are of roughly the same dimension, meters, the length of the exploratory path taken through these spaces as well as their respective topologies are quite different. Sample images taken in both environments can be seen in Figures 4(a) and 5(a), which provide more context for the types of environments being explored.

The first environment is an academic building featuring vast hallways, large classrooms, and highly visible walls. These qualities naturally lead to simpler space, which leads to a faster convergence and a more accurate model being produced. The second environment is an abandoned industrial laboratory and contains a larger number of densely packed interconnected rooms with more built in furniture which occludes the structural wall surfaces. This more complex structure results in fewer observations of the dominant structure, but a significant overall increase in the total number of planes detected, and as a result leads to a more challenging optimization. The sequence is also significantly longer than first ( meters vs meters). Nonetheless, despite these challenges the system is still able to extract the major structural features of the space and produce an estimate for the floor plan.

Again, as the exploration proceeds and the system learns about more structural planes it is able to use these to posit more accurate completions of the space. For example the system is able to apprehend the dimensions of neighboring rooms on a corridor by suggesting that they share some of the same structural walls even when those surfaces are not directly observed in each room since the optimization algorithm favors simple, regular explanations.

5 Conclusion

In this paper we have described an algorithm that can be used to automatically extract plausible floor plans of indoor scenes based on the kinds of incomplete 3D data that an autonomous mobile robot could acquire. The extracted floor plan is designed to be useful for subsequent motion planning procedures since it produces water tight explanations of space that complete partially observed layout surfaces and infer likely regions of free space. This ability to better understand the space from limited data allows the system to construct better motion plans with less data obviating the need to exhaustively explore each corner of the scene.

Our approach exploits an algorithm that provides it with estimates for the salient structural planes in the scene and it constructs volumetric explanations using those infinite planes as boundaries. Importantly, this allows the system to suggest relationships between rooms that are not evident in the acquired data. The algorithm also exploits the connection between the layout estimation task and the set cover problem to propose effective optimization algorithms with provable performance guarantees.

The ability to accurately extract room layout structure is an important first step in scene interpretation. These results provide context which can be used to inform other semantic analysis operations such as detecting and positioning furniture and speculating about the function of different spaces. Ultimately, it helps the autonomous system to apprehend the scene at a higher level of abstraction and communicate more effectively with human interlocutors.

(a) Sample Images
(b) Estimated floor plan at , , , and seconds.
Figure 5: Online estimation results for Building A. Total distance traveled meters.
(a) Sample Images
(b) Estimated floor plan at , , , and seconds.
Figure 6: Online estimation results for Building B. Total distance traveled meters.
Area ID SL-LMS Input Free Space Speculation Semantic Floor Plan
Area 1
Area 2
Area 3 See Figure 1(a) See Figure 1(b) See Figure 1(c)
Area 4
Area 5
Area 6
Figure 7: Batch floor plan generation results in several indoor environments. The first column shows the input provided by [15]. The second column illustrates the difference between occupied, free, and speculated free space. The third column shows the semantic floor plan with doors, corridors, and rooms highlighted in yellow, magenta, and cyan respectively.


  • [1] E. M. Arkin, L. P. Chew, D. P. Huttenlocher, K. Kedem, and J. S. Mitchell. An efficiently computable metric for comparing polygonal shapes. Technical report, Cornell University, Ithaca NY, 1991.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D Semantic Parsing of Large-Scale Indoor Spaces. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1534–1543, 2016.
  • [3] C. Fernandez-Labrador, A. Perez-Yus, G. Lopez-Nicolas, and J. J. Guerrero.

    Layouts from Panoramic Images with Geometry and Deep Learning.

  • [4] A. Flint, D. Murray, and I. Reid. Manhattan scene understanding using monocular, stereo, and 3D features. In Proceedings of the IEEE International Conference on Computer Vision, pages 2228–2235. IEEE, nov 2011.
  • [5] R. Guo and D. Hoiem. Support surface prediction in indoor scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2144–2151, 2013.
  • [6] A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces. In Neural Information Processing Systems (NIPS), pages 1288–1296, 2010.
  • [7] V. Hedau, D. Hoiem, and D. Forsyth. Recovering free space of indoor scenes from a single image. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2807–2814. IEEE, jun 2012.
  • [8] J. Kleinberg and E. Tardos. Algorithm design. Pearson Education India, 2006.
  • [9] C. Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. RoomNet: End-to-End Room Layout Estimation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2017-Octob, pages 4875–4884. IEEE, oct 2017.
  • [10] C. Liu, J. Wu, and Y. Furukawa. Floornet: A unified framework for floorplan reconstruction from 3d scans. CoRR, abs/1804.00090, 2018.
  • [11] C. Mura, O. Mattausch, and R. Pajarola. Piecewise-planar Reconstruction of Multi-room Interiors with Arbitrary Wall Arrangements. Computer Graphics Forum, 35(7):179–188, oct 2016.
  • [12] S. Ochmann, R. Vock, R. Wessel, and R. Klein. Automatic reconstruction of parametric building models from indoor point clouds. Computers and Graphics (Pergamon), 54:94–103, feb 2016.
  • [13] B. Okorn, X. Xiong, B. Akinci, and D. Huber. Toward Automated Modeling of Floor Plans. In Symposium on 3D Data Processing, Visualization and Transmission, 2010.
  • [14] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box in the box: Joint 3D layout and object reasoning from single images. In Proceedings of the IEEE International Conference on Computer Vision, pages 353–360, 2013.
  • [15] A. Shariati, B. Pfrommer, and C. J. Taylor. Simultaneous localization and layout model selection in manhattan worlds, 2018.
  • [16] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic Scene Completion from a Single Depth Image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 190–198. IEEE, jul 2017.
  • [17] C. Taylor and A. Cowley. Parsing Indoor Scenes Using RGB-D Imagery. Robotics: Science and Systems VIII, 2012.
  • [18] G. Tsai, C. Xu, J. Liu, and B. Kuipers. Real-time indoor scene understanding using Bayesian filtering with motion cues. In Proceedings of the IEEE International Conference on Computer Vision, pages 121–128. IEEE, nov 2011.
  • [19] S. Tulsiani, S. Gupta, D. Fouhey, A. A. Efros, and J. Malik. Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [20] J. Xiao and Y. Furukawa. Reconstructing the World’s Museums. International Journal of Computer Vision, 110(3):243–258, dec 2014.
  • [21] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2051–2059, 2018.