1 Introduction
Realistic 3D indoor scenes are useful for many tasks. In recent years, datasets of 3D scenes have stimulated advances in computer vision [zhang2018deep, zamir2018taskonomy], multimodal scene understanding [MatterportSim, chen2020soundspaces], 3D reconstruction [LocalImplicitGrids, SceneCAD], and visual navigation [gupta2017cognitive, GibsonEnv, AiHabitat]. In particular, multiroom 3D interiors have enabled work on longhorizon embodied AI tasks involving both manipulation and navigation.
Unfortunately, the scale and diversity of existing 3D indoor scene datasets is a bottleneck for this research agenda: they are not very large nor configurable, and it is timeconsuming and expensive to enlarge them. For example, existing datasets provide mostly ‘detached’ rooms (ScanNet [ScanNet]), or are limited to a small number of highquality multiroom residences (Gibson [GibsonEnv], Matterport3D [Matterport3D] and Replica [ReplicaDataset]).
Instead of using 3D reconstructions, one could turn to synthetic scene datasets, such as AI2THOR [AI2Thor], CHALET [CHALET] and 3DFRONT [3DFront]. One could also try to use generative models of furniture layouts to generate new synthetic 3D data [GRAINS, PlanIT, SceneGraphNet]. Regardless of how synthetic scenes are created, there exists a “reality gap” between synthetic scenes and real ones, both in terms of content (e.g. synthetic scenes are not messy enough) and style (e.g. synthetic scenes are not photorealistic enough) [RenderingSUNCG].
In this paper, we propose a new paradigm for creating houselevel 3D environments by piecing together existing 3D rooms, a task that we call Roominoes. This mixandmatch strategy is similar to roomlevel 3D scene synthesis that selects from a database of 3D objects and arranges them to create a furnished room. Instead of arranging 3D objects, the Roominoes task requires arranging entire rooms. In other words, the rooms are building blocks to produce a coherent floor plan through a ‘retrieve and edit’ approach. This approach allows for scalability through combination of existing highquality 3D rooms, as well as control of the level of complexity (from simple environments with few rooms, to complex environments with many rooms).
To analyze the challenges presented by the Roominoes task, we decompose it into three subtasks: (1) generating a 2D floor plan that is compatible with available 3D rooms, (2) retrieving a set of 3D rooms that can fit into the 2D floor plan, and (3) deforming individual 3D rooms so they fit the 2D floor plan. A Roominoes method must address these three subtasks. There are a spectrum of strategies for performing them, each with different tradeoffs. One could use existing 2D floor plans to provide plausible layouts; however, fitting available 3D rooms into those layouts may induce undesirably large deformations to the 3D rooms and the objects they contain. Alternatively, one could directly piece together existing 3D rooms without starting with a target layout; this will result in minimal distortion to the 3D rooms but may produce a lessrealistic layout. In this paper, we implement and evaluate these two strategies. To measure the quality of the generated 3D environments, we introduce a set of evaluation metrics for measuring both the 2D layout quality and 3D mesh quality (e.g. amount of deformation, navigability). We also discuss the potential usefulness of these generated environments for downstream tasks, and we demonstrate the applicability of generated 3D houses on a visual navigation task.
We find that the Roominoes task is challenging, requiring nontrivial geometry processing to properly deform room architectures while keeping contained objects relatively undeformed. We believe that compositional generation of realistic multiroom 3D houses will become increasingly useful.
2 Background & Related Work
Multiroom 3D environment generation
There is limited work that attempts to generate 3D environments with multiple rooms. Procedural content generation for studying RL is starting to leverage generated 3D environments. However, these are mostly restricted to 3D mazes [beattie2016deepmind] or towers with rooms [juliani2019obstacle], which do not reflect the complexity of realworld furnished houses. A related line of work addresses procedural generation of floor plan layouts for residences and other buildings [ResidentialLayout, bao2013generating, LayoutExploration]. These works, however, focus on generating the 3D architectural structure and do not typically produce realistically furnished 3D interiors.
Modeling by Assembly
Our task is related to the long line of research on 3D modeling by retrieving and assembling parts. Early work in this area focused on interactive modeling, using either shape similarity search [ModelingByExample] or a learned probabilistic graphical model [SidVangelisAssembly] to retrieve relevant parts for a user to combine. More recent approaches have automated both part retrieval and placement [SidVangelisSynthesis, ComplementMe]. ComplementMe [ComplementMe] is most similar to one of our approach: it trains a deep metric learning model to retrieve 3D parts that are compatible with a partial query shape. However, floor plans differ from partbased shapes both in data representation and in constraints that must be respected, which motivates us to develop a different formulation.
Indoor Scene Synthesis
Our problem statement is similar to that of indoor scene synthesis, which aims to generate room layouts comprised of existing 3D objects, whereas we focus on generating 3D layouts from existing 3D rooms. One family of approaches to this problem attempts to place objects via optimization, typically a variant of Markov Chain Monte Carlo, with respect to learned priors, handcrafted constraints, or some combination of the two
[MakeItHome, InteractiveFurnitureLayout, SceneSynth, HumanCentricSUNCGSceneSynth]. Another family of approaches trains deep generative models to output scenes, using either a latent variable model such as Variational Autoencoder
[GRAINS]or Generative Adversarial Network
[QixingSynth] or using an iterative objectbyobject approach [DeepSynthSIGGRAPH2018, FastSynthCVPR, PlanIT]. The iterative approaches are most similar to our setting. However, floor plan generation is a considerably different domain than furniture layout, as interroom relationships require more geometric precision than most interobject relationships (e.g. walls must align, portals must connect). Hence, a different approach is needed.Floor Plan Generation
There is a large body of existing work on generating 2D floor plans. As part of a larger system for generating 3D residential house models, Merrell et al. introduced a Bayesian network for modeling the room relationship graph along with an optimizationbased approach for realizing this graph through precise wall geometry
[ResidentialLayout]. Liu et al. also use an optimizationbased approach as part of a machineassisted interactive system for designing precast concrete buildings [LayoutExploration]. More recently, Wu et al. developed a mixedinteger quadratic programming (MIQP) formulation for optimizing building interior layouts [MIQPLayout]. Finally, the past year has seen the introduction of deep learning methods into this problem space, resulting in learningbased methods for generating floor plans given building outlines
[RPLAN], room relationship graphs [HouseGAN], or both [Graph2Plan]. These methods all address generation of 2D layouts with unconstrained room shapes. Very recent work takes a different approach of learning to generate a constraint graph and then solving the optimization problem posed by this graph to produce a final layout [GenerativeLayoutModelingConstraintGraphs]. In contrast to all these approaches, we seek to solve the problem of generating new 3D layouts, where the room shapes are constrained by an available set of existing 3D rooms.3 Overview
Generating 3D floor plans using existing 3D rooms is a more constrained problem than typical 2D floor plan generation: one cannot generate rooms of arbitrary shapes, but instead is limited to the shapes available in an existing 3D room database. It would be too restrictive, however, to allow no changes to the 3D room geometry at all, as it is unlikely that an existing set of 3D rooms can be recombined perfectly into a novel floor plan. The likelihood of creating plausible layouts from existing rooms can be greatly increased if we allow deformations to the original rooms. By distorting the outline of the room and moving the positions of connecting doors or spaces (which we call portals), we can reuse the available 3D rooms in a more flexible manner and generate higher quality layouts. However, this approach comes at the cost of degrading 3D room quality, which decreases as the amount of distortion increases.
There is no definitive answer on how to trade off between 2D layout plausibility and 3D mesh quality; different applications may warrant different tradeoffs. Thus, we explore a spectrum of strategies which span this tradeoff space. We propose to break down the Roominoes task into three parts: i) generating a 2D floor plan that is compatible with available 3D rooms; ii) retrieving 3D rooms that can fit into the 2D floor plan; and iii) deforming the 3D rooms so they better fit the 2D floor plan. We note that the third step, room deformation, can be solved separately, provided that a correspondence can be found between each retrieved 3D room and its counterpart in the 2D floor plan. Thus, our exploration of possible strategies focuses on approaches for solving the first two steps, taking into account the interaction between them. In the remainder of this section, we first identify strategies for each of these steps and then propose two full pipelines, which we describe further in the following sections.
3.1 Generating 2D Floor Plans
3D interiors follow the structure of an underlying 2D floor plan. Taking into consideration that the rooms in a 2D floor plan must be matched with available 3D rooms, we propose the following strategies for obtaining a 2D layout:
Existing Dataset
An obvious option for obtaining 2D floor plans is to draw them from an existing floor plan dataset, such as RPLAN [RPLAN] (80K annotated floor plans), or LiFULL [LiFULL] (millions of floor plan images). These datasets are far larger than their 3D scene counterparts: the Matterport3D [Matterport3D] dataset, for example, contains only 184 floors. Using an existing floor plan guarantees the quality of the layout, but has other disadvantages. The first issue is lack of control: if one wants to replace a bedroom with a living room, or to make a balcony larger, it is unlikely that another floor plan in the dataset satisfies such demands. The second issue concerns the availability of 3D rooms which match the shape of the rooms in the 2D layouts. In general, 3D rooms with exact shape matches are not available. To fit the 3D rooms into the layout, large deformation is often necessary, which can impact the resulting 3D mesh quality.
Generative Model
Instead of using floor plans drawn directly from a dataset, one could also use one of the many available datadriven generative models of 2D floor plans to generate novel 2D layouts [RPLAN, Graph2Plan, HouseGAN]. This approach will improve the control over the resulting layout, since these methods usually accept some user input specification (e.g. a room relationship graph) and are capable of generating multiple possible layouts given the same input specification. However, this approach does not address the issue with 3D deformation artifacts.
Naive Portal Stitching
One can avoid these deformation artifacts entirely by ignoring 2D layout altogether: instead of combining 3D rooms according to a layout, we can instead stitch rooms together by connecting pairs of portals. This approach removes the need to deform 3D rooms, but may produce an incoherent layout and is incompatible with downstream tasks that involve reasoning about the global scene layout.
Smart Portal Stitching
Finally, one can attempt to find a “middle ground” tradeoff between 2D layout plausibility and amount of 3D deformation by modifying the portal stitching method above: identifying 3D rooms that can likely fit together into a plausible layout without causing too much deformation.
Table 1 summarizes the strengths and weaknesses of these different strategies for obtaining a 2D layout.
Strategy  Layout Quality  Control  Deformation 
Dataset  High  Low  Large 
Generative Model  High  Medium  Large 
Naive Portal Stitching  Low  High  None 
Smart Portal Stitching  Medium  High  Medium 
3.2 Retrieving Compatible 3D Rooms
Depending on the strategy used to generate the 2D layout, the task of retrieving 3D rooms can be quite different. If the entire 2D layout is known in advance, then the task becomes finding the 3D rooms that best match the rooms in the layout (in terms of shape and/or portal placements). On the other hand, if the layout is determined onthefly (as in the two “Portal Stitching” approaches above), then the task becomes more illposed. As there is no ‘ground truth’ target room shape to match, one must instead retrieve rooms that are both geometrically compatible with the rooms already in the layout and and lead to a globally plausible layout.
3.3 Deforming 3D Rooms
The final task involves taking the retrieved 3D rooms and deforming them according to the 2D floor plan. The problem can be solved in two steps: first computing a correspondence between the original room and the target room, and then applying a standard mesh deformation algorithm according to the defined correspondence. When the available 3D rooms come equipped with semantic object annotations, one can also leverage this information to avoid introducing visually objectionable artifacts such as nonrigid bending of semanticallymeaningful rigid objects.
3.4 Task Paradigms
While the final task of deforming 3D rooms can be performed independently of the methods used for the first two tasks, these first two tasks are more correlated. We identify two major paradigms for solving them together:

2D before 3D: Generate the 2D floor plan first and use its room shapes as ground truth shape targets for 3D room retrieval (i.e. the “Dataset” strategy in Table 1).

2D by 3D: Build up a layout iteratively by retrieving and connecting 3D rooms (i.e. “Guided Portal Stitching” in Table 1).
In this paper, we propose one representative algorithm for each of the two paradigms. In Section 4, we describe a method that takes an existing 2D floor plan in the dataset, and retrieves 3D rooms similar according to the floor plan. In Section 5, we implement a method that uses deep metric learning to iteratively retrieve compatible 3D rooms and then apply a mixed integer quadratic programming (MIQP) optimization to assemble them into the final layout. Finally, in Section 6, we describes a strategy for deforming the 3D rooms, by first optimizing for a set of dense correspondences between the room outlines, and them applying cage deformation to deform the 3D rooms to the target outline.
Query Room  Best 4 Retrievals  Worst  
4 Generating 2D Layout Before Retrieving 3D Rooms
In this section, we describe an algorithm that utilizes a large collection of available 2D floor plans to guide the retrieval of 3D rooms. Since the 2D layout is given by a floor plan from the dataset, the focus of this algorithm is on minimizing deformation of the retrieved 3D rooms needed to fit into the 2D floor plan. Since it is computationally expensive to perform deformation for all available 3D rooms, we instead develop approximate metrics to assess how much deformation would be required.
4.1 Data Preparation
As our source of 2D floor plans, we use the RPLAN dataset [RPLAN], a collection of floor plan images annotated at pixel level. As in prior work, we apply additional filtering to the data: we make all walls the same thickness, retain only the largest portal between each pair of connecting rooms, and remove floor plans containing multipurpose rooms.
As our source of 3D room data, we use the Matterport3D dataset [Matterport3D], a collection of floor plans containing rooms. For each 3D room, we build a 2D representation equivalent to those in the 2D floor plan dataset by extracting its outline (given by human annotations) and using raycasting to identify open regions on its walls, which we consider as portals. Since all of our 2D rooms are rectilinear, we also convert the nonrectilinear architecture of 3D rooms by computing their rectilinear bounding cages (i.e. the union of axisaligned bounding boxes of all wall segments). We discard rooms that are not closed as well as rooms where a portal falls on a nonrectilinear part of the wall. We also discard room types not present in RPLAN. This leaves us with a final retrieval database of rooms. Finally, we apply a four way rotational augmentation of this dataset.
4.2 Retrieving Compatible 3D Rooms
Given a room from the 2D floor plan, we first filter out a subset of the 3D rooms that are compatible: those that have the same type label and contain same number of portals. For each 3D room in the filtered subset, we compute the following matching score between the source 3D room and the target 2D room , each containing portals and :
where
penalizes large difference in room areas , which will lead to large uniform scaling in the deformation process.
computes the twoway chamfer distance between a point sample of the room outlines , with the rooms normalized to have unit area and centered. This penalizes room outline differences which lead to nonrigid deformations.
tries all possible ways of pairing the portals. There are only ways because the order of the portals along the outline cannot be changed, and pairing any two portals uniquely determines the pairing of the rest. The portal matching cost is:
where denotes the position of the midpoint of the portal in the 1D parametrized, unit length, outline of the room, denotes the length of the portal in the same 1D parametrization, and is the indicator function that takes the value of if the frontfacing directions of the portals , are different. This penalizes matching portals that might lead to large deformation: we don’t want to slide the portal along the wall by too much, scale it too much, or put it in the wrong orientation. We use for all experiments.
Figure 1 shows examples of rooms retrieved using this score. The score allows retrieval of rooms that are similar regarding size, shape and portal locations. However, due to the limited availability of 3D rooms, it becomes increasingly unlikely to get an exact match as the number of portals and the complexity of the room shape increase.
5 Generating 2D Layout By Retrieving 3D Rooms
While guiding 3D retrieval with existing 2D floor plans ensures the quality of the resulting scene’s 2D layouts, it can lead to large deformations of 3D rooms. To reduce the amount of deformation, it is possible to instead build the layout ‘on the fly’: retrieving 3D rooms one by one and iteratively assembling them into a plausible 2D layout, changing the rooms slightly as needed. In this section, we describe one pipeline that follows this paradigm. We extract a graph abstraction of an existing 2D floor plan, where each node encodes the type of the room and and each unlabeled edge represents a portal connection. Conditioned on the graph, we iteratively retrieve compatible 3D rooms and use an optimization procedure to assemble them into a partial floor plan, repeat these steps until the layout is complete.
5.1 Room Retrieval
Each iteration of the algorithm must retrieve a 3D room corresponding to a node in the input layout graph. The retrieved room needs to be compatible not only with the shapes of the other rooms retrieved so far, but also with the rooms retrieved later on, in order to form a coherent floor plan. Instead of trying to manually select features that lead to such compatibility, we formulate the retrieval problem as a deep metric learning problem: we learn an embedding space of possible room shapes, where rooms which are compatible with the same layouts should be grouped together. This is similar in spirit to the approach taken by Sung et al. for learning to assemble partbased shapes [ComplementMe]. Our neural architecture has two subnetworks: an embedding network and a retrieval network. The embedding network maps individual 3D rooms into a highdimensional vector space; the retrieval network maps an input partial floor plan, as well as a relation graph of the entire floor plan, into a probability distribution over this space. The two networks are trained jointly using a contrastive learning framework, i.e. rooms that are known to fit well with a layout in the training set should map to higherprobability points than other rooms. Figure 2 shows the architecture of these networks and their interaction; the rest of this section describes them in more detail.
Embedding Network
The embedding network maps a 2D room into an embedding vector
. For this, we pass an imagebased representation of the room through a convolutional neural network (CNN). The image representation uses separate channels to represent the rooms, walls, portals, as well as the specific portal to which we want to connect the next room.
Retrieval Network
The retrieval network takes the current partial floor plan , the target floor plan relation graph , and the node
corresponding to the room to be inserted, and predicts a conditional probability distribution
over room embeddings . Similarly to the work of Sung et al. ComplementMe, we model this distribution as a mixture of Gaussians over a room shape embedding space. We use a mixture density network [MDN] to predict this distribution.Since the features of and are highly correlated, we use a single neural network to learn from both simultaneously. We first use a message passing neural network [NeuralMessagePassing] to extract information from the relationship graph . After rounds of message propagation, we predict a pernode descriptor , as well as a descriptor of the entire graph . To extract information from the partial floor plan, we use a CNN conditioned on an imagebased representation of the floor plan to predict an image feature . The imagebased representation is the same as that used by the embedding network, with an additional set of channels. These channels encode the corresponding node embedding for each room in the layout. Finally, we concatenate the image feature , the graph descriptor , as well as the descriptor for the room to be retrieved , and feed it to a mixture density network, which predicts the parameters of the Gaussian mixture, where the th Gaussian is parameterized by a weight , a mean
and a standard deviation
.Joint Training
We train the embedding and retrieval networks jointly using a triplet loss. Given a room embedding , we define the log likelihood that the embedding is sampled from the mixture distribution predicted by the retrieval network:
Using this embedding loss, we define a contrastive loss [TripletLoss] between an embedding of a positive example room and embedding of a negative example room :
where is a constant margin. In other words, this loss encourages positive example rooms to have a higher predicted likelihood than negative example rooms. Positive example rooms are created by removing a room from one of the training layouts. Negative examples are created by sampling rooms from other floor plans. For rooms with more than one portal, we randomly select a subset of portals to show in the query portal mask.
Graph Context  Floor Plan  Best 3 Retrievals  Worst  
We use rounds of propagation for the graph neural network. We choose as the dimension of the per node embedding and
as the dimension of the graph embedding. The CNN contains seven layers, without batch normalization, which we find to be detrimental to the training process. We use
as the dimension of predicted image feature. For the retrieval network, we predict a mixture of Gaussians over an dimensional embedding space. We use a margin of for the triplet contrastive loss. We train the neural networks using the Adam [Adam] optimizer, and with a batch size of . At each batch, negative examples are sampled from randomly selected floor plans. We start with uniformly selecting from these examples, gradually favoring harder examples as the training proceeds.Figure 3 shows examples of rooms predicted by the retrieval network. The network learns to retrieve rooms of the correct shape, correct type, and the correct number of portals. Note that the networks are not trained to recognize the possibility of using a room after rotations. To allow for this, we apply a four way rotational augmentation to all the candidate 3D rooms instead.
5.2 Placing Retrieved Rooms into the Floor Plan
We use the trained embedding and retrieval networks to iteratively retrieve and insert rooms into the layout. We determine the location of the new room by matching the position of the portal to the connecting rooms. When multiple such portals exist, we randomly select one for initialization.
Retrieval  Optimization  Retrieval  Optimization 
Without  With  Without  With 
However, it is unlikely that the retrieved room will fit perfectly into the current partial floor plan. We use beam search to increase the chance that we find better fits. To resolve remaining inconsistencies, we apply an optimization procedure after each step of room retrieval. Inspired by prior work [MIQPLayout, GenerativeLayoutModelingConstraintGraphs], we adopt a mixed integer quadratic programming (MIQP) based procedure for layout optimization. The optimization procedure has two goals. The first is to ensure that the partial floor plan is valid, i.e. each room has a positive area, no overlaps exist between different rooms, and all paired portals align with each other. Figure 4
shows examples of this process. The second goal is to avoid obvious semantic issues in the floor plan. We tackle one such issue, interior voids in the floor plan, by encouraging the rooms to be adjacent to each other, though additional heuristics could be employed to improve the quality of the 2D floor plans further. Figure
5 shows the effect of encouraging room adjacency. These goals must be achieved while minimizing deformation of the room outlines and the portal locations. The precise formulation of the objective function and constraints can be found in the supplemental material.6 Deforming Retrieved 3D Rooms
Having generated a 2D layout and retrieved 3D rooms, the final task involves deforming the individual room geometries to fit the layout. This is a nontrivial problem: the 3D room must be warped to fit the new 2D outline, but we must take care to avoid introducing visuallyobjectionable artifacts, most prominently nonrigid distortion to semanticallymeaningful objects within the room.
Deforming a 3D room to fit an optimized floor plan layout amounts to solving an analogy problem: given the source 3D room mesh and its 2D outline polygon , as well as the target 2D outline polygon produced by the optimizer, what is the corresponding target 3D room mesh ?
In this section, we describe a twostep solution. First, we warp the source outline onto the target outline by optimizing for a dense correspondence between them. Then, we use the warped outline as a control cage for cagebased deformation of the source mesh to produce the target mesh .
Finding a correspondence between outlines
Our first step in deforming the 3D room mesh to fit the new target room outline is to first deform its 2D outline to the target outline. As both outlines are closed, onedimensional, piecewise linear curves, this problem reduces to finding a correspondence between and . We solve this problem by sampling a finite set of 1D points along the source outline (including all of its corner points) and then optimizing for a corresponding set of 1D points on , by minimizing the following objective:
where
is the elasticity regularization term that encourages the output points to be spread out evenly on the target outline,
is the normal matching term that encourages originally vertical segments to stay vertical () and originally horizontal segments to stay horizontal (), where is the 2D position of along , and is the term that encourages source corner points to correspond to target corner points ( is a constant indicating whether the point is a corner in the source/target).
In addition, we impose the constraint that for all points , which enforces that the sequence of points remains monotonic. To make sure that the portals fall on the right segment of the outline after deformation, we impose the additional constraint that for each pairs of portals and , the endpoints of the original portal and falls within the corner points and that contain the target portal. For both the source and target outlines, we define to occur at the upperleftmost corner of the outline.
Outlinedriven deformation
Given the warped outline produced by the correspondence step, we drive a cagebased deformation of the room mesh. That is, is treated as the initial polygonal control cage for the mesh, and
provides the new positions of the cage vertices. We then interpolate the positions of all interior mesh vertices using mean value coordinates (MVC)
[MeanValueCoordinates]. Finally, we deform each wall where a portal falls on to make sure the portals fall on the right position. Figure 6 shows an example of these procedures applied to a 3D room.Input Room  Input Outline  Target Outline  Deformed Room 
Handling rigid objects
The deformation scheme described above nonrigidly distorts the room geometry. Since room retrieval and layout optimization try to retrieve and place rooms that fit into a layout together without changing their shape much, this distortion is typically small and unobjectionable. However, when objects within a room undergo such deformations, it can result in semanticallyimplausible bending artifacts. To prevent such artifacts, we cut all labeled objects other than those with labels floor, ceiling, wall or curtain out of the mesh, deform the rest of the mesh using the scheme above, and then insert the objects back into the deformed room mesh. To determine object insertion positions, we move their original centroid position according to the MVC deformation. In some cases, this results in placing the object in a position that intersects with other objects or the room geometry. If this occurs, we push the object away from the collision until the intersection is resolved. If doing so results in another intersection before the first is resolved, we uniformly scale the object down until the collision is resolved. Figure 7 shows deformation results with and without this special logic for handling rigid objects.

w/out Rigid Objects  w/ Rigid Objects  
7 Evaluation Strategies for 3D Layout Generation
In this section, we propose a set of metrics for evaluating the effectiveness of methods that solve the Roominoes task. First, we identify the properties of the output that are of interest to downstream tasks. We then define metrics that estimate the quality of the results, without having to spend many machine hours to actually test the generated data on downstream tasks. Finally, we demonstrate that 3D scenes generated by Roominoes can be useful for downstream tasks.
7.1 Evaluating 2D Layout Quality
The quality of a generated floor plan’s 2D layout is important for tasks that require reasoning about the global structure of a scene, e.g. indoor navigation. We propose the following metrics to evaluate 2D layout quality:

Fréchet Distance (FD): a measure of distributional similarity between the generated floor plans and those in a heldout test set [FrechetInceptionDistance]
. We evaluate this distance in two different feature spaces: a pretrained Inception image classifier, i.e. the standard Fréchet Inception Distance (
FID), and the CNN component of the retrieval network described in Section 5 (FD[R]). The first is more general, whereas the second is domainspecific. 
Classifier Fool Percentage: the percentage of heldout test layouts classified as “generated” by a classifier trained to distinguish between generated 2D layouts and layouts from the training dataset. A value of would suggest that the two sets of layouts are indistinguishable from each other.
We compare the following sources of 2D floor plans:

Rectangles: Randomly place rectangles, where equals the number of nodes in the graph. This simple baseline establishes a lower bound on performance.

Dataset: Floor plans drawn from a dataset (RPLAN [RPLAN]) that are not used for training the following models.

Generative Model: Generate floor plan given the polygonal outline of the building, using a recent method [RPLAN].

Smart Portal Stitching: Build a 2D layout while retrieving 3D rooms using the algorithm described in Section 5.

Smart Portal Stitching (no net): Use the algorithm from Section 5, but do not use the learned retrieval network; instead, retrieve random rooms with correct type and portal number.
Method  FID  FD[R]  % fool 
Rectangles  
Dataset  
Generative Model  
Smart Portal Stitching  
Smart Portal Stitching (no net) 
Table 2 shows the results of this experiment. As expected, sourcing 2D layouts from a floor plan dataset or a datadiven 2D floor plan generative model produces the highest quality layouts. The smart portal stitching methods sacrifice layout plausibility, although the version with the learned retrieval network performs better. Figure 8 shows examples of dataset layouts vs. those produced by smart portal stitching. The layouts generated by the smart portal stitching method have less consistent outlines and sometimes contain implausible features, such as interior holes. In extreme cases, the layout quality can be too low to be useful, as visualized in Figure 8(a) and 8(b).
Match 2D Layout Shape (2D: Dataset)  
2D Plan  
2D Original  
3D Plan  
Smart Portal Stitching  
2D Plan  
2D Original  
3D Plan 
7.2 Evaluating 3D Mesh Quality
Large deformations of 3D room meshes will degrade perceived visual quality of the output scenes, which is important to most downstream tasks. We propose the following metrics to evaluate mesh quality in the presence of deformations:

Area Change (Area): Percent change in the area of the room’s 2D outline after deformation.

Outline Change (Outline): Average distance between corresponding points on the outlines

Portal Change (Portal): Average distance between the midpoint of the portal after the deformation and the final portal location (obtained through deforming the wall segment)

Mesh Statistics: For each mesh, we discretize the area of each triangle and the length of each edge into 64 bins, and compute the Wasserstein distance of the triangle area distribution (Mesh A) and edge length distribution (Mesh E).
We use these metrics to compare the following 3D room retrieval methods:

Match 2D Layout Shape: Retrieve rooms whose shape best matches the shape of their corresponding room in a given input 2D floor plan (i.e. the algorithm from Section 4).

Smart Portal Stitching: Same as above.

Smart Portal Stitching (no net): Same as above.
Method  Area  Outline  Portal  Mesh A  Mesh E 
Match 2D Layout Shape  
Smart Portal Stitching  
Smart Portal Stitching (no net) 
Scene Source  Success Rate  SPL 
Match 2D Layout Shape  
Smart Portal Stitching  
MP3D Scenes 
Scene Source  Success Rate  SPL 
Smart Portal Stitching  
MP3D Scenes 
Table 3 shows the results of this experiment. The smart portal stitching strategy requires significantly fewer changes to the retrieval 3D rooms, with respect to the size, outline shape, portal locations, and mesh properties. Using a neural network further amplifies this effect. On the other hand, due to the limited availability of 3D rooms, the match and deform strategy struggles to find 3D rooms that match exactly, and thus has to modify each room much more. Figure 8 illustrates the differences in terms of amount of change required and impact on the 3D mesh quality. In extreme cases, the retrieved rooms are so incompatible that leads to unacceptable mesh quality post deformation. Figure 8(c) and 8(d) shows two such examples.
7.3 Evaluating on Downstream Tasks
To demonstrate that the 3D house scenes we generate through Roominoes can be useful for downstream tasks, we use them for two smallscale indoor visual navigation experiments. For the first experiment, we use a DDPPO agent [wijmans2020ddppo] trained on the Gibson [GibsonEnv] dataset in the Habitat simulation environment [AiHabitat]. We task the agent with performing a series of point goal navigation tasks in scenes generated by our method, as well as in original Matterport3D scenes. To evaluate how well the agent navigates these scenes, we use two standard metrics from the visual navigation literature: Success Rate and Success Weighted by Path Length (SPL) [EmbodiedNavEval]. The pretrained agent can navigate reasonably well in the scenes generated by both methods. Additional qualitative examples of agents navigating in the generated scenes are in the supplementary material. For the second experiment, we train two agents with proximal policy optimization [PPO] with the settings used in [AiHabitat], one on 104 floor plans generated by the Smart Portal Stitching strategy, and the other on 184 floor plans from Matterport3D. Both agents are trained with 30 million steps. We then evaluate the performance of the agents on the Gibson [GibsonEnv] validation set as used in the Habitat challenge [AiHabitat]. Table 5 summarizes the results. Agents trained on our data achieve comparable success rate to the agent trained on the original Matterport3D data, though with relatively lower SPL. These experiments suggest that data generated by our method is of comparable quality to original Matterport3D data with respect to indoor visual navigation. A fullscale evaluation of the impact of data augmentation with our data is computationally challenging, as it has been shown that navigation agents continue to learn from data after billions of steps [wijmans2020ddppo]. The supplemental material reports on an experiment intended to approximate this setting; we leave rigorous, fullscale evaluation to future work. Regardless, it has been shown that additional training data is beneficial to the performance of agents, even if the additional data is of poor quality [wijmans2020ddppo]. Thus, we believe that the scenes we generate, which we have shown to be of reasonable quality, can be used to improve the performance of navigation agents further.
7.4 Timing
It takes on average about 30 seconds for the Match 2D Layout Shape strategy to complete a 2D layout. The Smart Portal Stitching strategy, in contrast, requires about 10 minutes per floor plan. Virtually all of the time is spent on 2D layout optimization, where perroominsertion optimization time varies considerably depending upon layout complexity (from 0.1 to 60 seconds, which is our hard time limit). We note that the time spent on optimization can be reduced from 10 minutes down to 1 minute without beam search. Computing outline correspondence takes about 1 minute per floor plan with 250 point samples. This can be reduced drastically if done with fewer point samples. Finally, deforming the 3D mesh takes 3 minutes per floor plan, primarily due to the large number of vertices we have to handle.
8 Conclusion
In this paper, we presented Roominoes, a new task for creating houselevel 3D environments by mixingandmatching existing 3D rooms. We examined the possible solutions to three subtasks, and implemented two representative algorithms for solving the task. As discussed in Section 7, we found Roominoes to be challenging, primarily due to the need to tradeoff between 2D layout quality and 3D mesh quality. There are several potential directions one can take to address such challenges. First, the 2D floor plan quality for the “2D by 3D” class of methods can be improved by designing additional semantic features capturing what constitutes a good floor plan. For example, one could attempt to reduce the number of corners of the generated 2D floor plans to make them more regular, or finetune the shape of individual rooms to make the floor plan more similar to real ones. The 3D room deformation procedure can also be improved. Our current method does not fare well in the case of large deformations, particularly when portals need to slide significantly along walls. Future work might instead perform additional mesh surgery e.g. cutting and pasting the portal regions to a different part of the wall, instead of deforming the entire wall, or copying and pasting some existing regions, instead of applying large deformations to those regions. Doing so leads to the need to synthesize new textures, which is a challenging problem on its own. Currently, there is no guarantee that the layout of the 3D room postdeformation is still realistic. Thus, it would be beneficial to augment the control points with semantic layout rules learned from available 2D layout data. This can be extended further to take into account relations between individual rooms. It is possible to experiment with additional 3D room datasets such as ScanNet [ScanNet] or Gibson [GibsonEnv], in order to increase the chance of retrieving rooms that fit the target floor plan well. Finally, a more thorough evaluation in downstream tasks such as visual navigation, as well as other tasks that benefit from 3D scene datasets can reveal what properties of the generated scenes are most beneficial in such tasks.
Beyond addressing these limitations, there are several broader avenues for future work. For instance, one can extend Roominoes to support ondemand data generation, creating scenes tailored to address scenarios where methods for downstream tasks are underperforming. It is also worth doing more investigation on strategies that bridge the gap between different datasets: in this work, we simply combined one dataset of 2D floor plans with another dataset of 3D rooms. While the performance is acceptable, a more careful investigation addressing differences between datasets can potentially improve the generalization performance of the algorithms.
Finally, we would be excited to see Roominoes applied at scale to more of the applications that originally motivated its creation: computer vision, scene understanding, 3D reconstruction, and embodied AI. There are a number of interesting questions for future work. How do we find the right tradeoff between 2D and 3D quality depending on the type of downstream tasks? Does training models for these applications on massive sets of Roominoesgenerated 3D floor plans improve their generalization performance? And is it possible to reduce the number of training scenes needed if those scenes are tailored to the task(s) the method must address? These are some exciting questions that would be fruitful to explore with Roominoes.
Acknowledgments
We thank Supriya Gadi Patil and Jingling Li for discussions in the early stage of the project, and the anonymous reviewers for their helpful suggestions. An earlier version of this project was started when Kai Wang interned at Facebook AI Research, with helpful inputs from Dhruv Batra, Oleksandr Maksymets and Erik Wijimans. This work was funded in part by NSF award #1907547. Angel X. Chang is supported by a Canada CIFAR AI Chair and NSERC Discovery Grant, and Manolis Savva by a Canada Research Chair and NSERC Discovery Grant. Daniel Ritchie is an advisor to Geopipe and owns equity in the company. Geopipe is a startup that is developing 3D technology to build immersive virtual copies of the real world with applications in various fields, including games and architecture.
References
A Optimizing PostRetrieval Layouts
In this section, we describe the details optimization process we take to align the rooms retrieved by the retrieval process described in Section 5 of the main paper. Inspired by prior work [MIQPLayout, GenerativeLayoutModelingConstraintGraphs], we adopt a mixed integer quadratic programming (MIQP) based procedure for layout optimization.
a.1 Decomposing Rooms into Rectangles
Rooms in our representation are arbitrary rectilinear polygons. It is intractable to define certain important constraints, such as nonoverlap, between such polygons. Thus, our first step is to decompose each room into a set of rectangles. We then use constraints to bind these rectangles together so they behave as a continuous room.
In general, the decomposition of a rectilinear polygon into rectangles is not unique. In our implementation, we use the maximal decomposition, which is found by constructing a grid over all vertex coordinates and then taking the grid cells which fall inside the polygon as the decomposed rectangles (Figure 10). The maximal decomposition is quick to compute and also has the property that every pair of adjacent rectangles shares a complete edge. This edgesharing property makes it easy to impose additional constraints to bind the decomposed rectangles of a given room together.
Original Rooms  Decomposed Rectangles 
a.2 Optimization Variables
In our solver, we express each rectangle in terms of four variables: its upperleft vertex position , its width , and its height . In addition to the room shapes, we must also represent and optimize for the positions and sizes of portals between rooms (e.g. doors), so that the resulting floor plan is navigable. We describe a portal as a line segment with centroid position and radius (i.e. halflength) . Each portal is permitted to slide along certain walls of the room which contains it, as the next section will describe.
a.3 Constraints
Nonnegativity
All variables must be nonnegative so that our solution lies in the positive quadrant and no output rectangles or portals have a (nonphysical) negative width or height:
Minimal Room Size
To prevent the optimization from collapsing certain rooms in favor of others, we enforce that each room has a width and height of at least . To do this, we identify a sequence of indices of rectangles that spans the horizontal extent of the room, as well as a sequence of for rectangles that spans its vertical extent. Then:
Nonoverlap
We require the solution to have no overlapping pairs of rectangles . There are four possible relationships to account for: is either to the top, bottom, left, or right side of . Let
represent these relationships respectively. We let the MIQP optimizer select from this set of possible relationships by introducing an auxiliary binary variable
, where if and only rectangles and have the relationship . This results in the following set of constraints for each pair of rectangles :where is a large constant to ensure that rectangles and do not overlap in direction when (we set in our implementation). The last constraint requires that at least one of the four auxiliary variables has a value of 1.
Decomposition constraints
For each decomposed room, we introduce the following constraints for all pairs of its rectangles such that is to the left of (the rectangles share a vertical edge):
and the following constraints for all pairs of rectangles such that is to the top of (the rectangles share a horizontal edge):
These constraints bind the rectangles together such that they maintain shared edges.
Portal connection
If two portals are specified as connected, then their positions and halflengths must be equivalent:
Portal sliding
We require that portals stay on the same wall that they are initially defined to be on, and that their position and length do not extend beyond this wall. If a room decomposes to a single rectangle, then the portal can slide along one edge of this rectangle (for example, means the portal lies on the top wall). The sliding constraint thus takes on one of four cases:
In general, a room decomposes into multiple rectangles. Here, we must handle the case where a portal lies on a room wall that is shared by more than one decomposed rectangle. This scenario uses the same form of constraint as above, but requires us to know the indices of the rooms between which the portal can slide. For instance, if a portal slides along the left side of a wall shared by three rectangles where is the topmost rectangle and is the bottommost, then the constraints would be:
a.4 Objective
There may be multiple floor plan configurations which satisfy all the constraints defined above. Within this feasible set, there are certain configurations which are preferable. Primarily, we prefer layouts that change room shapes and portal positions/sizes as little as possible, as such changes will introduce distortion when transferred to the 3D mesh.
Secondarily, we prefer layouts which maximize the number of walltowall adjacencies between rooms, as this results in more plausibly compact/spaceefficient layouts and also avoids introducing interior voids in the layout. To keep track of adjacenciies, we use a similar formulation to the nonoverlap constraint. We add a binary variable for every pair of rectangles to indicate whether the two rectangles should be adjacent, and then we add the following constraints to account for possible adjacency relationships:
where is a binary variable for whether the rectangles are horizontally or vertically adjacent, and is the minimum length of the line segment and must share to be considered adjacent (we use ).
Finally, we minimize the following overall objective function:
where the version of a variable denotes its initial value. The first term penalizes changes in room rectangle shape; the second penalizes changes in portal radius. The third term rewards pairs of adjacent rectangles from different rooms, but only if the rooms containing those two rectangles have already had all rooms marked adjacent to them in the input graph placed into the layout (this is the role of the indicator function ). Finally, the last two terms penalize deviations in all portals’ positions along their respective walls. Here, and return the indices of all vertical and horizontal portals, respectively; , and give the index of the top, bottom, left, and right adjacent rectangle to portal ’s wall, respectively. In our implementation, we use , with all rooms scaled with a ratio of meters to units.
B Qualitative Results for the Navigation Experiment
Figure 11 shows example trajectories of a pretrained DDPPO [wijmans2020ddppo] agent walking through scenes generated by our methods, as well an original Matterport3D [Matterport3D] scene. The full videos can be found at DDPPO_portal_stitching.mp4, DDPPO_match_2d.mp4, DDPPO_mp3d.mp4 respectively. In these videos, the left side shows the depth image that the agent is seeing, whereas the right side shows the navigable areas of the scene, with the shortest trajectory to goal visualized in green and the trajectory the agent took visualized in blue.
Smart Portal Stitching  
Match 2D Layout Shape  
Matterport 3D  
C Walkthrough of a Generated Scene
Figure 12 shows a trajectory of a firstperson walk through of one of the generated scenes. The original semantic annotation of Matterpot3D meshes are done on meshes reconstructed with a different pipeline, and subsequently of lower visual quality. To produce this trajectory, we manually annotated the higher quality meshes of the set of rooms contained in a layout generated by the smart portal stitching strategy, and then manually performed a walk through. The full video can be found at first_person_walkthrough.mp4
D Evaluating the Effect of Data Augmentation with Our Data
Directly evaluating the impact of data augmentation with our data is challenging, as it has been shown that navigation agents continue to learn from data after billions of steps [wijmans2020ddppo]. Here, we provide an approximation by evaluating two agents that perform similarly on their respective training sets. We train one of the agent on 184 floor plans from the Matterport3D dataset, and the other the 184 Matterport3D floor plans, as well as 104 floor plans generated by the Smart Portal Stitching strategy. For the second agent, we construct the training set such that half of the episodes are from Matterport3D, and the other half from the generated data. We train the first agent for 30 million steps. We record the training success rate (about ) and SPL (about ), and then train the second agent until it reaches similar performances, at around 50 million steps. We then evaluate the trained agents on the Gibson validation set. The results are summarized in table 6. The agent trained on Matterport3D + Smart Portal Stitching outperforms the agent trained on only Matterport3D with respect to both success rate and SPL. We do stress that this is only an approximation, and it is possible the better performance results from other factors. We leave rigorous, fullscale evaluation to future works.
Scene Source  Success Rate  SPL 
Smart Portal Stitching + MP3D Scenes  
MP3D Scenes 
Comments
There are no comments yet.