People spend plenty of time indoors such as bedrooms, kitchen, living rooms, and study rooms. Online virtual interior design tools are available to help designers to produce solutions of indoor-redecoration in tens of minutes. However, the online indoor-redecoration in a shorter time is demanding in recent years. One of the bottlenecks is to automatically process the indoor layout of furniture, and the goal is to come up with an assistive model that helps designers produce indoor-redecoration solutions at the industry-level in a few seconds.
Data-hungry models are trained for computer vision and robotic navigation[1, 4], and then two major families of approaches have emerged in the line of auto-design work. The first family of models is object-oriented that the objects in the indoor-scene and their properties are represented . In , a model is developed to take the advantages of spaces and objects. However, these assistive models are unsatisfactory since they can only satisfy partial requirements from the customers, and the resulting auto-layouts are towards the same kind. Motivated by the challenge, we propose an assistant model that produces different types of auto-layout solutions of indoor-scene for different groups of customers at the industrial end (see Figure 1). Specifically, we propose a conditional auto-layout assistive model where the condition is the representation of the preferences/requirements from customers/property owners. Firstly, the proposed model generates a structural representation according to the conditional representation of different requirements. This conditional generation is produced by a trained conditional model. Secondly, a generated graph and the representation of the condition are mixed through a deep module. Finally, a location module is trained to give the prediction of the location of furniture for different requirements.
In this paper, we first introduce the related work in Section . The indoor synthesis pipeline which consists of three proposed modules is described in Section . In Section , we propose a graph representation procedure to extract the geometric indoor-scene into the abstract graph. We also propose a dataset of the indoor layout which is collected from the designs of professional interior designers at the industrial end. In Section , a model for a graph generation with conditions is developed where the condition is based on the preferences of the property owners. In Section 6, a conditional scene instantiation module is developed to produce the location of furniture for different property owners. In Section 7, we evaluate the proposed model for three types of rooms including the tatami, the balcony room, and the kitchen. Finally, we discuss the advantage and the future work of the proposed model in Section 8.
2 Related Work
We discuss the related work of indoor furniture layout including the indoor-scene representation, graph generative models and the location of furniture models.
2.1 Indoor-scene Representation
The study of indoor-scene representation starts two decades ago. In the early years, rule-based formulation is developed to generate the 3D object layouts for pre-specified sets of objects . Interior design principles are applied through the optimization of the cost function . The object-object relationships are analyzed in the interior scenes 
. Data-driven models are then developed in this field. The earliest data-driven work is modeled to produce the object co-occurrences through a Bayesian network. The followup work make use of the undirected factor graph learned from RGB-D images . Besides, other interior scene representations are applied to models including RGB-D frames, 2D sketches of the scenes, natural language text, and RGB-D representations. However, these representations of the indoor scenes are only practical to a limited range of indoor scenes.
The learning frameworks for the representation of the indoor scene are developed recently. These learning methods apply a range of techniques including human-centric probabilistic grammars 
, generative adversarial networks on a matrix representation of the interior scene
, recursive neural networks trained on 3D scene hierarchies
, and convolutional neural networks trained on top-down image representation of interior scenes. However, these learning methods are not able to obtain a high-level representation of the indoor scene. Besides, they are not practical at the industrial end as the generated in-door scenes cannot meet the personalized preferences of the property owners.
2.2 Graph Generative Models
Another line of recent work on learning the graph-structural representation through deep neural networks is explored. These work aim to learn generative models of graphs. The early work along the line is to develop a graph convolutional neural network (GCN) to perform graph classification . The message passing between nodes in a graph  is applied to the graph generation of interior scenes . However, these graph generative models are not able to produce graphs based on different conditions and meet the personalized preferences of property owners.
3 Indoor Synthesis Pipeline
In this section, we tackle the following indoor-scene layout problem: given the architectural specification of a room (e.g., walls, doors, and windows) of a particular type (e.g., kitchen or tatami), and the personalized preferences of a house owner for the indoor-decoration, we choose and locate a set of objects to decorate that type of a room. We aim to build an assistant model that can support the designer to produce a decoration solution quickly through auto-layout of the furniture for a room, as shown in 2. In order to synthesize a scene from an empty room according to the owner’s suggestion, the assistant model should complete the following three steps. Firstly, partial scenes of the rooms are completed. Secondly, a high-level representation of the layout of a room in the form of a relation graph is extracted. Thirdly, the house owner’s requirements for the indoor decoration are encoded and adapted in the model. As a consequence, the proposed system generates a relation graph following the owner’s requirements and then produces a solution to the house layout.
This assistant model starts with the extraction of the relation graphs from 3D indoor scenes. This extraction applies the geometric rules and a hierarchy rule to build a structural graph of the indoor scenes, as shown in Figure 3 and Figure 4. Next, this corpus of the extracted graphs of the indoor scenes and the customers’ requirements on the scenes are encoded. The encoded representation is applied to learn a generative model of graphs, as shown in Figure 5. The generated graph is therefore corresponding to the representation of an indoor scene given a particular condition. Finally, the abstract conditional graph and the representation of a suggestion are encoded to instantiate into a concrete scene, as shown in Figure 6.
Formally, given a set of indoor scenes where is the number of the scenes, and is an empty indoor scene with basic elements including walls, doors and windows. is the corresponding layout of the furniture for . Each contains the elements : is the position of the element; is the size of the element; is the direction of element; and is the category label of the element. Each element represents a furniture in an indoor scene . Besides, is the requirement from the property owner that determines the layout type of the room. Representation of the each indoor scene is first extracted to structural representation . is the graph representation of , and is the code of . A generative model is then trained to produce the abstract graph
by applying hidden vectorsand . Finally, gievn and , an instantiation model is learned to convert the abstract representation and to the completed scene . An industrial rendering software is then applied to produce an industrial solution for the property owner. The whole assistant system is aimed to help interior designers to produce the layout of the furniture in a shorter time at the industrial end.
4 Graph Representation of Scenes
In this section, we define the graph representation of the indoor scenes, and introduce a procedure of automatically extracting the graphs from 3D indoor scenes.
We propose a dataset which is a collection of over fifteen thousand indoor-scenes designs from professional interior designers who use an industrial design software in a daily basis. These designs contain three common types of rooms including tatami, kitchen and balcony. Pre-processing on this data is performed to filter out uncompleted layout of the rooms and the mislabeled room types, resulting in 5000 kitchens (with 41 unique object categories), 5000 balconies (with 37 unique categories) and 1000 tatami rooms (with 53 unique categories). Each room has a list of properties including its category, the geometry of each object and the requirements for the layout. There are no hierarchical and semantic relationships among objects in each room.
4.2 Graph Definition
The hierarchical and semantic relationship among objects in a room is encoded in the form of a relation graph. In each graph, each node denotes an object, each edge denotes spatial or semantic relationship between objects. Each node is labeled with a category label of the object. In particular, the walls, doors, windows and different objects have different category labels. The distance between objects contains three labels including near, middle and further. The spatial relationship between objects is represented with this three labels. Similarly, the semantic relationship is represented with three labels as following, walls-doors, walls-windows, walls-objects and objects-objects.
4.3 Graph Extraction
To convert an indoor scene into the above graph representation, We first label each node: the node of walls is , the node of doors is , the node of windows is , and the nodes of objects of different categories are . The number of edge types is set to following the above definition, and there are types of spatial relationship and types of semantic relationship. For example, the edge type is if the nodes are a wall and a door, and the distance between them is middle. The edge type is if the nodes are a wall and an object and the distance between them is further.
The graph extracted is dense (average of edges node), and it contains many non-semantically meaningful relationships. This density may lead to problems for the learning of a generative graph model. The density is very likely to confuse a neural network as it is hard to recognize the most important structural relationships among many edges. Therefore, less important edges are pruned from the extracted graph. We set a pruning procedure for the extracted graph. Edges between nodes in which both nodes are the wall are deleted. Edges between nodes where the spatial relationship is further deleted. For each node that is labeled as a door or window, we keep the links to the closest wall and the object. Similarly, for each node that is labeled as objects, we keep the links to the closest objects and door/window. Finally, to ensure the graph connectivity, we make use of the same technique to reconnected the nodes as in .
5 Deep Graph Generation with Conditions
Once the graph is extracted, the indoor scenes and the corresponding customers’ requirements are converted into graph representation and nodes . We focus on the problem of conditional structure generation of the completed indoor scene .
Given a set of graph , where is the extracted graph representation of and corresponds to the graph structure described by the set of nodes and the set of edges . The semantic attribute of the graph is denoted as the code . As already introduced above, describes the requirements from the corresponding property owner. It is basically the context of the graph.
A model is trained on a set of graphs with conditions , and this is used to generate more graphs mimicking the structure of those in the training set. We apply a conditional graph generation model, which applies three tricks to generates a graph for different conditions. Firstly, latent space conjugation is applied to effectively convert node-level encoding into permutation-invariant graph-level encoding. This conjugation allows the learning on arbitrary numbers of graphs and generation of graphs with variable sizes. Formally, given a graph , is regarded as a plain network with the adjacency matrix . It generates node features as the standard -dim spectral embedding based on . The stochastic latent variable is then introduced and is the regarded as the node embedding of . A single distribution is used to model all ’s by enforcing:
where , and . is a two-layer GCN model. and
compute the matrices of mean and standard deviation vectors, which share the first-layer parameters. is the -th row of . is the symmetrically normalized adjacency matrix of , where is its degree matrix with .
After sampling a desirable number of ’s to improve the capability of the graph decoder, a few fully connected neural network (FNN) layers are applied to
before computing the logistic sigmoid function for link prediction which can be described by
where , and . The model is optimized by minimizing the minus variational lower bound:
where is a link reconstruction loss and
is a prior loss based on the Kullback-Leibler divergence towards the Gaussian prior. The model now consists of a GCN-based graph encoder , and an FNN-based graph decoder/generator .
GCN is leveraged by devising a permutation-invariant graph discriminator. This discriminator is applied to enforce the intrinsic structural similarity between and under arbitrary node ordering. In particular, a discriminator of a two-layer GCN is constructed followed by a two-layer FNN. This discriminator is trained together with the above encoder and generator through following the GAN loss of a two-player minimax game:
where , and , and are the spectral embedding for GCN and FNN, respectively.
6 Scene Instantiation
In this section, we describe the procedure of taking a relationship graph and instantiating it into an actual indoor scene. In the extraction of graph, the edge pruning steps is likely to lead to the incomplete graph. Besides, the graph representation of the scene is abstract. It does not contain enough relationship edges to uniquely determine the spatial positions and orientations of all object nodes. The scene instantiation is also required to follow the preferences of the property owner.
Formally, a procedure is required for sampling from the following conditional probability distribution corresponding to the property owner’s requirements:
where is a scene, is a graph with vertices , edges , and requirements . is a predicate function indicating whether the relationship implied by the edge is satisfied in . is a sample of the requirements .
6.1 Object Instantiation Order
Before the prediction of the location of objects in the room, the order of the instantiation of objects should be set. The algorithm of determining the order in which to instantiate objects follows a similar logic in . It applies the structure of the graph, along with statistics about typical sizes for nodes of different categories. For example, in the instantiation of the tatami room, the order is set as tatami, work-desk, and cabinet.
6.2 Neural-guided Object Instantiation
After the order of instantiation is determined and an object of category is selected to add to the scene, the completed layout prediction of this object is expected to be produced following the requirements of the property owner’s suggestion . These prediction includes the location , orientation and physical dimensions .
Ideally, a generator function is expected to produce the outputs values with probability proportional to the true conditional scene category . As already explored in the previous work through iterative object insertion , the location, orientation and dimensions are sampled iteratively according on the corresponding to .
6.2.1 Structural representation of requirements
The requirements from the property owner are encoded as a vector in the graph generation phase. For example, if the balcony is in two categories leisure balcony and living balcony. The requirements vector is then one-hot vector and , respectively. However, this one-hot vector loses much information in the conditional scene instantiation phase. Therefore, we develop a mixture embedding module which applies the embedding of global graph representation, local graph representation, and edge type representation.
We first employ a GCN  network to produce the embedding of the global representation of the generated graph . Then, we employ the GCN  and the setting instantiation order to produce the embedding of the local representation of the generated graph . Specifically, we apply the order to get the vector of the adjacency matrix of the selected objects. The embedding of the local graph representation can be obtained. The embedding of the edge-type is obtained in a similar method as in the previous model . Note that this embedding follows a different rule as introduced above. Finally, we apply two-level FCN to combine these three types of embedding.
6.2.2 Conditional instantiation
After the structural representation of the suggestion is obtained, we train a model that is based on the condition of the requirements. This model receieves the inputs of the generated graph and code the requirements , and then predicts the location , the orientation , and the size of the selected object. We train both the embedding module and instantiation module together. Here the instantiation module is similar to the instantiation network in the previous work .
In this section, we show both qualitative and quantitative results that demonstrate the unity of the proposed assistant model to auto-layout of the indoor scenes with property owners’ personalized requirements. The generated examples are shown in Figure 7.
7.1 Generation Accuracy
The accuracy of the graph generation for each type of rooms is defined by:
where is the number of generated graph for label and is the number of the category of the graph.
We measure the accuracy of graphs for three types of rooms. For the evaluation of tatami room, we collect four types which are most popular among property owners from our sold interior decoration solutions. These four types are sleeping-functional tatami, storage-functional tatami, tea-functional tatami, and working-functional tatami. Similarly, There are three types of balconies chosen: the leisure-functional balcony, the washing-functional balcony, and the storage-functional balcony. There are two types of kitchen: the classical-functional kitchen and the multi-functional kitchen.
|Averaged accuracy||For sleep||For tea||For storage||For work|
|Averaged accuracy||For leisure||For wash||For storage|
|Averaged accuracy||For classical||For mixture|
7.2 Perceptual Study
In addition to the quantitative evaluation of the labeled graph generation, a two-alternative forced-choice (2AFC) perceptual study is conducted to compare the images from generated scenes with the corresponding scenes from the sold industrial solutions. The generated scenes are rendered using industrial rendering software. We remark that the rendering process is different from the previous work  that only produces solid color for the objects. Participants are shown two top-view scene images side by side and required to pick out the more plausible one. For each comparison and each room type, professional interior designers were recruited as the participants. As shown in Table 4, the generated scenes are similar to the scenes from the sold solutions.
Another two-alternative forced-choice (2AFC) perceptual study is conducted to compare images from the generated scenes with the corresponding generative scenes of the state-of-the-art models.professional interior designers were recruited as the participants in this perceptual study. As Table 4 shows, the generated scenes of the proposed system outperforms the state-of-the-art models such as PlanIT  and Grains . It is obvious that our model is able to generate indoor scenes for a variety of function-category rooms of a particular room type while the baseline models are not able to produce such layout. We show an example of the generated tatami room with four functionalities in Figure 1.
In this paper, we present an assistive method for interior decoration. The assistive method consists of the extraction of the abstract graph, conditional graph generation, and the conditional scene instantiation. The proposed system supports the interior designers in the industrial process to produce the decoration solution more quickly. In particular, this system produces the auto-layout of objects which is plausible according to the choice of the professional designers. Besides, the proposed system is shown to meet the personalized preferences from the property owners.
There are many challenges in automatic interior decoration. The proposed method is only able to support the auto-layout of objects in the interior scenes where the only global auto-layout follows the preferences of the property owners. The type, color, and brand of each object for the personalized preferences are not supported. In addition, the proposed method produces good results when the shape of the room is towards regular while there are a variety of different shapes in the real world. It is worthwhile to explore smarter systems to solve these challenges.
Dai, A., Ritchie, D., Bokeloh, M., Reed, S., Sturm, J., Nießner, M.: Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
-  Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., Hanrahan, P.: Example-based synthesis of 3d object arrangements. ACM Trans. Graph. 31(6) (Nov 2012). https://doi.org/10.1145/2366145.2366154, https://doi.org/10.1145/2366145.2366154
-  Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. CoRR (2017)
-  Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: Iqa: Visual question answering in interactive environments. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
Jiang, B., Zhang, Z., Lin, D., Tang, J., Luo, B.: Semi-supervised learning with graph learning-convolutional networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11305–11312 (2019)
-  Kermani, Z., Liao, Z., Tan, P., Zhang, H.: Learning 3d scene synthesis from annotated rgb-d images. Computer Graphics Forum 35, 197–206 (08 2016). https://doi.org/10.1111/cgf.12976
Li, M., Patil, A.G., Xu, K., Chaudhuri, S., Khan, O., Shamir, A., Tu, C., Chen, B., Cohen-Or, D., Zhang, H.: Grains: Generative recursive autoencoders for indoor scenes. ACM Trans. Graph.38(2) (Feb 2019). https://doi.org/10.1145/3303766, https://doi.org/10.1145/3303766
-  Merrell, P., Schkufza, E., Li, Z., Agrawala, M., Koltun, V.: Interactive furniture layout using interior design guidelines. SIGGRAPH 2011 (Aug 2011), http://graphics.berkeley.edu/papers/Merrell-IFL-2011-08/, to appear
-  Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
-  Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. 38(4) (Jul 2019). https://doi.org/10.1145/3306346.3322941, https://doi.org/10.1145/3306346.3322941
-  Xu, K., et al.: Constraint-based automatic placement for scene composition. In: IN GRAPHICS INTERFACE. pp. 25–34 (2002)
-  Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make it home: Automatic optimization of furniture arrangement. ACM Trans. Graph. 30(4) (Jul 2011). https://doi.org/10.1145/2010324.1964981, https://doi.org/10.1145/2010324.1964981