People spend lots of time indoors- in bedrooms, office, living rooms, gym and so on. Function, beauty, cost and comfort are keys for the redecoration of indoor scenes. Proprietor prefers demonstration of the layout of indoor scenes in several minutes nowdays. Therefore, online virtual interior tools are useful to help people design indoor spaces. These tools are faster, cheaper and more flexible than real redecoration in the real-world scenes.
This fast demonstration is based on auto layout of in-door furniture and a good graphics engine. Machine learning researchers take use of virtual tools to train data-hungry models for the auto layout[2, 5]. The models reduce the time of layout of furniture from hours to minutes and support the fast demonstration.
Generative models of indoor scenes are valuable for the auto layout of the furniture. This problem of indoor scenes synthesis are studied since the last decade. One family of the approach is object-oritented which the objects in the space are represented explicitly [4, 12, 14, 12]. The other family of models is space-oriented which space is treated as a first-class entity, each point in space is occupied through the modeling .
Deep generative modes are used for efficient generation of indoor scenes for auto-layout recently. These deep models further reduce the time from minutes to seconds. The variety of the generative layout is also increased. The deep generative models directly produces the layout of the furniture given an empty room. However, in the real world, the direction of a room is diverse in the real world. The south, north, northwest and et al are all possible. The layout for the real indoor scenes are required to meet with different directions Figure 1.
Therefore, in this paper, we propose an adversarial generative models for indoor scenes with rotation. The models give a design of layout of furniture when indoor scenes are rotated. This adversarial models consists of several modules, including rotation modules, two mode modules and double discriminators. The rotation modules are applied in the hidden layer of the generative models, the mode modules are applied after the generative output and the ground truth, the double discriminators are applied for the rotation robustness of the indoor scenes.
This paper is organized as following, The related work is introduced in the section 2 firstly. Then, section 3 introduces the problem formulation. The methods of the proposed adversarial models are in the section 4. The experiments and comparison with two baseline generative models are in the section 5. Finally, the discussion is in the section 6.
2 Related Work
Our work is related to data-hungry methods for synthesizing indoor scenes through the layout of furniture unconditionally or partially conditionally.
2.1 Structured data representation
As the layout of funiture for indoor scenes are highly structured. Representation of scenes as a graph is an elegant methodology. In the graph, semantic relationships are encoded as edges and objects are encoded as nodes. A small dataset of annotated scene hierarchies is learned as a grammar for the prediction of hierarchical indoor scenes 
. Then, the generation of scene graphs from images is applied, including using scene graph for image retrieval, generation of 2D images from an input scene graph . However, this family of structure representation is limited to a small dataset. And it’s not practical for the auto layout of furniture in the real world.
2.2 Indoor scene synthesis
. Then, bayesian networks is proposed for the synthesis of objects through the modeling of object co-occurence and placement statistics[graph-layout-3d]. The graphical models are employed to model the compatibility between furniture and input sktches of scenes. However, these early methods are mostly limited to the scenes size. It’s hard to produce good-quality layout for large scene size. Then, with the availability of large scene datasets including SUNCG . More sophisticated learning methods are proposed as following.
2.3 Image CNN networks
An image-based CNNs is applied to encoded top-down views of input scenes, then the encoded scenes are decoded for the prediction of object category and location . A variational Auto-Encoder is applied for the generation of scenes with representation of a matrix. In the matrix, each column is represented as an object with location and geometry attributes . A semantically-enriched image-based representation is learned from the top-down views of the indoor scenes, and convolutional object placement priors is trained . However, this family of image cnn networks do not study rooms with different rotation where rooms in the real world are towards a variety of direction.
2.4 Graph generative networks
As a significant number of methods has been proposed to model graphs as networks [6, 15], the family for the representation of indoor scenes in the form of tree-structured scene graphs are studied. Grains
which is consisted of a recursive auto-encoder network for the graph generation is trained. The Grains is targeted to produce different relationships including surrounding, supporting and et al. Similarly, a graph neural network is proposed for scene synthesis. The edges is represented as spatial and semantic relationships of objects in a dense graph. Both relationship graphs and instantiation are generated for the design of indoor scenes. The relationship graph help to find symbolical objects and the high-lever pattern .
2.5 cnn generative networks
Layout of indoor scenes is also explored as the problem of generation of layout. Geometric relations of different types of 2D elements of in-door scenes are modeled through synthesis of layouts. This synthesis is trained through an adversarial network with self-attention modules 
. A variational autoencoder is proposed for the generation of stochastic scene layouts with prior of a label for each scene. However, the generation of layout is limited to single direction of the indoor scenes. While real scenes are towards a variety of direction.
3 Problem Formulation
In this section, the auto layout of indoor scenes towards a variety of direction is formalized as following. Given a set of indoor scenes , where is the number of the scenes, is an empty indoor scene with basic elements including walls, doors and windows. is the corresponding layout of the furniture for . Each contains several elements as . is the position of the element, is the size of the element, is the direction of element. Each element represents a furniture in the indoor scene . Besides, is the direction of the indoor scene . As shown in the the Figure 2. Four samples of represents the direction of the indoor scenes, the position, size, direction of each funiture in the scene , walls, doors and windows in the scene.
Then, a model is expected to work as . That is, given an empty room with walls, windows and doors, and the direction of the room , the model produces layout including the position, size, and direction of each furniture.
We propose an adversarial model to produce this layout with direction of each room in this section. The proposed model is consisted with several modules as following. A conditional adversarial module  with a generator and a discriminator. A rotation module with several rotation filters. A mode module with two mode filters. A rotation discriminator module. The above module consists of the whole model as shown in Figure 3.
4.1 Conditional Adversarial Module
4.2 Rotation Module
This rotation module consists of several rotation filters. Each filter
rotates the hidden representation of the generator corresponding to the rotation of a given roomFigure 4. This module help the generator to produce layout of a room with different directions. It is formally expressed as the following:
where is the rotation of a room, are the hidden layers, are the rotation filters applied after each hidden layer.
4.3 Mode Module
This mode module consists of two mode filters. Each filter produces a binary attention map according to the ground truth Figure 5. The position inside the bbox of the furniture is labled as . The left position is labled as . One filter is in the front of the rotation discriminator. The other filter is applied is applied after ground truth layout. This module help the adversarial model to maintain the same furniture in corresponding with ground truth with rotation. Then the generated layout and ground truth applied with these two filters are formulated as the following Figure 5:
where is the direction of the indoor scene.
4.4 Rotation Discriminator
This module adds an extra discriminator in the adversarial model. The extra both discriminator determine whether the generated layout is rotated in corresponding with the same degree as the ground truth. And whether the number and category of the furniture in the layout is collapsed during rotation.
4.5 Training Formulation and Objectives
As introduced above, let denote the generator of the conditional adversarial model, denote its discriminator, denote the rotation discriminator. similarly, let denote the rotation filters, let denote the first mode filters, let denote the second mode filters. Then the formulation of the proposed models are represented as following.
Given an rendered image of size , where and denote the height and width of the rendered image. The adversarial network model is denoted as . Suppose the generator has levers, then the generator with application of the rotation filter in each hidden layer is formulated as where is the rotation of a room, are the hidden layers, are the rotation filters applied after each hidden layer. The first discriminator is applied to determine whether the generated layout image is real. Similarly, the first mode filter is applied in front of the generated layout . It transfers the generated layout as . The second mode filter is applied for the ground truth layout . It transfers the ground truth as .
Rotation discriminator network training
To train the first discriminator network , the first discriminator loss is formally written as following:
where if sample is drawn from the generator, and if the sample is from the ground truth. Here, denotes the rendered layout image generated from the generator with rotation . denotes the rendered ground truth layout with rotation .
Mode discriminator network training
To train the second discriminator network , the second discriminator loss is formally written as following:
where if sample is drawn from the generator, and if the sample is from the ground truth. Here, is drawn from the generator after the application of the filter , is drawn from the generator after the application of the filter .
Rotation Generator Training.
To train the generator network, a conditional loss functionis formally set as following:
where , denote the generation loss with rotation and the adversarial loss respectively, and are two constants for balancing the multi-task training.
Given the rendered indoor scene and its rotation , ground truth and prediction results , the generator loss is written as following:
And the and are written as following:
During the training process, the adversarial loss is used to fool the discriminator by maximizing the probability of the generated prediction being considered as the ground truth distribution.
A database of indoor furniture layouts is provided, plus an end-to-end rendering image of the interior layout. These layout data is from the designers at the real selling end, where proprietors choose the design of the layout for their properties.
5.1 Interior Layouts
professional designers work with an industry-lever virtual tool to produce a variety of designs. Among these designs, a part of them are sold to the proprietors for their interior decorations. We collect this designs at the selling end and provide interior layouts. Each sample of the layout has the following representation including the categories of the furniture in a room, the position of each furniture, the direction of each furniture, the position of the doors and windows in the room, the position of each fragment of the walls. Figure 6 illustrates the samples of our layouts both adopted from the interior design industry and sold to the proprietors. It contains types of rooms including, the bedroom, the bathroom, the study room and the tatami room. The designs of these rooms are sold to the proprietors whose properties have rooms, bathrooms since last year. Besides, each designs is modified after several versions both following the professional designers knowledge and the personalized suggestions of the each proprietor.
Besides, all the designs are rotated in direction including . The position of each furniture, the direction of each furniture, the position of the doors and windows in the room, the position of each fragment of the walls are all rotated Figure 7. Therefore, the total number of the layouts are with rotation .
5.2 Rendered Layouts
Besides, each layout sample is corresponding to the rendered layout images. These images are the key demonstration of the interior decoration. These rendered images contains several views and we collect the top-down view as the rendered view Figure 8. Therefore, the dataset also contains rendered layouts in the top-down view. Each rendered layout is corresponding to a design. The rendered data is produced from an industry-lever virtual tool which has already provides missions of rendering layout solutions to the proprietors Figure 8.
In this section, we present qualitative and quantitative results demonstrating the utility of the proposed adversarial model for scene synthesis and comparing it to two baselines. types of indoor rooms are evaluated including the bedroom, the bathroom, the study room and the tatami room. samples are randomly chosen for training, samples are used for the test. Both the training and test rooms are rotated in direction, , , and . The first baseline model is a classical adversarial model  which takes a pair of samples of a rendered empty room and its layouts for training. For the inference, it produces the layout of furniture given the rendered empty room. The second baseline model is a conditional adversarial model[st-gan], which takes the pair of samples together with the rotation for training. For the inference, it encodes the direction of the room and the rendered empty room and produces the layout. Similarly, our model encodes and to produce the layout.
6.1 Evaluation metrics
For the task of interior scene synthesis, we apply three metrics for the evaluation. Firstly, we use average mode accuracy for the evaluation. It is applied to measure the accuracy of category of furniture for a layout in corresponding with the ground truth. This average mode accuracy is formally expressed as following:
where is the total number of category of furniture in the ground truth dataset, is the number of the category of furniture in the generated layout in corresponding with the ground truth. For example, if the furniture is in the predicted layout where the ground truth layout also contains this furniture, then it’s calculated. To be noted, is the total number of the category of the furniture.
Secondly, in order to evaluate the position accuracy of furniture layout, we apply the classical mAP to measure the position of the furniture in the predicted layout. To be noted, the threshold for the IoU between the predicted bbox of furniture and the ground truth bbox is set as .
Thirdly, we apply a degree matrix to measure the rotation accuracy of each furniture in the prediction. At the industry end, the direction of the furniture is also a key for the interior designs in the real world. For an example, the TV set should be towards inside the room. This degree matrix is formally expressed as:
where is the total number of furniture in the dataset, is the rotation of the furniture in the prediction, is the rotation of the corresponding furniture in the ground truth.
6.2 Qualitative Comparisons
We compare with two baseline models for scene synthesis for four types of rooms Figure 9, 10, 11, 12. Out model outperforms the baseline models in the following aspects. Firstly, during the rotation of the indoor room, our model predict the same category of the furniture with the ground truth layout, while the two baseline models lose the category of the furniture. Secondly, our model predict good position of each furniture during the rotation of the room. While the baseline models sometimes predicts un-satisfied position that is strongly against the knowledge of the professional interior designers. Thirdly, the baseline models sometimes fall to give the position, size of the furniture in their prediction while our model seldom produce this failure.
6.3 Quantitative Comparisons
Similarly, we also compare with two baseline models quantitatively. All of the three metrics for four types of room are evaluated in the Tables. Table 1 demonstrates the accuracy of mode, position and size, direction of the furniture in the predicted layout. Our model outperforms the baseline models. Similarly, it 1 also demonstrates the comparison for the other types of rooms. In details, add half page.
We presented an adversarial model together with three modules to predict the interior scene synthesis with rotation. Besides, we open an interior layouts dataset that all of the designs are drawn from the professional designers. And the designs are at the selling end.
There are several avenues for the future work. Our method is currently limited to the generation of layouts for the common rooms. The layout of luxury rooms is hard to predicted. For example, it’s difficult to predict the layout for the luxury bedroom where the bathroom, the cloakroom are also in the luxury bedroom. Besides, our model is limited to high-lever understanding ability of the interior scene, it’s very likely that the structure model such as the graph model is not developed yet. Thirdly, the category of the furniture for each type of the room is only limited to a small number. For example, a generated layout of the bedroom often contains a bedroom, a tv set and a wardrobe. It can not support other furniture such as a dressing table, an office desk, a leisure sofa and et al.
-  (2013-06) Understanding indoor scenes using 3d geometric phrases. In , Cited by: §2.2.
-  (2018-06) ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
DeLay: robust spatial layout estimation for cluttered indoor scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2012-11) Example-based synthesis of 3d object arrangements. ACM Trans. Graph. 31 (6). External Links: Cited by: §1.
-  (2018-06) IQA: visual question answering in interactive environments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Representation learning on graphs: methods and applications. CoRR abs/1709.05584. External Links: Cited by: §2.4.
-  (2017) Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, Cited by: §6.
-  (2018-06) Image generation from scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2015-06) Image retrieval using scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2019-10) LayoutVAE: stochastic scene layout generation from a label set. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.5.
-  (2019) LayoutGAN: generating graphic layouts with wireframe discriminators. CoRR abs/1901.06767. External Links: Cited by: §2.5.
-  (2019-02) GRAINS: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. 38 (2). External Links: Cited by: §1, §2.4.
-  (2018-06) ST-gan: spatial transformer generative adversarial networks for image compositing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1, §4.
-  (2018-06) Human-centric indoor scene synthesis using stochastic grammar. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.4.
-  (2017-07) Semantic scene completion from a single depth image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2019-07) PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. 38 (4). External Links: Cited by: §2.4.
-  (2018-07) Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. 37 (4). External Links: Cited by: §1, §2.1, §2.3, §2.4.
-  (2013-07) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. ACM Trans. Graph. 32 (4). External Links: Cited by: §2.2.
-  (2020-04) Deep generative modeling for scene synthesis via hybrid representations. ACM Trans. Graph. 39 (2). External Links: Cited by: §2.3.