Towards Adversarial Planning for Indoor Scenes with Rotation

06/24/2020 ∙ by Xinhan Di, et al. ∙ ibm Tencent QQ 0

In this paper, we propose an adversarial model for producing furniture layout for interior scene synthesis when the interior room is rotated. The proposed model combines a conditional adversarial network, a rotation module, a mode module, and a rotation discriminator module. As compared with the prior work on scene synthesis, our proposed three modules enhance the ability of auto-layout generation and reduce the mode collapse during the rotation of the interior room. We provide an interior layout dataset that contains 14400 designs from the professional designers with rotation. In our experiments, we compare the quality of the layouts with two baselines. The numerical results demonstrate that the proposed model provides higher-quality layouts for four types of rooms, including the bedroom, the bathroom, the study room, and the tatami room.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 9

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People spend lots of time indoors- in bedrooms, office, living rooms, gym and so on. Function, beauty, cost and comfort are keys for the redecoration of indoor scenes. Proprietor prefers demonstration of the layout of indoor scenes in several minutes nowdays. Therefore, online virtual interior tools are useful to help people design indoor spaces. These tools are faster, cheaper and more flexible than real redecoration in the real-world scenes.

This fast demonstration is based on auto layout of in-door furniture and a good graphics engine. Machine learning researchers take use of virtual tools to train data-hungry models for the auto layout

[2, 5]. The models reduce the time of layout of furniture from hours to minutes and support the fast demonstration.

Generative models of indoor scenes are valuable for the auto layout of the furniture. This problem of indoor scenes synthesis are studied since the last decade. One family of the approach is object-oritented which the objects in the space are represented explicitly [4, 12, 14, 12]. The other family of models is space-oriented which space is treated as a first-class entity, each point in space is occupied through the modeling [18].

Deep generative modes are used for efficient generation of indoor scenes for auto-layout recently. These deep models further reduce the time from minutes to seconds. The variety of the generative layout is also increased. The deep generative models directly produces the layout of the furniture given an empty room. However, in the real world, the direction of a room is diverse in the real world. The south, north, northwest and et al are all possible. The layout for the real indoor scenes are required to meet with different directions Figure 1.

Therefore, in this paper, we propose an adversarial generative models for indoor scenes with rotation. The models give a design of layout of furniture when indoor scenes are rotated. This adversarial models consists of several modules, including rotation modules, two mode modules and double discriminators. The rotation modules are applied in the hidden layer of the generative models, the mode modules are applied after the generative output and the ground truth, the double discriminators are applied for the rotation robustness of the indoor scenes.

This paper is organized as following, The related work is introduced in the section 2 firstly. Then, section 3 introduces the problem formulation. The methods of the proposed adversarial models are in the section 4. The experiments and comparison with two baseline generative models are in the section 5. Finally, the discussion is in the section 6.

2 Related Work

Our work is related to data-hungry methods for synthesizing indoor scenes through the layout of furniture unconditionally or partially conditionally.

2.1 Structured data representation

As the layout of funiture for indoor scenes are highly structured. Representation of scenes as a graph is an elegant methodology. In the graph, semantic relationships are encoded as edges and objects are encoded as nodes. A small dataset of annotated scene hierarchies is learned as a grammar for the prediction of hierarchical indoor scenes [18]

. Then, the generation of scene graphs from images is applied, including using scene graph for image retrieval

[9], generation of 2D images from an input scene graph [8]. However, this family of structure representation is limited to a small dataset. And it’s not practical for the auto layout of furniture in the real world.

2.2 Indoor scene synthesis

Early work in the scene modeling applied kernels and graph walks to retrieve objects from a database [1, 3]

. Then, bayesian networks is proposed for the synthesis of objects through the modeling of object co-occurence and placement statistics[graph-layout-3d[4]]. The graphical models are employed to model the compatibility between furniture and input sktches of scenes

[19]. However, these early methods are mostly limited to the scenes size. It’s hard to produce good-quality layout for large scene size. Then, with the availability of large scene datasets including SUNCG [16]. More sophisticated learning methods are proposed as following.

2.3 Image CNN networks

An image-based CNNs is applied to encoded top-down views of input scenes, then the encoded scenes are decoded for the prediction of object category and location [18]. A variational Auto-Encoder is applied for the generation of scenes with representation of a matrix. In the matrix, each column is represented as an object with location and geometry attributes [20]. A semantically-enriched image-based representation is learned from the top-down views of the indoor scenes, and convolutional object placement priors is trained [18]. However, this family of image cnn networks do not study rooms with different rotation where rooms in the real world are towards a variety of direction.

2.4 Graph generative networks

As a significant number of methods has been proposed to model graphs as networks [6, 15], the family for the representation of indoor scenes in the form of tree-structured scene graphs are studied. Grains[12]

which is consisted of a recursive auto-encoder network for the graph generation is trained. The Grains is targeted to produce different relationships including surrounding, supporting and et al. Similarly, a graph neural network is proposed for scene synthesis. The edges is represented as spatial and semantic relationships of objects

[18] in a dense graph. Both relationship graphs and instantiation are generated for the design of indoor scenes. The relationship graph help to find symbolical objects and the high-lever pattern [17].

2.5 cnn generative networks

Layout of indoor scenes is also explored as the problem of generation of layout. Geometric relations of different types of 2D elements of in-door scenes are modeled through synthesis of layouts. This synthesis is trained through an adversarial network with self-attention modules [11]

. A variational autoencoder is proposed for the generation of stochastic scene layouts with prior of a label for each scene

[10]. However, the generation of layout is limited to single direction of the indoor scenes. While real scenes are towards a variety of direction.

Figure 2: Rooms are towards four different directions with the layout of funitures. The positions, directions of each furniture, wall, door and window are represented.

3 Problem Formulation

In this section, the auto layout of indoor scenes towards a variety of direction is formalized as following. Given a set of indoor scenes , where is the number of the scenes, is an empty indoor scene with basic elements including walls, doors and windows. is the corresponding layout of the furniture for . Each contains several elements as . is the position of the element, is the size of the element, is the direction of element. Each element represents a furniture in the indoor scene . Besides, is the direction of the indoor scene . As shown in the the Figure 2. Four samples of represents the direction of the indoor scenes, the position, size, direction of each funiture in the scene , walls, doors and windows in the scene.

Then, a model is expected to work as . That is, given an empty room with walls, windows and doors, and the direction of the room , the model produces layout including the position, size, and direction of each furniture.

Figure 3: The whole model architecture and the proposed modules are represented including the rotation module, the mode module, and the rotation discriminator module.

4 Methods

We propose an adversarial model to produce this layout with direction of each room in this section. The proposed model is consisted with several modules as following. A conditional adversarial module [13] with a generator and a discriminator. A rotation module with several rotation filters. A mode module with two mode filters. A rotation discriminator module. The above module consists of the whole model as shown in Figure 3.

4.1 Conditional Adversarial Module

This module is a conditional adversarial model [13], the generation part gets the input of a rendered image of a empty room, the condition part

encodes the direction of the room as a vector, and the discriminator part

determine whether the generated layout is real Figure 3.

Figure 4: The rotation module consists of several rotation filters. Samples of four types of rooms are represented with applying the rotation filters in the hidden layer.

4.2 Rotation Module

This rotation module consists of several rotation filters. Each filter

rotates the hidden representation of the generator corresponding to the rotation of a given room

Figure 4. This module help the generator to produce layout of a room with different directions. It is formally expressed as the following:

(1)

where is the rotation of a room, are the hidden layers, are the rotation filters applied after each hidden layer.

Figure 5: The mode samples for the corresponding generated layout and the ground truth are represented.

4.3 Mode Module

This mode module consists of two mode filters. Each filter produces a binary attention map according to the ground truth Figure 5. The position inside the bbox of the furniture is labled as . The left position is labled as . One filter is in the front of the rotation discriminator. The other filter is applied is applied after ground truth layout. This module help the adversarial model to maintain the same furniture in corresponding with ground truth with rotation. Then the generated layout and ground truth applied with these two filters are formulated as the following Figure 5:

(2)
(3)

where is the direction of the indoor scene.

4.4 Rotation Discriminator

This module adds an extra discriminator in the adversarial model. The extra both discriminator determine whether the generated layout is rotated in corresponding with the same degree as the ground truth. And whether the number and category of the furniture in the layout is collapsed during rotation.

4.5 Training Formulation and Objectives

As introduced above, let denote the generator of the conditional adversarial model, denote its discriminator, denote the rotation discriminator. similarly, let denote the rotation filters, let denote the first mode filters, let denote the second mode filters. Then the formulation of the proposed models are represented as following.

Given an rendered image of size , where and denote the height and width of the rendered image. The adversarial network model is denoted as . Suppose the generator has levers, then the generator with application of the rotation filter in each hidden layer is formulated as where is the rotation of a room, are the hidden layers, are the rotation filters applied after each hidden layer. The first discriminator is applied to determine whether the generated layout image is real. Similarly, the first mode filter is applied in front of the generated layout . It transfers the generated layout as . The second mode filter is applied for the ground truth layout . It transfers the ground truth as .

Rotation discriminator network training

To train the first discriminator network , the first discriminator loss is formally written as following:

(4)

where if sample is drawn from the generator, and if the sample is from the ground truth. Here, denotes the rendered layout image generated from the generator with rotation . denotes the rendered ground truth layout with rotation .

Mode discriminator network training

To train the second discriminator network , the second discriminator loss is formally written as following:

(5)

where if sample is drawn from the generator, and if the sample is from the ground truth. Here, is drawn from the generator after the application of the filter , is drawn from the generator after the application of the filter .

Rotation Generator Training.

To train the generator network, a conditional loss function

is formally set as following:

(6)

where , denote the generation loss with rotation and the adversarial loss respectively, and are two constants for balancing the multi-task training.

Given the rendered indoor scene and its rotation , ground truth and prediction results , the generator loss is written as following:

(7)

And the and are written as following:

(8)
(9)

During the training process, the adversarial loss is used to fool the discriminator by maximizing the probability of the generated prediction being considered as the ground truth distribution.

5 Dataset

A database of indoor furniture layouts is provided, plus an end-to-end rendering image of the interior layout. These layout data is from the designers at the real selling end, where proprietors choose the design of the layout for their properties.

Figure 6: Samples of the indoor-layout dataset are proposed including four types of rooms: the study room, the bedroom, the tatami room and the bathroom. The rooms are towards different directions.

5.1 Interior Layouts

professional designers work with an industry-lever virtual tool to produce a variety of designs. Among these designs, a part of them are sold to the proprietors for their interior decorations. We collect this designs at the selling end and provide interior layouts. Each sample of the layout has the following representation including the categories of the furniture in a room, the position of each furniture, the direction of each furniture, the position of the doors and windows in the room, the position of each fragment of the walls. Figure 6 illustrates the samples of our layouts both adopted from the interior design industry and sold to the proprietors. It contains types of rooms including, the bedroom, the bathroom, the study room and the tatami room. The designs of these rooms are sold to the proprietors whose properties have rooms, bathrooms since last year. Besides, each designs is modified after several versions both following the professional designers knowledge and the personalized suggestions of the each proprietor.

Figure 7: Samples of the indoor-layout data are rotated in four directions including .

Besides, all the designs are rotated in direction including . The position of each furniture, the direction of each furniture, the position of the doors and windows in the room, the position of each fragment of the walls are all rotated Figure 7. Therefore, the total number of the layouts are with rotation .

Figure 8: The layout sample and the corresponding rendered scenes are represented. For each sample, the layout sample including the position, the direction, the size of each furniture are represented on the left side, the corresponding rendered scene is represented on the right side.

5.2 Rendered Layouts

Besides, each layout sample is corresponding to the rendered layout images. These images are the key demonstration of the interior decoration. These rendered images contains several views and we collect the top-down view as the rendered view Figure 8. Therefore, the dataset also contains rendered layouts in the top-down view. Each rendered layout is corresponding to a design. The rendered data is produced from an industry-lever virtual tool which has already provides missions of rendering layout solutions to the proprietors Figure 8.

6 Evaluation

In this section, we present qualitative and quantitative results demonstrating the utility of the proposed adversarial model for scene synthesis and comparing it to two baselines. types of indoor rooms are evaluated including the bedroom, the bathroom, the study room and the tatami room. samples are randomly chosen for training, samples are used for the test. Both the training and test rooms are rotated in direction, , , and . The first baseline model is a classical adversarial model [7] which takes a pair of samples of a rendered empty room and its layouts for training. For the inference, it produces the layout of furniture given the rendered empty room. The second baseline model is a conditional adversarial model[st-gan], which takes the pair of samples together with the rotation for training. For the inference, it encodes the direction of the room and the rendered empty room and produces the layout. Similarly, our model encodes and to produce the layout.

6.1 Evaluation metrics

For the task of interior scene synthesis, we apply three metrics for the evaluation. Firstly, we use average mode accuracy for the evaluation. It is applied to measure the accuracy of category of furniture for a layout in corresponding with the ground truth. This average mode accuracy is formally expressed as following:

(10)

where is the total number of category of furniture in the ground truth dataset, is the number of the category of furniture in the generated layout in corresponding with the ground truth. For example, if the furniture is in the predicted layout where the ground truth layout also contains this furniture, then it’s calculated. To be noted, is the total number of the category of the furniture.

Secondly, in order to evaluate the position accuracy of furniture layout, we apply the classical mAP to measure the position of the furniture in the predicted layout. To be noted, the threshold for the IoU between the predicted bbox of furniture and the ground truth bbox is set as .

Thirdly, we apply a degree matrix to measure the rotation accuracy of each furniture in the prediction. At the industry end, the direction of the furniture is also a key for the interior designs in the real world. For an example, the TV set should be towards inside the room. This degree matrix is formally expressed as:

(11)

where is the total number of furniture in the dataset, is the rotation of the furniture in the prediction, is the rotation of the corresponding furniture in the ground truth.

Figure 9: The comparison of the layout for Tatami. This comparison is for three models including the proposed model and the two baselines. For each comparison sample, the left layout is from the baseline1 model, the middle layout is from the baseline2 model, the right layout is from the proposed model.
Figure 10: The comparison of the layout for Bathroom. This comparison is for three models including the proposed model and the two baselines. For each comparison sample, the left layout is from the baseline1 model, the middle layout is from the baseline2 model, the right layout is from the proposed model.
Figure 11: The comparison of the layout for Bedroom. This comparison is for three models including the proposed model and the two baselines. For each comparison sample, the left layout is from the baseline1 model, the middle layout is from the baseline2 model, the right layout is from the proposed model.
Figure 12: The comparison of the layout for Study. This comparison is for three models including the proposed model and the two baselines. For each comparison sample, the left layout is from the baseline1 model, the middle layout is from the baseline2 model, the right layout is from the proposed model.
Mode MAP ROT
model base1 base2 ours base1 base2 ours base1 base2 ours
tatami 0.7862 0.9326 0.9565 0.626 0.625 0.726 0.5860 0.6913 0.7613
bathroom 0.7522 0.8545 0.8645 0.506 0.538 0.708 0.4563 0.7020 0.7861
bedroom 0.7563 0.7242 0.8871 0.585 0.527 0.782 0.4287 0.6826 0.7864
study 0.7444 0.8885 0.9000 0.472 0.575 0.775 0.4419 0.6625 0.7704
Table 1: Comparison of evaluation for four type of rooms.

6.2 Qualitative Comparisons

We compare with two baseline models for scene synthesis for four types of rooms Figure 9, 10, 11, 12. Out model outperforms the baseline models in the following aspects. Firstly, during the rotation of the indoor room, our model predict the same category of the furniture with the ground truth layout, while the two baseline models lose the category of the furniture. Secondly, our model predict good position of each furniture during the rotation of the room. While the baseline models sometimes predicts un-satisfied position that is strongly against the knowledge of the professional interior designers. Thirdly, the baseline models sometimes fall to give the position, size of the furniture in their prediction while our model seldom produce this failure.

6.3 Quantitative Comparisons

Similarly, we also compare with two baseline models quantitatively. All of the three metrics for four types of room are evaluated in the Tables. Table 1 demonstrates the accuracy of mode, position and size, direction of the furniture in the predicted layout. Our model outperforms the baseline models. Similarly, it 1 also demonstrates the comparison for the other types of rooms. In details, add half page.

7 Discussion

We presented an adversarial model together with three modules to predict the interior scene synthesis with rotation. Besides, we open an interior layouts dataset that all of the designs are drawn from the professional designers. And the designs are at the selling end.

There are several avenues for the future work. Our method is currently limited to the generation of layouts for the common rooms. The layout of luxury rooms is hard to predicted. For example, it’s difficult to predict the layout for the luxury bedroom where the bathroom, the cloakroom are also in the luxury bedroom. Besides, our model is limited to high-lever understanding ability of the interior scene, it’s very likely that the structure model such as the graph model is not developed yet. Thirdly, the category of the furniture for each type of the room is only limited to a small number. For example, a generated layout of the bedroom often contains a bedroom, a tv set and a wardrobe. It can not support other furniture such as a dressing table, an office desk, a leisure sofa and et al.

References

  • [1] W. Choi, Y. Chao, C. Pantofaru, and S. Savarese (2013-06) Understanding indoor scenes using 3d geometric phrases. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.2.
  • [2] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner (2018-06) ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [3] S. Dasgupta, K. Fang, K. Chen, and S. Savarese (2016-06)

    DeLay: robust spatial layout estimation for cluttered indoor scenes

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [4] M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and P. Hanrahan (2012-11) Example-based synthesis of 3d object arrangements. ACM Trans. Graph. 31 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [5] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018-06) IQA: visual question answering in interactive environments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [6] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications. CoRR abs/1709.05584. External Links: Link, 1709.05584 Cited by: §2.4.
  • [7] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, Cited by: §6.
  • [8] J. Johnson, A. Gupta, and L. Fei-Fei (2018-06) Image generation from scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [9] J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015-06) Image retrieval using scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [10] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori (2019-10) LayoutVAE: stochastic scene layout generation from a label set. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.5.
  • [11] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu (2019) LayoutGAN: generating graphic layouts with wireframe discriminators. CoRR abs/1901.06767. External Links: Link, 1901.06767 Cited by: §2.5.
  • [12] M. Li, A. G. Patil, K. Xu, S. Chaudhuri, O. Khan, A. Shamir, C. Tu, B. Chen, D. Cohen-Or, and H. Zhang (2019-02) GRAINS: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. 38 (2). External Links: ISSN 0730-0301, Link, Document Cited by: §1, §2.4.
  • [13] C. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey (2018-06) ST-gan: spatial transformer generative adversarial networks for image compositing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1, §4.
  • [14] S. Qi, Y. Zhu, S. Huang, C. Jiang, and S. Zhu (2018-06) Human-centric indoor scene synthesis using stochastic grammar. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [15] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.4.
  • [16] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017-07) Semantic scene completion from a single depth image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [17] K. Wang, Y. Lin, B. Weissmann, M. Savva, A. X. Chang, and D. Ritchie (2019-07) PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. 38 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.4.
  • [18] K. Wang, M. Savva, A. X. Chang, and D. Ritchie (2018-07) Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. 37 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §1, §2.1, §2.3, §2.4.
  • [19] K. Xu, K. Chen, H. Fu, W. Sun, and S. Hu (2013-07) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. ACM Trans. Graph. 32 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.
  • [20] Z. Zhang, Z. Yang, C. Ma, L. Luo, A. Huth, E. Vouga, and Q. Huang (2020-04) Deep generative modeling for scene synthesis via hybrid representations. ACM Trans. Graph. 39 (2). External Links: ISSN 0730-0301, Link, Document Cited by: §2.3.