Neural Scene Decoration from a Single Photograph

by   Hong-Wing Pang, et al.

Furnishing and rendering an indoor scene is a common but tedious task for interior design: an artist needs to observe the space, create a conceptual design, build a 3D model, and perform rendering. In this paper, we introduce a new problem of domain-specific image synthesis using generative modeling, namely neural scene decoration. Given a photograph of an empty indoor space, we aim to synthesize a new image of the same space that is fully furnished and decorated. Neural scene decoration can be applied in practice to efficiently generate conceptual but realistic interior designs, bypassing the traditional multi-step and time-consuming pipeline. Our attempt to neural scene decoration in this paper is a generative adversarial neural network that takes the input photograph and directly produce the image of the desired furnishing and decorations. Our network contains a novel image generator that transforms an initial point-based object layout into a realistic photograph. We demonstrate the performance of our proposed method by showing that it outperforms the baselines built upon previous works on image translations both qualitatively and quantitatively. Our user study further validates the plausibility and aesthetics in the generated designs.



There are no comments yet.


page 1

page 4

page 7

page 8

page 11

page 13

page 14

page 15


LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators

Layout is important for graphic design and scene generation. We propose ...

Building-GAN: Graph-Conditioned Architectural Volumetric Design Generation

Volumetric design is the first and critical step for professional buildi...

Semantic Palette: Guiding Scene Generation with Class Proportions

Despite the recent progress of generative adversarial networks (GANs) at...

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

The goal of this paper is to take a single 2D image of a scene and recov...

Geometric Image Synthesis

The task of generating natural images from 3D scenes has been a long sta...

Intuitive, Interactive Beard and Hair Synthesis with Generative Models

We present an interactive approach to synthesizing realistic variations ...

A Data-driven Approach for Furniture and Indoor Scene Colorization

We present a data-driven approach that colorizes 3D furniture models and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empty scene Object labels Generated
Figure 1: Examples of neural scene decoration under our system. Here are two examples to generate indoor decoration plans and renderings given an empty background image and a set of object locations.

Decorating and rendering an indoor scene is a common task for interior design; this is typically performed by professional designers who carefully craft the room design and furniture placement from a conceptual draft, followed by extensive modeling via CAD/CAM software, and finally translated into a realistic image with a powerful 3D rendering engine. Such a task often requires extensive background knowledge and experience in the field of interior design, as well as high-end professional software. This makes it difficult for amateur people to design their own scenes from scratch.

On the other hand, various image synthesis methods using deep neural network architectures have been gaining popularity in the computer vision community. Various types of convolutional neural networks (CNNs) - typically in the form of encoder-decoder networks and generative adversarial networks (GANs)

[4] - have been shown to be able to generate images from a variety of inputs, such as semantic maps and text descriptions.

At the same time, the generative nature of deep neural networks have also been extended to generation of indoor scene structures. Scenes can be generated in the form of collection of 3D objects placed in specific arrangements; or in 2D representation forms such as top down coordinates of objects or even floor plans.

In this paper we attempt to combine both directions of research and present an image generation architecture that accepts empty scenes as input, and produces images of the empty scene decorated with furniture. While previous work has been able to synthesize images from simple structural inputs such as bounding boxes of objects, the problem of generation from an empty background image has not yet been extensively explored. Our goal is to generate images of decorated scenes with improved visual quality compared with existing image-to-image generation approaches, where the placement of objects should be coherent. We call this problem neural scene decoration.

An immediate application of neural scene decoration is for building new conceptual but realistic designs of the interior space. Despite the evolutions of computer-aided design software, generating an interior design is still a tedious task that requires close collaborations among artist, customer, and sometimes sales representatives. Based on a few example images, conceptual designs are generated in top-down diagrams before 3D models are created, textured, and rendered. The entire process is expensive, require some manual steps, and could take several days. Our goal is to let neural scene decoration rapidly generate realistic furnishing of empty space either from an example photograph or from an existing interior space.

To this end, we propose a novel solution that consists of two steps, each of which corresponds to a neural network. In the first step, we take the image of the empty scene as input and use an object proposal neural network to generate an object layout for the scene. We leverage the idea of object detection to design our object proposal network. In the second step, the proposed layout is used to create a semantic map that is subsequently used to synthesize a new image of the scene with furniture. Particularly, we train a generative adversarial network (GAN) that takes both the object layout and the background image and output the rendering in a multi-resolution fashion. Our experiments show that our proposed new task and method generates realistic images of the scene as if it were furnished. The quantitative results with Inception score and Frechet Inception distance confirm that our method performs better than existing image-to-image translation methods.

In summary, our contributions are

  • [leftmargin=*]

  • A new task on generative image modeling that we named as neural scene decoration: synthesize a realistic image with furnished decorations from an empty background image of a scene;

  • A neural network that features an effective image generator for solving the neural scene decoration problem;

  • Extensive experiments that demonstrate the performance of the proposed method and great potentials for future research.

2 Related Works

2.1 Image Synthesis

Before the modern deep learning, editing a single photograph can be done by building a physical model of the scene in the photo for object composition and rendering 

[15, 14, 16]. Recently, deep neural networks can be used to synthesize realistic images from a set of training data. Recent image synthesis methods mostly fall within two categories: generative adversarial networks (GAN) [4]

and variational autoencoders (VAE)

[17]. In this section, let us focus on discussing recent deep learning based techniques. Our proposed method adopts the GAN framework, which produces synthesized images from a generator network and employs a discriminator network to distinguish between real images and generated samples. is optimized to generate realistic samples that is able to ”fool” .

Conditional GANs (cGAN) [21] refer to GANs where the generated result is conditioned on the input. A typical example is image-to-image translation, where the GAN learns a mapping between two image domains such that the generated image not only follows the output domain distribution but also correspond to the input image in some way. pix2pix proposed by Isola et. al. [10] is able to translate semantic labels and rough sketches into corresponding images, by learning the translation mapping through paired dataset where a pair of matching images from both domains are sampled at a time. Efforts have been made to extend the paired generation problem in various directions such as generation of images in higher resolution [32]; greater diversity of appearance of generated images [44]; and generation from arbitrary viewpoints [29]. CycleGAN proposed by Zhu et. al. [43] extends image translation to unpaired datasets by adopting the cycle-consistency loss, extending image translation to applications where it is impractical to collect large numbers of paired images. Recently, Park et. al. proposed using spatially-adaptive normalisation (SPADE) [26] to modulate the intermediate activations with the input semantic map, as opposed to feeding it directly into the generator network, to strengthen the semantic information during generation.

Recently, plenty of GAN variants have been applied to 2D layout or scene generation. LayoutGAN [18] proposed a wireframe discriminator to better generate structural graphics. HiGAN [34] manipulates latent variation semantic layers, i.e., layout, category and lighting, to better synthesize scenes with new views and texture. ArchiGAN [3] and HouseGAN [23] generate planar apartment room layout and/or furniture layout. Particularly, Li et al. [19] produce high-quality scene images with salient object layout and invented backgound from background hallucination GAN (BachGAN). The main difference between BachGAN [19] and our generator in this work is that BachGAN varies the background randomly while our generator is conditioned on an image of an empty scene. Additionally, our system was trained with a more powerful dataset with ground-truth backgrounds and can generate novel scenes with the constrained input background and the object layouts.

2.2 Indoor Scene Planning

Indoor scene generation methods fall within two major categories: generation of 3D scenes and 2D representations. Earlier methods in this field have relied on optimization-based approaches that aim to minimize certain custom-defined objectives regarding the relative position and orientation of scene objects (e.g., TV should be facing sofa) [36]. A range of data-driven methods has since been proposed, attempting to learn the relationships between objects over large indoor datasets. Henderson et al. [5] have proposed to model the generation as a graphical model in terms of object placement, room type, room shape and other high-end factors. Recent works have shown that deep neural networks can also be modified to perform generation of 3D scenes; for example, Zhang et al. [38] proposed to use a GAN-based model that approximates the distribution of positions and orientations of scene furnitures, jointly optimizing discriminators on both 3D and rendered 2D arrangements.

Approaches that focuses on generating 2D representation of scenes often make use of the top-down view of an indoor area. Similar to that of 3D models, prior work in this line of approach has also seen a trend in data-driven techniques powered by deep generative models. Ritchie et al. [28] suggested iteratively inserting scene objects, using four separate CNN modules to reason on the category, location, orientation and dimension of the objects. Subsequently, Wang et al. proposed the PlanIT framework [30] which further generates a high-level hierarchical framework of objects before actually learning to place the objects, thus ensuring better coherence within the proposed arrangement of scene objects. Some previous works utilized spatial or network latent constraints as priors, e.g., relation graph priors [30, 7, 23], convolution prior[31], and performed well on spatial organization.

Our work is also relevant to scene decoration and manipulation techniques. One of the most relevant ideas is by Zhang et al. [37]

that aims to generate an empty scene from an RGB-D scan with applications to scene editing such as furnishing, material editing, and relighting. Far apart from their work that utilizes 3D inputs, our method performs decoration and furnishing only with a single input photograph. Our planning is done directly in the perspective view captured by the input background image, which allows us to consistently relate the input, object layout, and generated images in the same camera view. As an interactive application, ST-GAN  used spatial transformer network to generate geometric corrections to composite a foreground image of a furniture object to a background image. Since their method can only perform on a single object at a time, furnishing an entire scene is incredibly time consuming. By contrast, our method can generate the image of the entire room with a single inference.

3 Our Method

3.1 Problem formulation

Our goal is to create a generator network which produces a decorated indoor scene image , given an empty scene image of the same size , and a set of object labels indicating a list of objects to be placed in . Ideally, should be able to create a realistic image decorated with objects with the correct type and location as specified in , and also assimilate to the provided background image .

The format of object labels is crucial in determining how easy and straightforward the potential user can utilize the generation network to insert objects into the given background scene. SPADE [26] demonstrated the ability to synthesize realistic images using pixel-wise semantic annotations. This is, however, not entirely suitable for generating indoor scene images, as semantic labels of indoor scenes are often composed of a large number of complicated shapes.

In BachGAN [19], the authors proposed to generate images from the salient layout

, i.e. an array of bounding boxes of objects present in the scene. For each image, the object labels can be represented as a onehot tensor

, where is the number of object classes, and each object is represented as a rectangle of pixels with value 1 over background pixels with a value of 0.

In this paper, we additionally discuss generation from a point-based layout, i.e., the location of each object is represented by its centroid in a 2D image space. We find this format interesting to explore, as it implies end-users are able to generate decorated scenes only by specifying rough locations of objects. The size and shape of the objects are automatically inferred from the trained generator, based on observed appearances of object instances in the training dataset. This idea is closely related to center-based object detection and tracking [42, 41]

where the centroids can be first identified by a neural network using keypoint estimation and other properties of an object such as bounding box and pose can be subsequently regressed from the centroids.

Mathematically, we represent object labels of point positions by tensors , where each object is represented by a single pixel , where is the position coordinates and is the object class ID. While this leads to a very sparse representation, we demonstrate in Section 3.3 how our architecture downscales each object label and condition the generation process at multiple resolution levels.

3.2 Dataset

While there are various choices of datasets designed for indoor scene understanding and synthesis, we choose to conduct our experiments on the Structured3D dataset

[40], as it is to the best of our knowledge the only publicly available dataset with pre-rendered image pairs of empty and fully decorated scenes. From the larger Structured3D dataset, we obtain two smaller subsets of images of bedrooms and living rooms for separate training. Dataset statistics and procedure of obtaining the two training splits are described in detail in appendix section A.1.

3.3 Architecture design

Figure 2: Diagram of our generator design. Number inside boxes are the target side lengths of feature map outputs of upsampling / convolution modules.
Figure 3: Example of label downsampling. (a) Sample image. (b) Flattened visualization of scaled down point position labels at sizes 3232, 1616, 88 and 44.

We base our architecture design on the very recent work by Liu et al. [20], who proposed an unconditional image generation framework with a lightweight architecture that significantly reduces the resources and time need for training, compared with the state-of-the-art StyleGAN2 [13]. Our architecture has multiple key features that allow end-to-end conditional generation given a pair of background image and object label . We discuss these features below.

Object label insertion.

Our architecture generates the target image from a bottom-up approach, where multiple upsampling layers are used to increase the feature map size from 44 to the final output size of 512512. During this feed-forward process, we provide object label information at four resolution levels, namely at dimensions 44, 88, 1616 and 3232. The choice of only partially injecting object label information is partially inspired by [34], who observed that layers in the first half of unconditional GAN architectures (such as StyleGAN [12] and BigGAN [2]) is mostly reponsible for generation of foreground objects in a scene, whereas the latter half mostly affect factors such as style and color scheme.

Therefore, we need to downscale the given object label into smaller counterparts, using consecutive pooling layers. For bounding box labels, we use an average pool layer with kernel size 16 to scale down the original 512512 label map into a 32

32 label map; we then use more average pool layers with kernel size 2 to compute the resulting maps. For point position labels, we replace average pool with max pool layers, so that object pixels will be preserved at corresponding locations in the smaller label maps. Figure

3 shows an example of downscaling the label .

The scaled-down labels are then fed into label injection modules. Each module consist of a single SPADE residual block [26], which allows spatial annotations in the label maps to condition the generation process via learned denormalization attributes. There is a skip connection before and after the SPADE layer.

Background image insertion.

Another important generation constraint is that the background area of must conform to the input background image

. We observed that our problem formulation is similar to image inpainting, in the sense that the generator should ideally leave most of

untouched and only generate new content over certain areas. Similar to the inpainting approaches in [9, 35], we make use of an encoder-decoder structure where the size of middle latent feature maps is one quarter of the original image size, i.e. 128128. The encoded background feature maps with size are then merged with the object generation feature maps, also of the same size. The merging module is another SPADE residual block with two SPADE layers, where the object feature map is used to denormalize the background feature map, producing a merged feature map also of size . The feature map is then repeatedly upsampled twice to the output size of 512, where we follow the implementation of upsampling and skip-layer excitation (SLE) modules from [20], the latter of which merges activations from earlier layers.

Label discriminator.

[20] introduced a discriminator architecture, labelled as in Figure 2, that improves generation quality using two simple decoders. In order to ensure suitable objects are generated at the locations specified in , we introduce an additional discriminator that branches off from at size 32 32. In this branch, each feature map is first concatenated with the label map of the corresponding size , where .

3.4 Training

Loss function.

We make use of the hinge adversarial loss function in

[20] to train the discriminator:


where is the decorated scene image paired with ; it is also the source image where object locations in are captured from. The discriminator output

is a weighted combination of the logits output of the dual discriminator heads

and :


The generator is updated to push the discriminator output towards the real image direction, with an additional reconstruction loss term to constrain the output having the background scene specified by :


The term is defined by


where is a binary ”background mask” and denote element-wise multiplication. specifies the fixed background region between the pair of empty / decorated scene images , and is determined by checking which pixels belong to one of the background semantic classes (e.g. floor, wall) listed in section A.1.

Thus, we optimize and iteratively, using the mixing coefficients . Empirically we found that setting and to smaller values is sufficient for constraining the generation results to the provided background image and object label . We set and for all experiments in this paper.

4 Experiments

In this section, we present and analyze the generation quality of our proposed method. As it is hard to find existing baselines that exactly matches our novel problem formulation, we adopt two architectures for conditional image synthesis - SPADE [26] and BachGAN [19] - and make necessary modifications to each of them.


SPADE is a state-of-the-art generation architecture conditioned on pixelwise semantic maps. A simple baseline would be to adopt SPADE but with object labels passed as input, instead of semantic maps; however, we also aim to generate images with a given background image, which SPADE does not take into consideration. Thus, we propose two baselines altered from SPADE, namely SPADE-BG1 and SPADE-BG2:

  • For SPADE-BG1, the background image is passed through an encoder to form the latent code input to the generator. We use the VAE encoder shown in Fig. 14 in [26] (with

    discarded). In order to pass the point labels to the earlier (i.e. spatially smaller) layers, such labels are downsampled by max-pooling layers instead of direct interpolation.

  • SPADE-BG2 follows the same setting as SPADE-BG1, except that the background image is scaled and concatenated to the feature map following every upsampling operation in the generator.


Building upon the SPADE generator architecture, BachGAN aims to generate from foreground object bounding boxes and without any information regarding the background scene. It includes an additional background hallucination module that produces backgrounds matching the input object layout. Similarly to SPADE, the generated output of BachGAN is not conditioned on any input background image.

Thus, we adopt a variant of BachGAN - which is hereby denoted as BachGAN-BG - as a comparison baseline. In this variant, a given input background image is directly fed into the background generation module, instead of pooling features from multiple background image candidates. Figure 4 illustrates how BachGAN-BG differs from the original architecture.

Figure 4: Top: BachGAN architecture, simplified from Fig. 2 in [19]. Bottom: BachGAN-BG.

We perform evaluation on two types of object labels: in point-based layout which we proposed in Section 3.1, and in bounding-box layout which is originally employed in BachGAN.

4.1 Quantitative Evaluation

Data split Method FID KID
Bedroom BachGAN-BG 64.9 0.0186
Bedroom SPADE-BG1 62.0 0.0204
Bedroom SPADE-BG2 56.7 0.0161
Bedroom Ours 56.2 0.0104
Living room BachGAN-BG 61.3 0.0170
Living room SPADE-BG1 64.4 0.0245
Living room SPADE-BG2 57.8 0.0174
Living room Ours 54.5 0.0094
Table 1: Quantitative assessment using point locations as label input. Bold indicates the best result.
Data split Method FID KID
Bedroom BachGAN-BG 52.5 0.0111
Bedroom Ours 51.1 0.0062
Living room BachGAN-BG 50.4 0.0088
Living room Ours 52.7 0.0080
Table 2: Quantitative assessment of the image quality from our generator and the baselines, using bounding boxes as label input. Bold indicates the best result.

We evaluate the image synthesis performance in terms of the Frechet Inception Distance (FID) [6] and the Kernel Inception Distance (KID) [1]

. FID and KID scores measures the feature similarity between synthesis outputs and real images, i.e. in our case synthesized decorated rooms and ground-truth decorated rooms. Both scores are measured with the PyTorch implementation of the metrics in


Empty scene Labels BachGAN-BG SPADE-BG1 SPADE-BG2 Ours
Figure 5: Comparison of generation results, based on point labels, of BachGAN-BG and our method.

We compare our generation results for both the bedroom and living room dataset splits in point label format. Table 1

illustrates the evaluation metric scores of our models compared with the baselines; our models outperform the baseline in both splits. The comparison between the two is further illustrated in Figure

5; our model is more likely to capture the details of the foreground object. We hypothesize that our model are more successful in feeding object label information towards the initial layers of the generator; as each pixel in a downscaled object label would encourage object generation within a specific local region in the final image.

We also compare as reference our generation results in bounding box formats with BachGAN-BG; the results are shown in Table 2. Compared to point labels, the performance between the two methods are relatively similar; our method manages to perform better than BachGAN-BG in terms of KID, but does not enjoy an advantage over the FID metric. Additionally, we also observe that our method performs better on bedroom images, which often have less complicated arrangement of objects. Regardless, we believe our method still has potential, considering that the BachGAN model is trained in roughly the same amount of duration but with 4 GPUs, i.e. our method is able to produce generation results of similar quality using less computational resources.

4.2 Qualitative Evaluation

Empty scene Object labels Generated Object labels Generated
Figure 6: Different decoration renderings with different semantic maps based on the same background image. With the same background, our generator can well furnish the constrained objects in terms of diverse layout maps.
Empty scene Object labels Generated Object labels Generated
Figure 7: Experiments on furnishing real-world scenes. Our method can generate images with diverse settings.

In Figure 6 we illustrate that our model’s flexibility in handling different sets of object layouts over the same input background. With this framework, it is possible to produce different generation results by modifying the locations of different objects, as well as adding / deleting objects in the scene. The resultant decoration renderings demonstrate the generation versatility of our system referring to diverse semantic object layouts. Furthermore, we exemplify the general decoration function with real-world indoor scene examples in Figure 7 to validate the universality of our model for data out of training database [40]. It is interesting to discover that the generated paintings and furniture for the scenes are of different artistic styles which are in line with the entire decoration and background styles.

Data split Labels Ours BachGAN Neither
Bedroom Points 70.1 19.3 10.6
Bedroom Boxes 57.0 28.7 14.3
Living room Points 53.0 30.4 16.6
Living room Boxes 38.9 43.9 17.2
Table 3: User preferences.

4.3 User study

To complement our quantitative evaluation results, we conduct a user study to evaluate different users’ preference over images generated with our method and the baseline respectively. We randomly selected 15 pairs of generated images for each of the four dataset settings, with each pair conditioned on the same input background and object labels. Each respondent is presented with the 60 pairs of images in random order, and choose the image that he / she considers to be more natural and realistic. The order the two images are shown in each question is also randomized. Additionally, the user may also choose the option ”Neither is better”, if neither image is significantly better than the other. We collected in total 29 responses, and the result is shown in Table 3. In particular, our model is more successful in generating images with respect to point locations, especially in the bedroom split where the tends to be less objects and clutter. When using bounding boxes as object labels input, our method is still able to produce results around same level of quality of the baseline.

4.4 Automatic generation

Our current method requires both a background image and a set of object locations for successful generation, which is sufficient in use cases where the rough arrangement of objects is already decided. However, our method can be even further extended to automatic generation of decorated scenes from the background scene alone, by taking an additional step to automatically generate a set of plausible object arrangements based on the background input. Thus, we further explore this possibility of performing neural scene decoration by conditioning on the background image only, and automatically infer plausible layouts from the supplied background scene.

We believe that background scenes in general should provide sufficient information for learned neural networks to propose object locations via its visual cues, e.g. visible room structure and vanishing lines. A close example of this task would be LayoutVAE [11], which samples entire object layouts (i.e. a set of bounding boxes) of complex scenes from a variational autoencoder (VAE)-based framework. However, their approach is purely unconditional, which is not applicable in our setting of generating object layouts conditioned on the background image.

As a simple experiment, we train a Faster-RCNN [27] object detector to predict object locations using the Detectron2 [33] implementation. The object detector only sees empty background images, as well as the ground truth bounding boxes of objects in the corresponding decorated image. Due to the large number of duplicate box proposals, we filter out certain classes (e.g. the sofa class when generating bedroom images) and select the highest confidence bounding box per class. Finally, we obtain the point locations by computing the center point of the bounding box. This is different from using the centroid of object pixels during training, but it should serve as a rough estimate of the centroid location.

Under this probelm setting, we are able to to produce images of decorated rooms from an empty background image alone. We present the generation results, as well as comparison with naive image-to-image translation using Pix2PixHD [25] in details in appendix section B.2.

5 Conclusion

We introduced a new task for image synthesis called neural room decoration: the goal of this task is to furnish an indoor space by directly rendering furniture and wall decorations onto an image of the empty scene. To solve this task, we propose a generative architecture conditioned on a set of point locations that indicates the rough location of different types objects the user wish to place within the provided empty scene. We demonstrate that this method can be trained on interior scene datasets to generate images of furnished rooms automatically, bypassing the traditional pipeline that requires familiarity with professional 3D modeling and rendering software with complex settings. Neural room decoration is henceforth a small step towards building next-generation user-friendly interior design and rendering applications especially for novice users.


  • [1] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §4.1.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §3.3.
  • [3] S. Chaillou (2019) ArchiGAN: a generative stack for apartment building design. Cited by: §2.1.
  • [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1.
  • [5] P. Henderson, K. Subr, and V. Ferrari (2017) Automatic generation of constrained furniture layouts. arXiv preprint arXiv:1711.10939. Cited by: §2.2.
  • [6] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §A.3, §4.1.
  • [7] R. Hu, Z. Huang, Y. Tang, O. van Kaick, H. Zhang, and H. Huang (2020) Graph2Plan: learning floorplan generation from layout graphs. arXiv preprint arXiv:2004.13204. Cited by: §2.2.
  • [8] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §B.2.
  • [9] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §3.3.
  • [10] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1125–1134. Cited by: §2.1.
  • [11] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori (2019) Layoutvae: stochastic scene layout generation from a label set. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9895–9904. Cited by: §4.4.
  • [12] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §3.3.
  • [13] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, Cited by: §A.2, §3.3.
  • [14] K. Karsch, K. Sunkavalli, S. Hadap, N. Carr, H. Jin, R. Fonte, M. Sittig, and D. Forsyth (2014) Automatic scene inference for 3d object compositing. ACM Trans. Graph.. Cited by: §2.1.
  • [15] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem (2011) Rendering synthetic objects into legacy photographs. In Proceedings of the 2011 SIGGRAPH Asia Conference, Cited by: §2.1.
  • [16] K. Karsch (2015) Inverse rendering techniques for physically grounded image editing. Ph.D. Thesis, University of Illinois at Urbana-Champaign. Cited by: §2.1.
  • [17] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1.
  • [18] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu (2019) Layoutgan: generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767. Cited by: §2.1.
  • [19] Y. Li, Y. Cheng, Z. Gan, L. Yu, L. Wang, and J. Liu (2020) BachGAN: high-resolution image synthesis from salient object layout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §3.1, Figure 4, §4.
  • [20] B. Liu, Y. Zhu, K. Song, and A. Elgammal (2021) Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. External Links: 2101.04775 Cited by: §3.3, §3.3, §3.3, §3.4.
  • [21] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1.
  • [22] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §A.1.
  • [23] N. Nauata, K. Chang, C. Cheng, G. Mori, and Y. Furukawa (2020) House-gan: relational generative adversarial networks for graph-constrained house layout generation. arXiv preprint arXiv:2003.06988. Cited by: §2.1, §2.2.
  • [24] A. Obukhov, M. Seitzer, P. Wu, S. Zhydenko, J. Kyl, and E. Y. Lin (2020) High-fidelity performance metrics for generative models in pytorch. Zenodo. Note: Version: 0.2.0, DOI: 10.5281/zenodo.3786540 External Links: Link, Document Cited by: §4.1.
  • [25] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §B.2, §4.4.
  • [26] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §3.1, §3.3, 1st item, §4.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §4.4.
  • [28] D. Ritchie, K. Wang, and Y. Lin (2019) Fast and flexible indoor scene synthesis via deep convolutional generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6182–6190. Cited by: §2.2.
  • [29] H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan (2019) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2417–2426. Cited by: §2.1.
  • [30] K. Wang, Y. Lin, B. Weissmann, M. Savva, A. X. Chang, and D. Ritchie (2019) Planit: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–15. Cited by: §2.2.
  • [31] K. Wang, M. Savva, A. X. Chang, and D. Ritchie (2018) Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §2.2.
  • [32] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §2.1.
  • [33] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §4.4.
  • [34] C. Yang, Y. Shen, and B. Zhou (2019) Semantic hierarchy emerges in deep generative representations for scene synthesis. arXiv preprint arXiv:1911.09267. Cited by: §2.1, §3.3.
  • [35] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505–5514. Cited by: §3.3.
  • [36] L. F. Yu, S. K. Yeung, C. K. Tang, D. Terzopoulos, T. F. Chan, and S. J. Osher (2011) Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30,(4), July 2011, article no. 86 30 (4). Cited by: §2.2.
  • [37] E. Zhang, M. F. Cohen, and B. Curless (2016) Emptying, refurnishing, and relighting indoor spaces. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016) 35 (6). Cited by: §2.2.
  • [38] Z. Zhang, Z. Yang, C. Ma, L. Luo, A. Huth, E. Vouga, and Q. Huang (2020) Deep generative modeling for scene synthesis via hybrid representations. ACM Transactions on Graphics (TOG) 39 (2), pp. 1–21. Cited by: §2.2.
  • [39] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han (2020) Differentiable augmentation for data-efficient gan training. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §A.2.
  • [40] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou (2020) Structured3D: a large photo-realistic dataset for structured 3d modeling. External Links: 1908.00222 Cited by: §3.2, §4.2.
  • [41] X. Zhou, V. Koltun, and P. Krähenbühl (2020) Tracking objects as points. In eccv, Cited by: §3.1.
  • [42] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §3.1.
  • [43] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.1.
  • [44] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.1.

Appendix A Implementation details

We hereby provide additional information on our preprocessing, augmentation and training pipeline used in our experiments.

a.1 Dataset and preprocessing

The Structured3D dataset consists of pairs of fully decorated and empty indoor images, rendered from a total of distinct 3D scenes. Each image pair is supplemented by additional pixel-wise annotations, e.g. semantic maps and instance maps. Following the recommendation of the dataset authors, we split the 3500 scenes into a training split of 3000 scenes, a validation split of 150 scenes, and a test split of 150 scenes. All quantitative evaluation and samples in the figures are produced from the test split only.

Each semantic map is annotated using the 40-classes label set from the NYU-Depth V2 dataset [22]. 5 of the classes (window, door, wall, ceiling, floor) are ”background” classes, which are present on both empty and decorated scenes. The remaining classes represent ”foreground” objects only found in the decorated scenes, and our goal is to synthesize object instances from these classes. However, the distribution of such classes are highly unbalanced, and some of the classes are in fact virtually nonexistent in the dataset. Thus, in our experiments we only focus on generating the ten most frequently occurring classes in the dataset (excluding the ”wildcard” classes otherprop, otherstructure and otherfurniture). The list of classes used and their corresponding colors are shown in Table 5.

The Structured3D dataset consists of a wide variety of indoor scenes, including bedrooms, living rooms, and in some cases non-residential locations. Thus, we carry out experiments on smaller subsets of the dataset consisting of samples of the same type of rooms. Specifically, we perform experiments on two subsets - bedrooms and living rooms. Each subset has an associated anchor class - namely, bed and sofa - that should appear in every sample of the subset. To make generation easier, we filter out images where the anchor class object is barely visible, i.e. the number of pixels of that object is less than a certain threshold. The number of samples for each subset is outlined in Table 4.

Data split Anchor class Train Validation Test
Bedroom bed 5,112 460 465
Living room sofa 4,063 418 379
Table 4: Statistics of data used for training and testing.
Name Color Name Color
cabinet picture
bed curtain
chair television
sofa nightstand
table lamp
Table 5: Object classes used in our experiments.

All images from the dataset are of the size 1280 720. Thus, we resize the image by scaling the shorter, vertical side to match the generation image size , then randomly crop a square image along the x-axis. Furthermore, to ensure that the anchor class object is visible in each image, we first compute the leftmost and rightmost cropping positions where the centroid of the anchor class object is inside the square frame, and is at least pixels away from the left and right edges of the frame. We then randomly choose a position between the leftmost and rightmost positions. The randomness introduced here serves as data augmentation and provides a larger variety of images supplied to the training process. The same procedure is carried out for all experiments including baseline models. We set and when generating images with our proposed architecture.

a.2 Data augmentation

Figure 8: (a) Sample image with corresponding label map, downscaled to size 3232. (b) Same sample after translation and horizontal flipping.

A direct consequence of training on smaller subsets of the Structured3D dataset is that the number of usable training samples the model observes is greatly reduced. To deal with this issue, we implement the DiffAugment technique [39] in our training process. DiffAugment improves generation quality by randomly perturbing both the generated and real images with differentiable augmentations when training both and , and is reported to significantly boost the generation quality of state-of-the-art unconditional StyleGAN2 [13] architecture when training data is limited to a few thousand samples. Thus, we adopt this technique when training on our architecture, in order to compensate for the reduction of training samples.

While the authors of DiffAugment has proposed multiple augmentation methods, we only apply translation augmentation to the images, as other methods (e.g. random square cutouts) may affect the integrity of the decorated scene images. We set the translation augmentation probability to 30%, and also horizontally flip the images for 50% of the time. For each augmented image, its corresponding label map is also perturbed in the exact same way. Figure

8 shows an example of our augmentation scheme.

a.3 Training details

Our architecture is trained with a batch size of 16, with gradients accumulated every 4 iterations. Each training instance can be run on a single NVIDIA RTX 2080 Ti GPU. We monitor the training process by regularly computing the FID score [6]

over the test split, and stop training when improvement in generation quality begins to stagnate. We found that under the given training setup, our architecture reaches best generation quality starting from around 50 epochs, which typically requires around two full days for training.

a.4 Limitations of implementation

While the current proposal network is able to propose plausible object locations, we notice that the arrangement of the objects currently lacks flexibility. For example, supplying an object label with only one to two objects is less likely to result in realistic generated images. We suggest that this is a result of the training dataset only containing fully decorated rooms, i.e. the generator is not trained to produce partially decorated rooms. Likewise, our trained models also tend to perform not as good on arrangement of objects that rarely occur in the dataset.

Additionally, we found out multiple object instances in an image are occassionally labelled by Structured3D as the same object, e,g, paintings and curtains. This explains why a single picture object label can result in two (or more) paintings being generated. Reflections and highlights caused by foreground objects (e.g. lights) are also present in the corresponding empty scene image, which could hinder the ability of our approach when generalizing to real-life empty scene images that are not lit up.

Appendix B Additional Results

b.1 Comparisons to bounding box representation

While the amount of information stored in point locations are significantly less than pixel-wise semantic maps or bounding boxes, we argue that this gives the trained generator greater flexibility in determining the placement of objects, and therefore does not necessarily lead to a reduction in generation quality. In Figure 9 we provide a side-by-side comparison of the generation results conditioned on the same background and set of objects, where the object labels are in bounding box and point location formats respectively. The examples demonstrate that the generator is able to create objects of similar quality under our method, even when the label information of each object is reduced to a single 2D point.

Note that in figure 9, some objects may not appear in the point location image; this is because centroids of objects are precomputed according to the original image dimension, and may fall outside the image boundary after random cropping.

b.2 Generation with background images only

We compare the image synthesis results with automatically generated object labels via the method previously discussed in section 4.4 with images generated with pix2pixHD [25], a general image-to-image translation architecture. Note that this is a relatively naive approach that learns to place objects at commonly occurring areas (e.g. curtains on windows), and do not consider the internal relationships among the proposed objects. Thus, the proposed layout is not guaranteed to be fully coherent, and deviates from the distribution of object placements in Structured3D, thus affecting generation quality. The generation quality is therefore not as high as results in previous sections, which uses well-designed object layouts sampled from Structured3D. Nevertheless, there is quite a number of relatively well-generated images, some of which are demonstrated in Figure 10.

Furthermore, our approach also outperforms the image-to-image translation method, which failed to generate any tangible output. This is also expected as existing image translation methods are oriented toward translating between domains with similar visual structure (e.g. semantic maps vs images, 2D maps vs satellite imagery), thus an empty-to-fully-decorated mapping is likely to be too challenging for such methods. Our early studies also show that unsupervised image translation methods such as MUNIT [8] exhibits the same situation and is unable to produce meaningful images.

b.3 Additional qualitative results

Figures 11 and 12 provides additional examples of generation results on the bedroom and living room subsets respectively.

Empty scene Labels Boxes Points Ground truth
Figure 9: Comparison between generation results from bounding box and point location formats.
Empty scene Generated layout Pix2PixHD Ours
Figure 10: Several examples of scene decoration with automatically proposed object locations.
Empty scene Input layout Generated Ground truth
Figure 11: More generation results on the test split of the bedroom subset.
Empty scene Input layout Generated Ground truth
Figure 12: More generation results on the test split of the living room subset.