Log In Sign Up

Attribute-guided image generation from layout

Recent approaches have achieved great success in image generation from structured inputs, e.g., semantic segmentation, scene graph or layout. Although these methods allow specification of objects and their locations at image-level, they lack the fidelity and semantic control to specify visual appearance of these objects at an instance-level. To address this limitation, we propose a new image generation method that enables instance-level attribute control. Specifically, the input to our attribute-guided generative model is a tuple that contains: (1) object bounding boxes, (2) object categories and (3) an (optional) set of attributes for each object. The output is a generated image where the requested objects are in the desired locations and have prescribed attributes. Several losses work collaboratively to encourage accurate, consistent and diverse image generation. Experiments on Visual Genome dataset demonstrate our model's capacity to control object-level attributes in generated images, and validate plausibility of disentangled object-attribute representation in the image generation from layout task. Also, the generated images from our model have higher resolution, object classification accuracy and consistency, as compared to the previous state-of-the-art.


page 8

page 13

page 14

page 15

page 16

page 17


Image Generation from Layout

Despite significant recent progress on generative models, controlled gen...

AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style

Conditional image synthesis from layout has recently attracted much inte...

Object-Centric Image Generation from Layouts

Despite recent impressive results on single-object and single-domain ima...

Controlling Style and Semantics in Weakly-Supervised Image Generation

We propose a weakly-supervised approach for conditional image generation...

GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data

Object Transfiguration replaces an object in an image with another objec...

Person-in-Context Synthesiswith Compositional Structural Space

Despite significant progress, controlled generation of complex images wi...

Interactive Image Synthesis with Panoptic Layout Generation

Interactive image synthesis from user-guided input is a challenging task...

1 Introduction

Controlled image generation methods have achieved great successes in recent years, driven by the advances in conditional Generative Adversarial Networks (GANs) 

[goodfellow2014generative, kim2017learning, miyato2018cgans, odena2017conditional, reed2016generative, zhang2018self, zhang2017stackgan, zhu2017unpaired, zhu2017toward] and disentangled representations [kazemi2019style, zhu2018visual]. The goal of these methods is to generate high-fidelity images from various user specified guidelines (conditions), such as textual descriptions [hong2018inferring, mansimov2015generating, tan2018text2scene, xu2018attngan, zhang2017stackgan], attributes [dong2017semantic, karacan2016learning, li2019attribute, lu2018attribute, nam2018text, zhou2019text], scene graphs [ashual2019specifying, johnson2018image, li2019pastegan], layout [sun2019image, zhao2019image] and semantic segmentation [chen2017photographic, huang2018multimodal, isola2017image, karacan2016learning, liu2017unsupervised, park2019semantic, wang2018high, zhu2017unpaired, zhu2017toward]. The high-level nature of most of these specifications is desirable from ease of use and control point of views, but severely impoverished in terms of pixel-level spatial and appearance information, leading to a challenging image generation problem.

In this paper, we specifically focus on image generation from layout, where the input is a course spatial layout of the scene (e.g, bounding boxes and corresponding object class labels) and the output is an image consistent with the specified layout. Compared to text-to-image [reed2016generative, zhang2017stackgan] and scene-graph-to-image [ashual2019specifying, johnson2018image, li2019pastegan] generation paradigms, layout-to-image provides an easy, spatially aware and interactive abstraction for specifying desired content. This makes this paradigm compelling and effective for users across the spectrum of potential artistic skill sets; from children and amateurs to designers. Image generation from a layout is a relatively new problem, with earlier methods using layout only as an intermediate representation [hong2018inferring, johnson2018image], but not a core abstraction or specification exposed to the user.

Figure 1: Attribute-guided Image Generation from Layout. Unlike prior layout-based image generation architectures, our model allows for instance-level granular semantic attribute control over individual objects (e.g, specifying that a person should be wearing something black (top) or red (bottom)); it also ensures appearance consistency when bounding boxes in layout undergo translation.

Layout2Im [zhao2019image] was the first model proposed for image generation from layout, followed by more recent LostGAN [sun2019image], which improved on the performance in terms of overall image quality. However, all current image generation from layout frameworks [sun2019image, zhao2019image] are limited in a couple of fundamental ways. First, they lack ability to semantically control individual object instances. While both Layout2Im and LostGAN model distributions over appearances of objects through appearance [zhao2019image] or style [sun2019image] latent codes, neither is able to control these variations semantically. One can imagine using encoded sample patches depicting desired objects as an implicit control mechanism (i.e., generate an instance of a tree or sky that resembles an example in a given image patch), however, this is in the very least awkward and time consuming from the user perspective. Second, they generally lack consistency – are not spatially equivariant. Intuitively, shifting a location (bounding box) of an object in the layout specification, while keeping appearance/style latent code fixed, should result in the object simply shifting by the relatively same amount in the output image (property known as equivariance). However, current models fail to achieve this intuitive consistency. Finally, they are limited to low-resolution output images, typically of size 6464.

In this paper, we address these challenges by proposing a new framework for attribute-guided image generation from layout, building on, and substantially extending, the backbone of [zhao2019image]. In particular, we show that a series of simple and intuitive architectural changes: incorporating (optional) attribute information, adopting a global context encoder, and adding additional image generation path where object locations are shifted – leads to the instance-level fine-grained control over the generation of objects, while increasing the image quality and resolution. We call this model attribute-guided layout2im (see Figure 1).

Contributions: Our contributions are three fold: (1) our attribute-guided layout2im architecture allows (but does not require) direct instance-level granular semantic attribute control over individual objects; (2) is directly optimized to be consistent, or equivariant, with respect to spatial shifts of object bounding boxes in the input layout; and (3) allows for higher-resolution output images of size up to 128128 by utilizing global context encoder and progressive upsampling. Both qualitatively and quantitatively we show state-of-the-art generative performance on Visual Genome [krishna2017visual] benchmark dataset, while benefiting from the desirable control properties, unavailable in other models. The code and pretrained models will be made available 111

2 Related Work

Image Generation from Scene Graph: Scene graph is a convenient directed graph structure designed to represent a scene, where nodes correspond to objects and edges to the relationships between objects. Recently, scene graphs have been used in many image generation tasks due to their flexibility in explicitly encoding object-object relationships and interactions [ashual2019specifying, johnson2018image, li2019pastegan]

. The typical generation process involves two phases: (1) a graph convolutional neural network 

[henaff2015deep] is applied to the scene graph to predict the scene layout (i.e., bounding boxes and segmentation masks); and (2) the layout is then decoded into an image. Different from methods that generate image as a whole, Li et al [li2019pastegan] propose a semi-parametric approach and crop refining network to reconcile the isolated crops into an integrated image. Unlike in our approach, the scene layout in these models is used as an intermediate semantic representation to bridge abstract scene graph input and an image output.

Image Generation from Layout: Image layout, comprising bounding box locations, often serves as an intermediate step for image generation (see above). Zhao et al [zhao2019image] proposed image generation from layout as a task in its own right, where the image is generated from bounding boxes and corresponding object categories. To combine multiple objects,  [zhao2019image] sequentially fuse object feature maps using a convolutional LSTM (cLSTM) network; the resulting fused feature map is then decoded to an image. Turkoglu et al [turkoglu2019layer], on the other hand, divide the generation into multiple stages where objects are added to the canvas one by one. To better bridge the gap between layouts and images, Li et al [li2019object] uses a shape generator to outline the object shape and provide the model fine-grained information from text using object-driven attention. Similarly,  [sun2019image] learns object-level fine-grained mask maps that outline the object shape. In addition,  [karras2019style, sun2019image] show that incorporating layout information into normalization layer yields better results: adopting instance normalization technique  [karras2019style] in generator realize multi-object style control  [sun2019image], whereas spatially-adaptive normalization  [park2019semantic] based on segmentation mask modulates activation in upsampling layers. Taking the inspiration from  [karras2019style, park2019semantic, sun2019image], we apply spatially-adaptive normalization in our generator to compute layout-aware affine transformations in normalization layers.

Figure 2: Overview of our Attribute-Guided Layout2Im training pipeline

. Our architecture has three generation paths: Image reconstruction path (top/blue), Image generation path (middle/red) and Layout reconfiguration path (bottom/orange). Attribute classifier is used in reconstruction path to estimate attributes of objects that do not have any attribute annotations. Non-GT (sampled) attributes are used in image generation path and layout reconfiguration path to disentangle attribute information from class (

) and appearance ().

Semantic Image Synthesis:

Semantic image synthesis is an image-to-image translation task. While most methods use conditional adversarial training

[goodfellow2014generative, mirza2014conditional], such as pix2pix [isola2017image], pix2pixHD [wang2018high], cVAE-GAN and cLR-GAN [zhu2017toward], others such as Cascaded Refinement Networks [chen2017photographic] also yields plausible results. To preserve semantic information, normalization techniques like SPADE [park2019semantic] have recently been deployed to modulate the activations in normalization layers through a spatially adaptive and learned transformation. Semantic image synthesis can also serve as an intermediate step for image modeling [wang2016generative]. In addition, some image-to-image translation tasks are achieved in unsupervised manner [huang2018multimodal, liu2017unsupervised, zhu2017unpaired], but most existing semantic image synthesis models require paired data.

Attribute-guided Image Generation: In image generation, various attempts have been made to specify the attributes of generated images and objects. For example, [dong2017semantic, li2019attribute, lu2018attribute, nam2018text] aim to edit the attributes of the generated image using natural language descriptions. In [li2019attribute, zhou2019text]

authors embed a visual attribute vector (

e.g, blonde hair) for attribute-guided face generation. Li et al [li2019pastegan] also incorporates object-level appearance information in the input, but it relies on external memory to sample objects. Another line of the literature concentrates on editing images by providing reference images (e.g, [ashual2019specifying]) to transfer style. Different from prior approaches, our model allows direct attribute control over individual instances of objects in complex generative scenes.

3 Approach

Let us start by formally defining the problem. Let be an image canvas (e.g, of size 128128) and let = be a layout consisting of labeled object instances with class labels , attributes , and bounding boxes defined by top-left and bottom-right coordinates on the canvas, , where is the total number of object categories and is the -th attribute of -th object instance; is the attribute set. Note that each object might have more than one attribute. Let be the latent code for object instance , modeling class- and attribute-unconstrained appearance. We denote the set of all object instance latent codes in the layout by .

Attribute-guided image generation from a layout can then be formulated as learning a generator function which maps given input to the output image conforming to specifications:


where parameterizes the generator function and corresponds to the set of parameters that need to be learned.

Different from previous layout to image generation methods is explicit, but optional ( can be or sampled from the prior), inclusion of the attributes. Further, we specifically aim to learn which is, at least to some extent, equivariant with respect to location of objects in the layout. Our attribute-guided layout2im formulation builds on and improves [zhao2019image], as such it shares some of the basic architecture design principles outlined in Zhao et al [zhao2019image].

Training: The overall training pipeline of the proposed approach is illustrated in Figure 2. Given the input image and its layout = , our model first creates a word embedding for each object label and multi-hot attribute embedding for object attribute(s)222Note exactly elements of will be 1 and the rest are 0. , and form a joint object-attribute embedding where is the vector concatenation and is a MLP layer, composed of three fully connected layers that map the concatenated vector to a lower dimensional vector. A set of object latent codes

are sampled from the standard prior normal distribution

, and a new = is constructed, where represents bounding boxes that are randomly shifted in the canvas . The shifts are horizontal to maintain consistency. Then, our model estimates another set of latent codes , where is sampled from the posterior conditioned on CNN features of object cropped from input image . We effectively end up with three datasets:

  • (, ) for reconstruction of the original image. Mapping this input through generator should result in an image , which is a reconstruction of the training image serving as the source of the layout ;

  • (, ) for generation of a new image sharing the original layout. The output of here should be image that shares the layout with above, but where appearance of each object instance is novel (sampled from the prior).

  • (, ) which is used to generate an image from reconfigured layout (i.e., reconfiguration path, see Suppl. Sec. 1.1 for details). The output should be a corresponding shifted image , which shares latent appearance codes with those from Set 2.

The same pipeline is applied to all three input sets simultaneously: multiple object feature maps are constructed based on the layout and , and then fed into the object encoder and the objects fuser sequentially, generating a fused hidden feature map containing information from all specified objects. Lastly, we incorporate a global context embedding onto to form a context-aware feature canvas , and decode it back to an image with a combination of deconvolution, upsampling and SPADE normalization layers [park2019semantic]. For generated object in and , we make the object estimator regress the sampled latent codes based on to encourage to be consistent with the latent code , and use an auxiliary object classifier and attribute classifier to ensure has the desired category and attributes. To train the model adversarially, we also introduce a pair of discriminators, and , to classify results as being real/fake at image and object level.

Inference: At inference time, the model is able to synthesize a realistic image from a given (user specified) layout and object latent codes sampled from the prior normal object appearance distribution . The attributes can be specified by the user or sampled from prior class distributions of object-attribute co-occurrence counts, which we also estimate from data during training. In this way, attribute can be treated as “optional" at individual instance level; i.e., one can specify sub-set of attributes for any sub-set of instances.

3.1 Attribute Encoder

Visual Genome [krishna2017visual] dataset provides attribute descriptions for some objects. For example, a car object might have attributes red and parked, and a person object might have attributes smile and tall. There are over 40,000 attributes in the datasets. We only keep the most common 106 attributes for simplicity. In other words, . Each object might have more than one attribute, hence we adopt multi-hot embedding for the attribute encoder. If no attributes are specified for the object, the attribute embedding would be a vector of zeros, i.e., . We concatenate the multi-hot embedding with object word embedding and pass it through an MLP layer to obtain the final object-attribute embedding , which is then concatenated with (prior sampled) latent code to construct the object feature map . The feature map is therefore constructed by filling of the feature canvas with .

Attribute Disentanglement: For two novel image generation paths, mainly (, ) and (, ) we further entice the model to use attribute embedding to determine appearance of corresponding objects. To explicitly disentangle attribute information from and during training, we randomly choose a subset of training objects and replace their GT attributes with new attributes sampled from the attributes frequency distribution for the object class. By doing this, we force the generator to use the attributes code to generate objects with corresponding attributes, instead of encoding attribute information into and/or .

Figure 3: Global Context Encoder. The aggregated feature map is fed into a set of conv layers; then it is down-sampled, spatially expanded and concatenated to itself to form context-aware feature map that can then be decoded into an image.

3.2 Global Context Encoder

At the last stage of the generation process, the fused feature map is decoded back to the output image by a stack of deconvolution layers. However, the generated image obtained from simply decoding often contains objects that are not coherent within a scene. For example, it is observed that some generated objects and the background appear incoherent and exhibit sharp transitions (image patch pasting effect). Hence, it is desirable to explicitly incorporate global context information in each receptive field of the feature map , so that, locally, object generation is more informed. Since contains the information for all objects, it itself is a natural choice for encoding the global context (Figure 3). To encode , we feed the 8x8 feature map through two convolution layers to downsample it to a 2x2 feature map, which is average pooled to a global context vector. We then expand the vector to the size of the fused feature map . This concatenation, , is the final feature map used to decode the image.

3.3 SPADE normalization

Spatially-adaptive (SPADE) normalization [park2019semantic] is an improved normalization technique that prevents semantic information from being washed away by conventional normalization layers. In SPADE normalization, the learnable transformation (i.e., scale and shift) is learned directly from the semantic layouts. In our model, the feature map resembles the semantic layout because encodes both spatial and semantic information. Hence, we add SPADE normalization layers between the deconvolution layers in our image decoder where is used as the semantic layout. As we show later, in the ablation study (Table 3), the object accuracy of generated results improves when we adopt SPADE.

3.4 Loss Function

The structure of our discriminator follows the discriminator proposed in layout2im [zhao2019image] (see Appendices A.2 for details), but adds an additional term for attribute classifier , which predicts the attribute of cropped objects and is used to train the generator to synthesize objects with correct attributes. It is trained on real objects and their attributes .

Our GAN model utilizes two adversarial losses: Image Adversarial Loss and Object Adversarial Loss . Five more losses, including KL Loss , Image Reconstruction Loss , Object Latent code Reconstruction , Auxiliar Object Classification Loss and Auxiliar Attribute Classification Loss , are added to facilitate the generation of realistic images. Due to lack of space we provide details of the loss terms in Appendices Material (Section A.3). As the result, the generator minimizes:


and the discriminator minimizes:


where are weights for different loss terms.

4 Experiments

Datasets: We evaluate our proposed model on Visual Genome [krishna2017visual] datasets. We preprocess and split the dataset following the settings of [johnson2018image, sun2019image, zhao2019image]

. In total, we have 62,565 training, 5,506 validation and 5,088 testing images. Each image contains 3 to 30 objects from 178 categories, and each object has 0 to 5 attributes from 106 attribute set. We are unable to train on COCO

[lin2014microsoft] because it does not provide attribute annotations.

Method FID Inception Obj Acc Diversity Attribute Score Consistency
() Recall  Precision bg  fg
Real Images - 13.9 0.5 49.13 - 0.30   0.31 -

sg2im [johnson2018image]
74.61 6.3 0.2 40.29 0.15 0.12 0.07   0.15 0.87   0.84
Itr. SG [ashual2019specifying] 65.3 5.6 0.5 28.23 0.18 0.12 0.04   0.09 0.82   0.81
layout2im [zhao2019image] 40.07 8.1 0.1 48.09 0.17 0.09 0.09   0.22 0.87   0.85
LostGAN [sun2019image] 34.75 8.7 0.4 27.50 0.34 0.10 0.17   0.06 0.63   0.61

33.09 8.1 0.2 48.82
0.10 0.02
0.20 0.01
0.26   0.30 0.90   0.89
Table 1: Performance of 64 64 image generation on Visual Genome [krishna2017visual] dataset) For Diversity Score of our model, we have two versions of attribute use: GT attribute specification (top), and sampled attributes from prior class distributions of object-attribute co-occurrence counts (bottom). For Attribute Score, we predict the attributes of generated objects and calculate recall and precision against GT attributes. We trained Interactive SG without the GT object mask. : higher is better; : lower is better; bg: background, fg: foreground.

Experimental setup: Our experiments use the pre-trained VGG-net [simonyan2014very] as the base model to compute the inception score for generated image. For object classification loss and the attribute classification loss, our experiments adopted the ResNet-50 model [he2016deep] and replace its last layer with the corresponding dimensions. Both object and attribute classifier are trained using the objects cropped from real training images. For attribute accuracy, we estimate the attributes of generated objects using a separately trained attribute classifier which consists of five residual blocks, and compute the recall and precision against the GT attribute annotations. Lastly, we generate two sets of test images and use LPIPS metric [zhang2018unreasonable] to compute the diversity score. More specifically, we take the activation of conv1 to conv5 from AlexNet [krizhevsky2012imagenet], and normalize them in the channel dimension and take the L2 distance. We then average across spatial dimension and across all layers to get the LPIPS metric. -LPIPS metric is also used for consistency score, where we compute the foreground diversity between each object before and after it is shifted, and compute the background diversity for the rest of the image which did not undergo the shift. Higher consistency for both is better.

Baselines: We compare our model with Sg2Im [johnson2018image], Interactive Scene Graph [ashual2019specifying], Layout2im [zhao2019image] and LostGAN [sun2019image].

4.1 Quantitative Results

Table 1 and 2 shows the image generation results when trained using different models. For 64 64 images, our attribute-guided image generation from layout outperforms all other models in terms of object accuracy, attribute score and consistency score. Notably, our attribute classification score (recall and precision) outperform other models with a substantial margin, demonstrating our model’s capability to control the attributes of generated objects. For consistency in layout reconfiguration, our consistency score is the highest for both background and foreground in the generated images, reflecting the effectiveness of the layout reconfiguration path. Note, as expected, specifying attributes limits the diversity of output objects (). However, sampled from prior class distributions of object-attribute co-occurrence counts leads to much higher diversity of generated images ().

We also conduct experiments at 128 128 resolution and compare with LostGAN [sun2019image]. Our model obtains better results on the object accuracy, attribute score and consistency score.

4.2 Qualitative Results

Method FID Inception Obj Acc Diversity Attribute Score Consistency
(128 128) Recall  Precision bg  fg
Real Images - 26.15 0.23 41.92 - 0.27   0.27 -
LostGAN [sun2019image] 29.36 11.1 0.6 25.89 0.43 0.09 0.19   0.04 0.54   0.51
Ours 39.12 8.5 0.1 31.02 0.15 0.09 0.10   0.25 0.84   0.80
Table 2: Performance of 128 128 images on Visual Genome [krishna2017visual] dataset. We note that images generated by LostGan [sun2019image] contains too many attributes signals, which explains its high recall and low precision. : higher is better; : lower is better.
Method Inception Accu. Diversity Attribute Score Consistency
(64 64) Recall  Precision bg  fg
w/o attribute specification 7.9 0.05 48.01 0.19 0.08 0.08   0.13 0.88  0.87
w/o location change 7.8 0.1 48.96 0.12 0.05 0.25   0.30 0.86  0.84
w/o SPADE [park2019semantic] 7.9 0.1 45.05 0.15 0.07 0.23   0.29 0.89  0.88
w/o context encoder 7.7 0.1 47.96 0.13 0.15 0.24   0.30 0.89  0.87
full model 8.0 0.2 48.82 0.10 0.02 0.26   0.30 0.900.89
Table 3: Ablation study of our model on Visual Genome [krishna2017visual] dataset by removing different objectives. Inception is the inception score, Accu. is the object classification accuracy, and Diversity is the diversity score. : higher is better; : lower is better.

Figure 7 demonstrates our model’s ability to control the attributes of generated objects. For each image, we pick an object and replace its current attribute with a different one, while keeping the rest of the layout unchanged. It can be seen from Figure 7 that our model is able to change the attributes of the object of interest, and keep the rest of the image unmodified.

Figure 8 compares the results before and after some object bounding boxes in the canvas are horizontally shifted. For each images pair, the image on the left is generated from the GT layout, and the image on the right from the reconfigured layout. Our model shows better layout reconfigurability than other methods. For example, in Figure 8(b’) the boat is moved, in (d’) two human are moved, and in (e’) the horse is moved. In contrast, layout2im [zhao2019image] and LostGan [sun2019image] either change the theme of the image (see 8(f’)), or have missing objects (see 8(d’)) for reconfigured layouts. This is also reflected in their much lower consistency score.

Additional examples at 128 128 resolution are in Appendices, Figure 6 and 7. Appendices Figure 9 shows generated images obtained using different SoTA models. Our method consistently outperforms baselines in quality and consistency of generated images. LostGan [sun2019image] fails to generate plausible human faces, and layout2im [zhao2019image] does not generate realistic objects such as food.

4.3 Ablation Study

We demonstrate the necessity of our key components by comparing scores of several ablated models trained on Visual Genome [krishna2017visual] dataset. As shown in Table 3, removing any components is detrimental to the model’s performance. Not surprisingly, attribute specification is the key to the successful attribute classification. The absence of layout reconfiguration path decreases the inception score by , slightly increases the classification accuracy and, more importantly, reduces the consistency for reconfigured layouts. SPADE [park2019semantic] is beneficial for object classification accuracy, and global context encoder improves inception score by .

Figure 4: Examples of 64 64 generated images with modified attributes on Visual Genome [krishna2017visual] datasets obtained by our proposed method.
Figure 5: Examples of 64 64 generated images with horizontally shifted bounding boxes on Visual Genome [krishna2017visual] datasets

5 Conclusions

This paper proposes attribute-guided image generation from layout, an effective approach to control the visual appearance of generated images in instance-level. We also showed that the model ensures visual consistence of generated images when bounding boxes in layout undergo translation. Qualitative and quantitative results on Visual Genome [krishna2017visual] datasets demonstrated our model’s ability to synthesize images with instance-level attribute control and improved level of visual consistence.


This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC CRC and the NSERC DG and Discovery Accelerator Grants. Hardware support was provided by JELF CFI grant and Compute Canada under the RAC award.


Appendix A Approach Details

a.1 Layout Reconfiguration

In addition to image reconstruction path and image generation path, described in the main paper, layout reconfiguration path is introduced in our model to increase the spatial equivariance of the generator. Here we describe the layout reconfiguration path a little more completely. Similar to image generation path, an object latent code is sampled from a normal prior distribution , and is concatenated to the object attribute embedding . When composing the feature map , however, the input bounding boxes are randomly shifted. We limit ourselves to horizontal shifts in order to preserve coherence of the scene and not introduce perspective inconsistencies. Hence, each is composed based on the a new . Then, the set of feature maps are downsampled and passed to a cLSTM network to form the fused map , which is then decoded back to an image . The same image discriminator is applied to the generated image , and the object discriminator, the object classifier and the attribute classifier are applied to each generated object cropped based on the shifted bounding boxes .

a.2 Discriminator

The structure of the discriminator in our model follows the discriminator proposed in layout2im [40], but adds an additional term for the attributes ():

  • Image discriminator classifies the input image as real and the generated image , , as fake.

  • Object discriminator classifies the cropped objects from as real, and , and from , , , respectively, as fake.

  • Auxiliary object classifier predicts the category of cropped objects and is used to train the generator to synthesize correct objects. It is trained on real objects and their labels .

  • Auxiliary attribute classifier predicts the attribute of cropped objects and is used to train the generator to synthesize objects with correct attributes. It is trained on real objects and their attributes .

a.3 Loss Function

Our model follows the Generative Adversarial Networks framework [4]. Namely, one image generator and two discriminators are jointly trained in minimax game:


where is the real image sampled from the data distribution and is the latent codes that generator uses to produce fake image. Since we have two separate discriminators for images and objects, there are two adversarial losses:

  • Image Adversarial Loss. In each training iteration, our generator produces three images, which are: a generated image , a reconstructed image and a shifted image . Hence, the image adversarial loss is defined as in Eq. (4) for all three types of generated images. By averaging the loss for , , , we obtain:


    which generator minimizes, and discriminator maximizes.

  • Object Adversarial Loss. We crop and resize objects , and from , and , respectively. By treating cropped objects as images, the object adversarial loss is also defined as in Eq. (4):


In addition, we have another five losses to facilitate the generation of realistic images:

  • KL Loss. encourages the posterior distribution for object to be close to the prior , for all of the objects in the given image/layout.

  • Image Reconstruction Loss. is the L1 difference between ground-truth image and reconstructed image produced by the generator.

  • Object Latent Code Reconstruction Loss. penalizes the difference between the randomly sampled and the re-estimated and from the generated objects and , respectively.

  • Auxiliar Object Classification Loss. is defined as the cross entropy loss from the object classifier. Cropped objects with labels from real images are used to train the object classifier, and then the generator is trained to generate realistic objects , and that minimize .

  • Auxiliar Attribute Classification Loss. is defined as the weighted binary cross entropy loss from the attribute classifier. Similarly, real objects are used to train the classifier, and the generator is trained to generate objects , and with correct attribute labels that minimize .

Therefore, the generator minimizes:


and the discriminator minimizes:


where are weights for different loss terms.

Implementation Details: We set image canvas size to 64 64 (128 128), and the object size to 32 32 (64 64). The are , , , , , , . The dimension of the category embedding and the latent code are both 64. The model is trained using Adam with a learning rate of 0.0001 and a batch size of for 300,000 iterations on 2 Geforce GTX 1080 Ti. In each training iteration, we first train the object classifier, the attribute classifier and the two discriminators, and then the generator.

Appendix B Results

Due to limited space in the main paper, we provide additional evaluations here.

b.1 Spatial Equivariance Experiments

Figure 6 and 7 demonstrate the ability of our model to generate high quality images (at 128 128 resolution) and maintain consistency of objects when the boxes are shifted. We want to draw reader attention to 4-th row from the top in Figure 7. Note how our model can generate images where tree shifts from left to right based on the change in the layout (cyan), while largely maintaining the structure and appearance of the boat unchanged. In contrast, LostGAN [31], when presented with the same sifted layout, generates an image that is entirely incoherent with the original: boat is no longer discernible, sky changes color, etc. Similar behavior can also be observed in the last row, where our model is able to generate new version of the image with shifted placement of the elephants, while maintaining the tree line and overcast sky. The images produced with LostGAN [31] are highly incoherent with visible changes in both foreground and background objects, as well as their appearances (despite fixing appearance latent vectors). Similar behavior is also readily observed in Figure 6. For example, consider new shifted placement of the person in the third row from the top, or an almost mirror image produced by shifting trees and the house from right to left and vice versa in the 5-th row. LostGAN [31], while generates plausible images, is consistently failing to maintain style, appearance, structure and placement of objects when the layout is modified to simply spatially re-arrange the same objects.

Figure 8 shows similar ability to maintain consistency with spatial shifts of objects in the layout at the lower, 64 64, resolution. Note that results of LostGAN are less blurry because, unlike all other methods in Figure 8, they are computed at 128 128 resolution (but illustrated at 64 64); authors of LostGAN do not provide a trained 64 64 model. As such, the comparison to LostGAN isn’t exactly fair and is less favorable to us. Despite this, our model, is able to generate high-quality images that are consistent under spatial shifts in layout (see last row).

Figure 6: Examples of 128 128 generated images with horizontally shifted bounding boxes on Visual Genome datasets obtained by our proposed method.
Figure 7: Examples of 128 128 generated images with horizontally shifted bounding boxes on Visual Genome datasets obtained by our proposed method.
Figure 8: Examples of 64 64 generated images with horizontally shifted bounding boxes on Visual Genome datasets by our proposed method.

b.2 Qualitative Generation Experiments

Figure 9 showcases our model’s ability to generate plausible images for a wide variety of layout configurations (e.ghuman, food, animal, furniture, house). Notably, results of sg2im [10] and layout2im [40] are of lower quality and blurry. LostGAN [31] does not perform well on human faces. Similar to Figure 8, results of LostGAN in Figure 9 are less blurry because, unlike all other methods, they are at 128 128 resolution; LostGAN didn’t provide trained 64 64 model, so we use a higher resolution model instead for visualization.

Figure 9: Examples of 64 64 generated images on Visual Genome datasets obtained by our proposed method.

b.3 Attribute Modification Experiments

Figure 10 illustrates additional examples of our model’s ability to modify attributes of various objects. The change of attributes does not affect the layout or other objects in the image.

Figure 10: Examples of 64 64 generated images with modified attributes on Visual Genome datasets obtained by our proposed method.