Exploiting Relationship for Complex-scene Image Generation

04/01/2021 ∙ by Tianyu Hua, et al. ∙ Microsoft Ryerson University 4

The significant progress on Generative Adversarial Networks (GANs) has facilitated realistic single-object image generation based on language input. However, complex-scene generation (with various interactions among multiple objects) still suffers from messy layouts and object distortions, due to diverse configurations in layouts and appearances. Prior methods are mostly object-driven and ignore their inter-relations that play a significant role in complex-scene images. This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph. With the help of relationships, we propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects. Compared to standard location regression, we show relative scales and distances serve a more reliable target. Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships. Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph. Our method tends to synthesize plausible layouts and objects, respecting the interplay of multiple objects in an image. Experimental results on Visual Genome and HICO-DET datasets show that our proposed method significantly outperforms prior arts in terms of IS and FID metrics. Based on our user study and visual inspection, our method is more effective in generating logical layout and appearance for complex-scenes.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In the past few years, text-to-image generation has drawn extensive research attention for its potential applications in art generation, computer-aided design, image manipulation, etc. However, such success is only restricted to simple image generation, which only contains a single object in a small domain, such as flowers, birds, and faces Reed et al. (2016); Bao et al. (2017). Complex-scene generation, on the other hand, targets for synthesizing realistic scene images out of complex sentences depicting multiple objects as well as their interactions. Nevertheless, generating complex-scenes on demand is still far from mature based on recent studies Johnson et al. (2018); Xu et al. (2018); Li et al. (2019); Hinz et al. (2019).

Figure 1: Relationship matters for complex-scene image generation. The same object pair (e.g., man and board) could show different object shapes, scene layouts and appearances under different relationships.

Scene graph, a structured language representation, captures objects and their relationships described in the sentence Xu et al. (2017)

. Such representation is proven effective for image-text cross-modal tasks, such as structural image retrieval 

Johnson et al. (2015); Schuster et al. (2015); Johnson et al. (2018)

, image captioning 

Yang et al. (2019); Li and Jiang (2019); Li et al. (2018) and visual question answering Teney et al. (2017); Norcliffe-Brown et al. (2018). In this work, we focus on complex-scene image generation from scene graphs. Although extensive works have been done in scene graph generation from image Xu et al. (2017); Zellers et al. (2018); Li et al. (2017); Zhang et al. (2017b) (i.e. imagescene graph), reversely generating a complex-scene image from a scene graph remains challenging, due to the polymorphism nature of one-to-many mapping from a given scene graph to multiple reasonable images with different scene layouts.

A general pipeline for scene graph based image generation usually consists of two stages Johnson et al. (2018). The first one learns to synthesize a rough layout prediction from the scene-graph input. Usually, the object features are encoded with a graph module Johnson et al. (2018); Ashual and Wolf (2019)

, followed by a direct regression of bounding-box locations. At the second stage, a position-aware feature tensor, that combines object features and layout generated in the first stage, is fed into an image decoder for generating the final image. For enhancing the object appearances in generated images, Ashual 

et al. separates appearance embedding from layout embedding.

However, previous works on complex-scene generation heavily suffer from two fundamental problems: messy layout, and object distortion. 1) Messy layouts. Image generation models are expected to figure out the reasonable layout from scene-graph inputs. However, there exist an infinite number of reasonable layouts for a given scene-graph. Directly fitting a specific layout introduces huge confusion, and limits the generalization ability. As a result, existing methods are still struggling with messy layouts in practice. 2) Distortion in object appearance. Due to the high diversity in object categories, layouts, and relationship dependencies, objects are often distorted during generation. For each object, the texture and local appearances should be inferred, respecting both its category and allocated spatial arrangement. Moreover, complex and various relations among different objects in the scene-graph can further increase the diversity of shape appearances. As shown in Fig. 1, even with the same object pairs, equipping different relationships could lead to totally different scene layouts and appearances.

Some works Ashual and Wolf (2019) simplify the task by only taking a few simple spatial relationships among objects (such as “left”, “right” or “above”) but ignoring other complex relationships (such as verbs). Meanwhile, to reduce the complexity, some works only consider one specific stage of this task, such as layout generation from scene-graph Jyothi et al. (2019), image generation from layout Zhao et al. (2018); Sun and Wu (2019). All these works did not take account of the semantics and complex relationships among objects, which limits their application prospects.

In this work, we explore relationships to mitigate the above issues. We consider both simple spatial relationships and complex semantic relationships into consideration. We observed that, in different realistic images, relative scale and distance ratios between two interrelated objects from the same “subject-relation-object” triplet usually conform to a norm distribution with low variance, as in Fig. 

2. Even though the “human” have various poses, and the skateboard can be oriented to different directions, the scale ratio between the two bounding boxes is naturally clustered with very low variance. Thus, we first introduce relative scale ratios and distance for measuring the rationality of layouts generated from the scene graph. It means that all various reasonable layouts relevance to one specific scene graph can be measured under a common standard and result in very similar results. After that, we proposed a Pair-wise Spatial Constraint Module for assisting layout generation. Our Spatial Constraint Module is influenced by object pairs and their corresponding relation jointly. Meanwhile, the objective of this novel module is to correct the layout by fitting the relative scale ratio and relative distance ratio between interrelated object pair beside the absolute position coordinates of each object. In this way, the spatial commonsense of complex scene with multiple objects can be modeled.

Figure 2: Distributions of relative scale and distance for “man riding skateboard” and “man sitting on bench”.

Moreover, for enhancing the influence of relation for object appearance generation, we proposed a Relation-guided Appearance Generator and a novel Scene Graph Discriminator for guiding image generation. Unlike the traditional discriminator that only judges whether the image is fake or not, our proposed new discriminator has two main functions. One is to determine whether the objects in the generated image are relevant to the objects described in the text scene graph or not, and the other is to discriminate the relation predictions among objects from the generated image are correspondence with the relationship described in the input scene graph. By feeding back these strong discriminant signals, our Scene Graph Discriminator guarantees the generated object patches align with not only single object fine-grained information but also the relation discrepancy among objects.

The main contributions can be summarized as follows:

  • A novel pair-wise spatial constraint module with supervisions of relative scale and distance between objects for learning relationship-aware spatial perceptions.

  • A relation-guided appearance generator module followed by a scene graph discriminator for generating reasonable object patches respecting object fine-grained information and relation discrepancy.

  • A general framework for synthesizing scene layout and images from scene graphs. The experimental results on Visual Genome Krishna et al. (2017) and human-objects interactions dataset HICO-DET Chao et al. (2018) demonstrate the complex-scene images generated by our proposed method follow the common sense.

Related Work

Image Synthesis from Sentence

is a conditional image generation task whose conditional signal is natural language. Textual descriptions are traditionally fed directly to a recurrent model for semantic information extractions. After that, a generative model will produce the results conditioned on this vectorized sentence representation. Most of these tasks specialize in single object image generation 

Reed et al. (2016); Zhang et al. (2017a); Xu et al. (2018), whose layout is simple and the object usually centered with a large area in the image. However, generating realistic multi-object images conditioned on text descriptions is still a difficult task, since it addresses very complex sense layout generation and various object appearances mapping, and both of scene layout and object appearances are heavily influenced by the spatial and semantic relationships cross objects. Furthermore, encoding all information, including multiple object categories and the interactions among them into one vector, usually leads to critical details lost. Meanwhile, directly decoding images from such an encoded vector hurts the interpretability of the model.

Scene Graph Xu et al. (2017) is a directed graph that represents the structured relationships among objects in an explicit manner. Scene graphs have been widely used in many tasks such as image retrieval Johnson et al. (2015), image captioning Anderson et al. (2016), which serves as a medium that bridges the gap between language and vision.

Image Synthesis from Scene Graph Johnson et al. (2018) is a derivative of sentence based multiple-object image generation. Since the conditional signals are scene graphs, graphic models are usually applied for extracting essential information from the textual scene graph. After that, these extracted features are directly used for regressing the scene layouts and then treated as input to an image decoder for generating the final image Ashual and Wolf (2019). Such a framework is applicable to generation image contains multiple objects with simple spatial interactions. However, it is still suffering from modeling the reasonable scene layouts and appearances following commonsense due to the implication of semantic relationships in the scene graph.

In this paper, we focus on image generation from the textual scene graph. Different from previous methods, we highlight the impact of relationships among objects for dealing with the messy layout and various object appearance.


A scene graph is denoted as , where indicate the nodes in the graph, each denotes the category embedding of an object or instance. Note that we use words like ”object” or ”instance” in reference to a broad range of categories from ”human”, ”tree” to ”sky”, ”water” etc. The edges of the graph are extracted as a relationship embedding set . Two related objects and associate with one relationship , which leads to a triplet in the directed edge set .

Given a scene graph and its corresponding image , scene graph-based image generation model aims to generate an image according to by minimizing , where measures differences between and . A standard scene graph to image generation task can be formulated as two separate tasks: a scene graph to layout generation task which extracts object features with spatial constraints from scene graphs, and an image generation task, which generates images conditioned on the predicted object features and learned layout constraints, as shown in Fig. 3 (left).

Figure 3:

Illustrations of standard (left) and our (right) framework for scene graph to image generation. Left: Directly generating layout and image based on object features extracted from scene graph. Right: Our proposed framework with object pair-wise spatial constraints and appearance supervision respecting relationships among objects.

In this paper, we extend the traditional framework by emphasizing the influence of relationship for both object layouts and object appearances generation. As shown in Fig. 3 (right), three novel modules are proposed:

  • Pair-wise Spatial Constraint Module: a module for constraining layout generation according to the semantic information extracted from .

  • Relation-guided Appearance Generator: for each object , we introduce the semantic information of its connected relationships to influence the shape and appearance of the generated image of .

  • Scene Graph Discriminator (): a novel discriminator for strengthening the generated image to be relevant to the appearances of object , and the relationships in the edge set .

Layout Generator

Layout generator aims to predict bounding boxes for each object in given scene graph , where specifies normalized coordinates of the center, width and height in ground-truth image .

In previous work, the object representations are usually extracted from scene graph inputs, and then be passed to a box regression network to get the bounding box predictions . The box regression network is trained by maximizing the objective:


which penalize the difference between ground-truth and predicted boxes. indicates the amount of objects.

Since there are various reasonable layouts existing, as previously stated, a scene graph to layout task requires addressing challenging one-to-many mapping. Directly regressing layout to offsets of one specific bounding box would hurt the generalization ability of the layout generator, and make the layout generator to be difficult to convergence. In order to generate reasonable layouts, we relax the constraint of bounding box offsets regression and proposed a novel spatial constraint module for ensuring the rationality of layout.

Our Pair-wise Spatial Constraint Module introduces two novel metrics for measuring the realistic of layouts.

1. Relative Scale Invariance. The scale of an object is represented by the diagonal length of its bounding box. For any given triplet, the ratio between the scale of the subject and the scale of the object in different images are often roughly the same, as shown in Fig 4 (Left). We formulate the relative scale between the layout and as


2. Relative Distance Invariance. Similar to relative scale, relative distance target at calculating the distance between two objects in triplet normalized by the scales of two objects. The relative distance of related object pair and in realistic images is also naturally clustered to one specific value, and the distributions of relative distance for different triplets are usually with low variance, as shown in Fig 4 (Right). Normally, horizontal flips of images rarely alter spatial relationship distributions, we relax this constraint by using the absolute value of the horizontal coordinate difference. Most importantly, we normalize distance by the summed scales of object pairs so that the zooming effect of object depth can be canceled out. Therefore, the relative distance between the layout and can be formulated as

Figure 4: Distributions of relative scale and distance variance among top-100 triplets in VG and HICO-DET datasets. Low diversity of relative scale and distance is observed, following the property of commonsense knowledge.

We have keenly observed that relationship in a semantic form comes with it an inherent spatial constraint that has not been fully explored by others. For example, the relationship “holding” implies that the object should be within arm’s reach of the subject instead of miles away. The relationship “walking” indicates the relative vertical arrangement between subject and object heavily, whether it’s “man walking-on street” or “dog walking-on floor”. It means the relative scale and relative distance between two objects heavily depend on the relationship or interaction between these two objects. Therefore, we devise a training scheme that explicitly leverages this constraint.

In this work, the scene graph is first converted to object feature vectors and relation embeddings , and then fed into a Graph Convolutional Network (GCN). The GCN outputs updated object level feature vector aggregated with relation information, where is the graph convolutional operation, is the set of object embeddings relevant to , is the set of embeddings for relations among and . It means the output vector for an object should depend on representations of relationships and all objects connected via graph edged jointly. After that, we apply the updated object representations for generating the layout for object by , where is an bounding box offset regression network. We construct a scale-distance objective for our proposed spatial constraint module to assist the training progress of :


where and is the relative scale and relative distance between generated layouts for related object pair and respectively. is only computed on the connected object pairs in scene graph, since the relative scale and distance of two objects depend on the relationship between them, as we shown in Fig. 1.

With the supervision of relative scale and distance, the box regression network learns to arrange object boxes properly for reasonable layout generation.

Image Generator

Starting from the original object representations and initial relation embeddings , we can compute a combined “object-relation” vector for each object in scene graph:


where indicates a vector concatenation operation, is the collection of all triplet relevant to object , is a

dimensional noise vector randomly sampled from a Gaussian distribution, which aims to generate non-deterministic object features. The object and averaged relation embeddings are eventually be concatenated as inputs of our

Relation-guided Appearance Generator, which consists of an object mask predictor , an object appearnce feature predictor and a full image generator.

The combined vector will be sent simultaneously to and , both of which are four-layer conv nets normalized with spectral normalization techniques Miyato et al. (2018). Through an STN block Jaderberg et al. (2015), the two outputs for different objects will first be filled into their respective bounding box layouts. Then we obtain a set of object shape tensor and appearance tensor. By multiplying these two tensor, we can generate the final relation-guided appearance feature tensor for all objects in scene graph as


where is the STN block.

After that, our full image generator generate the image conditioned on all object appearance feature tensors and an additional noise vector . In detail, our image generator utilizes the ResNet architecture He et al. (2016) consists of six ResBlocks as backbone. Consider generating a 256256 image for scene graph, a randomly generated global latent vector

is a vector sampled from normal distribution. The vector is then mapped and reshaped to a 1024

44 (channels, width, and height) tensor through a fully-connected layer. Then, the tensor will be sent to the first ResBlock. Each of the six ResBlocks will upsample it’s inputs bilinearly with a ratio of two. In the meantime, the channel number drops by a factor of 2 except for the third block. Block by block, we fuse object appearance tensor with the outputs of each ResBlock (global appearance tensor) using the ISLA-Norm method proposed by Sun and Wu (2019). The final generated image comes from the outputs of the last ResBlock,


where indicate a ResBlock equipped with ISLA-Norm module, is the amount of ResBlocks in our image generator, is output of the -th ResBlock.

Our image generator takes the object appearance features with relational information and global random noise as condition, adapts scene composition, and finally generates the realistic image for scene graph .

Image Discriminator

Similar to the image generator, we adapt ResNet with downsampling blocks for image discriminator. The ResNet backbone consists of a different number of downsampling ResBlocks with respect to the input image sizes. The image downsampled by ResBlocks goes through a linear layer, and the outputs of the linear layer are further summed channel-wise to form a scalar output as the global discriminator score to measure whether the input image is real or not, which is similar to traditional GAN based methods.

Since different relationships result in diverse appearance in the same object, we argue that the learned object feature representation reflects not just class-related object styles but also the relationship-aware appearances. Thus, we proposed a novel Scene Graph Discriminator to measure whether the scene graph extracted from the generated image is associated with the given textual scene graph or not. In detail, we first extract object-level feature patches rerouted from the second layer of ResNet backbone, then resize these feature patches to the same size by an RoI align layer He et al. (2017)

. Then we introduce an object classifier

, which attempts to classify the feature patches into categories. By pairing object feature tensors according to the edges of the scene graph, we send the paired object feature and to the relationship classifier , which aims to predict the type of relationship of given object feature pair. Our proposed aims to encourage the image generator to be aware of the object categories and relationships exists in the scene graph:


Moreover, we introduce an object discriminator to measure whether each object in image appears realistic based on .

The overall objective function for training layout generator, image generator and discriminators is defined as:


where is image adversarial loss from , is object adversarial loss from , is scene graph relevant loss from . In our experiments, we set the loss weight parameters , .

Experimental Results

We evaluate our proposed method for generating images at three different resolutions 6464, 128128, and 256256 in below two datasets:

Visual Genome Krishna et al. (2017) was constructed with cognitive tasks that provide crowd-sourced dense annotations of both scene graphs and images. Following the settings of Johnson et al. (2018), we experiment on Visual Genome version 1.4. We keep 178 objects and 45 relations in the dataset by removing images with object and relationship categories less than 2000 and 500, respectively.

HICO-DET Chao et al. (2018) was built for modeling humans-object interactions. Compared with Visual Genome, the scene graphs provided in the dataset are human-centered. We keep images that have object categories higher than 1000 and discard images with interaction types that repeat less than 250 times, leaving 19 objects and 22 relationship types in total. Images with Object size below 3232 and images with objects less than 2 or more than 10 are ignored. Finally, we get 15963 train images and 4034 test images.

The COCO dataset 

Caesar et al. (2018) is not used in this paper because the relationship types in COCO are too simple, consisting mainly of naive spatial arrangement relations. We trained models using Adam with an initial lr=

and batch size of 32 for 200 epochs.

Several previous works target at multi-object image generation. Most of these works are about image synthesis from the ground-truth layout or pixel-level instance segmentation annotation Sun and Wu (2019); Hong et al. (2018); Li et al. (2019). The work of Ashual and Wolf aims to generated images from an input scene graph. However, the scene graph used in their work is simplified by only equipping six spatial relationships (right-of, left-of, above, below, surrounding and inside). Moreover, location attributes are assisted by additional information for each node in scene graph. Moreover, location attributes are assisted by additional information for each node in scene graph. Luo et al. only focus on spatial relationships instead of semantic relationships. Besides, objects used in their paper are mostly rigid bodies. Our paper learns from not just spatial relationships, but semantic relationships (e.g. looking at) as well. We use datasets that involve a large number of non-rigid objects that have various shapes and appearances and be sensitive to their relevant semantic relationships, which drastically increase the difficulty of our task. PasteGAN Yikang et al. (2019) applies both of the scene graph and ground-truth image crops as the inputs for complex-scence image generation. According to our best knowledge, sg2im Johnson et al. (2018) is the only related work about complex-scene image generation images from scene graphs that contain semantic and complex relationships among objects.

Compared Methods In this paper, we compare our proposed method with sg2im and PasteGAN for complex-scene image generation. Moreover, to demonstrate the effectiveness of our relation-guided appearance generator and scene graph discriminator, we also compare our method with LostGAN Sun and Wu (2019) which is designed for generating images by given ground-truth layout.

Resolution Method Visual Genome HICO-DET
6464 13.9 1 0.0 9.8 0.5 0.0
sg2im 6.3 0.2 47.6 4.4 0.1 99.9
LostGAN 6.90.1 38.7 4.50.3 86.4
Ours 7.5 0.4 29.0 5.5 0.1 41.7
sg2im 5.5 0.1 47.5 4.4 0.1 94.3
PasteGAN 6.90.2 58.5 - -
Ours 7.0 0.2 37.7 5.3 0.7 47.4
128128 22.5 1.9 0.0 13.7 0.7 0.0
sg2im 6.3 0.2 83.9 4.6 0.1 83.7
LostGAN 7.40.3 53.4 4.80.1 79.9
Ours 9.4 0.4 41.0 6.5 0.1 60.6
sg2im 6.2 0.2 83.8 4.6 0.1 123.0
Ours 9.2 0.8 53.0 5.0 0.3 61.4
256256 30.1 2.3 0.0 16.3 0.5 0.0
Ours 12.6 0.5 68.3 7.5 0.1 78.3
Ours 10.8 0.9 85.7 6.9 0.3 80.5
Table 1:

The comparison of IS and FID among different methods. On each dataset, the test set samples are randomly split into 5 groups. The mean and standard deviation across splits are reported in the above table.

indicates that the images are generated based on the ground-truth layouts instead of the generated layouts. denotes the real image.
Figure 5: Examples of layouts and images generated from scene graphs in Visual Genome and HICO-DET for our method and sg2im. In the layout examples, we use red color patches to denote bounding boxes that fail to reflect the distance between object pairs. The bounding boxes with blue background have an unnatural scale configuration. Best viewed in color version.

Quantitative Results

We adopt two metrics to evaluate the generated images.
Inception Score (IS) Salimans et al. (2016)

measures the diversity of generated images and their quality. A pre-trained InceptionV3 model is adapted to predict the class probabilities for given image. Larger inception scores are better.

Fréchet Inception Distance (FID) Heusel et al. (2017)222https://github.com/mseitzer/pytorch-fid measures the Fréchet distance between the multivariate Gaussian distribution of real images and generated ones. Lower FID scores are better.

These two metrics are widely used evaluation metrics for generative models.

IS aims to evaluate the reality of a single object, while FID is more suitable to reflect the quality of the generated image contains multiple objects.

Table 1 summarizes the performances on the two aforementioned datasets in terms of Inception Score and FID score. Our model outperforms sg2im on both VG and HICO-DET datasets. Moreover, even without the external information like image crops, our method can still achieve better results reported in PasteGAN. In addition, we conduct experiments of GT Layout versions using ground-truth bounding boxes during both training and testing. This method gives an upper bound to the model’s performance in the case of perfect layout construction. As shown in Table 1, our method has more potential than sg2im and LostGAN.

We also conducted ablation studies in Table 2. The relative importance of the Pair-wise Spatial Constraint Module, and Scene Graph Discriminator are measured by removing and from the overall objective function respectively. The ablation studies of Relation-guided Appearance Generator is measured by erasing relation embeddings during computing object shape and texture features. It can be found that the layout constraint module predicts reasonable spatial layout arrangements that improve the generated image qualities. The relation-guided generator introduces more reasonable appearance information. The scene graph discriminator can advance the correspondence between textual scene graph inputs and generated images.

Method IS FID
Ours 9.2 0.8 53.0
w/o Pair-wise Spatial Constrain Module () 8.6 1.2 59.8
w/o Relation-guided Appearance Generator 8.7 0.9 57.4
w/o Scene Graph Discriminator () 7.4 0.2 73.3
Table 2: Ablation studies conducted on our proposed method. The experimental results are reported on 128128 resolution image generation task in Visual Genome.
Figure 6: Generated samples from ground truth layouts on Visual Genome by sg2im, LostGAN and our method.
User study sg2im Same Ours
Image is more realistic 9% 26% 65%
Image has reasonable object arrangement 12% 27% 61%
Image reflects relationships in scene graph 9% 19% 72%
Layout is more reasonable 11% 30% 59%
Table 3: We performed a user study to compare the quality of generated layouts and images of our method against sg2im.

Qualitative Results

Fig. 5 shows the capability of our method compared with that of sg2im on VG and HICO-DET. In the 1st column, sg2im predicts the human layout is above the motorcycle, which is not a normal position arrangement. Similarly, in the 5th column, sg2im predicts that “pant” is not vertically in line with the “shirt”. Moreover, in the last column sg2im predicts that the scale of “sky” is too small. It leads to the chaotic color fill in the generated image. Similarly, in the 7th column sg2im predicts the scale of “snow” is much bigger than “mountain”, which conflicts with the triplet “snow on mountain”. These displacements occur when the training process is not enhanced with relative distance and scales.

Fig. 6 shows the generated images conditioned on ground truth layouts. We compared our model grounded on the same position layouts compared with sg2im and LostGAN. It can be found that our method is more likely to generate realistic images from rich layouts and with natural objects.

User Study We also conduct a user study to measure human preference between images generated by out method and sg2im in Table 3. We choose the 128128 resolution models for both cases. Our user study involves 40 students having a background in computer science. We generate 500 test cases from the VG test set for user study. A majority of users preferred the generated layouts and images from our method in 65% of image pairs.


The relationship between objects significantly rectifies the localization of objects and even their appearances. Prior literature mainly focuses on fitting the single object appearance. Semantic interactions among objects were overlooked, which may result in inconsistent and chaotic results. In this paper, we proposed a new framework to generate complex-scene images by exploring the importance of relationships among multiple objects in complex-scene image generation. Quantitative results, qualitative results, and user studies show our method’s ability for reasonable layout generation and object interactions alignment.


This work was supported by the National Key R&D Program of China under Grant No. 2020AAA0108600, and 2020AAA0103800.

Ethical Impact

Images are tiny visual samples of the grand physical world. The elementary particles that comprise our world first evolve and cluster into objects. Then appearance of objects are gradually shaped by the interactions between their counterparts. To model the natural clustering and interaction, we construct a generative model that strictly and structurally mimic the graph-like natural arrangement of our world. Our model builds a projection from symbolic graph space to the pixel space and we demonstrate how a small alternation in the object relationship can greatly affect the appearance of surrounding objects. In the future, we should research more on the neural design that can easily fit the data structure of pixels and encode more in the model visually commonsensical (relational) patterns. Overall, this paper has a positive impact on both industry and academia and enhance people’s understanding of visual thinking.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In

    European Conference on Computer Vision

    pp. 382–398. Cited by: Related Work.
  • O. Ashual and L. Wolf (2019) Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569. Cited by: Introduction, Introduction, Related Work, Experimental Results.
  • J. Bao, D. Chen, F. Wen, H. Li, and G. Hua (2017) CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: Introduction.
  • H. Caesar, J. R. R. Uijlings, and V. Ferrari (2018) COCO-stuff: thing and stuff classes in context.

    computer vision and pattern recognition

    , pp. 1209–1218.
    Cited by: Experimental Results.
  • Y. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng (2018) Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pp. 381–389. Cited by: 3rd item, Experimental Results.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Image Discriminator.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Image Generator.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: Quantitative Results.
  • T. Hinz, S. Heinrich, and S. Wermter (2019) Generating multiple objects at spatially distinct locations. arXiv preprint arXiv:1901.00686. Cited by: Introduction.
  • S. Hong, D. Yang, J. Choi, and H. Lee (2018) Inferring semantic layout for hierarchical text-to-image synthesis. computer vision and pattern recognition, pp. 7986–7994. Cited by: Experimental Results.
  • M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 2017–2025. Cited by: Image Generator.
  • J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228. Cited by: Introduction, Introduction, Introduction, Related Work, Experimental Results, Experimental Results.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: Introduction, Related Work.
  • A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori (2019) Layoutvae: stochastic scene layout generation from a label set. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9895–9904. Cited by: Introduction.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: 3rd item, Experimental Results.
  • W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao (2019) Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12174–12182. Cited by: Introduction, Experimental Results.
  • X. Li and S. Jiang (2019) Know more say less: image captioning based on scene graphs. IEEE Transactions on Multimedia 21 (8), pp. 2117–2130. Cited by: Introduction.
  • Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei (2018)

    Jointly localizing and describing events for dense video captioning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7492–7500. Cited by: Introduction.
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017) Scene graph generation from objects, phrases and region captions. In IEEE International Conference on Computer Vision, pp. 1270–1279. Cited by: Introduction.
  • A. Luo, Z. Zhang, J. Wu, and J. B. Tenenbaum (2020) End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3754–3763. Cited by: Experimental Results.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Image Generator.
  • W. Norcliffe-Brown, S. Vafeias, and S. Parisot (2018) Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8334–8343. Cited by: Introduction.
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: Introduction, Related Work.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: Quantitative Results.
  • S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: Introduction.
  • W. Sun and T. Wu (2019) Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10531–10540. Cited by: Introduction, Image Generator, Experimental Results, Experimental Results.
  • D. Teney, L. Liu, and A. van Den Hengel (2017) Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: Introduction.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: Introduction, Related Work.
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324. Cited by: Introduction, Related Work.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: Introduction.
  • L. Yikang, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang (2019) Pastegan: a semi-parametric method to generate image from scene graph. In Advances in Neural Information Processing Systems, pp. 3948–3958. Cited by: Experimental Results.
  • R. Zellers, M. Yatskar, S. Thomson, and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: Introduction.
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017a) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: Related Work.
  • H. Zhang, Z. Kyaw, S. Chang, and T. Chua (2017b) Visual translation embedding network for visual relation detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3107–3115. Cited by: Introduction.
  • B. Zhao, L. Meng, W. Yin, and L. Sigal (2018) Image generation from layout.. arXiv: Computer Vision and Pattern Recognition. Cited by: Introduction.