Synthesizing images from text descriptions (known as Text-to-Image synthesis
) is an important machine learning task, which requires handling ambiguous and incomplete information in natural language descriptions and learning across vision and language modalities. Approaches based on Generative Adversarial Networks (GANs) have recently achieved promising results on this task [23, 22, 32, 33, 29, 16, 9, 12, 34]
. Most GAN based methods synthesize the image conditioned only on a global sentence vector, which may miss important fine-grained information at the word level, and prevents the generation of high-quality images. More recently, AttnGAN is proposed which introduces the attention mechanism [28, 30, 2, 27] into the GAN framework, thus allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
Although images with realistic texture have been synthesized on simple datasets, such as birds [29, 16] and flowers , most existing approaches do not specifically model objects and their relations in images and thus have difficulties in generating complex scenes such as those in the COCO dataset . For example, generating images from a sentence “several people in their ski gear are in the snow” requires modeling of different objects (people, ski gear) and their interactions (people on top of ski gear), as well as filling the missing information (e.g., the rocks in the background). In the top row of Fig. 1, the image generated by AttnGAN does contain scattered texture of people and snow, but the shape of people are distorted and the picture’s layout is semantically not meaningful.  remedies this problem by first constructing a semantic layout from the text and then synthesizing the image by a deconvolutional image generator. However, the fine-grained word/object-level information is still not explicitly used for generation. Thus, the synthesized images do not contain enough details to make them look realistic (see the middle row of Fig. 1).
In this study, we aim to generate high-quality complex images with semantically meaningful layout and realistic objects. To this end, we propose a novel Object-driven Attentive Generative Adversarial Networks (Obj-GAN) that effectively capture and utilize fine-grained word/object-level information for text-to-image synthesis. The Obj-GAN consists of a pair of object-driven attentive image generator and object-wise discriminator, and a new object-driven attention mechanism. The proposed image generator takes as input the text description and a pre-generated semantic layout and synthesize high-resolution images via multiple-stage coarse-to-fine process. At every stage, the generator synthesizes the image region within a bounding box by focusing on words that are most relevant to the object in that bounding box, as illustrated in the bottom row of Fig. 1. More specifically, using a new object-driven attention layer, it uses the class label to query words in the sentences to form a word context vector, as illustrated in Fig. 4, and then synthesizes the image region conditioned on the class label and word context vector. The object-wise discriminator checks every bounding box to make sure that the generated object indeed matches the pre-generated semantic layout. To compute the discrimination losses for all bounding boxes simultaneously and efficiently, our object-wise discriminator is based on a Fast R-CNN , with a binary cross-entropy loss for each bounding box.
The contribution of this work is three-folded. (i) An Object-driven Attentive Generative Network (Obj-GAN) is proposed for synthesizing complex images from text descriptions. Specifically, two novel components are proposed, including the object-driven attentive generative network and the object-wise discriminator. (ii) Comprehensive evaluation on a large-scale COCO benchmark shows that our Obj-GAN significantly outperforms previous state-of-the-art text-to-image synthesis methods. Detailed ablation study is performed to empirically evaluate the effect of different components in Obj-GAN. (iii) A thorough analysis is performed through visualizing the attention layers of the Obj-GAN, showing insights of how the proposed model generates complex scenes in high quality. Compared with the previous work, our object-driven attention is more robust and interpretable, and significantly improves the object generation quality in complex scenes.
2 Related Work
Generating photo-realistic images from text descriptions, though challenging, is important to many real-world applications such as art generation and computer-aided design. There has been much research effort for this task through different approaches, such as variational inference [17, 6], approximate Langevin process 
, conditional PixelCNN via maximal likelihood estimation[26, 24], and conditional generative adversarial networks [23, 22, 32, 33]. Compared with other approaches, Generative Adversarial Networks (GANs)  have shown better performance in image generation [21, 3, 25, 13, 11, 10]. However, existing GAN based text-to-image synthesis is usually conditioned only on the global sentence vector, which misses important fine-grained information at the word level, and thus lacks the ability to generate high-quality images.  uses the traditional grid visual attention mechanism in this task, which enables synthesizing fine-grained details at different image regions by paying attentions to the relevant words in the text description.
To explicitly encode the semantic layout into the generator,  proposes to decompose the generation process into two steps, in which it first constructs a semantic layout (bounding boxes and object shapes) from the text and then synthesizes an image conditioned on the layout and text description.  also proposes such a two-step process to generate images from scene graphs, and their process can be trained end-to-end. In this work, the proposed Obj-GAN follows the two-step generation process as . However,  encodes the text into a single global sentence vector, which loses word-level fine-grained information. Moreover, it uses the image-level GAN loss for the discriminator, which is less effective at providing object-wise discrimination signal for generating salient objects. We propose a new object-driven attention mechanism to provide fine-grained information (words in the text description and objects in the layout) for different components, including an attentive seq2seq bounding box generator, an attentive image generator and an object-wise discriminator.
The attention mechanism has recently become a crucial part of vision-language multi-modal intelligence tasks. The traditional grid attention mechanism has been successfully used in modeling multi-level dependencies in image captioning , image question answering , text-to-image generation , unconditional image synthesis 16], image/text retrieval . In 2018,  proposes a bottom-up attention mechanism, which enables attention to be calculated over semantic meaningful regions/objects in the image, for image captioning and visual question-answering. Inspired by these works, we propose Obj-GAN which for the first time develops an object-driven attentive generator plus an object-wise discriminator, thus enables GANs to synthesize high-quality images of complicated scenes.
3 Object-driven Attentive GAN
As illustrated in Fig. 2, the Obj-GAN performs text-to-image synthesis in two steps: generating a semantic layout (class labels, bounding boxes, shapes of salient objects), and then generating the image. In the image generation step, the object-driven attentive generator and object-wise discriminator are designed to enable image generation conditioned on the semantic layout generated in the first step.
The input of Obj-GAN is a sentence with tokens. With a pre-trained bi-LSTM model, we encode its words as word vectors and the entire sentence as a global sentence vector . We provide details of this pre-trained bi-LSTM model and the implementation details of other modules of Obj-GAN in A.
3.1 Semantic layout generation
In the first step, the Obj-GAN takes the sentence as input and generates a semantic layout, a sequence of objects specified by their bounding boxes (with class labels) and shapes. As illustrated in Fig. 2, a box generator first generates a sequence of bounding boxes, and then a shape generator generates their shapes. This part resembles the bounding box generator and shape generator in , and we put our implementation details in A.
Here, are the pre-trained bi-LSTM word vectors, are the class label of the ’s object and its bounding box . In the rest of the paper, we will also call the label-box pair as a bounding box when no confusion arises. Since most of the bounding boxes have corresponding words in the sentence, the attentive seq2seq model captures this correspondence better than the seq2seq model used in .
Shape generator. Given the bounding boxes , the shape generator predicts the shape of each object in its bounding box, i.e.,
where is a random noise vector. Since the generated shapes not only need to match the location and category information provided by , but also should be aligned with its surrounding context, we build based on a bi-directional convolutional LSTM, as illustrated in Fig. 2. Training of is based on the GAN framework , in which a perceptual loss is also used to constrain the generated shapes and to stabilize the training.
3.2 Image generation
3.2.1 Attentive multistage image generator
As shown in Fig. 3, the proposed attentive multistage generative network has two generators (). The base generator first generates a low-resolution image conditioned on the global sentence vector and the pre-generated semantic layout. The refiner then refines details in different regions by paying attention to most relevant words and pre-generated class labels and generates a higher resolution image . Specifically,
is a random vector with standard normal distribution; (ii) ( ) is the encoding of low-resolution shapes (higher-resolution shapes ); (iii) are the patch-wise context vectors from the traditional grid attention, (iv) are the object-wise context vectors from our new object-driven attention, and are the label context vectors from class labels. We can stack more refiners to the generation process and get higher and higher resolution images. In this paper, we have two refiners ( and ) and finally generate images with resolution .
Compute context vectors via attention. Both patch-wise context vectors and object-wise context vectors are attention-driven context vectors for specific image regions, and encode information from the words that are most relevant to that image region. Patch-wise context vectors are for uniform-partitioned image patches determined by the uniform down-sampling/up-sampling structure of CNN, but these patches are not semantically meaningful. Object-wise context vectors are for semantically meaningful image regions specified by bounding boxes, but these regions are at different scales and may have overlaps.
Specifically, the patch-wise context vector ( objective-wise context vector ) is a dynamic representation of word vectors relevant to patch (bounding box ), which is calculated by
Here, ( ) indicates the weight the model attends to the ’th word when generating patch (bounding box ) and is computed by
For the traditional grid attention, we use the image region feature , which is one column in the previous hidden layer , to query the pre-trained bi-LSTM word vectors . For the new object-driven attention, we use the GloVe embedding of object class label to query the GloVe embedding of the words in the sentence, as illustrated in the lower part of Fig. 4.
Feature map concatenation. The patch-wise context vector can be directly concatenated with the image feature vector in the previous layer. However, the object-wise context vector cannot, because they are associated with bounding boxes instead of pixels in the hidden feature map. We propose to copy the object-wise context vector to every pixel where the ’th object is present, i.e., where is the vector outer-product, as illustrated in the upper-right part of Fig. 4. 111This operation can be viewed as an inverse of the pooling operator.
If there are multiple bounding boxes covering the same pixel, we have to decide whose context vector should be used on this pixel. In this case, we simply do a max-pooling across all the bounding boxes:
Then can be concatenated with the feature map and patch-wise context vectors for next-stage generation.
Label context vectors. Similarly, we distribute the class label information to the entire hidden feature map to get the label context vectors, i.e.,
Finally, we concatenate , , and
and pass the concatenated tensor through one up-sampling layer and several residual layers to generate a higher-resolution image.
Grid attention vs. object-driven attention. The process to compute the patch-wise context vectors above is the traditional grid attention mechanism used in AttnGAN . Note that its attention weights and context vector are useful only when the hidden feature in the stage correctly captures the content to be drawn in patch . This essentially assumes that the generation in the stage already captures a rough sketch (semantic layout). This assumption is valid for simple datasets like birds , but fails for complex datasets like COCO  where the generated low-resolution image typically does not have a meaningful layout. In this case, the grid attention is even harmful, because patch-wise context vector is attended to a wrong word and thus generate the texture associated with that wrong word. This may be the reason why AttnGAN’s generated image contains scattered patches of realistic texture but overall is semantically not meaningful; see Fig. 1
for example. Similar phenomenon is also observed in DeepDream. On the contrary, in our object-driven attention, the attention weights and context vector rely on the class label of the bounding box and are independent of the generation in the stage. Therefore, the object-wise context vectors are always helpful to generate images that are consistent with the pre-generated semantic layout. Another benefit of this design is that the context vector can also be used in the discriminator, as we present in 3.2.2.
We design patch-wise and object-wise discriminators to train the attentive multi-stage generator above. Given a patch from uniformly-partitioned image patches determined by the uniform down-sampling structure of CNN, the patch-wise discriminator is trying to determine whether this patch is realistic or not (unconditional) and whether this patch is consistent with the sentence description or not (conditional). Given a bounding box and the class label of the object within it, the object-wise discriminator is trying to determine whether this region is realistic or not (unconditional) and whether this region is consistent with the sentence description and given class label or not (conditional).
Patch-wise discriminators. Given an image-sentence pair ( is the sentence vector), the patch-wise unconditional and text discriminator can be written as
where Enc is a convolutional feature extractor that extracts patch-wise features, ( ) determine whether the patch is realistic (consistent with the text description) or not.
Shape discriminator. In a similar manner, we have our patch-wise shape discriminator
where we first concatenate the image and shapes in the channel dimension, and then extracts patch-wise features by another convolutional feature extractor Enc
. The probabilitiesdetermine whether the patch is consistent with the given shape. Our patch-wise discriminators , and resembles the PatchGAN  for the image-to-image translation task. Compared with the global discriminators in AttnGAN , the patch-wise discriminators not only reduce the model size and thus enable generating higher resolution images, but also increase the quality of generated images; see Table 1 for experimental evidence.
Object-wise discriminators. Given an image , bounding boxes of objects and their shapes , we propose the following object-wise discriminators:
Here, we first concatenate the image and shapes and extract a region feature vector for each bounding box through a Fast R-CNN model  with an ROI-align layer ; see Fig. 5(a). Then similar to the patch-wise discriminator (8), the unconditional (conditional) probabilities ( ) determine whether the ’th object is realistic (consistent with its class label and its text context information ) or not; see Fig. 5(b). Here, is the GloVe embedding of the class label and is its text context information defined in (3).
All discriminators are trained by the traditional cross entropy loss .
3.2.3 Loss function for the image generator
The generator’s GAN loss is a weighted sum of these discriminators’ loss, i.e.,
Here, is the number of bounding boxes, is the number of regular patches, are the weights of the object-wise GAN loss, patch-wise text conditional loss and patch-wise shape conditional loss, respectively. We tried combining our discriminators with the spectral normalized projection discriminator [18, 19], but did not see significant performance improvement. We report performance of the spectral normalized version in 4.1 and provide model architecture details in A.
Combined with the deep multi-modal attentive similarity model (DAMSM) loss introduced in , our final image generator’s loss is
where is a hyper-parameter to be tuned. Here, the DAMSM loss is a word level fine-grained image-text matching loss computed, which will be elaborated in A
. Based on the experiments on a held-out validation set, we set the hyperparameters in this section as:and .
Dataset. We use the COCO dataset  for evaluation. It contains 80 object classes, where each image is associated with object-wise annotations (i.e., bounding boxes and shapes) and 5 text descriptions. We use the official 2014 train (over 80K images) and validation (over 40K images) splits for training and test stages, respectively.
|P-AttnGAN w/ Lyt|
|P-AttnGAN w/ Lyt|
|P-AttnGAN w/ Lyt|
|Obj-GAN w/ SN|
|Obj-GAN w/ SN|
|Obj-GAN w/ SN|
|Reed et al. ||n/a||n/a|
score as the quantitative evaluation metrics. In our experiments, we found that Inception score can be saturated, even over-fitted, while FID is a more robust measure and aligns better with human qualitative evaluation. Following, we also use R-precision, a common evaluation metric for ranking retrieval results, to evaluate whether the generated image is well conditioned on the given text description. More specifically, given a pre-trained image-to-text retrieval model, we use generated images to query their corresponding text descriptions. First, given generated image conditioned on sentence and 99 random sampled sentences , we rank these 100 sentences by the pre-trained image-to-text retrieval model. If the ground truth sentence is ranked highest, we count this a success retrieval. For all the images in the test dataset, we perform this retrieval task once and finally count the percentage of success retrievals as the R-precision score.
It is important to point out that none of these quantitative metrics are perfect. Better metrics are required to evaluate image generation qualities in complicated scenes. In fact, the Inception score completely fails in evaluating the semantic layout of the generated images. The R-precision score depends on the pre-trained image-to-text retrieval model it uses, and can only capture the aspects that the retrieval model is able to capture. The pre-trained model we use is still limited in capturing the relations between objects in complicated scenes, so is our R-precision score.
Quantitative evaluation. We compute these three metrics under two settings for the full validation dataset.
Qualitative evaluation. Apart from the quantitative evaluation, we also visualize the outputs of all ablative versions of Obj-GAN and the state-of-the-art methods (i.e., ) whose pre-trained models are publicly available.
4.1 Ablation study
In this section, we first evaluate the effectiveness of the object-driven attention. Next, we compare the object-driven attention mechanism with the grid attention mechanism. Then, we evaluate the impact of the spectral normalization for Obj-GAN. We use Fig. 6 and the higher half of Table 1 to present the comparison among different ablative versions of Obj-GAN. Note that all ablative versions have been trained with batch size for epochs. In addition, we use the lower half of Table 1 to show the comparison between Obj-GAN and previous methods. Finally, we validated the Obj-GAN’s generalization ability on the novel text descriptions.
Object-driven attention. To evaluate the efficacy of the object-driven attention mechanism, we implement a baseline, named P-AttnGAN w/ Lyt, by disabling the object-driven attention mechanism in Obj-GAN. In essence, P-AttnGAN w/ Lyt can be considered as an improved version of AttnGAN with the patch-wise discriminator (abbreviated as the prefix “P-” in name) and the modules (e.g., shape discriminator) for handling the conditional layout (abbreviated as “Lyt”). Moreover, it can also be considered as a modified implementation of , which resembles their two-step (layout-image) generation. Note that there are three key differences between P-AttnGAN w/ Lyt and : (i) P-AttnGAN w/ Lyt has a multi-stage image generator that gradually increases the generated resolution and refines the generated images, while  has a single-stage image generator. (ii) With the help of the grid attentive module, P-AttnGAN w/ Lyt is able to utilize the fine-grained word-level information, while 
conditions on the global sentence information. (iii) The third difference lies in their loss functions: P-AttnGAN w/ Lyt uses the DAMSM loss in (11) to penalize the mismatch between the generated images and the input text descriptions, while  uses the perceptual loss to penalize the mismatch between the generated images and the ground-truth images. As shown in Table 1, P-AttnGAN w/ Lyt yields higher Inception score than  does.
We compare Obj-GAN with P-AttnGAN w/ Lyt under three settings, with each corresponding to a set of conditional layout input, i.e., the predicted boxes shapes, the ground-truth boxes predicted boxes, and the ground-truth boxes shapes. As presented in Table 1, Obj-GAN consistently outperforms P-AttnGAN w/ Lyt on all three metrics. In Fig. 7, we use the same layout as the conditional input, and compare the visual quality of their generated images. An interesting phenomenon shown in Fig. 7 is that both the foreground objects (e.g., airplane and train) and the background (e.g., airport and trees) textures synthesized by Obj-GAN are much richer and smoother than those using P-AttnGAN w/ Lyt. The effectiveness of the object-driven attention for the foreground objects is easy to understand. The benefits for the background textures using the object-driven attention mechanism is probably due to the fact that it implicitly provides stronger signal that distinguishes the foreground. As such, the image generator may have richer guidance and clearer emphasis when synthesizing textures for a certain region.
Grid attention vs. object-driven attention. We compare Obj-GAN with P-AttnGAN herein, so as to compare the effects of the object-driven and the grid attention mechanisms. In Fig. 8, we show the generated image of each method as well as the corresponding attention maps aligned on the right side. In a grid attention map, the brightness of a region reflects how much this region attended to the word above the map. As for the object-driven attention map, the word above each attention map is the most attended word by the highlighted object. The highlighted region of an object-driven attention map is the object shape.
As analyzed in 3.2.1, the reliability of grid attention weights depends on the quality of the previous layer’s image region features. This makes the grid attention unreliable sometimes, especially for complex scenes. For example, the grid attention weights in Fig. 8 are unreliable because they are scattered (e.g., the attention map for “man”) and inaccurate. However, this is not a problem for the object-driven attention mechanism, because its attention weights are directly calculated from embedding vectors of words and class labels and are independent of image features. Moreover, as shown in Fig. 4 and Equ. (6), the impact region of the object-driven attention context vector is bounded by the object shapes, which further enhances its semantics meaningfulness. As a result, the instance-driven attention significantly improves the visual quality of the generated images, as demonstrated in Fig. 8. Moreover, the performance can be further improved if the semantic layout generation is improved. In the extreme case, Obj-GAN based on ground truth layout (Obj-GAN) has the best visual quality (the rightmost column of Fig. 8) and the best quantitative evaluation (Table 1).
Obj-GAN w/ SN vs. Obj-GAN. We present the comparison between the cases with or without spectral normalization in the discriminators in Table 1 and Fig. 6. We observe that there is no obvious improvement on the visual quality, but slightly worse on the quantitative metrics. We show more results and discussions in A.
Comparison with previous methods. To compare Obj-GAN with the previous methods, initialized by the Obj-GAN models in the ablation study, we trained Obj-GAN-SOTA with batch size for 10 more epochs. In order to evaluate AttnGAN on FID, we conducted the evaluation on the officially released pre-trained model. Note that the Sg2Im  focuses on generating images from scene graphs and conducted the evaluation on a different split of COCO. However, we still included Sg2Im’s results to reflect the broader context of the related topic. As shown in Table 1, Obj-GAN-SOTA outperforms all previous methods significantly. We notice that the increment of batch size does boost the Inception score and R-precision, but does not improve FID. The possible explanation is: with a larger batch size, the DAMSM loss (a ranking loss in essence) in (11) plays a more important role and improves Inception and R-precision, but it does not focus on reducing FID between the generated images and the real ones.
Generalization ability. We further investigate if Obj-GAN just memorizes the scenarios in COCO or it indeed learns the relations between the objects and their surroundings. To this end, we compose several descriptions which reflect novel scenarios that are unlikely to happen in the real-world, e.g., a decker bus is floating on top of a lake, or a cat is catching a frisbee. We use Obj-GAN to synthesize images for these rare scenes. The results in Fig. 9 further demonstrate the good generalization ability of Obj-GAN.
In this paper, we have presented a multi-stage Object-driven Attentive Generative Adversarial Networks (Obj-GANs) for synthesizing images with complex scenes from the text descriptions. With a novel object-driven attention layer at each stage, our generators are able to utilize the fine-grained word/object-level information to gradually refine the synthesized image. We also proposed the Fast R-CNN based object-wise discriminators, each of which is paired with a conditional input of the generator and provides object-wise discrimination signal for that condition. Our Obj-GAN significantly outperforms previous state-of-the-art GAN models on various metrics on the large-scale challenging COCO benchmark. Extensive experiments demonstrate the effectiveness and generalization ability of Obj-GAN on text-to-image generation for complex scenes.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. CVPR, 2018.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
-  E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015.
-  R. B. Girshick. Fast R-CNN. In ICCV, 2015.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra.
DRAW: A recurrent neural network for image generation.In ICML, 2015.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In ICCV, 2017.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. NIPS, 2017.
-  S. Hong, D. Yang, J. Choi, and H. Lee. Inferring semantic layout for hierarchical text-to-image synthesis. CVPR, 2018.
-  Q. Huang, P. Zhang, D. O. Wu, and L. Zhang. Turbo learning for captionbot and drawingbot. In NeurIPS, 2018.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
-  J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz,
Z. Wang, and W. Shi.
Photo-realistic single image super-resolution using a generative adversarial network.In CVPR, 2017.
-  K. Lee, X. Chen, G. Hua, H. Hu, and X. He. Stacked cross attention for image-text matching. ECCV, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  S. Ma, J. Fu, C. W. Chen, and T. Mei. DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In CVPR, 2018.
-  E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
-  T. Miyato and M. Koyama. cgans with projection discriminator. ICLR, 2018.
-  A. Mordvintsev, C. Olah, and M. Tyka. Deep dream, 2015, 2017.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
-  S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
-  S. E. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale autoregressive density estimation. In ICML, 2017.
-  T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
-  A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. NIPS, 2017.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CVPR, 2018.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
-  H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 2018.
-  S. Zhang, H. Dong, W. Hu, Y. Guo, C. Wu, D. Xie, and F. Wu. Text-to-image synthesis via visual-memory creative adversarial network. In PCM, 2018.
-  S. Zhang, H. Dong, W. Hu, Y. Guo, C. Wu, D. Xie, and F. Wu. Text-to-image synthesis via visual-memory creative adversarial network. In PCM, 2018.
Appendix A Appendix
a.1 Obj-GAN vs. the ablative versions
In this section, we show more images generated by our Obj-GAN and its ablative versions on the COCO dataset. In Fig. 10 and Fig. 11, we provide more comparisons as the complementary for Fig. 6. It can be found that there are no obvious improvement on the visual quality when using the spectral normalization.
a.2 Visualization of attention maps
a.3 Results based on the ground-truth layout
a.4 Bi-LSTM text encoder, DAMSM and R-precision
We use the deep attentive multi-modal similarity model (DAMSM) proposed in , which learns a joint embedding of the image regions and words of a sentence in a common semantic space. The fine-grained conditional loss enforces the sub-region of the generated image to match the corresponding word in the sentence.
Bi-LSTM text encoder serves as the text encoder for both DAMSM and the box generator (see A.5). Bi-LSTM text encoder is a bi-directional LSTM that extracts semantic vectors from the text description. In the Bi-LSTM, each word corresponds to two hidden states, one for each direction. Thus, we concatenate its two hidden states to represent the semantic meaning of a word. The feature matrix of all words is indicated by . Its column is the feature vector for the word. is the dimension of the word vector and is the number of words. Meanwhile, the last hidden states of the bi-directional LSTM are concatenated to be the global sentence vector, denoted by . We present the network architectures for the Bi-LSTM text encoder in Table 2.
The image encoder
is a convolutional neural network that maps images to semantic vectors. The intermediate layers of the CNN model learns local features of different regions of the image, while the later layers learn global features of the image. More specifically, the image encoder is built upon Inception-v3 model pre-trained on ImageNet. We first rescale the input image to be 299299 pixels. And then, we extract the local feature matrix (reshaped from 7681717) from “” layer of Inception-v3. Each column of is the feature vector of a local image region. 768 is the dimension of the local feature vector, and 289 is the number of regions in the image. Meanwhile, the global feature vector
is extracted from the last average pooling layer of Inception-v3. Finally, we convert the image features to the common semantic space of text features by adding a new layer perceptron as shown in Eq. (12),
where and its column is the visual feature vector for the image region; is the visual feature vector for the whole image. While is the local image feature vector that corresponds to the word embedding, is the global feature vector that is related to the sentence embedding. is the dimension of the multimodal (i.e., image and text modalities) feature space. For efficiency, all parameters in layers built from Inception-v3 model are fixed, and the parameters in newly added layers are jointly learned with the rest of networks.
The fine-grained conditional loss is designed to learn the correspondence between image regions and words. However, it is difficult to obtain manual annotations. Actually, many words relate to concepts that may not easily be visually defined, such as open or old. One possible solution is to learn word-image correspondence in a semi-supervised manner, in which the only supervision is the correspondence between the entire image and the whole text description (a sequence of words).
We can first calculate the similarity matrix between all possible pairs of word and image region by Eq. (13),
where and means the similarity between the word and the image region.
Generally, a sub-region of the image is described by none or several words of the text description, and it is not likely to be described by the whole sentence. Therefore, we normalize the similarity matrix by Eq. (14),
Second, we build an attention model to compute a context vector for each word (query). The context vectoris a dynamic representation of image regions related to the word of the text description. It is computed as the weighted sum over all visual feature vectors,
where we define the weight via Eq. (16),
Here, is a factor that decides how much more attention is paid to features of its relevant regions when computing the context vector for a word.
Finally, we define the relevance between the
word and the image using the cosine similarity betweenand , i.e., . The relevance between the entire image (Q) and the whole text description (U) is computed by Eq. (17),
where is a factor that determines how much to magnify the importance of the most relevant word-image pair. When , approximates to .
For a text-image pair, we can compute the posterior probability of the text description () being matching with the image () via,
where is a smoothing factor determined by experiments. denotes a minibatch of text descriptions, in which only one description matches the image . Thus, for each image, there are mismatching text descriptions. The objective function is to learn the model parameters by minimizing the negative log posterior probability that the images are matched with their corresponding text descriptions (ground truth),
where ‘w’ stands for “word”.
Symmetrically, we can compute,
If we redefine Eq. (17) by and substitute it to Eq. (18), Eq. (19), Eq. (20), we can obtain loss functions and (where ‘s’ stands for “sentence”) using the sentence embedding and the global visual vector .
The fine-grained conditional loss is defined via Eq. (21),
The DAMSM is pre-trained by minimizing using real image-text pairs. Since the size of images for pre-training DAMSM is not limited by the size of images that can be generated, real images of size 299299 are utilized. Furthermore, the pre-trained DAMSM can provide visually-discriminative word features and a stable fine-grained conditional loss for the attention generative network.
The R-precision score. The DAMSM model is also used to compute the R-precision score. If there are relevant documents for a query, we examine the top ranked retrieval results of a system, and find that
are relevant, and then by definition, the R-precision (and also the precision and recall) is. More specifically, we use generated images to query their corresponding text descriptions. First, the image encoder and Bi-LSTM text encoder learned in our pre-trained DAMSM are utilized to extract features of the generated images and the given text descriptions. Then, we compute cosine similarities between the image features and the text features. Finally, we rank candidates text descriptions for each image in descending similarity and find the top relevant descriptions for computing the R-precision.
a.5 Network architectures for semantic layout generation
Box generator. We design our box generator by improving the one in  to be attentive. We denote the bounding box of the -th object as . Then, we formulate the joint probability of sampling from the box generator as
We implement as a categorical distribution, and implement as a mixture of quadravariate Gaussians. As described in , in order to reduce the parameter space, we decompose the box coordinate probability as , and approximate it with two bivariate Gasussian mixtures by
. The Gaussian Mixture Model (GMM) parameters for Eq. (22) are obtained from the decoder LSTM outputs. Given text encoder’s final hidden state and output , we initialize the decoder’s initial hidden state with . As for , we use it to compute the contextual input for the decoder:
where is a learnable parameter,
is the parameter of a linear transformation, andand represent the dot product and concatenation operation, respectively.
Then, the calculation of GMM parameters are shown as follows:
where are the parameters for GMM concatenated to a vector. We use the same Adam optimizer and training hyperparameters (i.e., learning rate , , ) as in .
) - Instance Normalization - ReLU]. We discovered that the usage of convtranspose would lead to unstable training which is reflected by the frequent severe grid artifacts. To this end, we replace this upsample block with that in our image generator (see Table3
) by switching the batch normalization to the instance one.
a.6 Network architectures for image generation
We present the network architecture for image generators in Table 4 and the network architectures for discriminators in Table 5, Table 6 and Table 7. They are built with basic blocks defined in Table 3. We set the hyperparameters of the network structures as: , , , , , , , and .
We employ an Adam optimizer for the generators with learning rate , and . For each discriminator, we also employ an Adam optimizer with the same hyperparameters.
We design the object-wise discriminators for small objects and large objects, respectively. We specify that if the maximum of width or height of an object is greater than one-third of the image size, then this object is large; otherwise, it is small.
a.7 Network architectures for spectral normalized projection discriminators
We combine our discriminators above with the spectral normalized projection discriminator in [18, 19]. The difference between the object-wise discriminator and the object-wise spectral normalized projection discriminator is illustrated in Figure 16. We present detailed network architectures of the spectral normalized projection discriminators in Table 8, Table 9 and Table 10, with basic blocks defined in Table 3.
|LSTM||, , , ,|
|Name||Operations / Layers|
|Interpolating ()||Nearest neighbor upsampling layer (up-scaling the spatial size by )|
|Upsampling ()||Interpolating () - convolution (stride , padding , decreasing channels to ) -|
|Batch Normalization (BN) - Gated Linear Unit (GLU).|
|Downsampling ()||In s: convolution (stride , increasing channels to ) - BN - LeakyReLU.|
|In s, the convolutional kernel size is . In the first block of s, BN is not applied.|
|Downsampling w/ SN ()||Convolution (spectral normalized, stride , increasing channels to ) - BN - LeakyReLU.|
|In the first block of s, BN is not applied.|
|Concat||Concatenate input tensors along the channel dimension.|
|Residual||Input [Reflection Pad (RPad) 1 - convolution (stride , doubling channels) -|
|Instance Normalization (IN) - GLU - RPad 1 - convolution (stride , keeping channels) - IN].|
|FC||At the beginning of s: fully connected layer - BN - GLU - reshape to 3D tensor.|
|FC w/ SN ()||Fully connected layer (spectral normalized, changing channels to ).|
|Outlogits||Convolution (stride , decreasing channels to ) - sigmoid.|
|Repeat ()||Copy a vector times.|
|Fmap Sum||Summing the two input feature maps element-wisely.|
|Fmap Mul||Multiplying the two input feature maps element-wisely.|
|Avg Pool ()||Average pooling along the -th dimension.|
|Conv||In s: convolution (stride , padding , changing channels to ) - Tanh.|
|In s, convolution (stride , padding , changing channels to ) - BN - LeakyReLU.|
|Conv w/ SN||Convolution (spectral normalized, stride , keeping channels).|
|Conv w/ SN||Convolution (spectral normalized, stride , decreasing channels to ).|
|Conditioning augmentation that converts the sentence embedding to the conditioning vector :|
|fully connected layer - ReLU.|
|Grid attention module. Refer to the paper for more details.|
|Object-driven attention module. Refer to the paper for more details.|
|Label distribution module. Refer to the paper for more details.|
|Shape Encoder ()||RPad 1 - convolution (stride , decreasing channels to ) - IN - LeakyReLU.|
|Shape Encoder w/ SN ()||RPad 1 - convolution (spectral normalized, stride , decreasing channels to ) - IN - LeakyReLU.|
|ROI Encoder||Convolution (stride , padding , decreasing channels to ) - LeakyReLU.|
|ROI Encoder w/ SN||Convolution (spectral normalized, stride , padding , decreasing channels to ) - LeakyReLU.|
|ROI Align ()||Pooling feature maps for ROI.|
|Stage||Name||Input Tensors||Output Tensors|
|FC||100-dimensional , and|
|Shape Encoder ()||()|
|Shape Encoder ()||()|
|Shape Encoder ()||()|
|Stage||Name||Input Tensors||Output Tensors|
|Concat - Conv||()|
|Outlogits (unconditional loss)|
|Outlogits (conditional loss)|