There are two main aspects to consider when approaching the text-to-image (T2I) task: the image generation quality and the image-text semantic consistency. The task can be modeled by a conditional Generative Adversarial Network (cGAN) [18, 5], where a Generator (), conditioned on the encoded text features, generates the corresponding images, and a Discriminator () determines the authenticity of the images, conditioned on the text.
To address the first aspect, Zhang et al.  introduced StackGAN by letting generate images at multiple resolutions, and adopting multiple s to jointly refine
from coarse to fine levels. StackGAN invokes a pre-trained Recurrent-Neural-Network (RNN)[7, 17] and considers the final hidden state as the sentence-level feature representation to condition the image generation. To approach the second aspect, Xu et al. 
take StackGAN as the base model and propose AttnGAN. Apart from sentence-level features, AttnGAN incorporates word embeddings into the generation and consistency-checking processes. The RNN is pre-trained with a Convolution-Neural-Network (CNN) image encoder to obtain the Deep-Attentional-Multimodal-Similarity-Model (DAMSM), which better aligns the image features and word embeddings. An attention mechanism is then derived to relate image regions to corresponding words.
While the T2I performance continues to advance [23, 39, 2, 12, 36, 6], the follow-up methods all share two common traits. First, they all adopt the same stacked model structure of , along with multiple s. Second, they all rely on the pre-trained DAMSM from AttnGAN for image-text consistency.
However, these methods fail to take advantage of recent advances in both the GAN and NLP literature. On the one hand, ProgressiveGAN and StyleGAN [8, 9] achieve the new state-of-the-art image generation quality from a similar multi-resolution perspective. On the other hand, the Transformer architecture [31, 4, 24] has engendered substantial gains across a wide range of challenging NLP tasks.
This fast-progressing research motivates us to explore new opportunities for text-to-image modeling. In particular, as StackGAN and follow-up work all depend on a pre-trained text encoder for word and sentence embeddings, and an additional image encoder to ascertain image-text consistency, two important questions arise. First, can we skip the pre-training step and train the text encoder as a part of ? Second, can we abandon the extra CNN and use as the image encoder? If the answers are affirmative, two further questions can be explored. When and the text encoder are jointly trained to match the visual and text features, can we obtain an image captioning model from them? Furthermore, since is trained to extract text-relevant image features, will it benefit in generating more semantically consistent images?
With these questions in mind, we present the Text and Image Mutual-translation adversarial nEtwork (TIME). To the best of our knowledge, this is the first work that jointly handles both text-to-image and image captioning in a single model. Our contributions can be summarized as follows:
We propose an efficient model for T2I tasks trained in an end-to-end fashion, without any need for pre-trained models or complex training strategies.
We present an aggregated generator that only outputs an image at the finest scale to eliminate the need for multiple s, as required for StackGAN-like generators.
We design both text-to-image and image captioning Transformers, with a more comprehensive image-text inter-domain attention mechanism for bidirectional text-image mutual translation.
We introduce two technical contributions: 2-D positional encoding for a better attention operation and the annealing hinged loss to dynamically balance the learning paces of and .
We show that the commonly used sentence-level text features are no longer needed in TIME, which leads to a more controllable T2I generation that is hard to achieve in previous models.
2 Related Work and Background
Generating realistic high-resolution images from text descriptions is an important task with a wide range of real-world applications, such as reducing the repetitive tasks in story-boarding, decorative painting design, and film or video game scene editing. Recent years have witnessed substantial progress in these directions [15, 20, 26, 25, 37, 35] owing largely to the success of deep generative models [5, 10, 30]. Reed et al.  first demonstrated the superior ability of conditional GANs to synthesize plausible images from text descriptions. Zhang et al. [37, 38] presented StackGAN, where several GANs are stacked to generate images at different resolutions. AttnGAN  further equips StackGAN with an attention mechanism to model multi-level textual conditions.
Subsequent work [23, 39, 2, 12, 36, 6] has built on StackGAN and AttnGAN. MirrorGAN  incorporates a pre-trained text re-description RNN to better align the images with the given texts. DMGAN  relies on a dynamic memory module on to adaptively fuse word embeddings into image features. ControlGAN  uses channel-wise attention in , and can thus generate shape-invariant images when changing the text descriptions. SDGAN  includes a contrastive loss to strengthen the image-text correlation. In the following, we describe the key components of StackGAN and AttnGAN.
2.1 StackGAN as the Image Generation Backbone
StackGAN adopts a coarse-to-fine structure that has shown substantial success on the T2I task. In practice, the generator takes three steps to produce a image as shown in Figure 2-(a). In stage-I, a image with coarse shapes is generated. In stage-II and III, the feature maps are further up-sampled to produce more detailed images with better textures. Three discriminators () are required to be able to train , where the lowest-resolution guides with regard to coarse shapes, while localized defects are refined by the higher-resolution s.
However, there are several reasons for seeking an alternative architecture. First, the multi- design is memory-demanding and has a high computational burden during training. As the image resolution increases, the respective higher-resolution s can raise the cost dramatically. Second, it is hard to balance the effects of the multiple s. Since the s are trained on different resolutions, their learning paces diverge, and such differences can result in conflicting signals when training . In our experiments, we notice a consistently slower convergence rate of the stacked structure compared to a single- design.
2.2 Dependence on Pre-trained modules
While the overall framework for T2I models resembles a conditional GAN (cGAN), multiple modules have to be pre-trained in previous works. As illustrated in Figure 2-(b), AttnGAN requires a DAMSM, which includes an Inception-v3 model 
that is first pre-trained on ImageNet, and then used to pre-train an RNN text encoder. MirrorGAN further proposes the STREAM model as shown in Fig. 2-(c), which is pre-trained for image captioning.
Such pre-training has a number of drawbacks, including, first and foremost, the computational burden. Second, the additional pre-trained CNN for image feature extraction introduces a significant amount of weights, which can be avoided as we shall later show. Third, using pre-trained modules leads to extra hyper-parameters that require dataset-specific tuning. For instance, in AttnGAN, the weight for the DAMSM loss can range from 0.2 to 100 across different datasets. While these pre-trained models boost the performance, empirical studies[23, 37] show that they do not converge if jointly trained with the cGAN.
2.3 The Image-Text Attention Mechanism
The attention mechanism employed in AttnGAN can be interpreted as a simplified version of the Transformer architecture , where the three-dimensional image features in the CNN are flattened into a two-dimensional sequence. This process is demonstrated in Fig. 4-(a), where an image-context feature is derived via an attention operation on the reshaped image feature and the sequence of word embeddings. The resulting image-context features are then concatenated to the image features to generate the images. We will show that a full-fledged version of the Transformer can further improve the performance without a substantial additional computational burden.
In this section, we present our proposed approach, starting with the model structure and training schema, followed by details of the architecture and method.
The upper panel in Fig. 3 shows the overall structure of TIME, consisting of a Text-to-Image Generator and an Image-Captioning Discriminator . We treat a text encoder and a text decoder as parts of . ’s Text-Conditioned Image Transformer accepts a series of word embeddings from and produce an image-context representation for to generate a corresponding image. accepts three kinds of input pairs, consisting of captions alongside: (a) matched real images ; (b) randomly mismatched real images ; and (c) generated images from . emits three outputs: the predicted conditional and unconditional authenticity scores of the image, and the predicted captions from the given images.
3.1 Text-to-Image Generator
3.1.1 Text-Conditioned Image Transformer
To condition the generation process on text descriptions in , we propose to adopt the more comprehensive Transformer model  to replace the attention mechanism used in AttnGAN. To this end, we present the Text-Conditioned Image Transformer (TCIT), illustrated in Fig. 3-(a) as the text conditioning module for . In TCIT, self-attention is first applied on the image features , which are then paired with the word embeddings to obtain the image-context feature representation via a multi-head attention. This entire operation is considered as a single layer, and we employ a two-layer design in TIME. As demonstrated in Fig. 4, there are three main differences between TCIT and the attention from AttnGAN.
First, while the projected key from is used for both matching with query and calculating in AttnGAN, TCIT has two separate linear layers to project , as illustrated in Fig. 4-(b). We show that such separation is beneficial for the T2I task. As focuses on matching with , the other projection obtains the value, which, instead, can be better optimized towards refining for a better image-context feature.
Second, TCIT adopts a multi-head structure as shown in Fig. 4-(c). Unlike in AttnGAN, where only one attention map is calculated, the Transformer replicates the attention module, thus adding more flexibility for each image region to account for multiple words. The benefits of applying multi-head attention to T2I is intuitive, as a given region in an image may be described by multiple words.
Third, TCIT achieves better performance by stacking the attention layers in a residual structure, while AttnGAN adopts it only as one layer. As shown in Fig. 4-(d), we argue that provisioning multiple attention layers and recurrently revising the image-context feature enables an improved T2I ability.
Aggregated structure To reduce the number of weights for the StackGAN structure, we present the design of an aggregated Generator from the same multi-resolution perspective. As shown in the upper panel of Fig. 3, outputs images only at the finest level. Specifically, still yields RGB outputs at multiple resolutions. However, instead of being treated as individual images at different scales, these RGB outputs are re-scaled to the highest resolution and added together to obtain a single aggregated image output. Therefore, only one is needed to train . remains able to perceive an image at multiple scales via residual blocks, in which skip connections (formulated by a convolution) can directly pass down-sampled low-level visual features to deeper layers.
3.2 Image-Captioning Discriminator
We treat the text encoder and text decoder as a part of our . Specifically, is a Transformer that first maps the word indices into an embedding space, and then adds contextual information to the embeddings via the self-attention operations. To train to actively generate text descriptions of an image, an attention mask is applied on the input of , such that each word can only attend to the words preceding it in a sentence. is a Transformer decoder 
that performs image captioning by predicting the next word’s probability distribution from the masked word embeddings (considering only previous words) and the image features.
Image-Captioning Transformer In contrast to TCIT, where is revised by , the inverse operation is leveraged for the image captioning task. As shown in Fig. 3-(b), we design the Image-Captioning Transformer (ICT) which first applies a self-attention on the word embeddings , and then revises the embeddings by attending to the most relevant image features along the spatial regions. ICT is used in for the image captioning task. In TIME, we find that a simple 4-layer 4-head ICT is sufficient to obtain high-quality captions and facilitate the consistency checking in the T2I task.
Conditional Image Text Matching We observe that a basic convolution design already succeeds in measuring the image-text consistency on . Therefore, to provide a basic conditional restriction for , we simply reshape the word embeddings into the same shape as the image feature-maps, and concatenate them into an image-context feature as illustrated in Fig. 3-(c). In TIME, such a naïve operation works surprisingly well and we only use two further convolutional layers to derive the consistency score from the image-context feature.
3.3 2-D Positional Encoding for Image Features
When we reshape the image features for the attention operation, there is no way for the Transformer to discern spatial information from the flattened features. To take advantage of coordinate signals, we propose 2-D positional encoding as a counter-part to the 1-D positional encoding in the Transformer .
The encoding at each position has the same dimensionality as the channel size of the image feature, and is directly added to the reshaped image feature . The first half of dimensions encode the y-axis positions and the second half encode the x-axis, with sinusoidal functions of different frequencies:
where , are the coordinates of each pixel location, and is the dimension index along the channel. Such 2-D encoding ensures that closer visual features have a more similar representation compared to features that are spatially more remote from each other. An example from a trained TIME feature space is visualized in Fig. 5. In practice, we apply 2-D positional encoding on the image features for both TCIT and ICT.
Formally, we denote the three kinds of outputs from as: , the image feature at resolution; , the unconditional image real/fake score; and , the conditional image real/fake score. Therefore, the predicted next word distribution from is: . Finally, the objectives for , and to jointly minimize are:
3.4.1 Hinged Image-Text Matching Loss
During training, we find that can learn a good semantic visual translation at very early iterations. As shown in Fig. 6
, while the convention is to train the model for 600 epochs on the CUB dataset, we observe that the semantic features begin to emerge onas early as after 20 epochs. Thus, we argue that it is not ideal to penalize by the conditional loss on in a static manner. Since is already very consistent to the given , if we let consider an already well-matched input as inconsistent, this may confuse and in turn hurt the consistency-checking performance.
Therefore, we revise the conditional loss for in Eqs. (3)-(5). We employ a hinged loss [13, 29] and dynamically anneal the penalty on the generated images according to how confident predicts the matched real pairs:
Here, denotes that the gradient is not computed for the enclosed function, and is the annealing factor. The hinged loss ensures that yields a lower score on compared to , while the annealing term ensures that penalizes sufficiently in early epochs.
On the other side, considers random noise and word embeddings from as inputs, and is trained to generate images that can fool into giving high scores on authenticity and semantic consistency with the text. Moreover, since can now caption the images, is also encouraged to make reconstruct the same sentences as provided as input. Thus, the objectives for to minimize are:
Note that is only trained with the . Hence, the word embeddings are only optimized towards making easier to check the image-text consistency and predict the correct captions. In our experiments, we find such setting works out fairly well, where is able to catch up with with good generations.
In this section, we evaluate the proposed model from both the text-to-image and image-captioning directions, and analyze each module’s effectiveness individually. Moreover, we highlight the desirable property of TIME being a more controllable generator compared to other T2I models.
Experiments are conducted on two commonly used datasets: CUB  (8,855 images for training and 2,933 images for validating) and MS-COCO  (80k images for training and 40k images for validating). We train the models on the training set and benchmark them on the validation set. Following the same convention as in previous T2I works [35, 23, 39], we measure the image quality by Inception Score  and the image-text consistency by R-precision 
. Our work is implemented in PyTorch, and all the code will be published.
4.1 A More Controllable without Sentence-Level Embedding
Most previous T2I models rely on a sentence-level embedding as a vital conditioning factor for [37, 35, 23, 39, 12]. Specifically, is concatenated with the noise as the input for , and is leveraged to compute the conditional authenticity of the images in . Sentence embeddings are preferred over word embeddings, as the latter lack the context and because semantic concepts are often expressed in multiple words.
However, since is a part of the input alongside , any slight changes in can lead to major visual changes in the resulting images, even when is fixed. This is undesirable when we like a generated image but want to slightly revise it by altering the text description. Examples are given in Fig. 7-(a), where changing just a single word leads to unpredictably large changes in the image. In contrast, since we adopt the Transformer as the text encoder, where the word embeddings already come with contextual information, is no longer needed in TIME. Via our Transformer text encoder, the same word in different sentences or at different positions will have different embeddings. As a result, the word embeddings are sufficient to provide semantically accurate information.
As shown in Fig. 7-(b) and (c), when changing the captions while fixing , TIME shows a more controllable generation. While previous works [12, 11] approach such controllability with great effort, including a channel-wise attention and extra content-wise perceptual losses, TIME naturally enables fine-grained manipulation of synthetic images via their text descriptions.
4.2 Backbone Model Structure
Table 1 demonstrates the performance comparison between the StackGAN structure and our proposed “aggregating” structure from a T2I context only. AttnGAN as the T2I backbone has been revised by recent advances in the GAN literature [39, 23]. For instance, Zhu et al.  implemented spectral normalization  into , which directly results in a performance boost. To keep the backbone updated, we also brought in new advances from recent years. Particularly, in the column names, “+new” means we train the model with the latest GAN technologies, including an equalized learning rate , style-based generator , and R-1 regularization . “Aggr” means we remove the “stacked” and multiple s and replace them with the proposed aggregated and a single . To show the comparison of the computing cost, we list the relative training times of all models with respect to StackGAN. All models are trained with the optimal hyper-parameter settings from  on the same GPU.
|StackGAN w/o stack||StackGAN||Aggr GAN||Aggr GAN +new||AttnGAN w/o stack||AttnGAN||Aggr AttnGAN +new|
In Table 1, our aggregated structure achieves the best performance/computing-cost ratio in both the image quality and the image-text consistency. Moreover, we find that the abandoned lower-resolution s in StackGAN have limited effect on image-text consistency. Instead, the image-text consistency appears more related to the generated image quality, as a higher IS always yields a better R-precision. It is worth noticing that the last column already performs similarly to several of the latest T2I models that are based on AttnGAN.
4.3 Attention Mechanisms
We conducted experiments to explore the best attention settings for the T2I task from the mechanisms discussed in Section 3.1. Table 2 lists the settings we tested, where all the models are configured the same based on AttnGAN, except for the attention mechanisms used in .
In particular, column 1 shows the baseline performance that employs the basic attention operation, described in Fig. 4-(a), from AttnGAN. The following columns show the results of using the Transformer illustrated in Fig. 4-(d) with different numbers of heads and layers (e.g., Tf-h4-l2 means a Transformer with 4 heads and 2 layers). According to the results, a Transformer with a more comprehensive attention yields a better performance than the baseline. However, when increasing the number of layers and heads beyond a threshold, a clear performance degradation emerges on the CUB dataset. We hypothesize that the optimal numbers of heads and layers depends on the dataset, where the 4-heads 2-layers setting is the sweet point for the CUB dataset. Intuitively, the increased parameters, as shown in the last two columns, could make the model harder to converge, and more susceptible to overfitting the training data.
4.4 Comparison with State-of-the-Art and Ablation Study
. On CUB, TIME yields a more consistent image synthesis quality, while AttnGAN is more likely to generate failure samples. On MS-COCO, where the images are much more diverse and complicated, TIME is still able to generate the essential contents that is consistent with the given text. Note that, although we do not particularly tune TIME’s hyper-parameters for MS-COCO (such as the Transformer settings and weights for loss functions), the T2I performance of TIME is still competitive. The overall performance of TIME proves its effectiveness, given that it also provides image captioning besides T2I, and does not rely on any pre-trained modules.
Importantly, TIME is a counter-part to AttnGAN, and the aforementioned detailed revisions of it, with fundamental differences (no pre-training, no extra CNN/RNN modules), while the other compared models all are works incrementally improving over AttnGAN with orthogonal contributions that could also be incorporated into TIME.
As shown in Table 3
, TIME demonstrates competitive performance on MS-COCO, and sets the new state-of-the-art Inception Score on CUB. Unlike the other models that require a well pre-trained language module and an Inception-v3 image encoder, TIME itself is sufficient to learn the cross-modal relationships between image and language. Regarding the image-text consistency performance, the visual variance is larger in MS-COCO, and thus it is easier for the models to generate matched images from text descriptions. Therefore, all models obtain a better R-Precision score on MS-COCO compared to CUB, while TIME is among the top performers on both datasets.
|- img captioning||Baseline||- Sentence emb||+ 2D-Pos Encode||+ Hinged loss|
Table 4 provides an ablation study. We take the model described until Section 3.2 as the baseline, and perform cumulative experiments on it. First, we remove the image captioning text decoder to show its impact on the T2I direction. Then, we show that dropping the sentence-level embedding does not hurt the performance, while adding 2-D positional encoding brings improvements in both image-text consistency and the overall image quality. Lastly, the hinged loss releases from a potentially conflicting signal, and therefore leads to provide a better objective for , resulting in a substantial boost in image quality.
4.5 Text Encoder and Image-captioning Text Decoder
While training under the cGAN framework, we show that our text encoder successfully acquires semantics. In Fig. 8.(a), words with similar meanings reside close to each other, such as “bill” and “beak”, “belly” and “breast”. Moreover, “large” ends up close to “red”, as the latter often applies to large birds, while “small” is close to “brown” and “grey”, which often apply to small birds.
Apart from a strong T2I performance, becomes a stand-alone image captioning model after training. In Table 5, we report the standard metrics [21, 32, 1] for a comprehensive evaluation of TIME’s image captioning performance. According to the results, TIME has a better performance than the pre-trained captioning model used in MirrorGAN, which is the main-stream “CNN encoder RNN decoder”-based captioning model . The superior captioning performance of TIME also explains its better T2I performance, as the pre-trained text decoder does not have a comparable performance to provide good conditioning information for . We include the result of TIME trained without the image generation part (i.e., trained only as an image captioning model). It suggests that the captioning performance does not benefit from an adversarial training with , but also shows that the adversarial training does not hurt the captioning performance. This reveals a promising area for future research, aimed at improving the performance in both directions (text-to-image and image-captioning) under the single TIME framework.
In this paper, we proposed the Text and Image Mutual-translation adversarial nEtwork (TIME), a unified framework trained with an adversarial schema that accomplishes both the text-to-image and image-captioning tasks. While previous work in the T2I field requires pre-training several supportive modules, TIME establishes the new state-of-the-art T2I performance on the CUB dataset without pre-training. Meanwhile, the joint process of learning both a text-to-image and an image-captioning model fully harnesses the power of GANs (since in related work, is typically abandoned after training ), yielding a promising image-captioning performance using . TIME bridges the gap between the visual and language domains, unveiling the immense potential of mutual translations between the two modalities within a single model.
-  (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.5.
-  (2019) Dualattn-gan: text to image synthesis with dual attentional generative adversarial network. IEEE Access 7 (), pp. 183706–183716. External Links: Cited by: §1, Figure 2, §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §2.2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.
-  (2019) Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321. Cited by: §1, Figure 2, §2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1, §4.2.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §4.2.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
-  (2019) Dual adversarial inference for text-to-image synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7567–7576. Cited by: §4.1.
-  (2019) Controllable text-to-image generation. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 2063–2073. External Links: Cited by: §1, Figure 2, §2, §4.1, §4.1.
-  (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §3.4.1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.
-  (2015) Generating images from captions with attention. arXiv preprint arXiv:1511.02793. Cited by: §2.
-  (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §4.2.
-  (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
-  (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.2.
-  (2017) Plug & play generative networks: conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477. Cited by: §2.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.5.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
-  (2019) Mirrorgan: learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514. Cited by: §1, Figure 2, §2.2, §2, §4.1, §4.2, §4.
-  (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §3.2.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.
Parallel multiscale autoregressive density estimation. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2912–2921. Cited by: §2.
-  (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.2.
-  (2017) Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896 7, pp. 3. Cited by: §3.4.1.
-  (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.3, §3.1.1, §3.3.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.5.
-  (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §4.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §4.5.
-  (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324. Cited by: §1, Figure 2, §2, §4.1, §4.2, §4.
-  (2019) Semantics disentangling for text-to-image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336. Cited by: §1, §2.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §1, Figure 2, §2.2, §2, §4.1, §4.
-  (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1947–1962. Cited by: §2.
-  (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5802–5810. Cited by: §1, Figure 2, §2, §4.1, §4.2, §4.