TIME: Text and Image Mutual-Translation Adversarial Networks

by   Bingchen Liu, et al.

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image-text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design a hinged and annealing conditional loss that dynamically balances the adversarial learning. In our experiments, TIME establishes the new state-of-the-art Inception Score of 4.88 on the CUB dataset, and shows competitive performance on MS-COCO on both text-to-image and image captioning tasks.


page 2

page 6

page 8

page 9

page 11

page 14


XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...

I-Tuning: Tuning Language Models with Image for Caption Generation

Recently, tuning the pre-trained language model (PLM) in a parameter-eff...

An Empirical Study of Language CNN for Image Captioning

Language Models based on recurrent neural networks have dominated recent...

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Generative adversarial networks have led to significant advances in cros...

A Framework and Dataset for Abstract Art Generation via CalligraphyGAN

With the advancement of deep learning, artificial intelligence (AI) has ...

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

Recent advances in image captioning have focused on scaling the data and...

1 Introduction

There are two main aspects to consider when approaching the text-to-image (T2I) task: the image generation quality and the image-text semantic consistency. The task can be modeled by a conditional Generative Adversarial Network (cGAN) [18, 5], where a Generator (), conditioned on the encoded text features, generates the corresponding images, and a Discriminator () determines the authenticity of the images, conditioned on the text.

To address the first aspect, Zhang et al. [37] introduced StackGAN by letting generate images at multiple resolutions, and adopting multiple s to jointly refine

from coarse to fine levels. StackGAN invokes a pre-trained Recurrent-Neural-Network (RNN)

[7, 17] and considers the final hidden state as the sentence-level feature representation to condition the image generation. To approach the second aspect, Xu et al. [35]

take StackGAN as the base model and propose AttnGAN. Apart from sentence-level features, AttnGAN incorporates word embeddings into the generation and consistency-checking processes. The RNN is pre-trained with a Convolution-Neural-Network (CNN) image encoder to obtain the Deep-Attentional-Multimodal-Similarity-Model (DAMSM), which better aligns the image features and word embeddings. An attention mechanism is then derived to relate image regions to corresponding words.

Figure 1: Qualitative results of TIME on the CUB dataset: The generated images show a more consistent level of quality, and works as a stand-alone image-captioning model.

While the T2I performance continues to advance [23, 39, 2, 12, 36, 6], the follow-up methods all share two common traits. First, they all adopt the same stacked model structure of , along with multiple s. Second, they all rely on the pre-trained DAMSM from AttnGAN for image-text consistency.

However, these methods fail to take advantage of recent advances in both the GAN and NLP literature. On the one hand, ProgressiveGAN and StyleGAN [8, 9] achieve the new state-of-the-art image generation quality from a similar multi-resolution perspective. On the other hand, the Transformer architecture [31, 4, 24] has engendered substantial gains across a wide range of challenging NLP tasks.

This fast-progressing research motivates us to explore new opportunities for text-to-image modeling. In particular, as StackGAN and follow-up work all depend on a pre-trained text encoder for word and sentence embeddings, and an additional image encoder to ascertain image-text consistency, two important questions arise. First, can we skip the pre-training step and train the text encoder as a part of ? Second, can we abandon the extra CNN and use as the image encoder? If the answers are affirmative, two further questions can be explored. When and the text encoder are jointly trained to match the visual and text features, can we obtain an image captioning model from them? Furthermore, since is trained to extract text-relevant image features, will it benefit in generating more semantically consistent images?

With these questions in mind, we present the Text and Image Mutual-translation adversarial nEtwork (TIME). To the best of our knowledge, this is the first work that jointly handles both text-to-image and image captioning in a single model. Our contributions can be summarized as follows:

  1. We propose an efficient model for T2I tasks trained in an end-to-end fashion, without any need for pre-trained models or complex training strategies.

  2. We present an aggregated generator that only outputs an image at the finest scale to eliminate the need for multiple s, as required for StackGAN-like generators.

  3. We design both text-to-image and image captioning Transformers, with a more comprehensive image-text inter-domain attention mechanism for bidirectional text-image mutual translation.

  4. We introduce two technical contributions: 2-D positional encoding for a better attention operation and the annealing hinged loss to dynamically balance the learning paces of and .

  5. We show that the commonly used sentence-level text features are no longer needed in TIME, which leads to a more controllable T2I generation that is hard to achieve in previous models.

  6. Extensive experiments show that our proposed TIME achieves a superior performance on text-to-image tasks and promising results on image captioning. Fig. 1-(c) showcases the superior synthetic image quality from TIME, while Fig. 1-(e) demonstrates TIME’s image captioning performance.

2 Related Work and Background

Generating realistic high-resolution images from text descriptions is an important task with a wide range of real-world applications, such as reducing the repetitive tasks in story-boarding, decorative painting design, and film or video game scene editing. Recent years have witnessed substantial progress in these directions [15, 20, 26, 25, 37, 35] owing largely to the success of deep generative models [5, 10, 30]. Reed et al. [25] first demonstrated the superior ability of conditional GANs to synthesize plausible images from text descriptions. Zhang et al. [37, 38] presented StackGAN, where several GANs are stacked to generate images at different resolutions. AttnGAN [35] further equips StackGAN with an attention mechanism to model multi-level textual conditions.

Subsequent work [23, 39, 2, 12, 36, 6] has built on StackGAN and AttnGAN. MirrorGAN [23] incorporates a pre-trained text re-description RNN to better align the images with the given texts. DMGAN [39] relies on a dynamic memory module on to adaptively fuse word embeddings into image features. ControlGAN [12] uses channel-wise attention in , and can thus generate shape-invariant images when changing the text descriptions. SDGAN [36] includes a contrastive loss to strengthen the image-text correlation. In the following, we describe the key components of StackGAN and AttnGAN.

2.1 StackGAN as the Image Generation Backbone

Figure 2: (a) The StackGAN structure that serves as the backbone in SOTA T2I models [37, 35, 23, 39, 2, 12, 6]. (b)&(c) Representative models build upon StackGAN, with red parts indicating modules that require pre-training

StackGAN adopts a coarse-to-fine structure that has shown substantial success on the T2I task. In practice, the generator takes three steps to produce a image as shown in Figure 2-(a). In stage-I, a image with coarse shapes is generated. In stage-II and III, the feature maps are further up-sampled to produce more detailed images with better textures. Three discriminators () are required to be able to train , where the lowest-resolution guides with regard to coarse shapes, while localized defects are refined by the higher-resolution s.

However, there are several reasons for seeking an alternative architecture. First, the multi- design is memory-demanding and has a high computational burden during training. As the image resolution increases, the respective higher-resolution s can raise the cost dramatically. Second, it is hard to balance the effects of the multiple s. Since the s are trained on different resolutions, their learning paces diverge, and such differences can result in conflicting signals when training . In our experiments, we notice a consistently slower convergence rate of the stacked structure compared to a single- design.

2.2 Dependence on Pre-trained modules

While the overall framework for T2I models resembles a conditional GAN (cGAN), multiple modules have to be pre-trained in previous works. As illustrated in Figure 2-(b), AttnGAN requires a DAMSM, which includes an Inception-v3 model [28]

that is first pre-trained on ImageNet

[3], and then used to pre-train an RNN text encoder. MirrorGAN further proposes the STREAM model as shown in Fig. 2-(c), which is pre-trained for image captioning.

Such pre-training has a number of drawbacks, including, first and foremost, the computational burden. Second, the additional pre-trained CNN for image feature extraction introduces a significant amount of weights, which can be avoided as we shall later show. Third, using pre-trained modules leads to extra hyper-parameters that require dataset-specific tuning. For instance, in AttnGAN, the weight for the DAMSM loss can range from 0.2 to 100 across different datasets. While these pre-trained models boost the performance, empirical studies

[23, 37] show that they do not converge if jointly trained with the cGAN.

2.3 The Image-Text Attention Mechanism

The attention mechanism employed in AttnGAN can be interpreted as a simplified version of the Transformer architecture [31], where the three-dimensional image features in the CNN are flattened into a two-dimensional sequence. This process is demonstrated in Fig. 4-(a), where an image-context feature is derived via an attention operation on the reshaped image feature and the sequence of word embeddings. The resulting image-context features are then concatenated to the image features to generate the images. We will show that a full-fledged version of the Transformer can further improve the performance without a substantial additional computational burden.

3 Methodology

In this section, we present our proposed approach, starting with the model structure and training schema, followed by details of the architecture and method.

Figure 3: Model overview of TIME. The upper panel shows a high-level summary of our architecture while the lower panel demonstrates the details of the individual modules.

The upper panel in Fig. 3 shows the overall structure of TIME, consisting of a Text-to-Image Generator and an Image-Captioning Discriminator . We treat a text encoder and a text decoder as parts of . ’s Text-Conditioned Image Transformer accepts a series of word embeddings from and produce an image-context representation for to generate a corresponding image. accepts three kinds of input pairs, consisting of captions alongside: (a) matched real images ; (b) randomly mismatched real images ; and (c) generated images from . emits three outputs: the predicted conditional and unconditional authenticity scores of the image, and the predicted captions from the given images.

3.1 Text-to-Image Generator

3.1.1 Text-Conditioned Image Transformer

To condition the generation process on text descriptions in , we propose to adopt the more comprehensive Transformer model [31] to replace the attention mechanism used in AttnGAN. To this end, we present the Text-Conditioned Image Transformer (TCIT), illustrated in Fig. 3-(a) as the text conditioning module for . In TCIT, self-attention is first applied on the image features , which are then paired with the word embeddings to obtain the image-context feature representation via a multi-head attention. This entire operation is considered as a single layer, and we employ a two-layer design in TIME. As demonstrated in Fig. 4, there are three main differences between TCIT and the attention from AttnGAN.

Figure 4: Differences between the attention mechanisms of AttnGAN and our model.

First, while the projected key from is used for both matching with query and calculating in AttnGAN, TCIT has two separate linear layers to project , as illustrated in Fig. 4-(b). We show that such separation is beneficial for the T2I task. As focuses on matching with , the other projection obtains the value, which, instead, can be better optimized towards refining for a better image-context feature.

Second, TCIT adopts a multi-head structure as shown in Fig. 4-(c). Unlike in AttnGAN, where only one attention map is calculated, the Transformer replicates the attention module, thus adding more flexibility for each image region to account for multiple words. The benefits of applying multi-head attention to T2I is intuitive, as a given region in an image may be described by multiple words.

Third, TCIT achieves better performance by stacking the attention layers in a residual structure, while AttnGAN adopts it only as one layer. As shown in Fig. 4-(d), we argue that provisioning multiple attention layers and recurrently revising the image-context feature enables an improved T2I ability.

Aggregated structure To reduce the number of weights for the StackGAN structure, we present the design of an aggregated Generator from the same multi-resolution perspective. As shown in the upper panel of Fig. 3, outputs images only at the finest level. Specifically, still yields RGB outputs at multiple resolutions. However, instead of being treated as individual images at different scales, these RGB outputs are re-scaled to the highest resolution and added together to obtain a single aggregated image output. Therefore, only one is needed to train . remains able to perceive an image at multiple scales via residual blocks, in which skip connections (formulated by a convolution) can directly pass down-sampled low-level visual features to deeper layers.

3.2 Image-Captioning Discriminator

We treat the text encoder and text decoder as a part of our . Specifically, is a Transformer that first maps the word indices into an embedding space, and then adds contextual information to the embeddings via the self-attention operations. To train to actively generate text descriptions of an image, an attention mask is applied on the input of , such that each word can only attend to the words preceding it in a sentence. is a Transformer decoder [24]

that performs image captioning by predicting the next word’s probability distribution from the masked word embeddings (considering only previous words) and the image features.

Image-Captioning Transformer In contrast to TCIT, where is revised by , the inverse operation is leveraged for the image captioning task. As shown in Fig. 3-(b), we design the Image-Captioning Transformer (ICT) which first applies a self-attention on the word embeddings , and then revises the embeddings by attending to the most relevant image features along the spatial regions. ICT is used in for the image captioning task. In TIME, we find that a simple 4-layer 4-head ICT is sufficient to obtain high-quality captions and facilitate the consistency checking in the T2I task.

Conditional Image Text Matching We observe that a basic convolution design already succeeds in measuring the image-text consistency on . Therefore, to provide a basic conditional restriction for , we simply reshape the word embeddings into the same shape as the image feature-maps, and concatenate them into an image-context feature as illustrated in Fig. 3-(c). In TIME, such a naïve operation works surprisingly well and we only use two further convolutional layers to derive the consistency score from the image-context feature.

3.3 2-D Positional Encoding for Image Features

When we reshape the image features for the attention operation, there is no way for the Transformer to discern spatial information from the flattened features. To take advantage of coordinate signals, we propose 2-D positional encoding as a counter-part to the 1-D positional encoding in the Transformer [31].

Figure 5: Visualization of 2-D positional embedding on image feature-maps. We take the first 3 channels (encoded by y-axis) and the middle 3 channels (encoded by x-axis) from feature-maps at the level, and display them as RGB images.

The encoding at each position has the same dimensionality as the channel size of the image feature, and is directly added to the reshaped image feature . The first half of dimensions encode the y-axis positions and the second half encode the x-axis, with sinusoidal functions of different frequencies:

where , are the coordinates of each pixel location, and is the dimension index along the channel. Such 2-D encoding ensures that closer visual features have a more similar representation compared to features that are spatially more remote from each other. An example from a trained TIME feature space is visualized in Fig. 5. In practice, we apply 2-D positional encoding on the image features for both TCIT and ICT.

3.4 Objectives

Formally, we denote the three kinds of outputs from as: , the image feature at resolution; , the unconditional image real/fake score; and , the conditional image real/fake score. Therefore, the predicted next word distribution from is: . Finally, the objectives for , and to jointly minimize are:


3.4.1 Hinged Image-Text Matching Loss

During training, we find that can learn a good semantic visual translation at very early iterations. As shown in Fig. 6

, while the convention is to train the model for 600 epochs on the CUB dataset, we observe that the semantic features begin to emerge on

as early as after 20 epochs. Thus, we argue that it is not ideal to penalize by the conditional loss on in a static manner. Since is already very consistent to the given , if we let consider an already well-matched input as inconsistent, this may confuse and in turn hurt the consistency-checking performance.

Figure 6: Samples generated in the early iterations during the training of TIME

Therefore, we revise the conditional loss for in Eqs. (3)-(5). We employ a hinged loss [13, 29] and dynamically anneal the penalty on the generated images according to how confident predicts the matched real pairs:


Here, denotes that the gradient is not computed for the enclosed function, and is the annealing factor. The hinged loss ensures that yields a lower score on compared to , while the annealing term ensures that penalizes sufficiently in early epochs.

On the other side, considers random noise and word embeddings from as inputs, and is trained to generate images that can fool into giving high scores on authenticity and semantic consistency with the text. Moreover, since can now caption the images, is also encouraged to make reconstruct the same sentences as provided as input. Thus, the objectives for to minimize are:


Note that is only trained with the . Hence, the word embeddings are only optimized towards making easier to check the image-text consistency and predict the correct captions. In our experiments, we find such setting works out fairly well, where is able to catch up with with good generations.

4 Experiments

In this section, we evaluate the proposed model from both the text-to-image and image-captioning directions, and analyze each module’s effectiveness individually. Moreover, we highlight the desirable property of TIME being a more controllable generator compared to other T2I models.

Experiments are conducted on two commonly used datasets: CUB [33] (8,855 images for training and 2,933 images for validating) and MS-COCO [14] (80k images for training and 40k images for validating). We train the models on the training set and benchmark them on the validation set. Following the same convention as in previous T2I works [35, 23, 39], we measure the image quality by Inception Score [27] and the image-text consistency by R-precision [37]

. Our work is implemented in PyTorch

[22], and all the code will be published.

4.1 A More Controllable without Sentence-Level Embedding

Most previous T2I models rely on a sentence-level embedding as a vital conditioning factor for [37, 35, 23, 39, 12]. Specifically, is concatenated with the noise as the input for , and is leveraged to compute the conditional authenticity of the images in . Sentence embeddings are preferred over word embeddings, as the latter lack the context and because semantic concepts are often expressed in multiple words.

Figure 7: Images from TIME with fixed and varied sentences

However, since is a part of the input alongside , any slight changes in can lead to major visual changes in the resulting images, even when is fixed. This is undesirable when we like a generated image but want to slightly revise it by altering the text description. Examples are given in Fig. 7-(a), where changing just a single word leads to unpredictably large changes in the image. In contrast, since we adopt the Transformer as the text encoder, where the word embeddings already come with contextual information, is no longer needed in TIME. Via our Transformer text encoder, the same word in different sentences or at different positions will have different embeddings. As a result, the word embeddings are sufficient to provide semantically accurate information.

As shown in Fig. 7-(b) and (c), when changing the captions while fixing , TIME shows a more controllable generation. While previous works [12, 11] approach such controllability with great effort, including a channel-wise attention and extra content-wise perceptual losses, TIME naturally enables fine-grained manipulation of synthetic images via their text descriptions.

4.2 Backbone Model Structure

Table 1 demonstrates the performance comparison between the StackGAN structure and our proposed “aggregating” structure from a T2I context only. AttnGAN as the T2I backbone has been revised by recent advances in the GAN literature [39, 23]. For instance, Zhu et al. [39] implemented spectral normalization [19] into , which directly results in a performance boost. To keep the backbone updated, we also brought in new advances from recent years. Particularly, in the column names, “+new” means we train the model with the latest GAN technologies, including an equalized learning rate [8], style-based generator [9], and R-1 regularization [16]. “Aggr” means we remove the “stacked” and multiple s and replace them with the proposed aggregated and a single . To show the comparison of the computing cost, we list the relative training times of all models with respect to StackGAN. All models are trained with the optimal hyper-parameter settings from [35] on the same GPU.

StackGAN w/o stack StackGAN Aggr GAN Aggr GAN +new AttnGAN w/o stack AttnGAN Aggr AttnGAN +new
Inception Score
Training time 0.57 1.0 0.78 0.85 0.71 1.14 1.0
Table 1: Comparison between stacked and aggregated structures on CUB dataset

In Table 1, our aggregated structure achieves the best performance/computing-cost ratio in both the image quality and the image-text consistency. Moreover, we find that the abandoned lower-resolution s in StackGAN have limited effect on image-text consistency. Instead, the image-text consistency appears more related to the generated image quality, as a higher IS always yields a better R-precision. It is worth noticing that the last column already performs similarly to several of the latest T2I models that are based on AttnGAN.

4.3 Attention Mechanisms

We conducted experiments to explore the best attention settings for the T2I task from the mechanisms discussed in Section 3.1. Table 2 lists the settings we tested, where all the models are configured the same based on AttnGAN, except for the attention mechanisms used in .

AttnGAN Tf-h1-l1 Tf-h4-l1 Tf-h4-l2 Tf-h4-l4 Tf-h8-l4
Inception Score

Table 2: Comparison between different attention settings on CUB dataset

In particular, column 1 shows the baseline performance that employs the basic attention operation, described in Fig. 4-(a), from AttnGAN. The following columns show the results of using the Transformer illustrated in Fig. 4-(d) with different numbers of heads and layers (e.g., Tf-h4-l2 means a Transformer with 4 heads and 2 layers). According to the results, a Transformer with a more comprehensive attention yields a better performance than the baseline. However, when increasing the number of layers and heads beyond a threshold, a clear performance degradation emerges on the CUB dataset. We hypothesize that the optimal numbers of heads and layers depends on the dataset, where the 4-heads 2-layers setting is the sweet point for the CUB dataset. Intuitively, the increased parameters, as shown in the last two columns, could make the model harder to converge, and more susceptible to overfitting the training data.

4.4 Comparison with State-of-the-Art and Ablation Study

We next compare TIME with several state-of-the-art models. Qualitative results of TIME can be found in Fig. 1, 7, and 8

. On CUB, TIME yields a more consistent image synthesis quality, while AttnGAN is more likely to generate failure samples. On MS-COCO, where the images are much more diverse and complicated, TIME is still able to generate the essential contents that is consistent with the given text. Note that, although we do not particularly tune TIME’s hyper-parameters for MS-COCO (such as the Transformer settings and weights for loss functions), the T2I performance of TIME is still competitive. The overall performance of TIME proves its effectiveness, given that it also provides image captioning besides T2I, and does not rely on any pre-trained modules.

Importantly, TIME is a counter-part to AttnGAN, and the aforementioned detailed revisions of it, with fundamental differences (no pre-training, no extra CNN/RNN modules), while the other compared models all are works incrementally improving over AttnGAN with orthogonal contributions that could also be incorporated into TIME.

AttnGAN ControllGAN MirrorGAN DMGAN TIME Real-Image
CUB Inception Score

R-precision NA
COCO Inception Score

R-precision NA
Table 3: Comparison between TIME and other models

As shown in Table 3

, TIME demonstrates competitive performance on MS-COCO, and sets the new state-of-the-art Inception Score on CUB. Unlike the other models that require a well pre-trained language module and an Inception-v3 image encoder, TIME itself is sufficient to learn the cross-modal relationships between image and language. Regarding the image-text consistency performance, the visual variance is larger in MS-COCO, and thus it is easier for the models to generate matched images from text descriptions. Therefore, all models obtain a better R-Precision score on MS-COCO compared to CUB, while TIME is among the top performers on both datasets.

- img captioning Baseline - Sentence emb + 2D-Pos Encode + Hinged loss
Inception Score

Table 4: Ablation Study of TIME on CUB dataset

Table 4 provides an ablation study. We take the model described until Section 3.2 as the baseline, and perform cumulative experiments on it. First, we remove the image captioning text decoder to show its impact on the T2I direction. Then, we show that dropping the sentence-level embedding does not hurt the performance, while adding 2-D positional encoding brings improvements in both image-text consistency and the overall image quality. Lastly, the hinged loss releases from a potentially conflicting signal, and therefore leads to provide a better objective for , resulting in a substantial boost in image quality.

Figure 8: Learned word embeddings on CUB, and qualitative results on MS-COCO

4.5 Text Encoder and Image-captioning Text Decoder

While training under the cGAN framework, we show that our text encoder successfully acquires semantics. In Fig. 8.(a), words with similar meanings reside close to each other, such as “bill” and “beak”, “belly” and “breast”. Moreover, “large” ends up close to “red”, as the latter often applies to large birds, while “small” is close to “brown” and “grey”, which often apply to small birds.

dataset model BLEU-1 BLEU-2 BLEU-3 BLEU-4 CIDEr METEOR
CUB MirrorGAN pre-trained 0.52 0.247 0.074 0.029 0.098 0.112
TIME 0.79 0.64 0.511 0.39 0.214 0.285
TIME cap-only 0.80 0.65 0.532 0.41 0.236 0.291
COCO MirrorGAN pre-trained 0.70 0.515 0.362 0.251 0.623 0.369
TIME 0.79 0.62 0.471 0.365 0.671 0.412
TIME cap-only 0.79 0.65 0.482 0.371 0.68 0.412

Table 5: Image captioning performance

Apart from a strong T2I performance, becomes a stand-alone image captioning model after training. In Table 5, we report the standard metrics [21, 32, 1] for a comprehensive evaluation of TIME’s image captioning performance. According to the results, TIME has a better performance than the pre-trained captioning model used in MirrorGAN, which is the main-stream “CNN encoder RNN decoder”-based captioning model [34]. The superior captioning performance of TIME also explains its better T2I performance, as the pre-trained text decoder does not have a comparable performance to provide good conditioning information for . We include the result of TIME trained without the image generation part (i.e., trained only as an image captioning model). It suggests that the captioning performance does not benefit from an adversarial training with , but also shows that the adversarial training does not hurt the captioning performance. This reveals a promising area for future research, aimed at improving the performance in both directions (text-to-image and image-captioning) under the single TIME framework.

5 Conclusion

In this paper, we proposed the Text and Image Mutual-translation adversarial nEtwork (TIME), a unified framework trained with an adversarial schema that accomplishes both the text-to-image and image-captioning tasks. While previous work in the T2I field requires pre-training several supportive modules, TIME establishes the new state-of-the-art T2I performance on the CUB dataset without pre-training. Meanwhile, the joint process of learning both a text-to-image and an image-captioning model fully harnesses the power of GANs (since in related work, is typically abandoned after training ), yielding a promising image-captioning performance using . TIME bridges the gap between the visual and language domains, unveiling the immense potential of mutual translations between the two modalities within a single model.


  • [1] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.5.
  • [2] Y. Cai, X. Wang, Z. Yu, F. Li, P. Xu, Y. Li, and L. Li (2019) Dualattn-gan: text to image synthesis with dual attentional generative adversarial network. IEEE Access 7 (), pp. 183706–183716. External Links: ISSN 2169-3536 Cited by: §1, Figure 2, §2.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §2.2.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.
  • [6] T. Hinz, S. Heinrich, and S. Wermter (2019) Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321. Cited by: §1, Figure 2, §2.
  • [7] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [8] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1, §4.2.
  • [9] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §4.2.
  • [10] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [11] Q. Lao, M. Havaei, A. Pesaranghader, F. Dutil, L. D. Jorio, and T. Fevens (2019) Dual adversarial inference for text-to-image synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7567–7576. Cited by: §4.1.
  • [12] B. Li, X. Qi, T. Lukasiewicz, and P. Torr (2019) Controllable text-to-image generation. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 2063–2073. External Links: Link Cited by: §1, Figure 2, §2, §4.1, §4.1.
  • [13] J. H. Lim and J. C. Ye (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §3.4.1.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.
  • [15] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov (2015) Generating images from captions with attention. arXiv preprint arXiv:1511.02793. Cited by: §2.
  • [16] L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §4.2.
  • [17] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.
  • [18] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
  • [19] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.2.
  • [20] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski (2017) Plug & play generative networks: conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477. Cited by: §2.
  • [21] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.5.
  • [22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
  • [23] T. Qiao, J. Zhang, D. Xu, and D. Tao (2019) Mirrorgan: learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514. Cited by: §1, Figure 2, §2.2, §2, §4.1, §4.2, §4.
  • [24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §3.2.
  • [25] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.
  • [26] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. De Freitas (2017)

    Parallel multiscale autoregressive density estimation


    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 2912–2921. Cited by: §2.
  • [27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.
  • [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.2.
  • [29] D. Tran, R. Ranganath, and D. M. Blei (2017) Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896 7, pp. 3. Cited by: §3.4.1.
  • [30] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.3, §3.1.1, §3.3.
  • [32] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.5.
  • [33] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §4.
  • [34] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §4.5.
  • [35] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324. Cited by: §1, Figure 2, §2, §4.1, §4.2, §4.
  • [36] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao (2019) Semantics disentangling for text-to-image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336. Cited by: §1, §2.
  • [37] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907–5915. Cited by: §1, Figure 2, §2.2, §2, §4.1, §4.
  • [38] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1947–1962. Cited by: §2.
  • [39] M. Zhu, P. Pan, W. Chen, and Y. Yang (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5802–5810. Cited by: §1, Figure 2, §2, §4.1, §4.2, §4.