Log In Sign Up

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

by   Ming Ding, et al.

The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.


page 1

page 6


CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open pro...

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

In this paper, we present a cross-modal recipe retrieval framework, Tran...

L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transfo...

Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

Vision Transformers (ViTs) and their multi-scale and hierarchical variat...

StyleNAT: Giving Each Head a New Perspective

Image generation has been a long sought-after but challenging task, and ...

Training Transformers Together

The infrastructure necessary for training state-of-the-art models is bec...

Aggregating Nested Transformers

Although hierarchical structures are popular in recent vision transforme...

1 Introduction

Recently, Text-to-image generation has been greatly advanced by large-scale pretrained transformers, e.g. DALL-E Ramesh et al. (2021) and CogView Ding et al. (2021). These models generally learn to generate image tokens in an auto-regressive way, thus suffer from the following defects:

Slow generation.

The generation of auto-regressive models is usually much slower than non-autoregressive models, e.g. GANs 

Goodfellow et al. (2014), with the same FLOPs. Instead of the large number of parameters, this shortcoming is mainly attributed to the nature of token-by-token generation of auto-regressive models, which cannot fully utilize the parallel computing ability of GPUs even after caching hidden states Ramachandran et al. (2017).

Expensive high-resolution training. The current large-scale pretrained models are usually based on Transformers Vaswani et al. (2017), where the attention operation has both time and space complexity of for training sequences of length . Within a limited budget, we face a trade-off between the number of parameters, which represents the modeling power, and the resolution of the generated images. Due to this reason, the majority of current text-to-image models chose a resolution of tokens (usually pixels) Ding et al. (2021); Ramesh et al. (2021); Gu et al. (2021), which is far lower than the resolution of the real photos.

Uni-direction. Auto-regressive models, e.g. GPTs, for images usually generate tokens following raster-scan order. This order shows the best perplexity during the evaluation Esser et al. (2020). However, this order makes the models unaware of the tokens below or on the right side during generation, so that text-guided infilling is not supported. Moreover, the uni-direction creates a gap between the pretrained text-to-image models and vision transformers (ViTs) Dosovitskiy et al. (2020) based on bidirectional masked prediction, e.g. MAE He et al. (2021) and SimMIM Xie et al. (2021), limiting their application on traditional visual tasks, e.g. image classification and object detection.

Present Work. To overcome the defects above, we first propose a simple and versatile pretraining method, Cross-Modal general Language M

odel (CogLM). Our CogLM masks various types of tokens in the sequence of text and image tokens, and learns to predict them in an auto-regressive way. Specifically, (1) if we mask all the image tokens, it is equivalent to the original CogView to perform a text-to-image generation task. (2) If we mask random patches of image tokens, it works similar to MAE as an infilling task. (3) If we mask text tokens, the task becomes image captioning.

The versatility of CogLM enables us to finetune a pretrained CogLM for different downstream tasks, and constructs a hierarchical model, CogView2.There are three steps in the hierarchical generation process as follows:

  1. First, we generate a batch of low-resolution images ( tokens in CogView2) using the pretrained CogLM, and then (optionally) filter out the bad samples based on the perplexity of CogLM image captioning, which is the post-selection method introduced in CogView Ding et al. (2021).

  2. The generated images are directly mapped into -token images by a direct super-resolution module finetuned from the pretrained CogLM. We use local attention implemented by our customized CUDA kernel to reduce the training expense. The high-resolution images from this step usually have inconsistent textures and lack of details.

  3. These high-resolution images are refined via another iterative super-resolution module finetuned from the pretrained CogLM. Most tokens are re-masked and re-generated in a local parallel auto-regressive (LoPAR) way, which is much faster than usual auto-regressive generation.

How does CogView2 conquer the three defects? Firstly, during pretraining the masked patch prediction task trains CogLM to handle bidirectional context, making it easy to be adapted to bidirectional tasks, e.g. the direct and iterative super-resolution. Secondly, the hierarchical design allows us to only care about local coherence at high-resolution level, so that local attention is leveraged to reduce the training expense. Thirdly, the local parallel auto-regressive generation can reduce the times of model forward from 3,600 to 6, greatly accelerating the generation of high-resolution images. CogView2 is about faster than the CogView (with sliding-window super-resolution) to generate images of similar resolution and better quality.

2 Related Work

Text-to-image generation

for arbitrary inputs is a long-held dream for many cross-modal machine learning researchers. Early attempts for this task are usually based on Generative Adversarial Nets 

Goodfellow et al. (2014), including AttnGAN Xu et al. (2018), DM-GAN Zhu et al. (2019), DF-GAN Tao et al. (2020)

et al. Although they can perform vivid synthesis on domain-specific datasets, e.g. Caltech-UCSD Birds 200, general-domain datasets, e.g. MS COCO 

Lin et al. (2014), are great challenges for these methods. DALL-E Ramesh et al. (2021), CogView Ding et al. (2021) and similar works Wu et al. (2021); Gafni et al. (2022) leverage VQ-VAE van den Oord et al. (2017) to compress an image to a sequence of discrete tokens and pretrain large transformers for auto-regressive generation, greatly advancing the task in general domain. LAFITE Zhou et al. (2021) learns to invert the pretrained CLIP Radford et al. (2021) embeddings in the shared space of text and image for text-free training. Recently, many researchers turn to diffusion models, e.g. Glide Nichol et al. (2021), largely due to the slow generation defect of the auto-regressive models.

Non-autoregressive generation

(NAR) is a popular topic recently in natural language generation, e.g. Mask-Predict 

Ghazvininejad et al. (2019) and GLAT Qian et al. (2021), exploring parallel decoding methods for auto-regressive-like models. The speed of generation was not an issue at the era when GANs dominated the image generation, while constitutes a considerate challenge for current auto-regressive text-to-image models. M6-UFC Zhang et al. (2021) first introduces NAR methods into the VQ-VAE framework, and similar ideas are adopted by VQ-diffusion Gu et al. (2021) and MaskGIT Chang et al. (2022). A possible drawback of pure NAR methods is that tokens sampled at the meantime might lead to global inconsistency in later steps during the generation of complex scenes. Our method introduces a hierarchical design to combine the consistency merit of auto-regressive models and the speed advantage of NAR methods.

3 Method

3.1 The Cross-Modal General Language Model

Figure 2: CogLM. (Left) The sequence consists of both text and image tokens. [BOI] (Begin-Of-Image) is the separator token. The mask regions are sampled according to different strategies. Only the second to last tokens in the mask regions are predicted to compute the loss. (Right) The mask will not change the input sequence, but change the attention map, where rows and columns of all the masked tokens together form a low-triangle attention mask matrix.

As the previous self-supervised pretext tasks often target at mask prediction in the computer vision 

Xie et al. (2021); He et al. (2021), our approach pursues a unification of auto-regressive generation and bidirectional context-aware mask prediction.

In NLP, General Language Model (GLM) Du et al. (2021) first proposes to change the direct mask prediction into blockwise auto-regressive generation. However, a part of its design is redundant for images. For instance, the sizes of the masked image patches are fixed, so that we do not require the capacity of filling blocks of indefinite length as in NLP. Moreover, GLM inserts a sentinel token for each mask region to predict its first token, which will greatly increase the sequence length and restrict the usage of 2D local attention.

Based on the analysis above, we present a more simple and general language model for both text and image data, Cross-modal general Language Model (CogLM). As shown in Figure 2, CogLM takes as input a concatenation of text and images tokenized by icetk 111, whose dictionary contains 20,000 image tokens and 130,000 text (both Chinese and English) tokens. Formally, let be the text tokens and be the image tokens, where and are the lengths of text and image tokens respectively.

The crucial step in CogLM is to sample mask regions according to various strategies. In practice, the following two strategies are used:

  • (Text-to-Image GPT.) The input sequence is . We mask all the image tokens, which is similar to the pretraining task of CogView Ding et al. (2021).

  • (A Combination of Mask Prediction and Image Captioning.) The input sequence is , where [BOE],[BOC] are seperators meaning begin-of-English and begin-of-Chinese used for the corresponding language. we mask random patches and the text tokens. In the ideal strategy, the two tasks should be separated but we combine them together for training efficiency.

Instead of replacing the tokens in the mask regions as [MASK], we make no change in the input but build an attention mask based on the mask regions. All the tokens outside mask regions are seen as context and can be attended to by all the other tokens. A token in mask regions can only be attended to by the tokens in mask regions and behind it. Specifically,


Figure 2 shows an example of the attention mask matrix of two mask regions.

In the mask regions, the model learns to predict the next token. The loss function can be written as follows,

Figure 3: Image Infilling of CogLM. Tokens (viewed as patches) in light green mean mask regions.

where the denotes the tokens outside the mask regions.

Infilling. Note that the first token in each mask region is not predicted during training. This feature seems to disable CogLM from image infilling or cloze filling in natural language, but this problem actually has a simple solution. During inference, we can move the last context token before each mask region into it, which is illustrated in Figure 3. Although these moved tokens becomes blind spots for mask regions before them, it causes minor influence in practice. To further avoid this minor influence and fully maintain the context information, we can deal with each mask region one by one. For each region, we only move the last context token before this region, and keep all the known token outside the mask regions. In this way, we cannot use the cached hidden states from the last region, slightly slowing down the multi-region infilling.

Advantages over GPT Radford et al. (2019), GLM Du et al. (2021) and MAE He et al. (2021). (GPT) The main advantage over GPT is that the modeling of bidirectional context are considered in CogLM, which will benefit many tasks relying on global information, e.g. super-resolution in the next section and image classification. The importance of bidirectional context has been verified in the comparison of BERT Devlin et al. (2018) and GPT on GLUE Wang et al. (2018). (GLM) The main advantage over GLM is simplicity. To unify the generation and bidirectional understanding, GLM needs to define many new special tokens and a new type of position embedding, insert a sentinel for each mask region and change the order of input tokens. It destroys the spatial relevance in the image data and excludes the possibility of the usage of 2D local attention or convolution. (MAE)

MAE is designed for self-supervised learning on pure image data and not ready for generation. Even without text, CogLM is more parameter-efficient because MAE is an encoder-decoder structure. A considerable part of parameters in encoder and decoder are learned for the same function, e.g. extracting the basic feature from inputs.

3.2 Pretraining

As we have introduced CogLM as a general pretraining framework, in this section, we will describe the details and hyperparameters of our pretrained CogLM.

Tokenization. We develop a unified tokenizer icetk of Image, Chinese and English. As shown in DebertaV2 He et al. (2020)

, a large vocabulary (128,000 tokens) benefits. For text, we extract a bilingual vocabulary of 130,000 tokens in icetk and explicitly classify them as Chinese, English, Common or Rare Symbols, so that we can specify the generated language via a sampling mask. The image tokenizer is a 20,000-token first-stage VQ-VAE 

van den Oord et al. (2017), largely following the tokenizer in CogView Ding et al. (2021). Inspired by Esser et al. (2020), a term of perceptual loss Zhang et al. (2018) is added to the reconstruction loss, significantly improving the reconstruction performance.

Transformer. The backbone of our pretrained CogLM is a Transformer with Sandwich LayerNorm Ding et al. (2021). The model has 6 billion parameters (48 layers, hidden size 3072, 48 attention heads) and been trained for 300,000 iterations in FP16 with batch size 4,096. The sequence length is 512, consisting of 400 image tokens, 1 separator and up to 111 text tokens.

Masking Strategy. We assign 50% percentage for each sampling strategy of mask regions. The analysis from SimMIM Xie et al. (2021) exhibits the great importance of the mask percentage and patch distribution. We follow their results to sample token patches at random until 75% of the tokens are in the mask regions. For bilingual samples, we randomly choose one of the languages during training.

3.3 Hierarchical Generation

Although the pretrained CogLM can generate images from text, the resolution is only tokens ( pixels). Actually the short sequence is an intentional design for fast generation. The versatility of CogLM allows us to finetune it into super-resolution models. The whole hierarchical pipeline makes up our CogView2 system.

Direct super-resolution. In this step, we want a model to map a generated low-resolution image token sequence to a higher-resolution sequence . We finetune the pretrained CogLM into an encoder-decoder architecture. The input of the encoder is the sequence of generated image tokens, and the input of the decoder is just a sequence of [MASK]. We do not follow the original transformer Vaswani et al. (2017) to add a cross-attention layer, instead we make the tokens in the decoder attend both local tokens in decoder and encoder. This cross-resolution local attention is implemented via a customized CUDA kernel introduced in section 4.3. Both encoder and decoder are initialized using the pretrained CogLM. In practice, we find it enough to only finetune the weights of the attention layers in the decoder, so that we can fix and share the other parameters between the encoder and decoder to reduce the memory consumption.

Although the direct-mapping is a traditional practice for super-resolution, e.g. SRCNN Dong et al. (2014), it is hardly qualified as generation; it focuses more on texture transformation. The loss function of direct-mapping is token-based or pixel-based (MAE), meaning that it predicts or maximizes the marginal distribution for each token instead of . As we use the cross-entropy loss and a multinomial sampling during generation, we get


Therefore, we need to refine the using another module.

Figure 4: Super-resolution modules. The low-resolution images are mapped into high-resolution images via the direct super-resolution module. In each snapshot during the iterative super-resolution, the tokens in the same color are generated at the same time. All the local windows work in parallel.

Iterative super-resolution. In this step, we aim to refine the initial high-resolution sequence into a better one . The working principle of the refinement is to break the independence of the generated tokens, while keep the parallelism. Thus, we propose a local parallel auto-regressive (LoPAR) way.

The motivation of LoPAR is that the hierarchical process frees us from the global dependence. As long as we maintain 25% – a ratio from MAE He et al. (2021) – random tokens as context, it is enough to recover the global scene of the image. If the re-generated tokens are coherent locally with the 25% kept tokens, the global coherence is also guaranteed. We mask 75% tokens of the and assume that there is a local window size ,


so that the local attention is sufficient and tokens from different local windows can be generated in parallel. To further increase the parallelism, we find the local inconsistency usually occurs when directly adjacent (vertically or horizontally) tokens are generated at the same time. We factorize the generation process into different iterations diagonally as in Figure 4 and follows,


where and are the indices of row and column in the local window.

To implement the iterative super-resolution module, we finetune the pretrained CogLM for 20,000 iterations into a BERT-style masked prediction model on -token sequences with local attention. The mask ratio is sampled from for each sample. During inference, we set the local window size and compress the iterative process from to 6 iterations by arranging the unmasked tokens and merging the first and final iterations222Implemented by a manually designed matrix. Details are included in our released codes..

4 Plug-in Improved Techniques for Transformers

4.1 Cluster Sampling

In auto-regressive generation, the sampling strategy over the predicted distribution of the tokens are crucial. Top-k or top-p (nucleus) sampling Holtzman et al. (2019) are the most common strategies but suffer from an incomplete truncation problem.

Figure 5: (Best viewed in color.) Incomplete truncation. The same color indicates very similar embeddings of the tokens. The hard truncation of top-k sampling twists the proportion between blue, green and red tokens.

The vocabulary of the image tokens are learned by VQVAE van den Oord et al. (2017), where the embeddings of some tokens are very similar. To represent the frequent patterns at a finer granularity, we use a large vocabulary of 20,000 tokens, three times larger than that of the previous works Ramesh et al. (2021); Ding et al. (2021)

, further exacerbating the situation. For instance, there are about 42 tokens basically “white” in icetk, which show subtle differences only when connected to some other tokens. Although the sum of the probabilities of these “white” token might be large enough, most of them could be filtered by top-k sampling. Figure 

5 illustrates the problem.

To solve the incomplete sampling problem, we propose cluster sampling. We group the 20,000 tokens into 500 clusters via K-means 

MacQueen and others (1967)

based on their vectors in VQVAE. During sampling, we first sample a cluster using top-k sampling based on the sum of probabilities of tokens in the clusters, and then sample in the cluster. All the tokens within a cluster are treated as a whole and will be filtered or kept together, alleviating the incomplete truncation problem.

4.2 Upweighting Textual Attention

Most text-image pairs are weakly relevant in the large training data of CogLM. Even the model perfectly fits the data, it should have a considerate probability to generate irrelevant images. To strengthen the relevance, we leverage the explainability of the attention operation. We directly add a constant to all the attention scores from image tokens to text tokens. This technique costs ignorable time consumption but largely improves the textual relevance of the generated images. In practice, will not influence the quality of the images.

4.3 Local Attention

Locality is one of the most important properties of image data. Local operations, e.g. convolution, dominated the visual computing before ViTs Dosovitskiy et al. (2020). Even attention in the ViTs mainly deals with the interactions between local tokens Raghu et al. (2021)

. We find it possible to finetune the pretrained CogLM using local attention and textual attention, which is generally compatible with the global attention weights from pretraining. However, 2D local attention cannot be implemented efficiently using high-level framework, e.g. Pytorch 

Paszke et al. (2019). We develop a customized CUDA kernel to support both 2D local attention, 2D auto-regressive local attention and cross-resolution local attention. In the super-resolution modules, we use local attention with the kernel size of , which is faster and consumes memory than global attention on 4,096 sequence with hidden size 64 per head.

5 Experiments

5.1 Dataset

Our dataset for pretraining contains about 30 million text-image pairs, mostly overlapped with that of CogView Ding et al. (2021). We filter about 5 million text-image pairs from the CogView dataset with some keywords, e.g. “abstract” and “texture”, because they are mostly background images used for design. These images consist of repeated patterns and contribute little to text-to-image generation. We then replenish the dataset with 5 million tag-image pairs. About half of the text are translated from English, and both Chinese and English text are kept to train our bilingual CogLM. Only the images whose resolution is at least are used to train the super-resolution modules.

5.2 Machine Evaluation

To compare with the previous and concurrent works, we follow the most popular benchmark originated from DALL-E Ramesh et al. (2021), Fréchet Inception Distances and Inception Scores evaluated on MS-COCO Lin et al. (2014). 30,000 captions from validation set are sampled to evaluate the FID. Since each image in COCO has up to 5 different captions, we carefully select the sampled captions to describe different images. We generate 16 samples for each caption (translated into Chinese), and select the best one with the lowest caption perplexity (the Caption Score in Ding et al. (2021)). Note that FID is not the perfect metric to evaluate CogView2 because (1) the advantage of CogView2 is to generate high-resolution images, but we need to resize the images back to for meaningful comparison. (2) There are mistakes when translating English captions into Chinese. (3) Our training data contain many single-object images, which are quite different with the distribution of COCO (common objects in context).

The results of machine evaluation are demonstrated in Table 1. We find that finetuning CogLM on MS-COCO dataset will largely improve the FID. During our finetuning, we witness the FID reducing from 24.0 (0 iteration) 19.2 (2,500 iterations) 17.7 (5,000 iterations). However, we find that the quality (human evaluation) of generation deteriorates. Though the style is more similar to COCO, the generation is not as accurate as the non-finetuned version.

Model FID-0 FID-1 FID-2 FID-4 FID-8 IS
AttnGAN* 35.2 44.0 72.0 108.0 100.0 23.3
DM-GAN* 26.0 39.0 73.0 119.0 112.3 32.2
DF-GAN* 26.0 33.8 55.9 91.0 97.0 18.7
DALL-E 27.5 28.0 45.5 83.5 85.0 17.9
CogView 27.1 19.4 13.9 19.4 23.6 18.2
XMC-GAN* 9.3 - - - - 30.5
NVWA* 12.9 13.8 15.7 19.3 24 27.2
LAFITE 26.9 23.0 18.7 15.7 14.8 26.0
Make-A-Scene* 7.55 - - - - -
DALL-E-2 10.9 - - - - -
CogView2 24.0 19.7 16.8 17.2 17.2 22.4
CogView2* 17.7 13.8 11.7 12.2 12.3 26.4
Table 1: Machine Evaluation Results on MS-COCO. (Downsampling CogView2 images to .) * means finetuning on MS-COCO.

6 Discussion

Auto-regressive or Diffusion? Although GPTs achieved great success in text generation, diffusion models become increasingly popular in image generation. Here we compare diffusion models with auto-regressive models from the aspect of speed, the largest disadvantage of the auto-regressive models discussed in the section 1

. With the same architecture, diffusion models require more FLOPs but have a high degree of parallelism. They can also make a trade-off between the quality and time consumption by manually scheduling the stride of sampling. For example, Glide 

Nichol et al. (2021) samples 250 diffusion steps for evaluation, and 27 steps for interactive sampling to reduce the latency to 15s. Auto-regressive models must generate the image token-by-token, but our LoPAR can upsample the image with a high parallelism degree, so that (potentially) we can reduce the time cost by introducing more hierarchies to design models much faster than diffusion models.

Comparison between DALL-E-2 and CogView2. DALL-E-2 is a recently released work for text-to-image generation on resolution. Although its probabilistic model and architecture are quite different from CogView2, they share the same spirit – hierarchical generation. Its quality gain over CogView2 is mainly originated from a third-level super-resolution and a “zeroth”-level image prior generation. Moreover, DALL-E-2 is trained on 650M text-image pairs, about the size of CogView2, which might also influence the performance.

7 Conclusion

The breakthrough in the text-to-image domain are made by auto-regressive models. However, the slow generation and high complexity hinder researchers from improving the quality in this direction. In this paper, we put forward the way of hierarchical transformers to help auto-regressive models conquer these disadvantages, and bridge the gap between text-to-image pretraining and recent visual representation learning methods, e.g. MAE He et al. (2021).

The advancement of text-to-image generation, especially text-guided image editing, will benefit the creation of artists and designers, and will also cause the risk of misinformation, leading to permanent damage to the reliability of web photos.


  • H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman (2022) MaskGIT: masked generative image transformer. arXiv preprint arXiv:2202.04200. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.1.
  • M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. (2021) Cogview: mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34. Cited by: item 1, §1, §1, §2, 1st item, §3.2, §3.2, §4.1, §5.1, §5.2.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §3.3.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §4.3.
  • Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang (2021) All nlp tasks are generation tasks: a general pretraining framework. arXiv preprint arXiv:2103.10360. Cited by: §3.1, §3.1.
  • P. Esser, R. Rombach, and B. Ommer (2020) Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841. Cited by: §1, §3.2.
  • O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman (2022) Make-a-scene: scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131. Cited by: §2.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Mask-predict: parallel decoding of conditional masked language models. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 6112–6121. Cited by: §2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1, §2.
  • S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo (2021) Vector quantized diffusion model for text-to-image synthesis. CoRR abs/2111.14822. Cited by: §1, §2.
  • K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick (2021)

    Masked autoencoders are scalable vision learners

    CoRR abs/2111.06377. Cited by: §1, §3.1, §3.1, §3.3, §7.
  • P. He, X. Liu, J. Gao, and W. Chen (2020) Deberta: decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654. Cited by: §3.2.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2, §5.2.
  • J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §4.1.
  • A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen (2021) Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741. Cited by: §2, §6.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    Advances in neural information processing systems 32. Cited by: §4.3.
  • L. Qian, H. Zhou, Y. Bao, M. Wang, L. Qiu, W. Zhang, Y. Yu, and L. Li (2021)

    Glancing transformer for non-autoregressive neural machine translation

    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1993–2003. Cited by: §2.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §3.1.
  • M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy (2021)

    Do vision transformers see like convolutional neural networks?

    Advances in Neural Information Processing Systems 34. Cited by: §4.3.
  • P. Ramachandran, T. L. Paine, P. Khorrami, M. Babaeizadeh, S. Chang, Y. Zhang, M. A. Hasegawa-Johnson, R. H. Campbell, and T. S. Huang (2017) Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001. Cited by: §1.
  • A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §1, §1, §2, §4.1, §5.2.
  • M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X. Jing (2020) Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865. Cited by: §2.
  • A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318. Cited by: §2, §3.2, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §3.3.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    pp. 353–355. Cited by: §3.1.
  • C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan (2021) N" uwa: visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417. Cited by: §2.
  • Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu (2021) SimMIM: A simple framework for masked image modeling. CoRR abs/2111.09886. Cited by: §1, §3.1, §3.2.
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1316–1324. Cited by: §2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §3.2.
  • Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou, and H. Yang (2021) M6-ufc: unifying multi-modal controls for conditional image synthesis. arXiv preprint arXiv:2105.14211. Cited by: §2.
  • Y. Zhou, R. Zhang, C. Chen, C. Li, C. Tensmeyer, T. Yu, J. Gu, J. Xu, and T. Sun (2021) LAFITE: towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792. Cited by: §2.
  • M. Zhu, P. Pan, W. Chen, and Y. Yang (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810. Cited by: §2.