Vector Quantized Diffusion Model for Text-to-Image Synthesis

by   Shuyang Gu, et al.

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.



There are no comments yet.


page 6

page 7

page 12

page 13

page 14


Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation

The integration of Vector Quantised Variational AutoEncoder (VQ-VAE) wit...

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application...

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Recently, vector-quantized image modeling has demonstrated impressive pe...

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPM) have shown remarkable pe...

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Autoregressive models and their sequential factorization of the data lik...

L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transfo...

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Diffusion models have recently been shown to generate high-quality synth...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent success of Transformer [vaswani2017attention, devlin2018bert]

in neural language processing (NLP) has raised tremendous interest in using successful language models for computer vision tasks. Autoregressive (AR) model 

[radford2018improving, radford2019language, brown2020language] is one of the most natural and popular approach to transfer from text-to-text generation (i.e., machine translation) to text-to-image generation. Based on the AR model, recent work DALL-E [ramesh2021zero] has achieved impressive results for text-to-image generation.

Despite their success, existing text-to-image generation methods still have weaknesses that need to be improved. One issue is the unidirectional bias. Existing methods predict pixels or tokens in the reading order, from top-left to bottom-right, based on the attention to all prefix pixels/tokens and the text description. This fixed order introduces unnatural bias in the synthesized images because important contextual information may come from any part of the image, not just from left or above. Another issue is the accumulated prediction errors. Each step of the inference stage is performed based on previously sampled tokens – this is different from that of the training stage, which relies on the so-called “teacher-forcing” practice [esser2021imagebart] and provides the ground truth for each step. This difference is important and its consequence merits careful examination. In particular, a token in the inference stage, once predicted, cannot be corrected and its errors will propagate to the subsequent tokens.

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation, a model that eliminates the unidirectional bias and avoids accumulated prediction errors. We start with a vector quantized variational autoencoder (VQ-VAE) and model its latent space by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM) 

[sohl2015deep, ho2020denoising], which has been applied to image synthesis with compelling results [dhariwal2021diffusion]

. We show that the latent-space model is well-suited for the task of text-to-image generation. Roughly speaking, the VQ-Diffusion model samples the data distribution by reversing a forward diffusion process that gradually corrupts the input via a fixed Markov chain. The forward process yields a sequence of increasingly noisy latent variables of the same dimensionality as the input, producing pure noise after a fixed number of timesteps. Starting from this noise result, the reverse process gradually denoises the latent variables towards the desired data distribution by learning the conditional transit distribution.

The VQ-Diffusion model eliminates the unidirectional bias. It consists of an independent text encoder and a diffusion image decoder, which performs denoising diffusion on discrete image tokens. At the beginning of the inference stage, all image tokens are either masked or random. Here the masked token serves the same function as those in mask-based generative models [devlin2018bert]

. The denoising diffusion process gradually estimates the probability density of image tokens step-by-step based on the input text. In each step, the diffusion image decoder leverages the contextual information of all tokens of the entire image predicted in the previous step to estimate a new probability density distribution and use this distribution to predict the tokens in the current step. This bidirectional attention provides global context for each token prediction and eliminates the unidirectional bias.

The VQ-Diffusion model, with its mask-and-replace diffusion strategy, also avoids the accumulation of errors. In the training stage, we do not use the “teacher-forcing” strategy. Instead, we deliberately introduce both masked tokens and random tokens and let the network learn to predict the masked token and modify incorrect tokens. In the inference stage, we update the density distribution of all tokens in each step and resample all tokens according to the new distribution. Thus we can modify the wrong tokens and prevent error accumulation. Comparing to the conventional replace-only diffusion strategy for unconditional image generation [austin2021structured], the masked tokens effectively direct the network’s attention to the masked areas and thus greatly reduce the number of token combinations to be examined by the network. This mask-and-replace diffusion strategy significantly accelerates the convergence of the network.

To assess the performance of the VQ-Diffusion method, we conduct text-to-image generation experiments with a wide variety of datasets, including CUB-200 [wah2011caltech], Oxford-102 [nilsback2008automated], and MSCOCO [lin2014microsoft]. Compared with AR model with similar numbers of model parameters, our method achieves significantly better results, as measured by both image quality metrics and visual examination, and is much faster. Compared with previous GAN-based text-to-image methods [xu2018attngan, zhang2017stackgan, zhang2018stackgan++, zhu2019dm], our method can handle more complex scenes and the synthesized image quality is improved by a large margin. Compared with extremely large models (models with ten times more parameters than ours), including DALL-E [ramesh2021zero] and CogView [ding2021cogview], our model achieves comparable or better results for specific types of images, i.e., the types of images that our model has seen during the training stage. Furthermore, our method is general and produces strong results in our experiments on both unconditional and conditional image generation with FFHQ [karras2019style]

and ImageNet 

[deng2009imagenet] datasets.

The VQ-Diffusion model also provides important benefits for the inference speed. With traditional AR methods, the inference time increases linearly with the output image resolution and the image generation is quite time consuming even for normal-size images (e.g., images larger than small thumbnail images of pixels). The VQ-Diffusion provides the global context for each token prediction and makes it independent of the image resolution. This allows us to provide an effective way to achieve a better tradeoff between the inference speed and the image quality by a simple reparameterization of the diffusion image decoder. Specifically, in each step, we ask the decoder to predict the original noise-free image instead of the noise-reduced image in the next denoising diffusion step. Through experiments we have found that the VQ-Diffusion method with reparameterization can be fifteen times faster than AR methods while achieving a better image quality.

2 Related Work

GAN-based Text-to-image generation.

In the past few years, Generative Adversarial Networks (GANs) 

[goodfellow2014generative] have shown promising results on text-to-image generation [reed2016generative, zhang2017stackgan, dash2017tac, nguyen2017plug, sharma2018chatpainter, hong2018inferring, xu2018attngan, zhang2018stackgan++, zhang2018photographic, gao2019perceptual, lao2019dual, li2019object, qiao2019learn, qiao2019mirrorgan, yin2019semantics, tan2019semantics, zhu2019dm, li2019controllable, cha2019adversarial, el2019tell, cheng2020rifegan, souza2020efficient, liang2020cpgan, tao2020df, zhang2021cross, ruan2021dae]. GAN-INT-CLS [reed2016generative] was the first to use a conditional GAN formulation for text-to-image generation. Based on this formulation, some approaches [zhang2017stackgan, zhang2018stackgan++, xu2018attngan, zhu2019dm, qiao2019mirrorgan, yin2019semantics, zhang2021cross, liang2020cpgan] were proposed to further improve the generation quality. These models generate high fidelity images on single domain datasets, e.g., birds [wah2011caltech] and flowers [nilsback2008automated]

. However, due to the inductive bias on the locality of convolutional neural networks, they struggle on complex scenes with multiple objects, such as those in the MS-COCO dataset 


Other works [hong2018inferring, li2019object] adopt a two-step process which first infer the semantic layout then generate different objects, but this kind of method requires fine-grained object labels, e.g., object bounding boxes or segmentation maps.

Autoregressive Models. AR models [radford2018improving, radford2019language, brown2020language] have shown powerful capability of density estimation and have been applied for image generation [salimans2017pixelcnn++, van2016pixel, parmar2018image, oord2017neural, razavi2019generating, chen2020generative, esser2021taming] recently. PixelRNN [salimans2017pixelcnn++, van2016pixel], Image Transformer  [parmar2018image] and ImageGPT [chen2020generative] factorized the probability density on an image over raw pixels. Thus, they only generate low-resolution images, like , due to the unaffordable amount of computation for large images.

VQ-VAE [oord2017neural, razavi2019generating], VQGAN [esser2021taming] and ImageBART [esser2021imagebart] train an encoder to compress the image into a low-dimensional discrete latent space and fit the density of the hidden variables. It greatly improves the performance of image generation.

DALL-E [ramesh2021zero], CogView [ding2021cogview] and M6 [lin2021m6]

propose AR-based text-to-image frameworks. They model the joint distribution of text and image tokens. With powerful large transformer structure and massive text-image pairs, they greatly advance the quality of text-to-image generation, but still have weaknesses of unidirectional bias and accumulated prediction errors due to the limitation of AR models.

Denoising Diffusion Probabilistic Models. Diffusion generative models were first proposed in [sohl2015deep] and achieved strong results on image generation  [ho2020denoising, nichol2021improved, ho2021cascaded, dhariwal2021diffusion]

and image super super-resolution 

[saharia2021image] recently. However, most previous works only considered continuous diffusion models on the raw image pixels. Discrete diffusion models were also first described in [sohl2015deep], and then applied to text generation in Argmax Flow [hoogeboom2021argmax]. D3PMs [austin2021structured] applies discrete diffusion to image generation. However, it also estimates the density of raw image pixels and can only generate low-resolution (e.g.,) images.

3 Background: Learning Discrete Latent Space of Images Via VQ-VAE

Transformer architectures have shown great promise in image synthesis due to their outstanding expressivity [chen2020generative, esser2021taming, ramesh2021zero]. In this work, we aim to leverage the transformer to learn the mapping from text to image. Since the computation cost is quadratic to the sequence length, it is computationally prohibitive to directly model raw pixels using transformers. To address this issue, recent works [oord2017neural, esser2021taming] propose to represent an image by discrete image tokens with reduced sequence length. Hereafter a transformer can be effectively trained upon this reduced context length and learn the translation from the text to image tokens.

Formally, a vector quantized variational autoencoder (VQ-VAE) [oord2017neural] is employed. The model consists of an encoder , a decoder and a codebook containing a finite number of embedding vectors , where is the size of the codebook and is the dimension of codes. Given an image , we obtain a spatial collection of image tokens with the encoder and a subsequent spatial-wise quantizer which maps each spatial feature into its closest codebook entry :


Where represents the encoded sequence length and is usually much smaller than . Then the image can be faithfully reconstructed via the decoder, i.e., . Hence, image synthesis is equivalent to sampling image tokens from the latent distribution. Note that the image tokens are quantized latent variables in the sense that they take discrete values. The encoder , the decoder and the codebook

can be trained end-to-end via the following loss function:


Where, stands for the stop-gradient operation. In practice, we replace the second term of Equation 2 with exponential moving averages (EMA) [oord2017neural] to update the codebook entries which is proven to work better than directly using the loss function.

4 Vector Quantized Diffusion Model

Given the text-image pairs, we obtain the discrete image tokens with a pretrained VQ-VAE, where represents the sequence length of tokens. Suppose the size of the VQ-VAE codebook is , the image token at location takes the index that specifies the entries in the codebook, i.e., . On the other hand, the text tokens  can be obtained through BPE-encoding [sennrich2015neural]. The overall text-to-image framework can be viewed as maximizing the conditional transition distribution .

Previous autoregressive models, e.g., DALL-E [ramesh2021zero] and CogView [ding2021cogview], sequentially predict each image token depends on the text tokens as well as the previously predicted image tokens, i.e., . While achieving remarkable quality in text-to-image synthesis, there exist several limitations of autoregressive modeling. First, image tokens are predicted in a unidirectional ordering, e.g., raster scan, which neglects the structure of 2D data and restricts the expressivity for image modeling since the prediction of a specific location should not merely attend to the context on the left or the above. Second, there is a train-test discrepancy as the training employs ground truth whereas the inference relies on the prediction as previous tokens. The so-called “teacher-forcing” practice [esser2021imagebart] or exposure bias [schmidt2019generalization] leads to error accumulation due to the mistakes in the earlier sampling. Moreover, it requires a forward pass of the network to predict each token, which consumes an inordinate amount of time even for the sampling in the latent space of low resolution (i.e., ), making the AR model impractical for real usage.

We aim to model the VQ-VAE latent space in a non-autoregressive manner. The proposed VQ-Diffusion method maximizes the probability with the diffusion model [sohl2015deep, ho2020denoising], an emerging approach that produces compelling quality on image synthesis [dhariwal2021diffusion]. While the majority of recent works focus on continuous diffusion models, using them for categorical distribution is much less researched [hoogeboom2021argmax, austin2021structured]. In this work, we propose to use its conditional variant discrete diffusion process for text-to-image generation. We will subsequently introduce the discrete diffusion process inspired by the masked language modeling (MLM) [devlin2018bert]

, and then discuss how to train a neural network to reverse this process.

Figure 1: Overall framework of our method. It starts with the VQ-VAE. Then, the VQ-Diffusion models the discrete latent space by reversing a forward diffusion process that gradually corrupts the input via a fixed Markov chain.

4.1 Discrete diffusion process

On a high level, the forward diffusion process gradually corrupts the image data via a fixed Markov chain , e.g., random replace some tokens of . After a fixed number of timesteps, the forward process yields a sequence of increasingly noisy latent variables of the same dimensionality as , and becomes pure noise tokens. Starting from the noise , the reverse process gradually denoises the latent variables and restore the real data by sampling from the reverse distribution sequentially. However, since

is unknown in the inference stage, we train a transformer network to approximate the conditional transit distribution

depends on the entire data distribution.

To be more specific, consider a single image token of at location , which takes the index that specifies the entries in the codebook, i.e., . Without introducing confusion, we omit superscripts in the following description. We define the probabilities that transits to using the matrices . Then the forward Markov diffusion process for the whole token sequence can be written as,


where is a one-hot column vector which length is and only the entry is 1. The categorical distribution over is given by the vector .

Importantly, due to the property of Markov chain, one can marginalize out the intermediate steps and derive the probability of at arbitrary timestep directly from as,


Besides, another notable characteristic is that by conditioning on , the posterior of this diffusion process is tractable, i.e.,


The transition matrix is crucial to the discrete diffusion model and should be carefully designed such that it is not too difficult for the reverse network to recover the signal from noises.

Previous works [hoogeboom2021argmax, austin2021structured] propose to introduce a small amount of uniform noises to the categorical distribution and the transition matrix can be formulated as,


with and . Each token has a probability of to remain the previous value at the current step while with a probability of to be resampled uniformly over all the categories.

Nonetheless, the data corruption using uniform diffusion is a somewhat aggressive process that may pose challenge for the reverse estimation. First, as opposed to the Gaussian diffusion process for ordinal data, an image token may be replaced to an utterly uncorrelated category, which leads to an abrupt semantic change for that token. Second, the network has to take extra efforts to figure out the tokens that have been replaced prior to fixing them. In fact, due to the semantic conflict within the local context, the reverse estimation for different image tokens may form a competition and run into the dilemma of identifying the reliable tokens.

Mask-and-replace diffusion strategy. To solve the above issues of uniform diffusion, we draw inspiration from mask language modeling [devlin2018bert] and propose to corrupt the tokens by stochastically masking some of them so that the corrupted locations can be explicitly known by the reverse network. Specifically, we introduce an additional special token, token, so each token now has discrete states. We define the mask diffusion as follows: each ordinary token has a probability of to be replaced by the token and has a chance of to be uniformly diffused, leaving the probability of to be unchanged, whereas the token always keeps its own state. Hence, we can formulate the transition matrix as,


The benefit of this mask-and-replace transition is that: 1) the corrupted tokens are distinguishable to the network, which eases the reverse process. 2) Comparing to the mask only approach in [austin2021structured], we theoretically prove that it is necessary to include a small amount of uniform noises besides the token masking, otherwise we get a trivial posterior when . 3) The random token replacement forces the network to understand the context rather than only focusing on the tokens. 4) The cumulative transition matrix and the probability in Equation 4 can be computed in closed form with:


Where , , and can be calculated and stored in advance. Thus, the computation cost of is reduced from to . The proof is given in the appendix.

2:      sample training image-text pair
5:      Eqn. 4 and 8
6:      Eqn. 9 and 12
7:      Update network parameters
8:until converged
Algorithm 1 Training of the VQ-Diffusion, given transition matrix , initial network parameters , loss weight , learning rate .

4.2 Learning the reverse process

To reverse the diffusion process, we train a denoising network to estimate the posterior transition distribution . The network is trained to minimize the variational lower bound (VLB) [sohl2015deep]:


Where is the prior distribution of timestep . For the proposed mask-and-replace diffusion, the prior is:


Note that since the transition matrix is fixed in the training, the is a constant number which measures the gap between the training and inference and can be ignored in the training.

Reparameterization trick on discrete stage. The network parameterization affects the synthesis quality significantly. Instead of directly predicting the posterior , recent works [ho2020denoising, hoogeboom2021argmax, austin2021structured] find that approximating some surrogate variables, e.g., the noiseless target data gives better quality. In the discrete setting, we let the network predict the noiseless token distribution at each reverse step. We can thus compute the reverse transition distribution according to:


Based on the reparameterization trick, we can introduce an auxiliary denoising objective, which encourages the network to predict noiseless token :


We find that combining this loss with improves the image quality.

2: Eqn. 10
3:while  do
4:      Eqn. 13
6:end while
7:return VQVAE-Decoder()
Algorithm 2

Inference of the VQ-Diffusion, given fast inference time stride

, input text .

Model architecture. We propose an encoder-decoder transformer to estimate the distribution . As shown in Figure 1, the framework contains two parts: a text encoder and a diffusion image decoder. Our text encoder takes the text tokens and yields a conditional feature sequence. The diffusion image decoder takes the image token and timestep and outputs the noiseless token distribution

. The decoder contains several transformer blocks and a softmax layer. Each transformer block contains a full attention, a cross attention to combine text information and a feed forward network block. The current timestep

is injected into the network with Adaptive Layer Normalization [ba2016layer](AdaLN) operator, i.e., , where is the intermediate activations, and are obtained from a linear projection of the timestep embedding.

Figure 2: Comparison with GAN-based method on CUB-200 and MSCOCO datasets.

Fast inference strategy In the inference stage, by leveraging the reparameterization trick, we can skip some steps in diffusion model to achieve a faster inference.

Specifically, assuming the time stride is , instead of sampling images in the chain of , we sample images in the chain of with the reverse transition distribution:


We found it makes the sampling more efficient which only causes little harm to quality. The whole training and inference algorithm is shown in Algorithm 1 and 2.

5 Experiments

In this section, we first introduce the overall experiment setups and then present extensive results to demonstrate the superiority of our approach in text-to-image synthesis. Finally, we point out that our method is a general image synthesis framework that achieves great performance on other generation tasks, including unconditional and class conditional image synthesis.

Datasets. To demonstrate the capability of our proposed method for text-to-image synthesis, we conduct experiments on CUB-200 [wah2011caltech], Oxford-102 [nilsback2008automated], and MSCOCO [lin2014microsoft] datasets. The CUB-200 dataset contains 8855 training images and 2933 test images belonging to 200 bird species. Oxford-102 dataset contains 8189 images of flowers of categories. Each image in CUB-200 and Oxford-102 dataset contains 10 text descriptions. MSCOCO dataset contains images for training and images for testing. Each image in this dataset has five text descriptions.

To further demonstrate the scalability of our method, we also train our model on large scale datasets, including Conceptual Captions [sharma2018conceptual, changpinyo2021conceptual] and LAION-400M [schuhmann2021laion]. The Conceptual Caption dataset, including both CC3M [sharma2018conceptual] and CC12M [changpinyo2021conceptual] datasets, contains 15M images. To balance the text and image distribution, we filter a 7M subset according to the word frequency. The LAION-400M dataset contains 400M image-text pairs. We train our model on three subsets from LAION, i.e., cartoon, icon, and human, each of them contains 0.9M, 1.3M, 42M images, respectively. For each subset, we filter the data according to the text.

Traning Details. Our VQ-VAE’s encoder and decoder follow the setting of VQGAN [esser2021taming] which leverages the GAN loss to get a more realistic image. We directly adopt the publicly available VQGAN model trained on OpenImages [krasin2017openimages] dataset for all text-to-image synthesis experiments. It converts images into tokens. The codebook size after removing useless codes. We adopt a publicly available tokenizer of the CLIP model [radford2021learning] as text encoder, yielding a conditional sequence of length 77. We fix both image and text encoders in our training.

For fair comparison with previous text-to-image methods under similar parameters, we build two different diffusion image decoder settings: 1) VQ-Diffusion-S (Small), it contains transformer blocks with dimension of . The model contains parameters. 2) VQ-Diffusion-B (Base), it contains transformer blocks with dimension of . The model contains parameters.

In order to show the scalability of our method, we also train our base model on a larger database Conceptual Captions, and then fine-tune it on each database. This model is denoted as VQ-Diffusion-F.

For the default setting, we set timesteps and loss weight . For the transition matrix, we linearly increase and from to and , respectively. We optimize our network using AdamW [loshchilov2017decoupled] with and . The learning rate is set to after 5000 iterations of warmup. More training details are provided in the appendix.

5.1 Comparison with state-of-the-art methods

We qualitatively compare the proposed method with several state-of-the-art methods, including some GAN-based methods [xu2018attngan, zhang2017stackgan, souza2020efficient, tan2019semantics, zhang2018stackgan++, zhu2019dm, tao2020df], DALL-E [ramesh2021zero] and CogView [ding2021cogview], on MSCOCO, CUB-200 and Oxford-102 datasets. We use FID [heusel2017gans] as the comparison metric and show the results in Table 1.

We can see that our small model, VQ-Diffusion-S, which has the similar parameter number with previous GAN-based models, has achieved top performance on CUB-200 and Oxford-102 datasets. Our base model, VQ-Diffusion-B, further improves the performance. And our VQ-Diffusion-F model achieves the best results and surpasses all previous methods by a large margin, even surpassing DALL-E [ramesh2021zero] and CogView [ding2021cogview], which have ten times more parameters than ours, on MSCOCO dataset.

Some visualized comparison results with DM-GAN [zhu2019dm] and DF-GAN [tao2020df] are shown in Figure 2. Obviously, our synthesized images have better realistic fine-grained details and are more consistent with the input text.

5.2 In the wild text-to-image synthesis

To demonstrate the capability of generating in-the-wild images, we train our model on three subsets from LAION-400M dataset, e.g., cartoon, icon and human. We provide our results here in Figure 3. Though our base model is much smaller than previous works like DALL-E and CogView, we also achieved a strong performance.

Compared with the AR method which generates images from top-left to down-right, our method generates images in a global manner. It makes our method can be applied to many vision tasks, e.g., irregular mask inpainting. For this task, we do not need to re-train a new model. We simply set the tokens in the irregular region as [MASK] token, and send them to our model. This strategy supports both unconditional mask inpainting and text conditional mask inpainting. Due to the space limitation, we show these results in the appendix.

MSCOCO CUB-200 Oxford-102
StackGAN [zhang2017stackgan] 74.05 51.89 55.28
StackGAN++ [zhang2018stackgan++] 81.59 15.30 48.68
EFF-T2I [souza2020efficient] - 11.17 16.47
SEGAN [tan2019semantics] 32.28 18.17 -
AttnGAN [xu2018attngan] 35.49 23.98 -
DM-GAN [zhu2019dm] 32.64 16.09 -
DF-GAN [tao2020df] 21.42 14.81 -
DAE-GAN [ruan2021dae] 28.12 15.19 -
DALLE [ramesh2021zero] 27.50 56.10 -
Cogview [ding2021cogview] 27.10 - -
VQ-Diffusion-S - 12.97 14.95
VQ-Diffusion-B 19.75 11.94 14.88
VQ-Diffusion-F 13.86 10.32 14.10
Table 1: FID comparison of different text-to-image synthesis method on MSCOCO, CUB-200, and Oxford-102 datasets.
Figure 3: In the wild text-to-image synthesis results.

5.3 Ablations

Number of timesteps. We investigate the timesteps in training and inference. As shown in Table 2, we perform the experiment on the CUB-200 dataset. We find when the training steps increase from to , the result improves, when it further increase to , it seems saturated. So we set the default timesteps number to in our experiments. To demonstrate the fast inference strategy, we evaluate the generated images from inference steps on five models with different training steps. We find it still maintains a good performance when dropping inference steps, which may save about inference times.

Mask-and-replace diffusion strategy. We explore how the mask-and-replace strategy benefits our performance on the Oxford-102 dataset. We set different final mask rate () to investigate the effect. Both mask only strategy () and replace only strategy () are special cases of our mask-and-replace strategy. From Figure 4, we find it get the best performance when . When , it may suffer from the error accumulation problem, when , the network may be difficult to find which region needs to pay more attention.

inference steps

training steps
10 25 50 100 200
10 32.35 27.62 23.47 19.84 20.96
25 - 18.53 15.25 14.03 16.13
50 - - 13.82 12.45 13.67
100 - - - 11.94 12.27
200 - - - - 11.80
Table 2: Ablation study on training steps and inference steps. Each column shares the same training steps while each row shares the same inference steps.
Figure 4: Ablation study on the mask rate and the truncation rate.
Model steps FID throughput(s/img)
VQ-AR-S 18.12 12.1
VQ-Diffusion-S 25 15.46 0.8
VQ-Diffusion-S 50 13.62 1.5
VQ-Diffusion-S 100 12.97 2.7
VQ-AR-B 17.76 36.2
VQ-Diffusion-B 25 14.03 2.1
VQ-Diffusion-B 50 12.45 4.1
VQ-Diffusion-B 100 11.94 8.0
Table 3: Comparison between VQ-Diffusion and VQ-AR models. By changing the inference steps, the VQ-Diffusion model is times faster than the VQ-AR model while maintaining better performance.

Truncation. We also demonstrate that the truncation sampling strategy is extremely important for our discrete diffusion based method. It may avoid the network sampling from low probability tokens. Specifically, we only keep top tokens of in the inference stage. We evaluate the results with different truncation rates on CUB-200 dataset. As shown in Figure 4, we find that it achieves the best performance when the truncation rate equals .

VQ-Diffusion vs VQ-AR. For a fair comparison, we replace our diffusion image decoder with an autoregressive decoder with the same network structure and keep other settings the same, including both image and text encoders. The autoregressive model is denoted as VQ-AR-S and VQ-AR-B, corresponding to VQ-Diffusion-S and VQ-Diffusion-B. The experiment is performed on the CUB-200 dataset. As shown in Table 3 , on both -S and -B settings the VQ-Diffusion model surpasses the VQ-AR model by a large margin. Meanwhile, we evaluate the throughput of both methods on a V100 GPU with a batch size of 32. The VQ-Diffusion with the fast inference strategy is times faster than the VQ-AR model with a better FID score.

5.4 Unified generation model

Our method is general, which can also be applied to other image synthesis tasks, e.g., unconditional image synthesis and image synthesis conditioned on labels. To generate images from a given class label, we first remove the text encoder network and cross attention part in transformer blocks, and inject the class label through the AdaLN operator. Our network contains transformer blocks with dimension

. We train our model on the ImageNet dataset. For VQ-VAE, we adopt the publicly available model from VQ-GAN 

[esser2021taming] trained on ImageNet dataset, which downsamples images from to . For unconditional image synthesis, we trained our model on the FFHQ256 dataset, which contains 70k high quality face images. The image encoder also downsamples images to tokens.

We assess the performance of our model in terms of FID and compare with a variety of previously established models [brock2018large, nichol2021improved, dhariwal2021diffusion, esser2021taming, esser2021imagebart]. Following [esser2021taming] we can further increase the quality by only accepting images with a top classification score, denoted as acc0.05. We show the quantitative results in Table  4. While some task-specialized GAN models report better FID scores, our approach provides a unified model that works well across a wide range of tasks.

Model ImageNet FFHQ
StyleGAN2 [karras2020analyzing] - 3.8
BigGAN [brock2018large] 7.53 12.4
BigGAN-deep [brock2018large] 6.84 -
IDDPM [nichol2021improved] 12.3 -
ADM-G [dhariwal2021diffusion] 10.94 -
VQGAN [esser2021taming] 15.78 9.6
ImageBART [esser2021imagebart] 21.19 9.57
Ours 11.89 6.33
ADM-G (1.0guid) [dhariwal2021diffusion] 4.59 -
VQGAN (acc0.05) [esser2021taming] 5.88 -
ImageBART (acc0.05) [esser2021imagebart] 7.44 -
Ours (acc0.05) 5.32 -
Table 4:

FID score comparison for class-conditional synthesis on ImageNet, and unconditional synthesis on FFHQ dataset. ’guid’ denotes using classifier guidance 

[dhariwal2021diffusion], ’acc’ denotes adopting acceptance rate [esser2021taming].

6 Conclusion

In this paper, we present a novel text-to-image architecture named VQ-Diffusion. The core design is to model the VQ-VAE latent space in a non-autoregressive manner. The proposed mask-and-replace diffusion strategy avoids the accumulation of errors of the AR model. Our model has the capacity to generate more complex scenes, which surpasses previous GAN-based text-to-image methods. Our method is also general and produces strong results on unconditional and conditional image generation.

Limitations and Future work. We are aware that there are some limitations of our model. 1) Due to the model capacity and the data bias, the current model has a bias for generating high quality objects. 2) To save GPU memory and make the training more efficient, the VQ-VAE and VQ-Diffusion are not trained end-to-end, which may not be optimal. In the future, we are looking forward to using a larger model and data for training and applying the VQ-Diffusion to more generation tasks.


We thank Qiankun Liu from University of Science and Technology of China for his help, he provided the initial code and datasets.


Appendix A Implementation details

In our experiments on text-to-image synthesis, we adopt the public VQ-VAE [oord2017neural] model provided by VQGAN [esser2021taming] trained on the OpenImages [krasin2017openimages] dataset, which downsamples images from to . We use the CLIP [radford2021learning] pretrained model (ViT-B) as our text encoder, which encodes a sentence to tokens. Our diffusion image decoder consists of several transformer blocks, each block contains full attention, cross attention, and feed forward network(FFN). Our base model contains transformer blocks, the channel of each block is . The FFN contains two linear layer, which expand the dimension to in the middle layer. The model contains M parameters. For our small model, it contains transformer blocks while the channel is , the FFN contains two convolution layers with kernel size , the channel expand rate is . The model contains M parameters.

For our class conditional generation model on ImageNet, we adopt the public VQ-VAE model provided by VQGAN trained on ImageNet, which downsamples images from to . Our model contains transformer blocks, each block contains a full attention layer and a FFN. The base channel number is . Besides, the FFN also uses convolution instead of linear layer, and the channel expand rate is .

Figure 5: Text guided image editing by VQ-Diffusion.
Figure 6: In the wild text-to-image synthesis by VQ-Diffusion.
Figure 7: Comparison our results with XMC-GAN, their results come from their paper.
Figure 8: VQ-Diffusion results on FFHQ1024 and FFHQ256 datasets.
Figure 9: VQ-Diffusion results of class conditional synthesis on ImageNet dataset.

Appendix B Proof of Equation 8

Mathematical induction can be used to prove the Equation 8 in the paper.

When , we have


which is clearly hold. Suppose the Equation 8 is hold at step , then for :

When ,

When ,

When and ,

So proof done.

Appendix C Results

In this part, we provide more visualization results. First, we compare our results with XMC-GAN in Figure 7. We got their results directly from their paper. The irregular mask inpainting results are shown in Figure 5. we show our more in the wild text-to-image results in Figure 6. And we provide our results on ImageNet and FFHQ in Figure 9 and Figure 8.