Automatic synthesis of realistic images from arbitrary text description is one of the core aspirations in artificial intelligence. Most existing works achieve the goal by consuming a large number of high quality image-text pairs[xu2018attngan, zhu2019dm, zhang2021crossmodal, ramesh2021zero, ding2021cogview], which, however, often requires heavy workload of precise human captioning and filtering. For instance, MS-COCO [lin2014microsoft], the most commonly used dataset in text-to-image generation tasks, requires over 70,000 worker hours in gathering and annotating the captions. Even for less curated datasets such as Google Conceptual Captions [Sharma2018ConceptualCA], it consists of 3.3 million image-text pairs that are heavily filtered from 5 billion images from around 1 billion English webpages. In practice, for a customized domain, it is infeasible to collect such a large number of image-text pairs for model training, due to the high cost of human captioning and filtering. This challenge renders the unprecedented importance of the zero-shot text-to-image generation tasks, where no domain-specific image-text pairs are used to train a model to generate images in a given domain.
Recently, several attempts have been made to tackle zero-shot text-to-image generation problem, by pre-training giant generative models on web-scale image-text pairs, such as DALL-E [ramesh2021zero] and CogView [ding2021cogview]. Both are auto-regressive Transformer models built for zero-shot text-to-image generation, as they can generate corresponding images given arbitrary text description without training on domain-specific datasets. However, to ensure good performance, these models require a gigantic scale of data collections, model size and model training. Specifically, DALL-E contains over 12 billion parameters and is trained on a dataset consisting of 250 million image-text pairs; CogView is a model with 4 billion parameters trained on 30 million image-text pairs. For this reason, hundreds of GPUs are required in training these models, which significantly increases carbon footprint and decrease the inclusivity: making it extremely difficult for more researchers to participate the study of this topic.
It is therefore desired to provide affordable solutions to build text-to-image generation models for the settings of limited image-text pair data, by reducing the requirements on model size, data collections and model training. In terms of data collections, in the ideal scenarios, the language-free
setting is probably the minimal and cheapest requirement, where only image data is provided. This is important because collecting only image data is much easier than constructing high-quality image-text pairs, given the ample domain-specific image datasets available online.
performance of zero-shot image-to-text generation on the COCO dataset.Lafite has much smaller model size, especially when considering trainable parameters (Left figure), but shows higher Inception score (Middle figure) and lower FID (Right figure). Please refer to Section 4 for details.
To this end, we propose Lafite111 LAnguage-Free traIning for Text-to-image gEneration, a generative adversarial approach to significantly lowering the cost barrier and to building efficient text-to-image generation models, based on the pre-trained CLIP model [radford2021learning]. Specifically, we take advantages of CLIP’s property on image-text feature alignment in the joint semantic space, to construct pseudo image-text feature pairs;
we propose a text-to-image GAN (Generative Adversarial Network) model[goodfellow2014generative] that can effectively leverage pseudo image-text feature pairs. Our major contributions can be summarized as followings:
We propose Lafite
, a versatile system that works effectively in a large range of text-to-image generation settings, including language-free, zero-shot and fully-supervised learning.
To the best of our knowledge, Lafite is the first work that enables the language-free training for the text-to-image generation task. We propose two novel schemes to construct pseudo image-text feature pairs, and conduct comprehensive study for the new setting. The effectiveness is validated with quantitative results on several datasets with different training schemes (training from scratch and fine-tuning from pre-trained generative models).
In zero-shot text-to-image generation settings, Lafite outperforms the prior art DALL-E and CogView on the COCO benchmark, with less than 1% of the trainable model parameter size (with frozen CLIP model weights). Please see Figure 1 for comparisons.
In the standard fully supervised settings, Lafite outperforms several state-of-the-art (SoTA) methods by a large margin. Surprisingly, even our language-free model shows superior performance than most existing models that are trained with full image-text pairs.
2 Related Work
Existing models on text-to-image generation can be categorized into two classes: fully-supervised text-to-image generation [xu2018attngan, zhu2019dm, zhang2021crossmodal] and zero-shot text-to-image generation [ramesh2021zero, ding2021cogview]. The SoTA in the full image-text pair setting is still dominated by GAN variants [xu2018attngan, zhu2019dm, zhang2021crossmodal]. GANs [goodfellow2014generative] have inspired many advances in image synthesis [mirza2014conditional, karras2017progressive, liu2017unsupervised, li2017alice, karras2019analyzing]. For text-to-image synthesis, the improved model performance is often benefited from large generative adversarial image models [zhang2021crossmodal] and pre-trained text encoders [liu2019roberta]. Recently, excellent zero-shot text-to-image generation performance has been achieved in DALL-E [ramesh2021zero] and CogView [ding2021cogview]. The basic idea is to encode images into discrete latent tokens using VQ-VAE[van2017neural, razavi2019generating], and pre-train a huge-size auto-regressive Transformers[vaswani2017attention] to predict these discrete tokens based on paired text sequences. Our Lafite is the first generative adversarial approach that achieves SoTA on zero-shot generation.
Multi-modal feature learning
Learning a joint and aligned feature space for vision-and-language has been a long standing problem in artificial intelligence [weston2010large, socher2010connecting]. Inspired by the BERT model [devlin2018bert], a number of methods attempt to learn generic multi-modal fusion layers, given the pre-extracted visual region features and textual encoder [lu2019vilbert, li2020oscar, su2019vl, zhang2021vinvl, kim2021vilt, li2021align]. These works aim at learning generic multi-modal representations for downstream tasks like visual question answering [antol2015vqa, hudson2019gqa]lin2014microsoft, agrawal2019nocaps], visual commonsense reasoning [zellers2019recognition]. Unlike the aforementioned works, another line of works focus on the way of learning visual representation from natural language supervisions, including both generative [desai2021virtex] and discriminative [wang2016learning, wang2018learning, zhang2020contrastive] methods. The latter learns an aligned visual-semantic space. This idea is recently scaled up in CLIP/ALIGN [radford2021learning, jia2021scaling], which pave the way toward building a universal image-text representation space. Our Lafite is built up in this universal space, and is the first one to leverage its multi-modal alignment property for language-free text-to-image generation.
CLIP for generation/manipulation.
The idea of multi-modal feature space also inspires some recent works on generative models [galatolo2021generating, patashnik2021styleclip, gal2021stylegan, pakhomov2021segmentation]. All of these works are related to ours in that the tools of pre-trained CLIP model and StyleGAN2 are employed. Our Lafite is different in two aspects: The motivations and scenarios are different. Existing works focus on latent optimization [galatolo2021generating], image manipulation [patashnik2021styleclip], domain adaptation [gal2021stylegan], image segmentation [pakhomov2021segmentation]. We present the first study on training text-to-image generation models without the requirement of paired captions. The techniques are different. Though image-text feature alignment property is leveraged in all works, Our Lafite is the only one to generate pseudo text features in the joint multi-modal space.
3 Lafite: A Language-Free Paradigm
A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images. However, this is especially challenging due to the lack of a universal captioning model that can bridge the modality gap between text and image to generate high-quality captions; generalize to diverse image domains with large domain gaps. In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.
Throughout the paper, denotes an image-text pair, is the corresponding generated image of . and denote the generator and discriminator respectively. We use and to denote the pre-trained text encoder and image encoder, which map text descriptions and image samples into a joint multi-modal feature space. denotes the real text feature,
denotes latent noise sampled from the standard Gaussian distribution, serving as one input of the generator. Our idea to achieve language-free training is to generate pseudo text features, which aims to approximating , by leveraging the image-text feature alignment of a pre-trained model. The generated features are then fed into the text-to-image generator to synthesize the corresponding images. Without loss of generality, we denote the mapping from input data to the multi-modal feature space as translator in two settings. If only images are provided (i.e. language-free setting), we consider a pseudo text-feature generation process ; If image-text pairs are provided (i.e. standard fully-supervised settings), we encode ground-truth text, .
3.1 Pseudo Text-Feature Generation
To achieve the goal, a universal multimodal feature space is desired, where features of paired texts and images are well aligned. The recently vision-and-language models such as CLIP and ALIGN achieve this, by pre-training on hundreds/thousands of millions of image-text pairs using contrastive learning. The cosine similarity between matched image-text features is maximized, while cosine similarity of the mis-matched pair is minimized. This naturally provides a high-dimensional hyper-sphere222
In our implementation, we normalize the features extracted with CLIP by their L2 norm.
for the multimodal features, where paired image-text should be close to each other, with a small angle between their feature vectors.
This inspires us to explore the potentials of generating pseudo text features for a given image on this hyper-sphere: , where denotes cosine similarity, is a threshold. This idea is illustrated in Figure 2. Based on the analysis, we consider two schemes to generate pseudo text features.
To generate pseudo text feature , we propose to perturb the image feature with adaptive Gaussian noise:
where is the Gaussian noise, is a fixed hyper-parameter representing the level of perturbations, denotes L2 norm. The added Gaussian noise is adaptive in the sense that it is normalized to a hyper-sphere, then re-scaled by the norm of image feature. We can prove that, with the adaptive noise, our can generate with a high probability which depends on and . The formal theorem and its proof are provided in the Appendix.
It is natural to extend to learn more adaptive noise instead of using a vanilla Gaussian. To this end, we propose to train an inferenceand . With the re-parameterization trick [kingma2013auto], the generation of pseudo text features is:
where denotes element-wise exponent operation, and denotes element-wise multiplication, denotes noise sampled from standard Gaussian. In practice, we construct and with 4 fully-connected (FC) layers respectively, and train them in a supervised way by maximizing the cosine similarity between generated text features and real text features.
|(a) Discriminator output||(b)||(c)|
Both schemes have their own pros and cons. The trainable perturbation generally yields better performance than the fixed perturbation. However, the fixed perturbation is easier to use, without the requirement of training an inference model on an additional dataset with annotated image-text pairs. Further, the performance of trainable perturbation is influenced by the gap between datasets used in training the inference model and the generative model, as empirically verified in our ablation studies in the experiment section.
3.2 Network Architectures
We propose to adapt the unconditional StyleGAN2 to a conditional generative model for our goal. Note that although we discuss our model in a language-free setting, it can be directly generalized to standard text-to-image generation by using (real text feature) instead of (pseudo text feature).
It is shown in recent works [liu2020style, wu2021stylespace] that the StyleSpace of StyleGAN2 is a well-disentangled intermediate feature space, whose dimensions are highly independent. By leveraging this property, we propose a simple yet effective approach to enable conditional generation: injecting new conditional information directly into the StyleSpace, as illustrated in Figure 3. Specifically, we choose to inject text information as follows. Random noise vectors are transformed into an intermediate latent space via a so-called mapping network, which consists of a sequence of FC layers. The space is claimed to better reflect the disentangled nature of the learned distribution. Each is further transformed to channel-wise unconditional style codes , using a different learned affine transformation for each layer of the generator. The space spanned by these style parameters is often referred to as StyleSpace, or . For a conditional vector from the image-text joint semantic space of CLIP, it is transformed into condition codes , using a different learned 2-layer FC network for each generator layer. At each layer of the generator, we concatenate its style and conditional codes to obtain , which is is further transformed to channel-wise conditional style codes , using a different learned affine transformation for each generator layer. We refer to the space spanned by these style parameters as Conditional StyleSpace, or . In sum, the generator synthesizes a fake image as:
In the text-to-image task, the discriminator ensures the generated image to satisfy two criterias: photo-realistic to human perception and fidelity to the text condition. To this end, we encode the input image with a shared discriminator backbone, then perform two tasks (each with a task-specific FC layer), as illustrated in Figure 4. projects into a scalar, indicating the level of true or fake of an input image . This is a common task shared in all GAN models; embeds into a semantic space, which is expected to be similar to the semantic space of CLIP. We compute the inner product to indicate how well the input image is semantically aligned/conditioned with the pseudo text feature. In summary, the discriminator output is defined as:
Intuitively, yields a high value for an image , when it is real (with large values) and the semantic similarity between and is high.
3.3 Training Objectives
For a mini-batch of images , is the corresponding generated pseudo text features for the
-th image. Our model is trained in an adversarial manner, with additional contrastive losses to ensure that the GAN feature space is aligned with pre-trained CLIP. The first one is the standard conditional GAN loss. The losses for the generator and discriminator are defined, with the logits from (4), as:
denotes the Sigmoid function.
To enforce that the discriminator-extracted feature is semantically aligned in the pre-trained CLIP feature space, we consider the following contrastive regularizer for the discriminator:
where denotes the cosine similarity, is a non-negative hyper-parameter. Intuitively, enforces the discriminator to output image feature that is similar to the corresponding text feature .
We further utilize the pre-trained CLIP model to improve the semantic correspondence of the generated images and its conditioned pseudo text feature . We define the following contrastive loss for the generator with the same hyper-parameter as (6):
With the above contrastive regularizers, the final training loss for the generator and discriminator are defined as:
where for language-free settings, and , for fully-supervised settings333Details about hyper-parameter tuning are provided in the Appendix..
3.4 Training Details
We summarize the language-free training schedule of Lafite in Algorithm 1. For the settings with full image-text pairs, one may replace pseudo text feature generation step with the ground-truth text feature .
To demonstrate the zero-shot task transfer ability of our model, we also consider a variant that is pre-trained on the Google Conceptual Captions 3M (CC3M) dataset [Sharma2018ConceptualCA]
, which consists of 3.3 millions of image-text pairs. For pseudo text-feature generation with trainable perturbation, we also train its inference model on CC3M. There is no image overlapping between the pre-training and downstream datasets, which ensures the fairness when comparing our method against others in transfer learning. For face domain, we pre-trained a model on FFHQ dataset[karras2019style] which contains 70,000 images. The pre-trained models can be fine-tuned with Lafite under language-free setting on different datasets, which will be discussed in the experiment section.
In practice, we also consider image data augmentation to improve extracted image features in (1). We choose to use random cropping and avoid using augmentations like color transformation, because they may lead to mismatching between generated text feature and the associated images . The details are summarized in Appendix.
As the proposed Lafite is a versatile system, we conduct experiments under different settings, including the proposed language-free setting, as well as the zero-shot and fully-supervised text-to-image generation settings. Due to the difference of two schemes to generate pseudo text features described in Section 3.1, we denote our system in two variants: fixed perturbations as and trainable perturbations as
, respectively. All of our experiments are conducted on 4 Nvidia Tesla V100 GPUs, implemented using Pytorch[paszke2019pytorch]. CLIP-ViT/B-32 is used in our methods unless specified. All the codes and pre-trained models will be publicly available upon acceptance.
We consider a suite of datasets that are commonly used in literature [xu2018attngan, zhu2019dm, zhang2021crossmodal, ye2021improving], including MS-COCO [cho2014learning], CUB [WahCUB_200_2011], LN-COCO [pont2020connecting], Multi-modal CelebA-HQ (MM CelebA-HQ) [xia2021tedigan]. All the images are scaled to resolution . Statistics of these datasets are summarized in Table 7 in the Appendix.
Following [ramesh2021zero, ding2021cogview], we report the blurred Fréchet Inception Distance (FID) [heusel2017gans] and Inception Score (IS) [salimans2016improved] on MS-COCO dataset, which are computed using 30,000 generated images with randomly sampled text from validation set. FID- means the FID is computed after blurring all the images by a Gaussian filter with radius .
4.1 Language-free Text-to-image Generation
We first study Lafite under the proposed language-free setting, in which only images are provided in a given domain, and no paired caption is available during training.
As a baseline, we employed the SoTA image captioning model VinVL [zhang2021vinvl] to generate some associated captions for images. Note that MS-COCO image-text pairs were used to train the author-provided VinVL image captioning model, so the MS-COCO comparison is unfairly biased in favor of the baseline due to this information leakage. We compare this baseline method with our Lafite using the same network architecture and hyper-parameter setting for fairness.
The main results are in Table 1. Both variants of our Lafite significantly outperform the captioning-based baseline method. The simple performs the best on this dataset, indicating the generality of the method. For , note that CC3M is used to train the inference model, thus there is no information leakage in method as we test on the MS-COCO dataset. Some generated examples are provided in Figure 5, from which we can see that our Lafite leads to text-aligned generation though no text data is used during training, verifying the effectiveness of the proposed method.
Furthermore, we can actually perform generation conditioned on images: For a given image, we generate an image-conditioned pseudo text feature vector with Lafite. Passing this pseudo text feature vector to leads to generated images that are similar to the given image. Consequently, Lafite enables image generation with multi-modal conditions, i.e. it can be conditioned on both image and text simultaneously. The implementation details are discussed in the Appendix. Some generated examples are provided in Figure 6, more results are provided in the Appendix.
4.2 Zero-Shot Text-to-image Generation
Zero-shot is a setting to evaluate a pre-trained text-to-image generation model, without training the model on any of downstream data. MS-COCO dataset is used for evaluating our model pre-trained on CC3M. The main results are shown in Table 2. Compared to DALL-E [ramesh2021zero] and CogView [ding2021cogview], Lafite achieves better quantitative results in most cases. We also emphasize that our model has only 75 millions of trainable parameters, while DALL-E has over 12 billions of parameters. Arguably, our pre-training dataset CC3M is much smaller444Though we acknowledge that Lafite is based on an off-the-shelf discriminate model CLIP, which is trained on 400 million image-text pairs, compared to the pre-training dataset used in DALL-E, which contains 250 millions of image-text pairs.
4.3 Standard Text-to-image Generation
We now consider the standard text-to-image generation task, where all the ground-truth image-text pairs are provided during training. We compare Lafite against a series of competitive systems: AttnGAN [xu2018attngan], Obj-GAN [li2019object], DM-GAN [zhu2019dm], OP-GAN , DF-GAN [tao2021dfgan] and XMC-GAN [zhang2021crossmodal]. The main results evaluated by FID and IS on different datasets are provided in Table 3. We also report the Semantic Object Accuracy (SOA) on MS-COCO following previous works [9184960, zhang2021crossmodal]. Results of competitive models are directly cited from the corresponding papers. It is clear that our proposed model consistently outperforms all other methods, creating new SoTA results in standard text-to-image generation.
|Training from Scratch|
|Fine-tuned from Pre-trained Model|
4.4 Adaptation of Pre-trained Models
Language-free model fine-tuning.
Compared with existing works, one key advantage of the pre-trained Lafite model is that it naturally enables language-free model fine-tuning. The results are provided in Table 4, where both and are investigated on different datasets. We see that fine-tuning from the pre-trained model generally outperform training from scratch. We also notice that performance of pre-trained Lafite largely depends on the domain gap in pre-training and fine-tuning datasets. For example, sometimes obtains worse results than , especially when the fine-tuning dataset is dissimilar to CC3M, i.e., CUB and MM CelebA-HQ. This indicates that the inference model used for generating text features may have biases, because it may over-fit to its training dataset CC3M.
Pre-trained Lafite is also highly training-efficient. For example, training from scratch with Lafite on MS-COCO dataset requires around 4 days to reach FID of 18, while fine-tuning only needs 3 hours. This becomes a critical advantage especially when we require several text-to-image generation models across different datasets.
Adaptation of pre-trained Lafite is sample-efficient. One interesting question is, how much percentage of image-text pairs do we need to outperform previous SoTA XMC-GAN on MS-COCO dataset? To answer this question, we conduct experiment in which only a portion of the images are associated with ground-truth text. Our model is first pre-trained using all the images under the language-free setting, then it is fine-tuned with varying percentages of image-text pairs. The main results are summarized in Figure 7. Our method outperforms XMC-GAN on both IS and FID when less than half of total of the image-text pairs are employed.
4.5 Ablation Study
Ablation study of training objectives
We first investigate the impact of each component in our objective functions. The standard generator and discriminator losses are always employed, we ablate by excluding and one by one. The results are provided in Table 5. For both variants of Lafite, it is observed the model performance could drop significantly.
|CLIP(B-32) Text encoder|
|CLIP(B-16) Text encoder|
Ablations of pre-trained text/image encoders
To demonstrate the importance of using a multi-modal feature-aligned pre-trained model in our Lafite, we compare the CLIP model and other single-modality models. We adopt the popular RoBERTa [liu2019roberta] as the baseline text encoder, which was trained on a large text corpus only. Note that it is infeasible to perform language-free training without the joint feature space. Thus this experiment is based on fully-supervised text-to-image generation setting. For a fair comparison, we also report the results of only using the text encoder of CLIP while discarding the image encoder. In this setting, there is no image encoder thus the term is removed from the objective function consequently. The results are reported in Table 6. As expected, even if the image encoder of CLIP is not used, models with only CLIP text encoder still significantly outperform models using RoBERTa. From the results, we can conclude that: The feature space of CLIP is semantically meaningful for text-to-image generation, thus only using text encoder of CLIP still leads to better results than RoBERTa; Text-to-image generation results can be improved by using a feature-aligned joint feature space (CLIP vs others), and can be further improved with a stronger joint space (CLIP-ViT/B-16 outperforms CLIP-ViT/B-32, where ViT/B-16 and ViT/B-32 are different designs of visual transformers [dosovitskiy2020image]).
We have presented Lafite, an approach to build text-to-image generation systems without domain-specific image-text pairs in training. We achieve the goal by resorting to generating pseudo text features from images. Excellent performance in a variety of text-to-image generations tasks have demonstrated the effectiveness of Lafite, including language-free, zero-shot and fully supervised settings. In particular, Lafite creates new SoTA in zero-shot setting, with only 1% trainable parameter counts compared with recent advances such as DALL-E/CogView. Lafite also outperforms prior arts in the fully-supervised settings. We believe that language-free training is a promising direction to enable broader application areas for text-to-image generation, as it significantly lowers the burden on data collection. One interesting future direction is to explore image synthesis in the wild, where long tail and open set conditions are provided for generation.
Appendix A Appendix
a.1 Theoretical Results
For a given threshold , the generated text feature by satisfies with probability at least
where is the dimension number of features, is the Gamma function.
Without loss of generality, we omit the subscript for clearness.
Denote , then we have
By the cumulative distribution function (CDF) of inner product of random vectors on sphere[cho2009inner], we know that
where is the dimension number of features, is the Gamma function. Thus we have
which completes the proof. ∎
a.2 Experiment Details
The statistics of datasets are summarized in Table 7.
Image feature extraction
In practice, we use random cropping as data augmentation when we extract the image features, which is presented in Algorithm 2. The pseudo text features will be generated by perturb the average feature of augmented samples. In our implementation, we set and pre-process the image samples to get their augmented features before training stage, thus it will not increase the training time. Note that in contrastive loss (7), we also apply random cropping with while set as large will slow down the training.
The hyper-parameters are selected based on the performance on MS-COCO dataset. Specifically, is selected from , are selected from .
a.3 More Results
We provide the implementation details of image generation with multi-modal conditions, and more generated examples under language-free setting.
To generate an image conditioned on both a reference image and text description, we first extract the text feature from the given text, and pseudo text feature from the image. Then will be feed into the pre-trained generator, leading to two conditional style codes and . We construct a new conditional style code, whose elements are randomly selected from the corresponding elements in either or . The new conditional style code will be fed into the generator to generate the desired image.
Note that generation conditioned on image is not reconstruction. Thus when only a reference image is provided, the generated image may have differences with the given image. However, they will share some visible characteristics that are semantic meaningful as illustrated in our examples.
Some text-to-image generation results on CUB, MS-COCO, MM CelebA-HQ, LN-COCO are provided in the following figures.