Log In Sign Up

LAFITE: Towards Language-Free Training for Text-to-Image Generation

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1 recently proposed large DALL-E model.


page 6

page 13

page 14

page 15

page 16


Shifted Diffusion for Text-to-image Generation

We present Corgi, a novel method for text-to-image generation. Corgi is ...

Semantically Invariant Text-to-Image Generation

Image captioning has demonstrated models that are capable of generating ...

A Generic Approach for Enhancing GANs by Regularized Latent Optimization

With the rapidly growing model complexity and data volume, training deep...

Lafite2: Few-shot Text-to-Image Generation

Text-to-image generation models have progressed considerably in recent y...

Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset

In recent years, there is a growing number of pre-trained models trained...

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Machine learning models are now able to convert user-written text descri...

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Training a text-to-image generator in the general domain (e.g., Dall.e, ...

1 Introduction

Automatic synthesis of realistic images from arbitrary text description is one of the core aspirations in artificial intelligence. Most existing works achieve the goal by consuming a large number of high quality image-text pairs 

[xu2018attngan, zhu2019dm, zhang2021crossmodal, ramesh2021zero, ding2021cogview], which, however, often requires heavy workload of precise human captioning and filtering. For instance, MS-COCO [lin2014microsoft], the most commonly used dataset in text-to-image generation tasks, requires over 70,000 worker hours in gathering and annotating the captions. Even for less curated datasets such as Google Conceptual Captions [Sharma2018ConceptualCA], it consists of 3.3 million image-text pairs that are heavily filtered from 5 billion images from around 1 billion English webpages. In practice, for a customized domain, it is infeasible to collect such a large number of image-text pairs for model training, due to the high cost of human captioning and filtering. This challenge renders the unprecedented importance of the zero-shot text-to-image generation tasks, where no domain-specific image-text pairs are used to train a model to generate images in a given domain.

Recently, several attempts have been made to tackle zero-shot text-to-image generation problem, by pre-training giant generative models on web-scale image-text pairs, such as DALL-E [ramesh2021zero] and CogView [ding2021cogview]. Both are auto-regressive Transformer models built for zero-shot text-to-image generation, as they can generate corresponding images given arbitrary text description without training on domain-specific datasets. However, to ensure good performance, these models require a gigantic scale of data collections, model size and model training. Specifically, DALL-E contains over 12 billion parameters and is trained on a dataset consisting of 250 million image-text pairs; CogView is a model with 4 billion parameters trained on 30 million image-text pairs. For this reason, hundreds of GPUs are required in training these models, which significantly increases carbon footprint and decrease the inclusivity: making it extremely difficult for more researchers to participate the study of this topic.

It is therefore desired to provide affordable solutions to build text-to-image generation models for the settings of limited image-text pair data, by reducing the requirements on model size, data collections and model training. In terms of data collections, in the ideal scenarios, the language-free

setting is probably the minimal and cheapest requirement, where only image data is provided. This is important because collecting only image data is much easier than constructing high-quality image-text pairs, given the ample domain-specific image datasets available online.

Figure 1: Model size vs

performance of zero-shot image-to-text generation on the COCO dataset.

Lafite has much smaller model size, especially when considering trainable parameters (Left figure), but shows higher Inception score (Middle figure) and lower FID (Right figure). Please refer to Section 4 for details.

To this end, we propose Lafite111 LAnguage-Free traIning for Text-to-image gEneration, a generative adversarial approach to significantly lowering the cost barrier and to building efficient text-to-image generation models, based on the pre-trained CLIP model [radford2021learning]. Specifically, we take advantages of CLIP’s property on image-text feature alignment in the joint semantic space, to construct pseudo image-text feature pairs;

we propose a text-to-image GAN (Generative Adversarial Network) model 

[goodfellow2014generative] that can effectively leverage pseudo image-text feature pairs. Our major contributions can be summarized as followings:

  • We propose Lafite

    , a versatile system that works effectively in a large range of text-to-image generation settings, including language-free, zero-shot and fully-supervised learning.

  • To the best of our knowledge, Lafite is the first work that enables the language-free training for the text-to-image generation task. We propose two novel schemes to construct pseudo image-text feature pairs, and conduct comprehensive study for the new setting. The effectiveness is validated with quantitative results on several datasets with different training schemes (training from scratch and fine-tuning from pre-trained generative models).

  • In zero-shot text-to-image generation settings, Lafite outperforms the prior art DALL-E and CogView on the COCO benchmark, with less than 1% of the trainable model parameter size (with frozen CLIP model weights). Please see Figure 1 for comparisons.

  • In the standard fully supervised settings, Lafite outperforms several state-of-the-art (SoTA) methods by a large margin. Surprisingly, even our language-free model shows superior performance than most existing models that are trained with full image-text pairs.

2 Related Work

Text-to-image generation

Existing models on text-to-image generation can be categorized into two classes: fully-supervised text-to-image generation [xu2018attngan, zhu2019dm, zhang2021crossmodal] and zero-shot text-to-image generation [ramesh2021zero, ding2021cogview]. The SoTA in the full image-text pair setting is still dominated by GAN variants [xu2018attngan, zhu2019dm, zhang2021crossmodal]. GANs [goodfellow2014generative] have inspired many advances in image synthesis [mirza2014conditional, karras2017progressive, liu2017unsupervised, li2017alice, karras2019analyzing]. For text-to-image synthesis, the improved model performance is often benefited from large generative adversarial image models [zhang2021crossmodal] and pre-trained text encoders [liu2019roberta]. Recently, excellent zero-shot text-to-image generation performance has been achieved in DALL-E [ramesh2021zero] and CogView [ding2021cogview]. The basic idea is to encode images into discrete latent tokens using VQ-VAE[van2017neural, razavi2019generating], and pre-train a huge-size auto-regressive Transformers[vaswani2017attention] to predict these discrete tokens based on paired text sequences. Our Lafite is the first generative adversarial approach that achieves SoTA on zero-shot generation.

Multi-modal feature learning

Learning a joint and aligned feature space for vision-and-language has been a long standing problem in artificial intelligence [weston2010large, socher2010connecting]. Inspired by the BERT model [devlin2018bert], a number of methods attempt to learn generic multi-modal fusion layers, given the pre-extracted visual region features and textual encoder [lu2019vilbert, li2020oscar, su2019vl, zhang2021vinvl, kim2021vilt, li2021align]. These works aim at learning generic multi-modal representations for downstream tasks like visual question answering [antol2015vqa, hudson2019gqa]

, image captioning 

[lin2014microsoft, agrawal2019nocaps], visual commonsense reasoning [zellers2019recognition]. Unlike the aforementioned works, another line of works focus on the way of learning visual representation from natural language supervisions, including both generative [desai2021virtex] and discriminative [wang2016learning, wang2018learning, zhang2020contrastive] methods. The latter learns an aligned visual-semantic space. This idea is recently scaled up in CLIP/ALIGN [radford2021learning, jia2021scaling], which pave the way toward building a universal image-text representation space. Our Lafite is built up in this universal space, and is the first one to leverage its multi-modal alignment property for language-free text-to-image generation.

CLIP for generation/manipulation.

The idea of multi-modal feature space also inspires some recent works on generative models [galatolo2021generating, patashnik2021styleclip, gal2021stylegan, pakhomov2021segmentation]. All of these works are related to ours in that the tools of pre-trained CLIP model and StyleGAN2 are employed. Our Lafite is different in two aspects: The motivations and scenarios are different. Existing works focus on latent optimization [galatolo2021generating], image manipulation [patashnik2021styleclip], domain adaptation [gal2021stylegan], image segmentation [pakhomov2021segmentation]. We present the first study on training text-to-image generation models without the requirement of paired captions. The techniques are different. Though image-text feature alignment property is leveraged in all works, Our Lafite is the only one to generate pseudo text features in the joint multi-modal space.

3 Lafite: A Language-Free Paradigm

A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images. However, this is especially challenging due to the lack of a universal captioning model that can bridge the modality gap between text and image to generate high-quality captions; generalize to diverse image domains with large domain gaps. In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.

Throughout the paper, denotes an image-text pair, is the corresponding generated image of . and denote the generator and discriminator respectively. We use and to denote the pre-trained text encoder and image encoder, which map text descriptions and image samples into a joint multi-modal feature space. denotes the real text feature,

denotes latent noise sampled from the standard Gaussian distribution, serving as one input of the generator. Our idea to achieve language-free training is to generate pseudo text features

, which aims to approximating , by leveraging the image-text feature alignment of a pre-trained model. The generated features are then fed into the text-to-image generator to synthesize the corresponding images. Without loss of generality, we denote the mapping from input data to the multi-modal feature space as translator in two settings. If only images are provided (i.e. language-free setting), we consider a pseudo text-feature generation process ; If image-text pairs are provided (i.e. standard fully-supervised settings), we encode ground-truth text, .

3.1 Pseudo Text-Feature Generation

To achieve the goal, a universal multimodal feature space is desired, where features of paired texts and images are well aligned. The recently vision-and-language models such as CLIP and ALIGN achieve this, by pre-training on hundreds/thousands of millions of image-text pairs using contrastive learning. The cosine similarity between matched image-text features is maximized, while cosine similarity of the mis-matched pair is minimized. This naturally provides a high-dimensional hyper-sphere


In our implementation, we normalize the features extracted with CLIP by their L2 norm.

for the multimodal features, where paired image-text should be close to each other, with a small angle between their feature vectors.

Figure 2: The illustration that the generated pseudo text feature vector (blue dashed arrow) should have high cosine similarity with the image feature (red solid arrow), i.e. .

This inspires us to explore the potentials of generating pseudo text features for a given image on this hyper-sphere: , where denotes cosine similarity, is a threshold. This idea is illustrated in Figure 2. Based on the analysis, we consider two schemes to generate pseudo text features.

Fixed perturbations

To generate pseudo text feature , we propose to perturb the image feature with adaptive Gaussian noise:


where is the Gaussian noise, is a fixed hyper-parameter representing the level of perturbations, denotes L2 norm. The added Gaussian noise is adaptive in the sense that it is normalized to a hyper-sphere, then re-scaled by the norm of image feature. We can prove that, with the adaptive noise, our can generate with a high probability which depends on and . The formal theorem and its proof are provided in the Appendix.

Trainable perturbations

It is natural to extend to learn more adaptive noise instead of using a vanilla Gaussian. To this end, we propose to train an inference

model which takes the image features as inputs and outputs the mean and variance of the desired noise distribution. Specifically, the inference model consists of two neural networks

and . With the re-parameterization trick [kingma2013auto], the generation of pseudo text features is:


where denotes element-wise exponent operation, and denotes element-wise multiplication, denotes noise sampled from standard Gaussian. In practice, we construct and with 4 fully-connected (FC) layers respectively, and train them in a supervised way by maximizing the cosine similarity between generated text features and real text features.

Figure 3: The process of injecting text-conditional information into each layer of the generator, where FC denotes fully-connected layer. The green modules have their own trainable parameters per generator layer. We can view the original StyleGAN2 constructs its StyleSpace as the process from to . We propose to inject the semantic conditional information and further build our Conditional StyleSpace, whose elements will be used to modulate image generation. This figure illustrates the language-free setting, where real image is used to generate pseudo text feature ; For the fully supervised text-to-image generation setting, real text is used for the extraction of text feature . Please refer to the definition of translator in Section 3 for details.
(a) Discriminator output (b) (c)
Figure 4: Illustration of discriminator outputs and training objectives for the language-free setting.


Both schemes have their own pros and cons. The trainable perturbation generally yields better performance than the fixed perturbation. However, the fixed perturbation is easier to use, without the requirement of training an inference model on an additional dataset with annotated image-text pairs. Further, the performance of trainable perturbation is influenced by the gap between datasets used in training the inference model and the generative model, as empirically verified in our ablation studies in the experiment section.

3.2 Network Architectures

We propose to adapt the unconditional StyleGAN2 to a conditional generative model for our goal. Note that although we discuss our model in a language-free setting, it can be directly generalized to standard text-to-image generation by using (real text feature) instead of (pseudo text feature).


It is shown in recent works [liu2020style, wu2021stylespace] that the StyleSpace of StyleGAN2 is a well-disentangled intermediate feature space, whose dimensions are highly independent. By leveraging this property, we propose a simple yet effective approach to enable conditional generation: injecting new conditional information directly into the StyleSpace, as illustrated in Figure 3. Specifically, we choose to inject text information as follows. Random noise vectors are transformed into an intermediate latent space via a so-called mapping network, which consists of a sequence of FC layers. The space is claimed to better reflect the disentangled nature of the learned distribution. Each is further transformed to channel-wise unconditional style codes , using a different learned affine transformation for each layer of the generator. The space spanned by these style parameters is often referred to as StyleSpace, or . For a conditional vector from the image-text joint semantic space of CLIP, it is transformed into condition codes , using a different learned 2-layer FC network for each generator layer. At each layer of the generator, we concatenate its style and conditional codes to obtain , which is is further transformed to channel-wise conditional style codes , using a different learned affine transformation for each generator layer. We refer to the space spanned by these style parameters as Conditional StyleSpace, or . In sum, the generator synthesizes a fake image as:



In the text-to-image task, the discriminator ensures the generated image to satisfy two criterias: photo-realistic to human perception and fidelity to the text condition. To this end, we encode the input image with a shared discriminator backbone, then perform two tasks (each with a task-specific FC layer), as illustrated in Figure 4. projects into a scalar, indicating the level of true or fake of an input image . This is a common task shared in all GAN models; embeds into a semantic space, which is expected to be similar to the semantic space of CLIP. We compute the inner product to indicate how well the input image is semantically aligned/conditioned with the pseudo text feature. In summary, the discriminator output is defined as:


Intuitively, yields a high value for an image , when it is real (with large values) and the semantic similarity between and is high.

3.3 Training Objectives

For a mini-batch of images , is the corresponding generated pseudo text features for the

-th image. Our model is trained in an adversarial manner, with additional contrastive losses to ensure that the GAN feature space is aligned with pre-trained CLIP. The first one is the standard conditional GAN loss. The losses for the generator and discriminator are defined, with the logits from (

4), as:



denotes the Sigmoid function.

To enforce that the discriminator-extracted feature is semantically aligned in the pre-trained CLIP feature space, we consider the following contrastive regularizer for the discriminator:


where denotes the cosine similarity, is a non-negative hyper-parameter. Intuitively, enforces the discriminator to output image feature that is similar to the corresponding text feature .

We further utilize the pre-trained CLIP model to improve the semantic correspondence of the generated images and its conditioned pseudo text feature . We define the following contrastive loss for the generator with the same hyper-parameter as (6):


With the above contrastive regularizers, the final training loss for the generator and discriminator are defined as:


where for language-free settings, and , for fully-supervised settings333Details about hyper-parameter tuning are provided in the Appendix..

3.4 Training Details

1:  Input: An image dataset , pre-trained encoders , hyper-parameters
2:  while  not converge  do
3:     Sample mini-batch ;
4:     Sample perturbation noise ;
5:      // Pseudo text feature generation
6:     Generate according to (1) or (2);
7:      // Forward pass of G and D
8:     Sample latent noise ;
9:     Synthesize fake image with G using (3);
10:     Feed real/fake images to D using (4);
11:      // Update G and D with gradient descent
12:     Update D with (8);
13:     Update G with (9);
14:  end while
Algorithm 1 Language-free training of Lafite

We summarize the language-free training schedule of Lafite in Algorithm 1. For the settings with full image-text pairs, one may replace pseudo text feature generation step with the ground-truth text feature .


To demonstrate the zero-shot task transfer ability of our model, we also consider a variant that is pre-trained on the Google Conceptual Captions 3M (CC3M) dataset [Sharma2018ConceptualCA]

, which consists of 3.3 millions of image-text pairs. For pseudo text-feature generation with trainable perturbation, we also train its inference model on CC3M. There is no image overlapping between the pre-training and downstream datasets, which ensures the fairness when comparing our method against others in transfer learning. For face domain, we pre-trained a model on FFHQ dataset

[karras2019style] which contains 70,000 images. The pre-trained models can be fine-tuned with Lafite under language-free setting on different datasets, which will be discussed in the experiment section.

Data augmentation.

In practice, we also consider image data augmentation to improve extracted image features in (1). We choose to use random cropping and avoid using augmentations like color transformation, because they may lead to mismatching between generated text feature and the associated images . The details are summarized in Appendix.

4 Experiments

As the proposed Lafite is a versatile system, we conduct experiments under different settings, including the proposed language-free setting, as well as the zero-shot and fully-supervised text-to-image generation settings. Due to the difference of two schemes to generate pseudo text features described in Section 3.1, we denote our system in two variants: fixed perturbations as and trainable perturbations as

, respectively. All of our experiments are conducted on 4 Nvidia Tesla V100 GPUs, implemented using Pytorch

[paszke2019pytorch]. CLIP-ViT/B-32 is used in our methods unless specified. All the codes and pre-trained models will be publicly available upon acceptance.


We consider a suite of datasets that are commonly used in literature [xu2018attngan, zhu2019dm, zhang2021crossmodal, ye2021improving], including MS-COCO [cho2014learning], CUB [WahCUB_200_2011], LN-COCO [pont2020connecting], Multi-modal CelebA-HQ (MM CelebA-HQ) [xia2021tedigan]. All the images are scaled to resolution . Statistics of these datasets are summarized in Table 7 in the Appendix.

Evaluation metrics.

Following [ramesh2021zero, ding2021cogview], we report the blurred Fréchet Inception Distance (FID) [heusel2017gans] and Inception Score (IS) [salimans2016improved] on MS-COCO dataset, which are computed using 30,000 generated images with randomly sampled text from validation set. FID- means the FID is computed after blurring all the images by a Gaussian filter with radius .

Figure 5: Language-free text-to-image generation examples on MS-COCO validation set.
Figure 6: Image generation with multi-modal conditions (conditioned on both image and text).

4.1 Language-free Text-to-image Generation

We first study Lafite under the proposed language-free setting, in which only images are provided in a given domain, and no paired caption is available during training.

Captioning-based baseline:

As a baseline, we employed the SoTA image captioning model VinVL [zhang2021vinvl] to generate some associated captions for images. Note that MS-COCO image-text pairs were used to train the author-provided VinVL image captioning model, so the MS-COCO comparison is unfairly biased in favor of the baseline due to this information leakage. We compare this baseline method with our Lafite using the same network architecture and hyper-parameter setting for fairness.

Model IS FID-0 FID-1 FID-2 FID-4 FID-8
Table 1: Results of language-free setting on MS-COCO dataset. ‘Cap’ indicates a text-to-image generation baseline method based on VinVL captioning.

The main results are in Table 1. Both variants of our Lafite significantly outperform the captioning-based baseline method. The simple performs the best on this dataset, indicating the generality of the method. For , note that CC3M is used to train the inference model, thus there is no information leakage in method as we test on the MS-COCO dataset. Some generated examples are provided in Figure 5, from which we can see that our Lafite leads to text-aligned generation though no text data is used during training, verifying the effectiveness of the proposed method.

Model IS FID-0 FID-1 FID-2 FID-4 FID-8
Table 2: Results of zero-shot setting on MS-COCO dataset, the model is pre-trained with image-text pairs from CC3M dataset.
AttnGAN - 125.98
Obj-GAN - - - - - -
DM-GAN - - - 131.05
OP-GAN - - - - - -
DF-GAN - - - - - - 137.60
XMC-GAN - - - -
Table 3: Standard text-to-image generation on CUB, LN-COCO and MM CelebA-HQ datasets.

Furthermore, we can actually perform generation conditioned on images: For a given image, we generate an image-conditioned pseudo text feature vector with Lafite. Passing this pseudo text feature vector to leads to generated images that are similar to the given image. Consequently, Lafite enables image generation with multi-modal conditions, i.e. it can be conditioned on both image and text simultaneously. The implementation details are discussed in the Appendix. Some generated examples are provided in Figure 6, more results are provided in the Appendix.

4.2 Zero-Shot Text-to-image Generation

Zero-shot is a setting to evaluate a pre-trained text-to-image generation model, without training the model on any of downstream data. MS-COCO dataset is used for evaluating our model pre-trained on CC3M. The main results are shown in Table 2. Compared to DALL-E [ramesh2021zero] and CogView [ding2021cogview], Lafite achieves better quantitative results in most cases. We also emphasize that our model has only 75 millions of trainable parameters, while DALL-E has over 12 billions of parameters. Arguably, our pre-training dataset CC3M is much smaller444Though we acknowledge that Lafite is based on an off-the-shelf discriminate model CLIP, which is trained on 400 million image-text pairs, compared to the pre-training dataset used in DALL-E, which contains 250 millions of image-text pairs.

4.3 Standard Text-to-image Generation

We now consider the standard text-to-image generation task, where all the ground-truth image-text pairs are provided during training. We compare Lafite against a series of competitive systems: AttnGAN [xu2018attngan], Obj-GAN [li2019object], DM-GAN [zhu2019dm], OP-GAN [9184960], DF-GAN [tao2021dfgan] and XMC-GAN [zhang2021crossmodal]. The main results evaluated by FID and IS on different datasets are provided in Table 3. We also report the Semantic Object Accuracy (SOA) on MS-COCO following previous works [9184960, zhang2021crossmodal]. Results of competitive models are directly cited from the corresponding papers. It is clear that our proposed model consistently outperforms all other methods, creating new SoTA results in standard text-to-image generation.

Training from Scratch
Fine-tuned from Pre-trained Model
Table 4: Comparisons between two schemes for language-free training on different datasets.

4.4 Adaptation of Pre-trained Models

Language-free model fine-tuning.

Compared with existing works, one key advantage of the pre-trained Lafite model is that it naturally enables language-free model fine-tuning. The results are provided in Table 4, where both and are investigated on different datasets. We see that fine-tuning from the pre-trained model generally outperform training from scratch. We also notice that performance of pre-trained Lafite largely depends on the domain gap in pre-training and fine-tuning datasets. For example, sometimes obtains worse results than , especially when the fine-tuning dataset is dissimilar to CC3M, i.e., CUB and MM CelebA-HQ. This indicates that the inference model used for generating text features may have biases, because it may over-fit to its training dataset CC3M.

Pre-trained Lafite is also highly training-efficient. For example, training from scratch with Lafite on MS-COCO dataset requires around 4 days to reach FID of 18, while fine-tuning only needs 3 hours. This becomes a critical advantage especially when we require several text-to-image generation models across different datasets.

Semi-supervised fine-tuning.

Adaptation of pre-trained Lafite is sample-efficient. One interesting question is, how much percentage of image-text pairs do we need to outperform previous SoTA XMC-GAN on MS-COCO dataset? To answer this question, we conduct experiment in which only a portion of the images are associated with ground-truth text. Our model is first pre-trained using all the images under the language-free setting, then it is fine-tuned with varying percentages of image-text pairs. The main results are summarized in Figure 7. Our method outperforms XMC-GAN on both IS and FID when less than half of total of the image-text pairs are employed.

(a) FID
(b) IS
Figure 7: Comparison of Lafite and prior art XMC-GAN. X-axis is the percentage of image-text pairs in the full MS-COCO dataset. XMC-GAN has over 166 millions trainable parameters, while our Lafite only has 75 millions trainable parameters.

4.5 Ablation Study

Ablation study of training objectives

We first investigate the impact of each component in our objective functions. The standard generator and discriminator losses are always employed, we ablate by excluding and one by one. The results are provided in Table 5. For both variants of Lafite, it is observed the model performance could drop significantly.


Table 5: Ablations of training losses on MS-COCO dataset, means the component is used during training.
Model Feature dim IS FID SOA-C SOA-I
CLIP(B-32) Text encoder
CLIP(B-16) Text encoder
Table 6: Results of using different pre-trained models on MS-COCO dataset.

Ablations of pre-trained text/image encoders

To demonstrate the importance of using a multi-modal feature-aligned pre-trained model in our Lafite, we compare the CLIP model and other single-modality models. We adopt the popular RoBERTa [liu2019roberta] as the baseline text encoder, which was trained on a large text corpus only. Note that it is infeasible to perform language-free training without the joint feature space. Thus this experiment is based on fully-supervised text-to-image generation setting. For a fair comparison, we also report the results of only using the text encoder of CLIP while discarding the image encoder. In this setting, there is no image encoder thus the term is removed from the objective function consequently. The results are reported in Table 6. As expected, even if the image encoder of CLIP is not used, models with only CLIP text encoder still significantly outperform models using RoBERTa. From the results, we can conclude that: The feature space of CLIP is semantically meaningful for text-to-image generation, thus only using text encoder of CLIP still leads to better results than RoBERTa; Text-to-image generation results can be improved by using a feature-aligned joint feature space (CLIP vs others), and can be further improved with a stronger joint space (CLIP-ViT/B-16 outperforms CLIP-ViT/B-32, where ViT/B-16 and ViT/B-32 are different designs of visual transformers [dosovitskiy2020image]).

5 Conclusion

We have presented Lafite, an approach to build text-to-image generation systems without domain-specific image-text pairs in training. We achieve the goal by resorting to generating pseudo text features from images. Excellent performance in a variety of text-to-image generations tasks have demonstrated the effectiveness of Lafite, including language-free, zero-shot and fully supervised settings. In particular, Lafite creates new SoTA in zero-shot setting, with only 1% trainable parameter counts compared with recent advances such as DALL-E/CogView. Lafite also outperforms prior arts in the fully-supervised settings. We believe that language-free training is a promising direction to enable broader application areas for text-to-image generation, as it significantly lowers the burden on data collection. One interesting future direction is to explore image synthesis in the wild, where long tail and open set conditions are provided for generation.


Appendix A Appendix

a.1 Theoretical Results

Theorem 1.

For a given threshold , the generated text feature by satisfies with probability at least

where is the dimension number of features, is the Gamma function.


Without loss of generality, we omit the subscript for clearness.

Denote , then we have


By the cumulative distribution function (CDF) of inner product of random vectors on sphere

[cho2009inner], we know that

where is the dimension number of features, is the Gamma function. Thus we have

which completes the proof. ∎

a.2 Experiment Details


The statistics of datasets are summarized in Table 7.

Dataset #train #validation caption/image
MS-COCO 82k 40k 5
CUB 9k 3k 10
LN-COCO 134k 8k 1
MM CelebA-HQ 24k 6k 10
Table 7: Statistics of datasets. The last column indicates ratio of captions vs images.

Image feature extraction

In practice, we use random cropping as data augmentation when we extract the image features, which is presented in Algorithm 2. The pseudo text features will be generated by perturb the average feature of augmented samples. In our implementation, we set and pre-process the image samples to get their augmented features before training stage, thus it will not increase the training time. Note that in contrastive loss (7), we also apply random cropping with while set as large will slow down the training.

1:  Input: An image dataset , pre-trained , hyper-parameters
2:   // Image feature generation
3:  for  to  do
4:     if use data augmentation then
5:        Initialize ;
6:        for  todo
7:           , where denotes randomly cropping image to be ;
8:        end for
9:        ;
10:     else
11:        Initialize ;
12:     end if
13:  end for
Algorithm 2 Image feature extraction process
Figure 8: Generating examples on MS-COCO dataset.
Figure 9: Generating examples on CUB dataset.


The hyper-parameters are selected based on the performance on MS-COCO dataset. Specifically, is selected from , are selected from .

a.3 More Results

We provide the implementation details of image generation with multi-modal conditions, and more generated examples under language-free setting.

To generate an image conditioned on both a reference image and text description, we first extract the text feature from the given text, and pseudo text feature from the image. Then will be feed into the pre-trained generator, leading to two conditional style codes and . We construct a new conditional style code, whose elements are randomly selected from the corresponding elements in either or . The new conditional style code will be fed into the generator to generate the desired image.

Note that generation conditioned on image is not reconstruction. Thus when only a reference image is provided, the generated image may have differences with the given image. However, they will share some visible characteristics that are semantic meaningful as illustrated in our examples.

Some text-to-image generation results on CUB, MS-COCO, MM CelebA-HQ, LN-COCO are provided in the following figures.

Figure 10: Generating examples on MM CelebA-HQ dataset.
Figure 11: Generating examples on LN-COCO dataset.
Figure 12: Generating images with multi-modal conditions (conditioned on both image and text) on MS-COCO dataset.
Figure 13: Generating images with multi-modal conditions (conditioned on both image and text) on CUB dataset.
Figure 14: Generating images with multi-modal conditions (conditioned on both image and text) on MM CelebA-HQ dataset.
Figure 15: Generating images with multi-modal conditions (conditioned on both image and text) on LN-COCO dataset.