CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

03/01/2022
by   Zihao Wang, et al.
0

Training a text-to-image generator in the general domain (e.g., Dall.e, CogView) requires huge amounts of paired text-image data, which is too expensive to collect. In this paper, we propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator. Specifically, given an image without text labels, we first extract the embedding of the image in the united language-vision embedding space with the image encoder of CLIP. Next, we convert the image into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN model can be trained with the unlabeled image dataset in hand). Finally, we train an autoregressive transformer that maps the image tokens from its unified language-vision representation. Once trained, the transformer can generate coherent image tokens based on the text embedding extracted from the text encoder of CLIP upon an input text. Such a strategy enables us to train a strong and general text-to-image generator with large text-free image dataset such as ImageNet. Qualitative and quantitative evaluations verify that our method significantly outperforms optimization-based text-to-image methods in terms of image quality while not compromising the text-image matching. Our method can even achieve comparable performance as flagship supervised models like CogView.

READ FULL TEXT

page 1

page 6

page 7

page 8

page 12

page 13

page 14

page 15

research
12/31/2021

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Conventional methods for the image-text generation tasks mainly tackle t...
research
03/06/2023

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Recently, large-scale Vision and Language (V&L) pretraining has become t...
research
05/26/2023

Generating Images with Multimodal Language Models

We propose a method to fuse frozen text-only large language models (LLMs...
research
03/25/2023

Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN

Currently, text-to-image synthesis uses text encoder and image generator...
research
06/01/2023

Intelligent Grimm – Open-ended Visual Storytelling via Latent Diffusion Models

Generative models have recently exhibited exceptional capabilities in va...
research
08/17/2023

Text-Only Training for Visual Storytelling

Visual storytelling aims to generate a narrative based on a sequence of ...
research
05/21/2023

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

We present a large-scale in-the-wild Japanese laughter corpus and a laug...

Please sign up or login with your details

Forgot password? Click here to reset