clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP

10/05/2022
by   Justin N. M. Pinkney, et al.
0

We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. This is achieved by training a diffusion model conditioned on CLIP embeddings to sample latent vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model. We demonstrate that clip2latent allows us to generate high-resolution (1024x1024 pixels) images based on text prompts with fast sampling, high image quality, and low training compute and data requirements. We also show that the use of the well studied StyleGAN architecture, without further fine-tuning, allows us to directly apply existing methods to control and modify the generated images adding a further layer of control to our text-to-image pipeline.

READ FULL TEXT

page 1

page 4

page 6

page 7

page 15

page 16

research
04/13/2023

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

Diffusion models have proven to be highly effective in generating high-q...
research
03/21/2023

3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion

We tackle the task of text-to-3D creation with pre-trained latent-based ...
research
05/25/2023

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Text-to-Image diffusion models have made tremendous progress over the pa...
research
06/14/2023

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

Diffusion models (DMs) have recently gained attention with state-of-the-...
research
06/20/2023

Masked Diffusion Models are Fast Learners

Diffusion models have emerged as the de-facto technique for image genera...
research
01/28/2023

Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark (CCUB) Dataset

It has been shown that accurate representation in media improves the wel...
research
02/10/2023

Adding Conditional Control to Text-to-Image Diffusion Models

We present a neural network structure, ControlNet, to control pretrained...

Please sign up or login with your details

Forgot password? Click here to reset