Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

11/14/2022
by   Dominic Rampas, et al.
0

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing. We release all of our code and pretrained models at https://github.com/dome272/Paella

READ FULL TEXT

page 1

page 5

page 7

page 8

page 9

research
09/19/2022

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Although two-stage Vector Quantized (VQ) generative models allow for syn...
research
01/17/2018

Semi-supervised FusedGAN for Conditional Image Generation

We present FusedGAN, a deep network for conditional image synthesis with...
research
07/24/2023

Interpolating between Images with Diffusion Models

One little-explored frontier of image generation and editing is the task...
research
04/27/2022

Optimized latent-code selection for explainable conditional text-to-image GANs

The task of text-to-image generation has achieved remarkable progress du...
research
06/07/2023

Designing a Better Asymmetric VQGAN for StableDiffusion

StableDiffusion is a revolutionary text-to-image generator that is causi...
research
02/25/2022

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

Text-to-image generation intends to automatically produce a photo-realis...
research
11/02/2022

TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text

Language is one of the primary means by which we describe the 3D world a...

Please sign up or login with your details

Forgot password? Click here to reset