Zero-shot spatial layout conditioning for text-to-image diffusion models

06/23/2023
by   Guillaume Couairon, et al.
0

Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

READ FULL TEXT

page 2

page 4

page 7

page 8

page 12

page 13

page 14

page 15

research
02/24/2021

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better mod...
research
08/11/2023

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Text-to-image synthesis has achieved high-quality results with recent ad...
research
04/14/2023

Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Recent large-scale text-to-image generation models have made significant...
research
04/11/2022

No Token Left Behind: Explainability-Aided Image Classification and Generation

The application of zero-shot learning in computer vision has been revolu...
research
05/29/2023

Controllable Text-to-Image Generation with GPT-4

Current text-to-image generation models often struggle to follow textual...
research
04/26/2023

Controllable Image Generation via Collage Representations

Recent advances in conditional generative image models have enabled impr...
research
01/17/2023

GLIGEN: Open-Set Grounded Text-to-Image Generation

Large-scale text-to-image diffusion models have made amazing advances. H...

Please sign up or login with your details

Forgot password? Click here to reset