Guiding Text-to-Image Diffusion Model Towards Grounded Generation

01/12/2023
by   Ziyi Li, et al.
0

The goal of this paper is to augment a pre-trained text-to-image diffusion model with the ability of open-vocabulary objects grounding, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we insert a grounding module into the existing diffusion model, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we propose an automatic pipeline for constructing a dataset, that consists of image, segmentation mask, text prompt triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the guided diffusion model to build a synthetic semantic segmentation dataset, and show that training a standard segmentation model on such dataset demonstrates competitive performance on zero-shot segmentation(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 19

page 20

page 21

page 22

research
09/03/2023

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Large-scale text-to-image diffusion models have shown impressive capabil...
research
06/15/2023

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

The variety of objects in the real world is nearly unlimited and is thus...
research
06/01/2023

Intelligent Grimm – Open-ended Visual Storytelling via Latent Diffusion Models

Generative models have recently exhibited exceptional capabilities in va...
research
02/27/2023

LMSeg: Language-guided Multi-dataset Segmentation

It's a meaningful and attractive topic to build a general and inclusive ...
research
01/17/2023

GLIGEN: Open-Set Grounded Text-to-Image Generation

Large-scale text-to-image diffusion models have made amazing advances. H...
research
09/08/2023

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Diffusion models have revolted the field of text-to-image generation rec...
research
08/29/2023

Shatter and Gather: Learning Referring Image Segmentation with Text Supervision

Referring image segmentation, the task of segmenting any arbitrary entit...

Please sign up or login with your details

Forgot password? Click here to reset