Counting Guidance for High Fidelity Text-to-Image Synthesis

06/30/2023
by   Wonjun Kang, et al.
0

Recently, the quality and performance of text-to-image generation significantly advanced due to the impressive results of diffusion models. However, text-to-image diffusion models still fail to generate high fidelity content with respect to the input prompt. One problem where text-to-diffusion models struggle is generating the exact number of objects specified in the text prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. In this paper, we propose a method to improve diffusion models to focus on producing the correct object count given the input prompt. We adopt a counting network that performs reference-less class-agnostic counting for any given image. We calculate the gradients of the counting network and refine the predicted noise for each step. To handle multiple types of objects in the prompt, we use novel attention map guidance to obtain high-fidelity masks for each object. Finally, we guide the denoising process by the calculated gradients for each object. Through extensive experiments and evaluation, we demonstrate that our proposed guidance method greatly improves the fidelity of diffusion models to object count.

READ FULL TEXT

page 2

page 3

page 4

page 6

page 7

research
08/11/2023

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Text-to-image synthesis has achieved high-quality results with recent ad...
research
10/15/2021

Counting Objects by Diffused Index: geometry-free and training-free approach

Counting objects is a fundamental but challenging problem. In this paper...
research
08/09/2023

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

In the text-to-image generation field, recent remarkable progress in Sta...
research
01/28/2023

SEGA: Instructing Diffusion using Semantic Dimensions

Text-to-image diffusion models have recently received a lot of interest ...
research
06/09/2023

RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models

The emergence of Neural Radiance Fields (NeRF) has promoted the developm...
research
07/17/2023

Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

Conditional diffusion models have demonstrated impressive performance in...
research
11/02/2022

TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text

Language is one of the primary means by which we describe the 3D world a...

Please sign up or login with your details

Forgot password? Click here to reset