Controllable Text-to-Image Generation with GPT-4

05/29/2023
by   Tianjun Zhang, et al.
0

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.

READ FULL TEXT

page 2

page 8

page 9

page 14

page 15

page 16

page 17

research
06/23/2023

Zero-shot spatial layout conditioning for text-to-image diffusion models

Large-scale text-to-image diffusion models have significantly improved t...
research
08/09/2023

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

In the text-to-image generation field, recent remarkable progress in Sta...
research
05/18/2023

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Achieving machine autonomy and human control often represent divergent o...
research
06/19/2023

Conditional Text Image Generation with Diffusion Models

Current text recognition systems, including those for handwritten script...
research
05/02/2023

Multimodal Procedural Planning via Dual Text-Image Prompting

Embodied agents have achieved prominent performance in following human i...
research
08/16/2023

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Large language models (LLMs) have made tremendous progress in natural la...
research
06/01/2023

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

Constructing AI models that respond to text instructions is challenging,...

Please sign up or login with your details

Forgot password? Click here to reset