Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

06/22/2022
by   Jiahui Yu, et al.
0

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

READ FULL TEXT

page 3

page 23

page 35

page 36

page 37

page 38

page 41

page 42

research
05/23/2022

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

We present Imagen, a text-to-image diffusion model with an unprecedented...
research
03/07/2023

Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding

Generative transformers have shown their superiority in synthesizing hig...
research
09/02/2023

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

Text-to-image generation (TTI) refers to the usage of models that could ...
research
02/24/2021

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better mod...
research
11/24/2021

Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences

Autoregressive models have proven to be very powerful in NLP text genera...
research
06/06/2023

Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue

Factual correctness is often the limiting factor in practical applicatio...
research
10/09/2021

Vector-quantized Image Modeling with Improved VQGAN

Pretraining language models with next-token prediction on massive text c...

Please sign up or login with your details

Forgot password? Click here to reset