ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

12/31/2021
by   Han Zhang, et al.
6

Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed. In this paper, we propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor. To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS-COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.

READ FULL TEXT

page 5

page 7

page 9

page 11

research
12/22/2022

GENIE: Large Scale Pre-training for Text Generation with Diffusion Model

In this paper, we propose a large-scale language pre-training for text G...
research
03/21/2023

MAGVLT: Masked Generative Vision-and-Language Transformer

While generative modeling on multimodal image-text data has been activel...
research
11/22/2021

L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transfo...
research
10/19/2021

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

We study the joint learning of image-to-text and text-to-image generatio...
research
05/21/2018

Turbo Learning for Captionbot and Drawingbot

We study in this paper the problems of both image captioning and text-to...
research
03/01/2022

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Training a text-to-image generator in the general domain (e.g., Dall.e, ...
research
06/15/2022

Prefix Language Models are Unified Modal Learners

With the success of vision-language pre-training, we have witnessed the ...

Please sign up or login with your details

Forgot password? Click here to reset