Unifying Multimodal Transformer for Bi-directional Image and Text Generation

10/19/2021
by   Yupan Huang, et al.
0

We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9 fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.

READ FULL TEXT
research
12/31/2021

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Conventional methods for the image-text generation tasks mainly tackle t...
research
02/07/2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

In this work, we pursue a unified paradigm for multimodal pretraining to...
research
09/20/2023

Kosmos-2.5: A Multimodal Literate Model

We present Kosmos-2.5, a multimodal literate model for machine reading o...
research
02/01/2022

Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens

We report the Regression Transformer (RT), a method that abstracts regre...
research
02/19/2021

Progressive Transformer-Based Generation of Radiology Reports

Inspired by Curriculum Learning, we propose a consecutive (i.e. image-to...
research
12/01/2021

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Recently, vector-quantized image modeling has demonstrated impressive pe...
research
04/15/2022

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Though deep generative models have gained a lot of attention, most of th...

Please sign up or login with your details

Forgot password? Click here to reset