Unifying Vision-and-Language Tasks via Text Generation

02/04/2021
by   Jaemin Cho, et al.
38

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on answering questions that have rare answers. In addition, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, which achieves similar performance to separately optimized single-task models. Our code will be publicly available at: https://github.com/j-min/VL-T5

READ FULL TEXT
research
08/17/2023

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

Natural Language Explanations (NLE) aim at supplementing the prediction ...
research
06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...
research
11/23/2021

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

In this paper, we propose UNICORN, a vision-language (VL) model that uni...
research
02/07/2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

In this work, we pursue a unified paradigm for multimodal pretraining to...
research
08/27/2023

Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Mechanistic interpretability seeks to understand the neural mechanisms t...
research
09/23/2020

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language cou...
research
06/17/2022

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...

Please sign up or login with your details

Forgot password? Click here to reset