12-in-1: Multi-Task Vision and Language Representation Learning

12/05/2019
by   Jiasen Lu, et al.
22

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

READ FULL TEXT

page 1

page 8

page 16

page 17

page 18

page 19

research
08/06/2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
research
12/03/2018

Multi-task Learning of Hierarchical Vision-Language Representation

It is still challenging to build an AI system that can perform tasks tha...
research
05/02/2022

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

We present Answer-Me, a task-aware multi-task framework which unifies a ...
research
02/14/2022

ASC me to Do Anything: Multi-task Training for Embodied AI

Embodied AI has seen steady progress across a diverse set of independent...
research
04/18/2019

Attentive Single-Tasking of Multiple Tasks

In this work we address task interference in universal networks by consi...
research
05/27/2023

A Match Made in Heaven: A Multi-task Framework for Hyperbole and Metaphor Detection

Hyperbole and metaphor are common in day-to-day communication (e.g., "I ...
research
04/06/2022

Improving Multi-task Generalization Ability for Neural Text Matching via Prompt Learning

Text matching is a fundamental technique in both information retrieval a...

Please sign up or login with your details

Forgot password? Click here to reset