Multi-task Learning of Hierarchical Vision-Language Representation

12/03/2018
by   Duy-Kien Nguyen, et al.
0

It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and trained them on its dedicated datasets. Although this approach has seen a certain degree of success, it comes with difficulties of understanding relations among different tasks and transferring the knowledge learned for a task to others. We propose a multi-task learning approach that enables to learn vision-language representation that is shared by many tasks from their diverse datasets. The representation is hierarchical, and prediction for each task is computed from the representation at its corresponding level of the hierarchy. We show through experiments that our method consistently outperforms previous single-task-learning methods on image caption retrieval, visual question answering, and visual grounding. We also analyze the learned hierarchical representation by visualizing attention maps generated in our network.

READ FULL TEXT
research
12/05/2019

12-in-1: Multi-Task Vision and Language Representation Learning

Much of vision-and-language research focuses on a small but diverse set ...
research
08/30/2019

Multi-Task Learning with Language Modeling for Question Generation

This paper explores the task of answer-aware questions generation. Based...
research
05/10/2021

You Only Learn One Representation: Unified Network for Multiple Tasks

People “understand” the world via vision, hearing, tactile, and also the...
research
12/17/2020

MIX : a Multi-task Learning Approach to Solve Open-Domain Question Answering

In this paper, we introduce MIX : a multi-task deep learning approach to...
research
06/07/2021

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Despite recent progress, learning new tasks through language instruction...
research
10/10/2021

Multi-task Learning with Metadata for Music Mood Classification

Mood recognition is an important problem in music informatics and has ke...
research
04/02/2017

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visu...

Please sign up or login with your details

Forgot password? Click here to reset