Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

04/02/2017
by   Tanmay Gupta, et al.
0

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.

READ FULL TEXT

page 3

page 6

page 8

02/12/2020

Component Analysis for Visual Question Answering Architectures

Recent research advances in Computer Vision and Natural Language Process...
08/17/2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-lang...
05/12/2020

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...
04/03/2018

Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

A key solution to visual question answering (VQA) exists in how to fuse ...
12/03/2018

Multi-task Learning of Hierarchical Vision-Language Representation

It is still challenging to build an AI system that can perform tasks tha...
07/29/2021

Towards robust vision by multi-task learning on monkey visual cortex

Deep neural networks set the state-of-the-art across many tasks in compu...
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...