Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

by   Tanmay Gupta, et al.

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.


page 3

page 6

page 8


Component Analysis for Visual Question Answering Architectures

Recent research advances in Computer Vision and Natural Language Process...

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-lang...

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...

Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

A key solution to visual question answering (VQA) exists in how to fuse ...

Multi-task Learning of Hierarchical Vision-Language Representation

It is still challenging to build an AI system that can perform tasks tha...

Towards robust vision by multi-task learning on monkey visual cortex

Deep neural networks set the state-of-the-art across many tasks in compu...

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...