Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

04/02/2017
by   Tanmay Gupta, et al.
0

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.

READ FULL TEXT

page 3

page 6

page 8

research
02/12/2020

Component Analysis for Visual Question Answering Architectures

Recent research advances in Computer Vision and Natural Language Process...
research
05/24/2023

Interpretable by Design Visual Question Answering

Model interpretability has long been a hard problem for the AI community...
research
08/17/2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-lang...
research
05/12/2020

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...
research
12/03/2018

Multi-task Learning of Hierarchical Vision-Language Representation

It is still challenging to build an AI system that can perform tasks tha...
research
09/14/2022

Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering

Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), ...
research
07/29/2021

Towards robust vision by multi-task learning on monkey visual cortex

Deep neural networks set the state-of-the-art across many tasks in compu...

Please sign up or login with your details

Forgot password? Click here to reset