In Defense of Grid Features for Visual Question Answering

01/10/2020
by   Huaizu Jiang, et al.
7

Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA and find they can work surprisingly well-running more than an order of magnitude faster with the same accuracy. Through extensive experiments, we verify that this observation holds true across different VQA models, datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.

READ FULL TEXT

page 6

page 7

research
09/21/2017

Visual Question Generation as Dual Task of Visual Question Answering

Recently visual question answering (VQA) and visual question generation ...
research
01/16/2021

Latent Variable Models for Visual Question Answering

Conventional models for Visual Question Answering (VQA) explore determin...
research
06/25/2021

A Picture May Be Worth a Hundred Words for Visual Question Answering

How far can we go with textual representations for understanding picture...
research
02/27/2020

Visual Commonsense R-CNN

We present a novel unsupervised feature representation learning method, ...
research
11/17/2015

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

We address the problem of Visual Question Answering (VQA), which require...
research
01/27/2018

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

Visual Question Answering (VQA) has attracted attention from both comput...
research
08/07/2017

Structured Attentions for Visual Question Answering

Visual attention, which assigns weights to image regions according to th...

Please sign up or login with your details

Forgot password? Click here to reset