DeepAI AI Chat
Log In Sign Up

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

07/25/2017
by   Peter Anderson, et al.
The University of Adelaide
Microsoft
JD.com, Inc.
Macquarie University
0

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, improving the best published result in terms of CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

READ FULL TEXT

page 1

page 3

page 9

page 12

page 13

page 14

page 15

02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
02/27/2020

Visual Commonsense R-CNN

We present a novel unsupervised feature representation learning method, ...
11/17/2015

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

We address the problem of Visual Question Answering (VQA), which require...
03/09/2016

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Much recent progress in Vision-to-Language problems has been achieved th...
03/07/2023

Graph Neural Networks in Vision-Language Image Understanding: A Survey

2D image understanding is a complex problem within Computer Vision, but ...
08/18/2020

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

Attention models are widely used in Vision-language (V-L) tasks to perfo...
01/17/2020

Adapting Grad-CAM for Embedding Networks

The gradient-weighted class activation mapping (Grad-CAM) method can fai...

Code Repositories

bottom-up-attention-vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.


view repo

bottom-up-attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome


view repo