Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

03/09/2016
by   Qi Wu, et al.
0

Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.

READ FULL TEXT

page 1

page 6

page 13

page 14

research
06/03/2015

What value do explicit high level concepts have in vision to language problems?

Much of the recent progress in Vision-to-Language (V2L) problems has bee...
research
07/27/2022

Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base

Knowledge-based visual question answering (KVQA) task aims to answer que...
research
07/25/2017

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image...
research
10/30/2018

Gated Hierarchical Attention for Image Captioning

Attention modules connecting encoder and decoders have been widely appli...
research
09/04/2019

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Object detection plays an important role in current solutions to vision ...
research
11/09/2020

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to...
research
08/09/2017

Learning to Disambiguate by Asking Discriminative Questions

The ability to ask questions is a powerful tool to gather information in...

Please sign up or login with your details

Forgot password? Click here to reset