Joint Image Captioning and Question Answering

05/22/2018
by   Jialin Wu, et al.
0

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8 validation set and 68.4 models results in 69.7

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2019

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared bo...
research
04/29/2020

Pragmatic Issue-Sensitive Image Captioning

Image captioning systems have recently improved dramatically, but they s...
research
11/09/2020

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to...
research
09/10/2018

SPASS: Scientific Prominence Active Search System with Deep Image Captioning Network

Planetary exploration missions with Mars rovers are complicated, which g...
research
04/05/2019

Actively Seeking and Learning from Live Data

One of the key limitations of traditional machine learning methods is th...
research
02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
research
12/09/2020

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Texts appearing in daily scenes that can be recognized by OCR (Optical C...

Please sign up or login with your details

Forgot password? Click here to reset