Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

09/04/2019
by   Soravit Changpinyo, et al.
0

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

READ FULL TEXT

page 1

page 5

research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
09/01/2023

Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

Object proposal generation serves as a standard pre-processing step in V...
research
10/03/2018

Transfer Learning via Unsupervised Task Discovery for Visual Question Answering

We study how to leverage off-the-shelf visual and linguistic data to cop...
research
03/09/2016

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Much recent progress in Vision-to-Language problems has been achieved th...
research
03/27/2020

Assessing Image Quality Issues for Real-World Problem

We introduce a new large-scale dataset that links the assessment of imag...
research
03/27/2020

Assessing Image Quality Issues for Real-World Problems

We introduce a new large-scale dataset that links the assessment of imag...
research
07/26/2017

Deep Interactive Region Segmentation and Captioning

With recent innovations in dense image captioning, it is now possible to...

Please sign up or login with your details

Forgot password? Click here to reset