Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

09/15/2021
by   Ander Salaberria, et al.
0

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that our unimodal model is complementary to multimodal models in both OK-VQA and VQA 2.0, and yield the best result to date in OK-VQA among systems not using external knowledge graphs, and comparable to systems that do use them. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.

READ FULL TEXT

page 1

page 5

page 8

research
07/26/2022

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Visual question answering (VQA) often requires an understanding of visua...
research
05/24/2022

Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Vision-and-language (V L) models pretrained on large-scale multimodal ...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
08/22/2021

External Knowledge enabled Text Visual Question Answering

The open-ended question answering task of Text-VQA requires reading and ...
research
01/20/2020

Recommending Themes for Ad Creative Design via Visual-Linguistic Representations

There is a perennial need in the online advertising industry to refresh ...
research
09/14/2022

Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering

Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), ...
research
08/19/2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision Language Models (VLMs), which extend Large Language Models (LLM) ...

Please sign up or login with your details

Forgot password? Click here to reset