Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

06/15/2023
by   Thomas Mensink, et al.
0

We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0 experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0 an automatic retrieval-augmented prototype yields 48.8 dataset enables future research on retrieval-augmented vision+language models.

READ FULL TEXT

page 1

page 2

page 6

page 15

page 17

page 18

page 19

research
08/05/2022

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Visual question answering is an important task in both natural language ...
research
08/11/2023

Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have made signifi...
research
03/09/2023

Toward Unsupervised Realistic Visual Question Answering

The problem of realistic VQA (RVQA), where a model has to reject unanswe...
research
11/16/2015

Yin and Yang: Balancing and Answering Binary Visual Questions

The complex compositional structure of language makes problems at the in...
research
06/09/2020

Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?

To be reliable on rare events is an important requirement for systems ba...
research
10/18/2022

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ...
research
08/19/2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision Language Models (VLMs), which extend Large Language Models (LLM) ...

Please sign up or login with your details

Forgot password? Click here to reset