ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

08/05/2022
by   Bingning Wang, et al.
0

Visual question answering is an important task in both natural language and vision understanding. However, in most of the public visual question answering datasets such as VQA, CLEVR, the questions are human generated that specific to the given image, such as `What color are her eyes?'. The human generated crowdsourcing questions are relatively simple and sometimes have the bias toward certain entities or attributes. In this paper, we introduce a new question answering dataset based on image-ChiQA. It contains the real-world queries issued by internet users, combined with several related open-domain images. The system should determine whether the image could answer the question or not. Different from previous VQA datasets, the questions are real-world image-independent queries that are more various and unbiased. Compared with previous image-retrieval or image-caption datasets, the ChiQA not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. ChiQA contains more than 40K questions and more than 200K question-images pairs. A three-level 2/1/0 label is assigned to each pair indicating perfect answer, partially answer and irrelevant. Data analysis shows ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading. We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.

READ FULL TEXT

page 1

page 2

page 6

research
05/31/2019

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Visual Question Answering (VQA) in its ideal form lets us study reasonin...
research
06/15/2023

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

We propose Encyclopedic-VQA, a large scale visual question answering (VQ...
research
08/24/2022

FashionVQA: A Domain-Specific Visual Question Answering System

Humans apprehend the world through various sensory modalities, yet langu...
research
11/16/2015

Yin and Yang: Balancing and Answering Binary Visual Questions

The complex compositional structure of language makes problems at the in...
research
03/01/2023

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Vision-and-language multi-modal pretraining and fine-tuning have shown g...
research
02/25/2019

GQA: a new dataset for compositional question answering over real-world images

We introduce GQA, a new dataset for real-world visual reasoning and comp...
research
01/23/2023

HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images

Visual question answering (VQA) is an important and challenging multimod...

Please sign up or login with your details

Forgot password? Click here to reset