Point and Ask: Incorporating Pointing into Visual Question Answering

11/27/2020
by   Arjun Mani, et al.
0

Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: https://github.com/princetonvisualai/pointingqa .

READ FULL TEXT

page 1

page 3

page 4

page 7

page 8

page 9

research
04/13/2021

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Most existing research on visual question answering (VQA) is limited to ...
research
10/26/2022

What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?

In visual question answering (VQA), a machine must answer a question giv...
research
04/07/2021

Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering

We introduce an evaluation methodology for visual question answering (VQ...
research
08/19/2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision Language Models (VLMs), which extend Large Language Models (LLM) ...
research
08/17/2019

What is needed for simple spatial language capabilities in VQA?

Visual question answering (VQA) comprises a variety of language capabili...
research
06/27/2022

Consistency-preserving Visual Question Answering in Medical Imaging

Visual Question Answering (VQA) models take an image and a natural-langu...
research
08/07/2017

Structured Attentions for Visual Question Answering

Visual attention, which assigns weights to image regions according to th...

Please sign up or login with your details

Forgot password? Click here to reset