WebQA: Multihop and Multimodal QA

09/01/2021
by   Yingshan Chang, et al.
0

Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources – akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.

READ FULL TEXT

page 2

page 3

page 9

page 11

page 12

page 13

page 14

research
03/05/2023

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

The ideal form of Visual Question Answering requires understanding, grou...
research
11/01/2018

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

Language-and-vision navigation and question answering (QA) are exciting ...
research
01/09/2023

MAQA: A Multimodal QA Benchmark for Negation

Multimodal learning can benefit from the representation power of pretrai...
research
01/31/2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Real-world data contains a vast amount of multimodal information, among ...
research
08/14/2019

Fusion of Detected Objects in Text for Visual Question Answering

To advance models of multimodal context, we introduce a simple yet power...
research
04/17/2021

Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

In recent years, vision-language research has shifted to study tasks whi...
research
12/08/2020

Edited Media Understanding: Reasoning About Implications of Manipulated Images

Multimodal disinformation, from `deepfakes' to simple edits that deceive...

Please sign up or login with your details

Forgot password? Click here to reset