Focal Visual-Text Attention for Memex Question Answering

01/12/2020
by   Junwei Liang, et al.
2

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photo albums, we have to look at whole collections with sequences of photos. This paper proposes a new multimodal MemexQA task: given a sequence of photos from a user, the goal is to automatically answer questions that help users recover their memory about an event captured in these photos. In addition to a text answer, a few grounding photos are also given to justify the answer. The grounding photos are necessary as they help users quickly verifying the answer. Towards solving the task, we 1) present the MemexQA dataset, the first publicly available multimodal question answering dataset consisting of real personal photo albums; 2) propose an end-to-end trainable network that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. Experimental results on the MemexQA dataset demonstrate that our model outperforms strong baselines and yields the most relevant grounding photos on this challenging task.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 9

page 13

page 14

page 16

research
06/05/2018

Focal Visual-Text Attention for Visual Question Answering

Recent insights on language and vision with neural networks have been su...
research
08/04/2017

MemexQA: Visual Memex Question Answering

This paper proposes a new task, MemexQA: given a collection of photos or...
research
08/12/2021

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

The abundance and richness of Internet photos of landmarks and cities ha...
research
11/14/2022

Multi-VQG: Generating Engaging Questions for Multiple Images

Generating engaging content has drawn much recent attention in the NLP c...
research
12/20/2021

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

Recently, there has been an increasing interest in building question ans...
research
07/26/2022

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering the natural ...
research
03/15/2018

A picture is worth a thousand words but how to organize thousands of pictures?

We live in a society where the large majority of the population has a ca...

Please sign up or login with your details

Forgot password? Click here to reset