Episodic Memory Question Answering

05/03/2022
by   Samyak Datta, et al.
3

Egocentric augmented reality devices such as wearable glasses passively capture visual data as a human wearer tours a home environment. We envision a scenario wherein the human communicates with an AI agent powering such a device by asking questions (e.g., where did you last see my keys?). In order to succeed at this task, the egocentric AI assistant must (1) construct semantically rich and efficient scene memories that encode spatio-temporal information about objects seen during the tour and (2) possess the ability to understand the question and ground its answer into the semantic memory representation. Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an allocentric, top-down semantic feature map and grounds the question into the map to localize the answer. We show that our choice of episodic scene memory outperforms naive, off-the-shelf solutions for the task as well as a host of very competitive baselines and is robust to noise in depth, pose as well as camera jitter. The project page can be found at: https://samyak-268.github.io/emqa .

READ FULL TEXT

page 4

page 7

page 8

page 11

page 13

page 15

page 16

page 18

research
04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...
research
10/02/2020

Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views

We study the task of semantic mapping - specifically, an embodied agent ...
research
03/26/2022

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...
research
01/17/2021

HySTER: A Hybrid Spatio-Temporal Event Reasoner

The task of Video Question Answering (VideoQA) consists in answering nat...
research
11/12/2018

Blindfold Baselines for Embodied QA

We explore blindfold (question-only) baselines for Embodied Question Ans...
research
08/10/2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Audio-Visual Question Answering (AVQA) task aims to answer questions abo...
research
04/04/2016

Detecting Engagement in Egocentric Video

In a wearable camera video, we see what the camera wearer sees. While th...

Please sign up or login with your details

Forgot password? Click here to reset