AssistSR: Affordance-centric Question-driven Video Segment Retrieval

11/30/2021
by   Stan Weixian Lei, et al.
8

It is still a pipe dream that AI assistants on phone and AR glasses can assist our daily life in addressing our questions like "how to adjust the date for this watch?" and "how to set its heating duration? (while pointing at an oven)". The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this AQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 1.4k multimodal questions on 1k video segments from instructional videos on diverse daily-used items. To address AQVSR, we develop a straightforward yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Our codes and data are available at https://github.com/StanLei52/AQVSR.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 8

page 10

page 11

research
03/08/2022

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

A long-standing goal of intelligent assistants such as AR glasses/robots...
research
01/24/2020

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

We introduce a new multimodal retrieval task - TV show Retrieval (TVR), ...
research
02/03/2021

QuizCram: A Quiz-Driven Lecture Viewing Interface

QuizCram is an interface for navigating lecture videos that uses quizzes...
research
08/22/2023

Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of ma...
research
03/29/2023

Hierarchical Video-Moment Retrieval and Step-Captioning

There is growing interest in searching for information from large video ...
research
05/13/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Videos convey rich information. Dynamic spatio-temporal relationships be...
research
07/08/2023

Reading Between the Lanes: Text VideoQA on the Road

Text and signs around roads provide crucial information for drivers, vit...

Please sign up or login with your details

Forgot password? Click here to reset