Learning to Retrieve Videos by Asking Questions

05/11/2022
by   Avinash Madasu, et al.
9

The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework

READ FULL TEXT
research
08/21/2023

Simple Baselines for Interactive Video Retrieval with Questions and Answers

To date, the majority of video retrieval systems have been optimized for...
research
05/07/2019

Interactive Video Retrieval with Dialog

Now that everyone can easily record videos, the quantity of which is con...
research
08/18/2020

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

With the arising concerns for the AI systems provided with direct access...
research
02/02/2022

The slurk Interaction Server Framework: Better Data for Better Dialog Models

This paper presents the slurk software, a lightweight interaction server...
research
03/02/2021

Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with Partial Query

Text-based image retrieval has seen considerable progress in recent year...
research
06/26/2021

Saying the Unseen: Video Descriptions via Dialog Agents

Current vision and language tasks usually take complete visual data (e.g...
research
09/23/2019

Improving Generative Visual Dialog by Answering Diverse Questions

Prior work on training generative Visual Dialog models with reinforcemen...

Please sign up or login with your details

Forgot password? Click here to reset