Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval

06/05/2023
by   Minjoon Jung, et al.
0

Video moment retrieval (VMR) aims to identify the specific moment in an untrimmed video for a given natural language query. However, this task is prone to suffer the weak visual-textual alignment problem from query ambiguity, potentially limiting further performance gains and generalization capability. Due to the complex multimodal interactions in videos, a query may not fully cover the relevant details of the corresponding moment, and the moment may contain misaligned and irrelevant frames. To tackle this problem, we propose a straightforward yet effective model, called Background-aware Moment DEtection TRansformer (BM-DETR). Given a target query and its moment, BM-DETR also takes negative queries corresponding to different moments. Specifically, our model learns to predict the target moment from the joint probability of the given query and the complement of negative queries for each candidate frame. In this way, it leverages the surrounding background to consider relative importance, improving moment sensitivity. Extensive experiments on Charades-STA and QVHighlights demonstrate the effectiveness of our model. Moreover, we show that BM-DETR can perform robustly in three challenging VMR scenarios, such as several out-of-distribution test cases, demonstrating superior generalization ability.

READ FULL TEXT

page 2

page 9

page 16

research
10/23/2022

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Video corpus moment retrieval (VCMR) is the task to retrieve the most re...
research
03/24/2023

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Recently, video moment retrieval and highlight detection (MR/HD) are bei...
research
01/29/2023

Multi-video Moment Ranking with Multimodal Clue

Video corpus moment retrieval (VCMR) is the task of retrieving a relevan...
research
06/06/2023

Prompting Large Language Models to Reformulate Queries for Moment Localization

The task of moment localization is to localize a temporal moment in an u...
research
08/14/2023

Knowing Where to Focus: Event-aware Transformer for Video Grounding

Recent DETR-based video grounding models have made the model directly pr...
research
10/17/2022

Selective Query-guided Debiasing Network for Video Corpus Moment Retrieval

Video moment retrieval (VMR) aims to localize target moments in untrimme...
research
05/17/2018

Action Completion: A Temporal Model for Moment Detection

We introduce completion moment detection for actions - the problem of lo...

Please sign up or login with your details

Forgot password? Click here to reset