Knowing Where to Focus: Event-aware Transformer for Video Grounding

08/14/2023
by   Jinhyun Jang, et al.
0

Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

READ FULL TEXT

page 1

page 8

page 14

page 15

research
09/23/2021

End-to-End Dense Video Grounding via Parallel Regression

Video grounding aims to localize the corresponding video moment in an un...
research
10/12/2021

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Temporal language grounding in videos aims to localize the temporal span...
research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
06/05/2023

Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval

Video moment retrieval (VMR) aims to identify the specific moment in an ...
research
08/11/2019

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

We address the problem of video moment localization with natural languag...
research
03/10/2022

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a nat...
research
11/23/2017

Self-view Grounding Given a Narrated 360° Video

Narrated 360 videos are typically provided in many touring scenarios to ...

Please sign up or login with your details

Forgot password? Click here to reset