Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

09/07/2021
by   Jungkyoo Shin, et al.
0

Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperform state-of-the-art TML methods with R@1 of 45.50 show that the revised state-of-the-arts methods by replacing the original LSTM with our CM-LSTM achieve performance gains.

READ FULL TEXT

page 2

page 19

research
08/04/2020

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Query-based moment localization is a new task that localizes the best ma...
research
08/24/2020

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Video Moment Retrieval (VMR) is a task to localize the temporal moment i...
research
08/20/2019

Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

This paper studies the problem of temporal moment localization in a long...
research
08/06/2020

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...
research
06/03/2021

Deconfounded Video Moment Retrieval with Causal Intervention

We tackle the task of video moment retrieval (VMR), which aims to locali...
research
05/12/2022

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

Language-driven action localization in videos is a challenging task that...
research
10/21/2014

Attentive monitoring of multiple video streams driven by a Bayesian foraging strategy

In this paper we shall consider the problem of deploying attention to su...

Please sign up or login with your details

Forgot password? Click here to reset