Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

08/04/2020
by   Daizong Liu, et al.
0

Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.

READ FULL TEXT

page 2

page 8

research
08/06/2020

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...
research
09/07/2021

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Temporal moment localization aims to retrieve the best video segment mat...
research
09/04/2020

Video Moment Retrieval via Natural Language Queries

In this paper, we propose a novel method for video moment retrieval (VMR...
research
10/12/2021

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

This paper focuses on tackling the problem of temporal language localiza...
research
01/05/2023

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Given an untrimmed video, temporal sentence localization (TSL) aims to l...
research
01/19/2020

Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

This work proposes a novel attentive graph neural network (AGNN) for zer...
research
06/10/2020

Interpretable Multimodal Learning for Intelligent Regulation in Online Payment Systems

With the explosive growth of transaction activities in online payment sy...

Please sign up or login with your details

Forgot password? Click here to reset