Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

06/06/2019
by   Zhu Zhang, et al.
0

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, propose a multi-head self-attention to capture long-range semantic dependencies from video context, and next employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments demonstrate the effectiveness of our proposed method.

READ FULL TEXT
research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
06/18/2020

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which a...
research
09/04/2020

Video Moment Retrieval via Natural Language Queries

In this paper, we propose a novel method for video moment retrieval (VMR...
research
07/01/2022

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D...
research
09/21/2021

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

This paper tackles a recently proposed Video Corpus Moment Retrieval tas...
research
07/06/2020

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

The rapid growth of user-generated videos on the Internet has intensifie...
research
10/12/2021

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

This paper focuses on tackling the problem of temporal language localiza...

Please sign up or login with your details

Forgot password? Click here to reset