You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

05/25/2022
by   Xin Sun, et al.
0

Moment retrieval in videos is a challenging task that aims to retrieve the most relevant video moment in an untrimmed video given a sentence description. Previous methods tend to perform self-modal learning and cross-modal interaction in a coarse manner, which neglect fine-grained clues contained in video content, query context, and their alignment. To this end, we propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level. Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework. A coarse-grained feature encoder and a co-attention mechanism are utilized to obtain a preliminary perception of intra-modality and inter-modality information. Then a fine-grained feature encoder and a conditioned interaction module are introduced to enhance the initial perception inspired by how humans address reading comprehension problems. Moreover, to alleviate the huge computation burden of some existing methods, we further design an efficient choice comparison module and reduce the hidden size with imperceptible quality loss. Extensive experiments on Charades-STA, TACoS, and ActivityNet Captions datasets demonstrate that our solution outperforms existing state-of-the-art methods.

READ FULL TEXT

page 2

page 3

page 10

research
10/12/2021

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Temporal language grounding in videos aims to localize the temporal span...
research
11/18/2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...
research
04/21/2021

Deep Music Retrieval for Fine-Grained Videos by Exploiting Cross-Modal-Encoded Voice-Overs

Recently, the witness of the rapidly growing popularity of short videos ...
research
04/07/2017

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network

Cross-modal retrieval has become a highlighted research topic for retrie...
research
08/06/2020

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...
research
12/20/2020

Adaptive Bi-directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

Recently, the attention-enhanced multi-layer encoder, such as Transforme...
research
05/20/2023

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Text-video retrieval is a challenging cross-modal task, which aims to al...

Please sign up or login with your details

Forgot password? Click here to reset