A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

11/18/2020
by   Bowen Zhang, et al.
0

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.

READ FULL TEXT

page 8

page 10

research
08/06/2020

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...
research
09/21/2021

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

This paper tackles a recently proposed Video Corpus Moment Retrieval tas...
research
05/25/2022

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Moment retrieval in videos is a challenging task that aims to retrieve t...
research
10/11/2021

ViSeRet: A simple yet effective approach to moment retrieval via fine-grained video segmentation

Video-text retrieval has many real-world applications such as media anal...
research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
10/31/2021

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Temporal Moment Localization (TML) in untrimmed videos is a challenging ...
research
03/15/2021

Boundary Proposal Network for Two-Stage Natural Language Video Localization

We aim to address the problem of Natural Language Video Localization (NL...

Please sign up or login with your details

Forgot password? Click here to reset