Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

11/04/2021
by   Ding Li, et al.
0

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.

READ FULL TEXT

page 1

page 3

page 7

research
11/19/2019

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Video moment retrieval is to search the moment that is most relevant to ...
research
04/05/2019

Weakly Supervised Video Moment Retrieval From Text Queries

There have been a few recent methods proposed in text to video moment re...
research
04/20/2022

Video Moment Retrieval from Text Queries via Single Frame Annotation

Video moment retrieval aims at finding the start and end timestamps of a...
research
08/24/2020

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Video Moment Retrieval (VMR) is a task to localize the temporal moment i...
research
12/04/2020

Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language

We address the problem of retrieving a specific moment from an untrimmed...
research
09/27/2019

wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval

Given a video and a sentence, the goal of weakly-supervised video moment...
research
11/30/2018

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

This research strives for natural language moment retrieval in long, unt...

Please sign up or login with your details

Forgot password? Click here to reset