VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

08/24/2020
by   Minuk Ma, et al.
0

Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention maps which limit the localization performance. To handle this issue, Video-Language Alignment Network (VLANet) is proposed that learns sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flow to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to gather. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.

READ FULL TEXT

page 2

page 12

page 13

research
11/19/2019

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Video moment retrieval is to search the moment that is most relevant to ...
research
08/10/2023

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Video moment localization aims to retrieve the target segment of an untr...
research
09/07/2021

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Temporal moment localization aims to retrieve the best video segment mat...
research
09/27/2019

wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval

Given a video and a sentence, the goal of weakly-supervised video moment...
research
08/19/2020

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Video moment retrieval aims to localize the target moment in an video ac...
research
11/04/2021

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Video moment retrieval aims to search the moment most relevant to a give...
research
11/16/2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

This technical report describes the CONE approach for Ego4D Natural Lang...

Please sign up or login with your details

Forgot password? Click here to reset