Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

03/16/2020
by   Yijun Song, et al.
8

The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage. The proposed method leverages the idea of attentional reconstruction and directly scores the candidate segments with the learnt proposal-level attentions. Moreover, another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage. We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation and adopt 2D convolution to exploit inter-proposal clues for learning reliable attention map. Experiments on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our MARN over the existing weakly-supervised methods.

READ FULL TEXT

page 2

page 5

page 14

page 18

page 19

page 20

page 21

research
01/25/2020

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal ground...
research
06/21/2020

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

A system capturing the association between video frames and textual quer...
research
10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...
research
08/19/2020

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Video moment retrieval aims to localize the target moment in an video ac...
research
05/29/2023

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to localize and reco...
research
06/06/2019

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

In this paper, we address a novel task, namely weakly-supervised spatio-...
research
01/14/2022

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Temporal video grounding (TVG) aims to localize a target segment in a vi...

Please sign up or login with your details

Forgot password? Click here to reset