Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

09/18/2020
by   Jie Wu, et al.
38

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

READ FULL TEXT

page 3

page 8

research
10/21/2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an...
research
06/30/2021

Weakly Supervised Temporal Adjacent Network for Language Grounding

Temporal language grounding (TLG) is a fundamental and challenging probl...
research
01/21/2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

The task of video grounding, which temporally localizes a natural langua...
research
01/18/2020

Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Temporally language grounding in untrimmed videos is a newly-raised task...
research
02/20/2023

Constraint and Union for Partially-Supervised Temporal Sentence Grounding

Temporal sentence grounding aims to detect the event timestamps describe...
research
06/21/2020

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

A system capturing the association between video frames and textual quer...
research
08/10/2023

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Video moment localization aims to retrieve the target segment of an untr...

Please sign up or login with your details

Forgot password? Click here to reset