Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

01/21/2019
by   Dongliang He, et al.
0

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

READ FULL TEXT
research
01/18/2020

Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Temporally language grounding in untrimmed videos is a newly-raised task...
research
09/18/2020

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Temporal grounding of natural language in untrimmed videos is a fundamen...
research
07/09/2018

Video Summarisation by Classification with Deep Reinforcement Learning

Most existing video summarisation methods are based on either supervised...
research
09/11/2019

Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction

The task of temporally grounding language queries in videos is to tempor...
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
03/15/2023

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Video temporal grounding aims to pinpoint a video segment that matches t...
research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...

Please sign up or login with your details

Forgot password? Click here to reset