Video Captioning via Hierarchical Reinforcement Learning

11/29/2017
by   Xin Wang, et al.
0

Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the challenge by proposing a novel hierarchical reinforcement learning framework for video captioning, where a high-level Manager module learns to design sub-goals and a low-level Worker module recognizes the primitive actions to fulfill the sub-goal. With this compositional framework to reinforce video captioning at different levels, our approach significantly outperforms all the baseline methods on a newly introduced large-scale dataset for fine-grained video captioning. Furthermore, our non-ensemble model has already achieved the state-of-the-art results on the widely-used MSR-VTT dataset.

READ FULL TEXT

page 1

page 8

research
04/24/2018

Fine-grained Video Classification and Captioning

We describe a DNN for fine-grained action classification and video capti...
research
10/15/2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

This notebook paper presents our model in the VATEX video captioning cha...
research
03/26/2023

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

Despite the recent emergence of video captioning models, how to generate...
research
05/10/2021

PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

Scene Rearrangement Planning (SRP) is an interior task proposed recently...
research
12/02/2021

Relational Graph Learning for Grounded Video Description Generation

Grounded video description (GVD) encourages captioning models to attend ...
research
10/24/2020

Improved Actor Relation Graph based Group Activity Recognition

Video understanding is to recognize and classify different actions or ac...
research
05/15/2023

Edit As You Wish: Video Description Editing with Multi-grained Commands

Automatically narrating a video with natural language can assist people ...

Please sign up or login with your details

Forgot password? Click here to reset