Global2Local: A Joint-Hierarchical Attention for Video Captioning

03/13/2022
by   Chengpeng Dai, et al.
0

Recently, automatic video captioning has attracted increasing attention, where the core challenge lies in capturing the key semantic items, like objects and actions as well as their spatial-temporal correlations from the redundant frames and semantic content. To this end, existing works select either the key video clips in a global level (across multi frames), or key regions within each frame, which, however, neglect the hierarchical order, i.e., key frames first and key regions latter. In this paper, we propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model in a hierarchical manner. Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key frames, achieving an accurate global-to-local feature representation to guide the captioning. Extensive quantitative evaluations on two public benchmark datasets MSVD and MSR-VTT demonstrates the superiority of the proposed method over the state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 6

research
05/28/2016

Video Key Frame Extraction using Entropy value as Global and Local Feature

Key frames play an important role in video annotation. It is one of the ...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
01/02/2021

Video Captioning in Compressed Video

Existing approaches in video captioning concentrate on exploring global ...
research
11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...
research
12/20/2022

METEOR Guided Divergence for Video Captioning

Automatic video captioning aims for a holistic visual scene understandin...
research
11/20/2016

Recurrent Memory Addressing for describing videos

In this paper, we introduce Key-Value Memory Networks to a multimodal se...
research
03/05/2018

Less Is More: Picking Informative Frames for Video Captioning

In video captioning task, the best practice has been achieved by attenti...

Please sign up or login with your details

Forgot password? Click here to reset