Less Is More: Picking Informative Frames for Video Captioning

03/05/2018
by   Yangyu Chen, et al.
0

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard Encoder-Decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing textual discrepancy. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results shows that our model can use 6-8 frames to achieve competitive performance across popular benchmarks.

READ FULL TEXT

page 1

page 5

page 7

page 13

page 14

research
12/20/2020

Guidance Module Network for Video Captioning

Video captioning has been a challenging and significant task that descri...
research
08/17/2016

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

We present our submission to the Microsoft Video to Language Challenge o...
research
11/21/2019

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decode...
research
06/03/2019

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

In this paper, the problem of describing visual contents of a video sequ...
research
07/16/2017

RED: Reinforced Encoder-Decoder Networks for Action Anticipation

Action anticipation aims to detect an action before it happens. Many rea...
research
03/13/2022

Global2Local: A Joint-Hierarchical Attention for Video Captioning

Recently, automatic video captioning has attracted increasing attention,...
research
01/11/2022

Condensing a Sequence to One Informative Frame for Video Recognition

Video is complex due to large variations in motion and rich content in f...

Please sign up or login with your details

Forgot password? Click here to reset