Not All Words are Equal: Video-specific Information Loss for Video Captioning

01/01/2019
by   Jiarong Dong, et al.
4

An ideal description for a given video should fix its gaze on salient and representative content, which is capable of distinguishing this video from others. However, the distribution of different words is unbalanced in video captioning datasets, where distinctive words for describing video-specific salient objects are far less than common words such as 'a' 'the' and 'person'. The dataset bias often results in recognition error or detail deficiency of salient but unusual objects. To address this issue, we propose a novel learning strategy called Information Loss, which focuses on the relationship between the video-specific visual content and corresponding representative words. Moreover, a framework with hierarchical visual representations and an optimized hierarchical attention mechanism is established to capture the most salient spatial-temporal visual information, which fully exploits the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized architecture outperforms state-of-the-art video captioning methods on MSVD with CIDEr score 87.5, and achieves superior CIDEr score 47.7 on MSR-VTT. We also show that our Information Loss is generic which improves various models by significant margins.

READ FULL TEXT

page 2

page 4

page 9

research
06/05/2017

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
06/11/2019

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descrip...
research
07/19/2017

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

The attention mechanisms in deep neural networks are inspired by human's...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...
research
05/03/2021

MemX: An Attention-Aware Smart Eyewear System for Personalized Moment Auto-capture

This work presents MemX: a biologically-inspired attention-aware eyewear...
research
10/15/2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

This notebook paper presents our model in the VATEX video captioning cha...

Please sign up or login with your details

Forgot password? Click here to reset