Memory-Attended Recurrent Network for Video Captioning

05/10/2019
by   Wenjie Pei, et al.
0

Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle this limitation, we propose the Memory-Attended Recurrent Network (MARN) for video captioning, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Thus, our model is able to achieve a more comprehensive understanding for each word and yield higher captioning quality. Furthermore, the built memory structure enables our method to model the compatibility between adjacent words explicitly instead of asking the model to learn implicitly, as most existing models do. Extensive validation on two real-word datasets demonstrates that our MARN consistently outperforms state-of-the-art methods.

READ FULL TEXT

page 1

page 8

research
12/20/2020

Guidance Module Network for Video Captioning

Video captioning has been a challenging and significant task that descri...
research
06/05/2017

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
01/14/2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Video captioning is a popular task that challenges models to describe ev...
research
11/20/2016

Recurrent Memory Addressing for describing videos

In this paper, we introduce Key-Value Memory Networks to a multimodal se...
research
10/13/2021

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...
research
09/19/2018

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

Learning visual feature representations for video analysis is a daunting...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...

Please sign up or login with your details

Forgot password? Click here to reset