Multimodal Memory Modelling for Video Captioning

11/17/2016
by   Junbo Wang, et al.
0

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is still a challenging problem. In this paper, we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide global visual attention on described targets. Specifically, the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. First, text representation in the Long Short-Term Memory (LSTM) based text decoder is written into the memory, and the memory contents will be read out to guide an attention to select related visual targets. Then, the selected visual information is written into the memory, which will be further read out to the text decoder. To evaluate the proposed model, we perform experiments on two publicly benchmark datasets: MSVD and MSR-VTT. The experimental results demonstrate that our method outperforms the state-of-theart methods in terms of BLEU and METEOR.

READ FULL TEXT
research
06/15/2016

Bidirectional Long-Short Term Memory for Video Description

Video captioning has been attracting broad research attention in multime...
research
11/20/2016

Recurrent Memory Addressing for describing videos

In this paper, we introduce Key-Value Memory Networks to a multimodal se...
research
06/07/2020

NITS-VC System for VATEX Video Captioning Challenge 2020

Video captioning is process of summarising the content, event and action...
research
02/22/2022

Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental c...
research
02/27/2020

Hierarchical Memory Decoding for Video Captioning

Recent advances of video captioning often employ a recurrent neural netw...
research
11/05/2022

Differentiable Neural Computers with Memory Demon

A Differentiable Neural Computer (DNC) is a neural network with an exter...
research
12/19/2016

Few-Shot Object Recognition from Machine-Labeled Web Images

With the tremendous advances of Convolutional Neural Networks (ConvNets)...

Please sign up or login with your details

Forgot password? Click here to reset