Multimodal Semantic Attention Network for Video Captioning

05/08/2019
by   Liang Sun, et al.
0

Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2016

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...
research
01/11/2017

Attention-Based Multimodal Fusion for Video Description

Currently successful methods for video description are based on encoder-...
research
11/20/2016

Recurrent Memory Addressing for describing videos

In this paper, we introduce Key-Value Memory Networks to a multimodal se...
research
11/23/2016

Semantic Compositional Networks for Visual Captioning

A Semantic Compositional Network (SCN) is developed for image captioning...
research
09/05/2019

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Recently, automatic image caption generation has been an important focus...
research
01/04/2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing Spatio-temporal events in natural language, video capti...
research
11/30/2018

Deep Multimodal Learning: An Effective Method for Video Classification

Videos have become ubiquitous on the Internet. And video analysis can pr...

Please sign up or login with your details

Forgot password? Click here to reset