Video Captioning with Multi-Faceted Attention

12/01/2016
by   Xiang Long, et al.
0

Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an extensible approach to jointly leverage several sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs for sentence generation, with several attention layers and two multimodal layers. The attention mechanism learns to automatically select the most salient visual features or semantic attributes, and the multimodal layer yields overall representations for the input and outputs of the sentence generation component. Experimental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms the state-of-the-art approaches, while ground truth based semantic attributes are able to further elevate the output quality to a near-human level.

READ FULL TEXT

page 2

page 4

page 9

research
05/08/2019

Multimodal Semantic Attention Network for Video Captioning

Inspired by the fact that different modalities in videos carry complemen...
research
11/28/2018

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

We address the problem of phrase grounding by learning a multi-level com...
research
10/16/2021

Visual-aware Attention Dual-stream Decoder for Video Captioning

Video captioning is a challenging task that captures different visual pa...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...
research
09/28/2022

Thinking Hallucination for Video Captioning

With the advent of rich visual representations and pre-trained language ...
research
09/10/2019

Multimodal Attention Branch Network for Perspective-Free Sentence Generation

In this paper, we address the automatic sentence generation of fetching ...
research
05/15/2020

Near-duplicate video detection featuring coupled temporal and perceptual visual structures and logical inference based matching

We propose in this paper an architecture for near-duplicate video detect...

Please sign up or login with your details

Forgot password? Click here to reset