Spatio-Temporal Attention Models for Grounded Video Captioning

10/17/2016
by   Mihai Zanfir, et al.
0

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

READ FULL TEXT

page 4

page 9

page 14

research
05/10/2019

Spatio-temporal Video Re-localization by Warp LSTM

The need for efficiently finding the video content a user wants is incre...
research
07/12/2016

Weakly Supervised Learning of Heterogeneous Concepts in Videos

Typical textual descriptions that accompany online videos are 'weak': i....
research
06/15/2016

Bidirectional Long-Short Term Memory for Video Description

Video captioning has been attracting broad research attention in multime...
research
11/16/2017

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...
research
06/06/2019

Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

Video Captioning and Summarization have become very popular in the recen...
research
04/11/2019

Recurrent Space-time Graphs for Video Understanding

Visual learning in the space-time domain remains a very challenging prob...
research
03/26/2023

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

Despite the recent emergence of video captioning models, how to generate...

Please sign up or login with your details

Forgot password? Click here to reset