Generating Video Descriptions with Topic Guidance

08/31/2017
by   Shizhe Chen, et al.
0

Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.

READ FULL TEXT

page 1

page 3

page 7

research
08/31/2017

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies ...
research
10/26/2020

Multimodal Topic Learning for Video Recommendation

Facilitated by deep neural networks, video recommendation systems have m...
research
07/10/2018

Topic-Guided Attention for Image Captioning

Attention mechanisms have attracted considerable interest in image capti...
research
04/04/2023

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective i...
research
12/09/2015

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

In this paper, we describe the system for generating textual description...
research
07/31/2023

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Stylized visual captioning aims to generate image or video descriptions ...
research
06/29/2021

Topic-to-Essay Generation with Comprehensive Knowledge Enhancement

Generating high-quality and diverse essays with a set of topics is a cha...

Please sign up or login with your details

Forgot password? Click here to reset