Enriching Video Captions With Contextual Text

07/29/2020
by   Philipp Rimle, et al.
8

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.

READ FULL TEXT

page 1

page 6

research
02/28/2018

Joint Event Detection and Description in Continuous Video Streams

As a fine-grained video understanding task, dense video captioning invol...
research
01/05/2023

ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions

Advancements in Text-to-Image synthesis over recent years have focused m...
research
04/02/2019

Context and Attribute Grounded Dense Captioning

Dense captioning aims at simultaneously localizing semantic regions and ...
research
11/19/2021

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Dense video captioning (DVC) aims to generate multi-sentence description...
research
09/05/2020

Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

A key capability of an intelligent system is deciding when events from p...
research
07/21/2023

OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?

This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-sca...
research
03/04/2019

M-VAD Names: a Dataset for Video Captioning with Naming

Current movie captioning architectures are not capable of mentioning cha...

Please sign up or login with your details

Forgot password? Click here to reset