Joint Event Detection and Description in Continuous Video Streams

02/28/2018
by   Huijuan Xu, et al.
0

As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers and proposes variable-length temporal events based on pooled features. In order to explicitly model temporal relationships between visual events and their captions in a single video, we propose a two-level hierarchical LSTM module that transcribes the event proposals into captions. Unlike existing dense video captioning approaches, our proposal generation and language captioning networks are trained end-to-end, allowing for improved temporal segmentation. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by most language generation metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.

READ FULL TEXT
research
04/08/2019

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate a...
research
04/23/2018

Jointly Localizing and Describing Events for Dense Video Captioning

Automatically describing a video with natural language is regarded as a ...
research
04/18/2022

End-to-end Dense Video Captioning as Sequence Generation

Dense video captioning aims to identify the events of interest in an inp...
research
04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...
research
07/29/2020

Enriching Video Captions With Contextual Text

Understanding video content and generating caption with context is an im...
research
06/20/2023

Dense Video Object Captioning from Disjoint Supervision

We propose a new task and model for dense video object captioning – dete...
research
03/11/2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Joint video-language learning has received increasing attention in recen...

Please sign up or login with your details

Forgot password? Click here to reset