Dense Video Object Captioning from Disjoint Supervision

06/20/2023
by   Xingyi Zhou, et al.
0

We propose a new task and model for dense video object captioning – detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Our model for dense video object captioning is trained end-to-end and consists of different modules for spatial localization, tracking, and captioning. As such, we can train our model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of our model. This results in noteworthy zero-shot performance. Moreover, by finetuning a model from this initialization, we can further improve our performance, surpassing strong image-based baselines by a significant margin. Although we are not aware of other work performing this task, we are able to repurpose existing video grounding datasets for our task, namely VidSTG and VLN. We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence. Our model outperforms dedicated, state-of-the-art models for spatial grounding on both VidSTG and VLN.

READ FULL TEXT

page 2

page 3

page 9

research
02/28/2018

Joint Event Detection and Description in Continuous Video Streams

As a fine-grained video understanding task, dense video captioning invol...
research
04/24/2018

Fine-grained Video Classification and Captioning

We describe a DNN for fine-grained action classification and video capti...
research
11/16/2017

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...
research
07/24/2023

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...
research
11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...
research
10/15/2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

This notebook paper presents our model in the VATEX video captioning cha...
research
07/24/2022

SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions

Detecting suspicious activities in surveillance videos has been a longst...

Please sign up or login with your details

Forgot password? Click here to reset