Fine-grained Audible Video Description

03/27/2023
by   Xuyang Shen, et al.
0

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, ie, the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions.

READ FULL TEXT

page 2

page 4

page 5

page 8

page 9

page 12

page 13

research
04/24/2018

Fine-grained Video Classification and Captioning

We describe a DNN for fine-grained action classification and video capti...
research
10/07/2020

Rescribe: Authoring and Automatically Editing Audio Descriptions

Audio descriptions make videos accessible to those who cannot see them b...
research
11/26/2015

TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos

Automatically describing videos has ever been fascinating. In this work,...
research
09/22/2018

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

Automatic generation of textual video descriptions that are time-aligned...
research
09/09/2019

Neural Naturalist: Generating Fine-Grained Image Comparisons

We introduce the new Birds-to-Words dataset of 41k sentences describing ...
research
03/29/2022

Image Retrieval from Contextual Descriptions

The ability to integrate context, including perceptual and temporal cues...
research
05/15/2023

Edit As You Wish: Video Description Editing with Multi-grained Commands

Automatically narrating a video with natural language can assist people ...

Please sign up or login with your details

Forgot password? Click here to reset