A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

05/17/2020
by   Vladimir Iashin, et al.
0

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable to input any two modalities in a sequence-to-sequence task. We show that the pre-training a bi-modal encoder along with a bi-modal decoder for captioning can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance.

READ FULL TEXT

page 2

page 19

research
03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
research
12/15/2021

Dense Video Captioning Using Unsupervised Semantic Information

We introduce a method to learn unsupervised semantic visual information ...
research
09/22/2019

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modaliti...
research
10/31/2016

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features

We propose a novel approach for First Impressions Recognition in terms o...
research
12/07/2018

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...
research
01/14/2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Video captioning is a popular task that challenges models to describe ev...
research
12/20/2022

METEOR Guided Divergence for Video Captioning

Automatic video captioning aims for a holistic visual scene understandin...

Please sign up or login with your details

Forgot password? Click here to reset