DeepAI AI Chat
Log In Sign Up

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

by   Tanzila Rahman, et al.
The University of British Columbia

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.


page 1

page 7


Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Recognizing sounds is a key aspect of computational audio scene analysis...

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is aff...

Support-set based Multi-modal Representation Enhancement for Video Captioning

Video captioning is a challenging task that necessitates a thorough comp...

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Most prior art in visual understanding relies solely on analyzing the "w...

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video c...