DeepAI AI Chat
Log In Sign Up

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

09/22/2019
by   Tanzila Rahman, et al.
The University of British Columbia
0

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.

READ FULL TEXT

page 1

page 7

03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
05/17/2020

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...
05/29/2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Recognizing sounds is a key aspect of computational audio scene analysis...
08/08/2017

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is aff...
05/19/2022

Support-set based Multi-modal Representation Enhancement for Video Captioning

Video captioning is a challenging task that necessitates a thorough comp...
08/29/2020

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Most prior art in visual understanding relies solely on analyzing the "w...
04/05/2017

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video c...