Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

09/22/2019
by   Tanzila Rahman, et al.
0

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.

READ FULL TEXT

page 1

page 7

research
03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
research
05/17/2020

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...
research
08/08/2017

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is aff...
research
08/29/2020

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Most prior art in visual understanding relies solely on analyzing the "w...
research
05/19/2022

Support-set based Multi-modal Representation Enhancement for Video Captioning

Video captioning is a challenging task that necessitates a thorough comp...
research
04/05/2017

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video c...
research
10/28/2021

Audio-visual Representation Learning for Anomaly Events Detection in Crowds

In recent years, anomaly events detection in crowd scenes attracts many ...

Please sign up or login with your details

Forgot password? Click here to reset