iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

08/29/2020
by   amanchadha, et al.
1

Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA relies solely on visual information, other modalities such as audio and speech are vital for a human observer's perception of an environment. We formulate DVC and VideoQA tasks as machine translation problems that utilize multiple modalities. By evaluating the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively, we show that our approach furthers the state-of-the-art. Code and samples are available at: https://iperceive.amanchadha.com

READ FULL TEXT

page 1

page 4

page 6

page 8

research
03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
research
06/25/2021

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Causality knowledge is vital to building robust AI systems. Deep learnin...
research
09/22/2019

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modaliti...
research
04/08/2019

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate a...
research
08/17/2021

End-to-End Dense Video Captioning with Parallel Decoding

Dense video captioning aims to generate multiple associated captions wit...
research
12/12/2022

Contextual Explainable Video Representation: Human Perception-based Understanding

Video understanding is a growing field and a subject of intense research...
research
06/12/2020

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...

Please sign up or login with your details

Forgot password? Click here to reset