Co-attentional Transformers for Story-Based Video Understanding

10/27/2020
by   Björn Bebensee, et al.
0

Inspired by recent trends in vision and language learning, we explore applications of attention mechanisms for visio-lingual fusion within an application to story-based video understanding. Like other video-based QA tasks, video story understanding requires agents to grasp complex temporal dependencies. However, as it focuses on the narrative aspect of video it also requires understanding of the interactions between different characters, as well as their actions and their motivations. We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas and measure its performance on the video question answering task. We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Our model outperforms the baseline model by 8 percentage points overall, at least 4.95 and up to 12.8 percentage points on all difficulty levels and manages to beat the winner of the DramaQA challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2020

DramaQA: Character-Centered Video Story Understanding with Hierarchical QA

Despite recent progress on computer vision and natural language processi...
research
12/06/2015

A Restricted Visual Turing Test for Deep Scene and Event Understanding

This paper presents a restricted visual Turing test (VTT) for story-line...
research
04/01/2019

Constructing Hierarchical Q A Datasets for Video Story Understanding

Video understanding is emerging as a new paradigm for studying human-lik...
research
07/21/2021

CogME: A Novel Evaluation Metric for Video Understanding Intelligence

Developing video understanding intelligence is quite challenging because...
research
07/29/2018

Story Understanding in Video Advertisements

In order to resonate with the viewers, many video advertisements explore...
research
08/17/2023

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

We introduce EgoSchema, a very long-form video question-answering datase...
research
03/26/2021

On the hidden treasure of dialog in video question answering

High-level understanding of stories in video such as movies and TV shows...

Please sign up or login with your details

Forgot password? Click here to reset