On Modality Bias in the TVQA Dataset

12/18/2020
by   Thomas Winterbottom, et al.
6

TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer  45 find that a bilinear pooling based joint representation of modalities damages model performance by 9 We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13 (70.50 biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.

READ FULL TEXT

page 5

page 7

page 9

page 11

page 12

page 16

research
04/17/2020

Knowledge-Based Visual Question Answering in Videos

We propose a novel video understanding task by fusing knowledge-based an...
research
10/23/2019

KnowIT VQA: Answering Knowledge-Based Questions about Videos

We propose a novel video understanding task by fusing knowledge-based an...
research
12/14/2021

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Answering semantically-complicated questions according to an image is ch...
research
07/06/2023

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

The prevalence of large-scale multimodal datasets presents unique challe...
research
11/01/2018

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

Language-and-vision navigation and question answering (QA) are exciting ...
research
12/18/2020

Trying Bilinear Pooling in Video-QA

Bilinear pooling (BLP) refers to a family of operations recently develop...
research
06/01/2023

PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

Information extraction, e.g., attribute value extraction, has been exten...

Please sign up or login with your details

Forgot password? Click here to reset