Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

07/06/2023
by   Netta Madvil, et al.
0

The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70 using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.

READ FULL TEXT

page 7

page 18

page 19

page 20

research
10/15/2020

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

This paper presents MAST, a new model for Multimodal Abstractive Text Su...
research
08/01/2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...
research
12/18/2020

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...
research
07/27/2023

Cortex Inspired Learning to Recover Damaged Signal Modality with ReD-SOM Model

Recent progress in the fields of AI and cognitive sciences opens up new ...
research
07/02/2019

E-Sports Talent Scouting Based on Multimodal Twitch Stream Data

We propose and investigate feasibility of a novel task that consists in ...
research
11/02/2022

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Laughter is considered one of the most overt signals of joy. Laughter is...
research
07/14/2020

DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection

For recognizing speakers in video streams, significant research studies ...

Please sign up or login with your details

Forgot password? Click here to reset