MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

12/01/2021
by   Mattia Soldan, et al.
0

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours.

READ FULL TEXT
research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...
research
10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...
research
12/15/2014

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Solving the visual symbol grounding problem has long been a goal of arti...
research
03/24/2020

Video Object Grounding using Semantic Roles in Language Description

We explore the task of Video Object Grounding (VOG), which grounds objec...
research
03/03/2015

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

In this work, we introduce a dataset of video annotated with high qualit...
research
06/06/2022

Norm Participation Grounds Language

The striking recent advances in eliciting seemingly meaningful language ...
research
04/06/2016

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

This paper investigates how linguistic knowledge mined from large text c...

Please sign up or login with your details

Forgot password? Click here to reset