Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework

04/09/2021
by   Santiago Castro, et al.
1

Work to date on language-informed video understanding has primarily addressed two tasks: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit the fact that candidate answers are readily available; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. In this paper, we propose fill-in-the-blanks as a video understanding evaluation framework that addresses these previous evaluation drawbacks, and more closely reflects real-life settings where no multiple choices are given. The task tests a system understanding of a video by requiring the model to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance, thus suggesting that the task is more challenging than current video understanding benchmarks.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 9

page 11

08/11/2021

Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Video question answering has recently received a lot of attention from m...
12/09/2015

MovieQA: Understanding Stories in Movies through Question-Answering

We introduce the MovieQA dataset which aims to evaluate automatic story ...
06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...
08/04/2017

MemexQA: Visual Memex Question Answering

This paper proposes a new task, MemexQA: given a collection of photos or...
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...
11/23/2016

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

While deep convolutional neural networks frequently approach or exceed h...
06/23/2016

VideoMCC: a New Benchmark for Video Comprehension

While there is overall agreement that future technology for organizing, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.