Just Ask: Learning to Answer Questions from Millions of Narrated Videos

12/01/2020
by   Antoine Yang, et al.
5

Modern approaches to visual question answering require large annotated datasets for training. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and to learn video question answering (VideoQA) from millions of readily-available narrated videos. We propose to automatically generate question-answer pairs from transcribed video narrations leveraging a state-of-the-art text transformer pipeline and obtain a new large-scale VideoQA training dataset. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer embedding. We evaluate our model on the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate that finetuning our model on target datasets significantly outperforms the state of the art on MSRVTT-QA, MSVD-QA and ActivityNet-QA. Finally, for a detailed evaluation we introduce a new manually annotated VideoQA dataset with reduced language biases and high quality annotations. Our code and datasets will be made publicly available at https://www.di.ens.fr/willow/research/just-ask/ .

READ FULL TEXT

page 1

page 4

page 7

page 13

page 14

page 16

page 17

research
05/10/2022

Learning to Answer Visual Questions from Web Videos

Recent methods for visual question answering rely on large-scale annotat...
research
06/16/2022

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Video question answering (VideoQA) is a complex task that requires diver...
research
08/18/2023

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Video Question Answering (VideoQA) is a challenging task that entails co...
research
11/12/2016

Leveraging Video Descriptions to Learn Video Question Answering

We propose a scalable approach to learn video-based question answering (...
research
01/30/2022

A Dataset for Medical Instructional Video Classification and Question Answering

This paper introduces a new challenge and datasets to foster research to...
research
10/17/2019

Question Classification with Deep Contextualized Transformer

The latest work for Question and Answer problems is to use the Stanford ...
research
07/25/2016

Much Ado About Time: Exhaustive Annotation of Temporal Data

Large-scale annotated datasets allow AI systems to learn from and build ...

Please sign up or login with your details

Forgot password? Click here to reset