Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

06/16/2022
by   Antoine Yang, et al.
8

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.

READ FULL TEXT

page 4

page 8

page 18

page 19

research
05/10/2022

Learning to Answer Visual Questions from Web Videos

Recent methods for visual question answering rely on large-scale annotat...
research
03/12/2021

Cooperative Learning of Zero-Shot Machine Reading Comprehension

Pretrained language models have significantly improved the performance o...
research
09/06/2021

General-Purpose Question-Answering with Macaw

Despite the successes of pretrained language models, there are still few...
research
11/27/2022

Multi-Modal Few-Shot Temporal Action Detection

Few-shot (FS) and zero-shot (ZS) learning are two different approaches f...
research
12/01/2020

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Modern approaches to visual question answering require large annotated d...
research
05/11/2023

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...
research
06/15/2023

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

Video Question Answering (VideoQA) has been significantly advanced from ...

Please sign up or login with your details

Forgot password? Click here to reset