Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

08/18/2023
by   Dohwan Ko, et al.
0

Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA.

READ FULL TEXT

page 8

page 15

research
10/24/2020

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions

Visual Question Answering is a multi-modal task that aims to measure hig...
research
12/01/2020

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Modern approaches to visual question answering require large annotated d...
research
11/21/2019

ChartNet: Visual Reasoning over Statistical Charts using MAC-Networks

Despite the improvements in perception accuracies brought about via deep...
research
02/10/2023

Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement

Conversational Question Answering (ConvQA) models aim at answering a que...
research
04/09/2021

Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework

Work to date on language-informed video understanding has primarily addr...
research
05/04/2020

Visual Question Answering with Prior Class Semantics

We present a novel mechanism to embed prior knowledge in a model for vis...
research
06/06/2022

Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering questions ab...

Please sign up or login with your details

Forgot password? Click here to reset