Locate before Answering: Answer Guided Question Localization for Video Question Answering

10/05/2022
by   Tianwen Qian, et al.
0

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose "Locate before Answering" (LocAns), a novel approach that integrates a question locator and an answer predictor into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer predictor, but also is used to generate pseudo temporal labels for the question locator. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on two modern long-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative examples show the reliable performance of the question localization.

READ FULL TEXT

page 1

page 3

page 6

page 7

research
08/01/2022

Video Question Answering with Iterative Video-Text Co-Tokenization

Video question answering is a challenging task that requires understandi...
research
09/14/2022

WildQA: In-the-Wild Video Question Answering

Existing video understanding datasets mostly focus on human interactions...
research
08/17/2023

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

We introduce EgoSchema, a very long-form video question-answering datase...
research
10/11/2022

Learning to Locate Visual Answer in Video Corpus Using Question

We introduce a new task, named video corpus visual answer localization (...
research
03/02/2022

Video Question Answering: Datasets, Algorithms and Challenges

Video Question Answering (VideoQA) aims to answer natural language quest...
research
05/11/2023

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...
research
07/04/2020

Modality Shifting Attention Network for Multi-modal Video Question Answering

This paper considers a network referred to as Modality Shifting Attentio...

Please sign up or login with your details

Forgot password? Click here to reset