Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

09/08/2022
by   Jiong Wang, et al.
0

Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on question grounding, and the FS self-supervision helps improve the question answering and grounding performance on both QA-supervision only and full-supervision settings.

READ FULL TEXT

page 1

page 3

page 7

research
08/01/2022

Video Question Answering with Iterative Video-Text Co-Tokenization

Video question answering is a challenging task that requires understandi...
research
12/19/2022

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...
research
09/04/2023

Can I Trust Your Answer? Visually Grounded Video Question Answering

We study visually grounded VideoQA in response to the emerging trends of...
research
05/28/2019

Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering

This paper proposes a method to gain extra supervision via multi-task le...
research
03/13/2022

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

The temporal answering grounding in the video (TAGV) is a new task natur...
research
05/11/2023

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...
research
07/26/2022

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering the natural ...

Please sign up or login with your details

Forgot password? Click here to reset