Video Question Answering with Iterative Video-Text Co-Tokenization

08/01/2022
by   AJ Piergiovanni, et al.
0

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.

READ FULL TEXT

page 2

page 10

page 23

research
05/03/2017

The Forgettable-Watcher Model for Video Question Answering

A number of visual question answering approaches have been proposed rece...
research
06/02/2022

Structured Two-stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) i...
research
01/05/2021

End-to-End Video Question-Answer Generation with Generator-Pretester Network

We study a novel task, Video Question-Answer Generation (VQAG), for chal...
research
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...
research
09/08/2022

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...
research
10/05/2022

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Video question answering (VideoQA) is an essential task in vision-langua...
research
07/31/2019

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Understanding and conversing about dynamic scenes is one of the key capa...

Please sign up or login with your details

Forgot password? Click here to reset