Structured Two-stream Attention Network for Video Question Answering

06/02/2022
by   Lianli Gao, et al.
0

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0 Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1 and 5.1

READ FULL TEXT
research
08/01/2022

Video Question Answering with Iterative Video-Text Co-Tokenization

Video question answering is a challenging task that requires understandi...
research
03/29/2018

Motion-Appearance Co-Memory Networks for Video Question Answering

Video Question Answering (QA) is an important task in understanding vide...
research
07/11/2019

Two-stream Spatiotemporal Feature for Video QA Task

Understanding the content of videos is one of the core techniques for de...
research
09/04/2023

Can I Trust Your Answer? Visually Grounded Video Question Answering

We study visually grounded VideoQA in response to the emerging trends of...
research
08/17/2023

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

We introduce EgoSchema, a very long-form video question-answering datase...
research
11/07/2015

Stacked Attention Networks for Image Question Answering

This paper presents stacked attention networks (SANs) that learn to answ...
research
12/18/2020

Trying Bilinear Pooling in Video-QA

Bilinear pooling (BLP) refers to a family of operations recently develop...

Please sign up or login with your details

Forgot password? Click here to reset