Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

05/09/2022
by   Min Peng, et al.
20

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2021

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
06/24/2019

Adversarial Multimodal Network for Movie Question Answering

Visual question answering by using information from multiple modalities ...
research
04/29/2021

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering

This paper presents a novel method, termed Bridge to Answer, to infer co...
research
07/31/2019

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Understanding and conversing about dynamic scenes is one of the key capa...
research
06/28/2019

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-ended video question answering aims to automatically generate the n...
research
07/17/2023

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

We present in this paper a novel scheme for multimodal learning named th...
research
03/20/2021

Paying Attention to Multiscale Feature Maps in Multimodal Image Matching

We propose an attention-based approach for multimodal image patch matchi...

Please sign up or login with your details

Forgot password? Click here to reset