Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

06/28/2019
by   Zhu Zhang, et al.
1

Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context. We then devise a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation, which avoids the information missing of the top encoder layer. The extensive experiments show the effectiveness and efficiency of our method.

READ FULL TEXT

page 1

page 3

page 6

research
07/20/2017

Video Question Answering via Attribute-Augmented Attention Network Learning

Video Question Answering is a challenging problem in visual information ...
research
04/04/2022

Long Movie Clip Classification with State-Space Video Models

Most modern video recognition models are designed to operate on short vi...
research
09/27/2021

Improving Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal information

Context: Stack Overflow is very helpful for software developers who are ...
research
11/07/2019

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is des...
research
05/02/2019

Conditioning LSTM Decoder and Bi-directional Attention Based Question Answering System

Applying neural-networks on Question Answering has gained increasing pop...
research
11/15/2018

Exploiting Sentence Embedding for Medical Question Answering

Despite the great success of word embedding, sentence embedding remains ...
research
05/09/2022

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...

Please sign up or login with your details

Forgot password? Click here to reset