A Better Way to Attend: Attention with Trees for Video Question Answering

09/05/2019
by   Hongyang Xue, et al.
20

We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the Extended Soft Attention (E-SA) model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the visual words are processed with an attention module and the verbal ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two datasets. The experimental results show the superiority of our HTreeMN model over the other attention models especially on complex questions. Our code is available on github. Our code is available at https://github.com/ZJULearning/TreeAttention

READ FULL TEXT

page 1

page 9

page 10

page 11

research
09/28/2016

Hierarchical Memory Networks for Answer Selection on Unknown Words

Recently, end-to-end memory networks have shown promising results on Que...
research
08/01/2018

Learning Visual Question Answering by Bootstrapping Hard Attention

Attention mechanisms in biological perception are thought to select subs...
research
11/18/2017

Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

Recently, the Visual Question Answering (VQA) task has gained increasing...
research
10/19/2022

Dense but Efficient VideoQA for Intricate Compositional Reasoning

It is well known that most of the conventional video question answering ...
research
04/08/2020

Pruning and Sparsemax Methods for Hierarchical Attention Networks

This paper introduces and evaluates two novel Hierarchical Attention Net...
research
10/10/2016

End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

We propose a high-level concept word detector that can be integrated wit...
research
06/03/2018

Multi-Cast Attention Networks for Retrieval-based Question Answering and Response Prediction

Attention is typically used to select informative sub-phrases that are u...

Please sign up or login with your details

Forgot password? Click here to reset