PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

07/17/2023
by   Nghia Hieu Nguyen, et al.
0

We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.

READ FULL TEXT
research
04/30/2023

Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, ...
research
04/25/2020

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...
research
09/10/2021

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
06/24/2019

Adversarial Multimodal Network for Movie Question Answering

Visual question answering by using information from multiple modalities ...
research
05/19/2022

Acceptability Judgements via Examining the Topology of Attention Maps

The role of the attention mechanism in encoding linguistic knowledge has...
research
05/09/2022

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
04/14/2021

Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention

We present a model that jointly learns the denotations of words together...

Please sign up or login with your details

Forgot password? Click here to reset