Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

01/24/2018
by   Zhe Wang, et al.
0

Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2 performance of 69.6

READ FULL TEXT

page 1

page 3

page 7

page 8

research
06/27/2016

Revisiting Visual Question Answering Baselines

Visual question answering (VQA) is an interesting learning setting for e...
research
10/05/2020

Attention Guided Semantic Relationship Parsing for Visual Question Answering

Humans explain inter-object relationships with semantic labels that demo...
research
01/16/2021

Latent Variable Models for Visual Question Answering

Conventional models for Visual Question Answering (VQA) explore determin...
research
10/09/2018

Knowing Where to Look? Analysis on Attention of Visual Question Answering System

Attention mechanisms have been widely used in Visual Question Answering ...
research
09/25/2017

Can you fool AI with adversarial examples on a visual Turing test?

Deep learning has achieved impressive results in many areas of Computer ...
research
04/16/2016

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

This paper proposes deep convolutional network models that utilize local...
research
01/11/2022

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progres...

Please sign up or login with your details

Forgot password? Click here to reset