OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

05/07/2023
by   Nghia Hieu Nguyen, et al.
0

In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering) dataset, the first large-scale dataset for VQA with open-ended answers in Vietnamese, consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs). Moreover, we proposed FST, QuMLAG, and MLPAG which fuse information from images and answers, then use these fused features to construct answers as humans iteratively. Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C. The dataset is available to encourage the research community to develop more generalized algorithms including transformers for low-resource languages such as Vietnamese.

READ FULL TEXT

page 13

page 42

research
05/03/2015

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answerin...
research
02/22/2018

VizWiz Grand Challenge: Answering Visual Questions from Blind People

The study of algorithms to automatically answer visual questions current...
research
12/15/2021

3D Question Answering

Visual Question Answering (VQA) has witnessed tremendous progress in rec...
research
04/07/2021

Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering

We introduce an evaluation methodology for visual question answering (VQ...
research
04/06/2016

A Focused Dynamic Attention Model for Visual Question Answering

Visual Question and Answering (VQA) problems are attracting increasing i...
research
06/23/2016

Analyzing the Behavior of Visual Question Answering Models

Recently, a number of deep-learning based models have been proposed for ...
research
09/12/2022

Towards Multi-Lingual Visual Question Answering

Visual Question Answering (VQA) has been primarily studied through the l...

Please sign up or login with your details

Forgot password? Click here to reset