Achieving Human Parity on Visual Question Answering

11/17/2021
by   Ming Yan, et al.
0

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper describes our recent research of AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding) that obtains similar or even slightly better results than human being does on VQA. This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.

READ FULL TEXT

page 2

page 14

page 15

page 16

page 17

page 19

page 20

page 21

research
04/17/2020

Knowledge-Based Visual Question Answering in Videos

We propose a novel video understanding task by fusing knowledge-based an...
research
10/23/2019

KnowIT VQA: Answering Knowledge-Based Questions about Videos

We propose a novel video understanding task by fusing knowledge-based an...
research
01/16/2021

Latent Variable Models for Visual Question Answering

Conventional models for Visual Question Answering (VQA) explore determin...
research
02/12/2020

Component Analysis for Visual Question Answering Architectures

Recent research advances in Computer Vision and Natural Language Process...
research
08/28/2021

On the Significance of Question Encoder Sequence Model in the Out-of-Distribution Performance in Visual Question Answering

Generalizing beyond the experiences has a significant role in developing...
research
12/27/2021

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Contrastive Language–Image Pre-training (CLIP) has shown remarkable succ...
research
08/24/2022

UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering

Visual question answering (VQA) that leverages multi-modality data has a...

Please sign up or login with your details

Forgot password? Click here to reset