Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering

08/10/2017
by   Zhou Yu, et al.
0

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex interactions between multi-modal features; 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a `co-attention' mechanism is developed by using a deep neural network architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and question representations. For multi-modal feature fusion, a generalized Multi-modal Factorized High-order pooling approach (MFH) is developed to achieve more effective fusion of multi-modal features by exploiting their correlations sufficiently, which can further result in superior VQA performance as compared with the state-of-the-art approaches. For answer prediction, the KL (Kullback-Leibler) divergence is used as the loss function to achieve more accurate characterization of the complex correlations between multiple diverse answers with same or similar meaning, which can allow us to achieve faster convergence rate and obtain slightly better accuracy on answer prediction. A deep neural network architecture is designed to integrate all these aforementioned modules into one unified model for achieving superior VQA performance. With an ensemble of 9 models, we achieve the state-of-the-art performance on the large-scale VQA datasets and win the runner-up in the VQA Challenge 2017.

READ FULL TEXT

page 1

page 6

page 8

page 10

research
08/04/2017

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...
research
08/08/2018

Question-Guided Hybrid Convolution for Visual Question Answering

In this paper, we propose a novel Question-Guided Hybrid Convolution (QG...
research
07/07/2021

MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering

Medical Visual Question Answering (VQA) is a multi-modal challenging tas...
research
11/18/2017

Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

Recently, the Visual Question Answering (VQA) task has gained increasing...
research
11/12/2017

High-Order Attention Models for Visual Question Answering

The quest for algorithms that enable cognitive abilities is an important...
research
05/30/2020

Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction

Diabetic Retinopathy (DR) is one of the major causes of visual impairmen...
research
09/05/2017

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Large-scale image annotation is a challenging task in image content anal...

Please sign up or login with your details

Forgot password? Click here to reset