Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

08/04/2017
by   Zhou Yu, et al.
0

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.

READ FULL TEXT

page 5

page 8

research
08/10/2017

Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...
research
03/31/2020

X-Linear Attention Networks for Image Captioning

Recent progress on fine-grained visual recognition and visual question a...
research
06/25/2019

Deep Modular Co-Attention Networks for Visual Question Answering

Visual Question Answering (VQA) requires a fine-grained and simultaneous...
research
01/31/2019

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...
research
04/12/2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
research
08/11/2023

Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have made signifi...
research
11/16/2015

Yin and Yang: Balancing and Answering Binary Visual Questions

The complex compositional structure of language makes problems at the in...

Please sign up or login with your details

Forgot password? Click here to reset