Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

06/06/2016
by   Akira Fukui, et al.
0

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.

READ FULL TEXT
research
10/14/2016

Hadamard Product for Low-rank Bilinear Pooling

Bilinear models provide rich representations compared with linear models...
research
06/20/2017

Compact Tensor Pooling for Visual Question Answering

Performing high level cognitive tasks requires the integration of featur...
research
12/18/2017

Visual Explanations from Hadamard Product in Multimodal Deep Networks

The visual explanation of learned representation of models helps to unde...
research
02/04/2019

Embodied Multimodal Multitask Learning

Recent efforts on training visual navigation agents conditioned on langu...
research
03/23/2017

Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation

In state-of-the-art Neural Machine Translation, an attention mechanism i...
research
05/21/2018

Bilinear Attention Networks

Attention networks in multimodal learning provide an efficient way to ut...
research
11/23/2020

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Accurate and efficient product classification is significant for E-comme...

Please sign up or login with your details

Forgot password? Click here to reset