DeepAI
Log In Sign Up

Multi-modality Latent Interaction Network for Visual Question Answering

08/10/2019
by   Peng Gao, et al.
14

Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be significantly improved by combining with pre-trained language model BERT.

READ FULL TEXT

page 4

page 8

12/14/2021

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Answering semantically-complicated questions according to an image is ch...
01/25/2022

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Visual Question Answering (VQA) attracts much attention from both indust...
05/12/2020

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...
10/27/2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings),...
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...
01/25/2022

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Learning to answer visual questions is a challenging task since the mult...
05/28/2019

Leveraging Medical Visual Question Answering with Supporting Facts

In this working notes paper, we describe IBM Research AI (Almaden) team'...