The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR

09/11/2018
by   Mateusz Malinowski, et al.
0

Visual QA is a pivotal challenge for higher-level reasoning, requiring understanding language, vision, and relationships between many objects in a scene. Although datasets like CLEVR are designed to be unsolvable without such complex relational reasoning, some surprisingly simple feed-forward, "holistic" models have recently shown strong performance on this dataset. These models lack any kind of explicit iterative, symbolic reasoning procedure, which are hypothesized to be necessary for counting objects, narrowing down the set of relevant objects based on several attributes, etc. The reason for this strong performance is poorly understood. Hence, our work analyzes such models, and finds that minor architectural elements are crucial to performance. In particular, we find that early fusion of language and vision provides large performance improvements. This contrasts with the late fusion approaches popular at the dawn of Visual QA. We propose a simple module we call Multimodal Core, which we hypothesize performs the fundamental operations for multimodal tasks. We believe that understanding why these elements are so important to complex question answering will aid the design of better-performing algorithms for Visual QA while minimizing hand-engineering effort.

READ FULL TEXT

page 1

page 2

page 3

research
11/01/2018

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

Language-and-vision navigation and question answering (QA) are exciting ...
research
07/28/2019

An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

Visual question answering (Visual QA) has attracted significant attentio...
research
11/11/2015

Visual7W: Grounded Question Answering in Images

We have seen great progress in basic perceptual tasks such as object rec...
research
11/21/2018

An Interpretable Model for Scene Graph Generation

We propose an efficient and interpretable scene graph generator. We cons...
research
04/08/2019

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...
research
10/29/2018

TallyQA: Answering Complex Counting Questions

Most counting questions in visual question answering (VQA) datasets are ...
research
01/05/2023

SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

Existing multimodal conversation agents have shown impressive abilities ...

Please sign up or login with your details

Forgot password? Click here to reset