Barlow constrained optimization for Visual Question Answering

03/07/2022
by   Abhishek Jha, et al.
8

Visual question answering is a vision-and-language multimodal task, that aims at predicting answers given samples from the question and image modalities. Most recent methods focus on learning a good joint embedding space of images and questions, either by improving the interaction between these two modalities, or by making it a more discriminant space. However, how informative this joint space is, has not been well explored. In this paper, we propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB), that improves the information content of the joint space by minimizing the redundancy. It reduces the correlation between the learned feature components and thereby disentangles semantic concepts. Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different `views' of what in essence is the same semantic information. We propose a constrained optimization policy to balance the categorical and redundancy minimization forces. When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4 and 4 exhibits better interpretability.

READ FULL TEXT

page 13

page 21

page 27

page 28

page 29

page 30

research
12/21/2020

Learning content and context with language bias for Visual Question Answering

Visual Question Answering (VQA) is a challenging multimodal task to answ...
research
11/30/2018

From Known to the Unknown: Transferring Knowledge to Answer Questions about Novel Visual and Semantic Concepts

Current Visual Question Answering (VQA) systems can answer intelligent q...
research
01/20/2020

Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models

Visual Question Answering (VQA) has emerged as a Visual Turing Test to v...
research
05/23/2022

VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering

Visual understanding requires seamless integration between recognition a...
research
06/06/2023

Diversifying Joint Vision-Language Tokenization Learning

Building joint representations across images and text is an essential st...
research
11/08/2019

Are we asking the right questions in MovieQA?

Joint vision and language tasks like visual question answering are fasci...
research
11/01/2016

Solving Visual Madlibs with Multiple Cues

This paper presents an approach for answering fill-in-the-blank multiple...

Please sign up or login with your details

Forgot password? Click here to reset