VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

10/24/2022
by   Sahithya Ravi, et al.
0

There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don't, from contextualized commonsense knowledge from COMET.

READ FULL TEXT
research
06/03/2022

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

The Visual Question Answering (VQA) task aspires to provide a meaningful...
research
08/08/2019

From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense

Visual Question Answering (VQA) is a challenging task for evaluating the...
research
09/05/2018

Improving Question Answering by Commonsense-Based Pre-Training

Although neural network approaches achieve remarkable success on a varie...
research
12/13/2020

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

Reasoning is a critical ability towards complete visual understanding. T...
research
01/15/2021

Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

The limits of applicability of vision-and-language models are defined by...
research
12/24/2020

REM-Net: Recursive Erasure Memory Network for Commonsense Evidence Refinement

When answering a question, people often draw upon their rich world knowl...
research
04/30/2021

Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads

Vision-and-Language (VL) pre-training has shown great potential on many ...

Please sign up or login with your details

Forgot password? Click here to reset