Interpretable Visual Understanding with Cognitive Attention Network

08/06/2021
by   Xuejiao Tang, et al.
0

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN

READ FULL TEXT

page 4

page 11

research
04/17/2022

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...
research
07/04/2021

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Visual Commonsense Reasoning (VCR) predicts an answer with corresponding...
research
02/04/2023

Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Visual Commonsense Reasoning (VCR) remains a significant yet challenging...
research
11/10/2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Visual commonsense understanding requires Vision Language (VL) models to...
research
04/17/2023

SCANet: Self-Paced Semi-Curricular Attention Network for Non-Homogeneous Image Dehazing

The presence of non-homogeneous haze can cause scene blurring, color dis...
research
07/17/2020

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image-text matching plays a central role in bridging vision and language...
research
05/04/2022

Great Truths are Always Simple: A Rather Simple Knowledge Encoder for Enhancing the Commonsense Reasoning Capacity of Pre-Trained Models

Commonsense reasoning in natural language is a desired ability of artifi...

Please sign up or login with your details

Forgot password? Click here to reset