DeepAI AI Chat
Log In Sign Up

Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

by   Jian Zhu, et al.
Tongji University

A framework performing Visual Commonsense Reasoning(VCR) needs to choose an answer and further provide a rationale justifying based on the given image and question, where the image contains all the facts for reasoning and requires to be sufficiently understood. Previous methods use a detector applied on the image to obtain a set of visual objects without considering the exact positions of them in the scene, which is inadequate for properly understanding spatial and semantic relationships between objects. In addition, VCR samples are quite diverse, and parameters of the framework tend to be trained suboptimally based on mini-batches. To address above challenges, pseudo 3D perception Transformer with multi-level confidence optimization named PPTMCO is proposed for VCR in this paper. Specifically, image depth is introduced to represent pseudo 3-dimension(3D) positions of objects along with 2-dimension(2D) coordinates in the image and further enhance visual features. Then, considering that relationships between objects are influenced by depth, depth-aware Transformer is proposed to do attention mechanism guided by depth differences from answer words and objects to objects, where each word is tagged with pseudo depth value according to related objects. To better optimize parameters of the framework, a model parameter estimation method is further proposed to weightedly integrate parameters optimized by mini-batches based on multi-level reasoning confidence. Experiments on the benchmark VCR dataset demonstrate the proposed framework performs better against the state-of-the-art approaches.


page 1

page 3

page 6


SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Visual Dialog requires an agent to engage in a conversation with humans ...

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...

Learning Visual Commonsense for Robust Scene Graph Generation

Scene graph generation models understand the scene through object and pr...

CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension

The task of multimodal referring expression comprehension (REC), aiming ...

A Multi-Level Approach to Waste Object Segmentation

We address the problem of localizing waste objects from a color image an...

Detecting Visual Relationships with Deep Relational Networks

Relationships among objects play a crucial role in image understanding. ...