Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

01/30/2023
by   Jian Zhu, et al.
0

A framework performing Visual Commonsense Reasoning(VCR) needs to choose an answer and further provide a rationale justifying based on the given image and question, where the image contains all the facts for reasoning and requires to be sufficiently understood. Previous methods use a detector applied on the image to obtain a set of visual objects without considering the exact positions of them in the scene, which is inadequate for properly understanding spatial and semantic relationships between objects. In addition, VCR samples are quite diverse, and parameters of the framework tend to be trained suboptimally based on mini-batches. To address above challenges, pseudo 3D perception Transformer with multi-level confidence optimization named PPTMCO is proposed for VCR in this paper. Specifically, image depth is introduced to represent pseudo 3-dimension(3D) positions of objects along with 2-dimension(2D) coordinates in the image and further enhance visual features. Then, considering that relationships between objects are influenced by depth, depth-aware Transformer is proposed to do attention mechanism guided by depth differences from answer words and objects to objects, where each word is tagged with pseudo depth value according to related objects. To better optimize parameters of the framework, a model parameter estimation method is further proposed to weightedly integrate parameters optimized by mini-batches based on multi-level reasoning confidence. Experiments on the benchmark VCR dataset demonstrate the proposed framework performs better against the state-of-the-art approaches.

READ FULL TEXT

page 1

page 3

page 6

research
12/16/2021

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...
research
04/10/2022

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Visual Dialog requires an agent to engage in a conversation with humans ...
research
04/17/2022

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...
research
06/17/2020

Learning Visual Commonsense for Robust Scene Graph Generation

Scene graph generation models understand the scene through object and pr...
research
05/12/2023

Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention

The monocular depth estimation task has recently revealed encouraging pr...
research
02/03/2023

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

Text-based image captioning is an important but under-explored task, aim...

Please sign up or login with your details

Forgot password? Click here to reset