Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

04/10/2022
by   Shunyu Zhang, et al.
0

Visual Dialog requires an agent to engage in a conversation with humans grounded in an image. Many studies on Visual Dialog focus on the understanding of the dialog history or the content of an image, while a considerable amount of commonsense-required questions are ignored. Handling these scenarios depends on logical reasoning that requires commonsense priors. How to capture relevant commonsense knowledge complementary to the history and the image remains a key challenge. In this paper, we propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK). In our model, the external knowledge is represented with sentence-level facts and graph-level facts, to properly suit the scenario of the composite of dialog history and image. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion. Experimental results and analysis on VisDial v1.0 and VisDialCK datasets show that our proposed model effectively outperforms comparative methods.

READ FULL TEXT

page 7

page 8

research
12/18/2019

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to eng...
research
04/29/2020

Multi-View Attention Networks for Visual Dialog

Visual dialog is a challenging vision-language task in which a series of...
research
09/16/2017

Augmenting End-to-End Dialog Systems with Commonsense Knowledge

Building dialog agents that can converse naturally with humans is a chal...
research
10/07/2020

Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions

Existing persona-grounded dialog models often fail to capture simple imp...
research
01/30/2023

Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

A framework performing Visual Commonsense Reasoning(VCR) needs to choose...
research
08/05/2021

Hybrid Reasoning Network for Video-based Commonsense Captioning

The task of video-based commonsense captioning aims to generate event-wi...
research
08/28/2022

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Building a conversational embodied agent to execute real-life tasks has ...

Please sign up or login with your details

Forgot password? Click here to reset