KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

08/11/2020
by   Xiaoze Jiang, et al.
0

Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms exiting models with state-of-the-art results.

READ FULL TEXT
research
02/13/2023

CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving...
research
05/28/2021

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Despite the achievements of large-scale multimodal pre-training approach...
research
07/02/2022

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

The intelligent dialogue system, aiming at communicating with humans har...
research
06/28/2023

Knowledge-Enhanced Hierarchical Information Correlation Learning for Multi-Modal Rumor Detection

The explosive growth of rumors with text and images on social media plat...
research
10/26/2022

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

The goal of visual answering localization (VAL) in the video is to obtai...
research
07/19/2023

Multi-Grained Multimodal Interaction Network for Entity Linking

Multimodal entity linking (MEL) task, which aims at resolving ambiguous ...
research
08/16/2021

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Vision-and-language pretraining (VLP) aims to learn generic multimodal r...

Please sign up or login with your details

Forgot password? Click here to reset