Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

04/15/2022
by   Feilong Chen, et al.
0

Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the VisDial dataset.

READ FULL TEXT

page 1

page 4

research
03/15/2020

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...
research
09/11/2019

Probabilistic framework for solving Visual Dialog

In this paper, we propose a probabilistic framework for solving the task...
research
05/15/2021

Premise-based Multimodal Reasoning: A Human-like Cognitive Process

Reasoning is one of the major challenges of Human-like AI and has recent...
research
02/20/2018

Combining Textual Content and Structure to Improve Dialog Similarity

Chatbots, taking advantage of the success of the messaging apps and rece...
research
09/13/2021

Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of cohere...
research
07/07/2019

Informative Visual Storytelling with Cross-modal Rules

Existing methods in the Visual Storytelling field often suffer from the ...
research
08/16/2017

mAnI: Movie Amalgamation using Neural Imitation

Cross-modal data retrieval has been the basis of various creative tasks ...

Please sign up or login with your details

Forgot password? Click here to reset