Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

09/17/2021
by   Feilong Chen, et al.
0

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a Multimodal Incremental Transformer with Visual Grounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

READ FULL TEXT

page 1

page 8

research
09/13/2021

Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of cohere...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...
research
03/19/2020

Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding

We propose a new spatial memory module and a spatial reasoner for the Vi...
research
11/23/2016

GuessWhat?! Visual object discovery through multi-modal dialogue

We introduce GuessWhat?!, a two-player guessing game as a testbed for re...
research
07/12/2021

Modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue

To encourage AI agents to conduct meaningful Visual Dialogue (VD), the u...
research
10/24/2022

Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?

Decoding strategies play a crucial role in natural language generation s...
research
05/10/2021

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual ground...

Please sign up or login with your details

Forgot password? Click here to reset