Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

05/24/2021
by   Tao Tu, et al.
9

GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on this dialog history between the Questioner and the Oracle, a Guesser makes a final guess of the target object. Previous baseline Oracle model encodes no visual information in the model, and it cannot fully understand complex questions about color, shape, relationships and so on. Most existing work for Guesser encode the dialog history as a whole and train the Guesser models from scratch on the GuessWhat?! dataset. This is problematic since language encoder tend to forget long-term history and the GuessWhat?! data is sparse in terms of learning visual grounding of objects. Previous work for Questioner introduces state tracking mechanism into the model, but it is learned as a soft intermediates without any prior vision-linguistic insights. To bridge these gaps, in this paper we propose Vilbert-based Oracle, Guesser and Questioner, which are all built on top of pretrained vision-linguistic model, Vilbert. We introduce two-way background/target fusion mechanism into Vilbert-Oracle to account for both intra and inter-object questions. We propose a unified framework for Vilbert-Guesser and Vilbert-Questioner, where state-estimator is introduced to best utilize Vilbert's power on single-turn referring expression comprehension. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7 Oracle, Guesser and End-to-End Questioner respectively.

READ FULL TEXT

page 12

page 13

page 14

page 15

page 16

page 17

page 18

research
03/06/2022

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a...
research
04/28/2020

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Visual dialog is a challenging vision-language task, where a dialog agen...
research
05/08/2020

History for Visual Dialog: Do we really need it?

Visual Dialog involves "understanding" the dialog history (what has been...
research
09/13/2021

Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of cohere...
research
10/01/2018

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

In an open-world setting, it is inevitable that an intelligent agent (e....
research
03/10/2020

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Stickers with vivid and engaging expressions are becoming increasingly p...
research
02/24/2020

Guessing State Tracking for Visual Dialogue

The Guesser plays an important role in GuessWhat?! like visual dialogues...

Please sign up or login with your details

Forgot password? Click here to reset