Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

03/20/2017
by   Abhishek Das, et al.
0

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.

READ FULL TEXT

page 1

page 7

page 9

research
08/10/2018

Mind Your Language: Learning Visually Grounded Dialog in a Multi-Agent Setting

The task of visually grounded dialog involves learning goal-oriented coo...
research
09/23/2019

Improving Generative Visual Dialog by Answering Diverse Questions

Prior work on training generative Visual Dialog models with reinforcemen...
research
09/06/2021

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser

Considering the importance of building a good Visual Dialog (VD) Questio...
research
08/10/2018

Community Regularization of Visually-Grounded Dialog

The task of conducting visually grounded dialog involves learning goal-o...
research
08/18/2020

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

With the arising concerns for the AI systems provided with direct access...
research
05/16/2017

Cooperative Learning with Visual Attributes

Learning paradigms involving varying levels of supervision have received...
research
06/26/2021

Saying the Unseen: Video Descriptions via Dialog Agents

Current vision and language tasks usually take complete visual data (e.g...

Please sign up or login with your details

Forgot password? Click here to reset