Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene

03/16/2022
by   Duo Zheng, et al.
0

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation, especially such as GuessWhich and GuessWhat, where the only image is visible by either and both of the questioner and the answerer, respectively. Researchers explore more on visual dialog tasks in such kind of single- or perfectly co-observable visual scene, while somewhat neglect the exploration on tasks of non perfectly co-observable visual scene, where the images accessed by two agents may not be exactly the same, often occurred in practice. Although building common ground in non-perfectly co-observable visual scene through conversation is significant for advanced dialog agents, the lack of such dialog task and corresponding large-scale dataset makes it impossible to carry out in-depth research. To break this limitation, we propose an object-referring game in non-perfectly co-observable visual scene, where the goal is to spot the difference between the similar visual scenes through conversing in natural language. The task addresses challenges of the dialog strategy in non-perfectly co-observable visual scene and the ability of categorizing objects. Correspondingly, we construct a large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual Reality images and 97k dialogs generated by self-play. Finally, we give benchmark models for this task, and conduct extensive experiments to evaluate its performance as well as analyze its main challenges.

READ FULL TEXT

page 3

page 4

page 5

page 8

page 14

research
03/07/2019

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions ...
research
01/25/2019

Audio-Visual Scene-Aware Dialog

We introduce the task of scene-aware dialog. Given a follow-up question ...
research
05/08/2018

Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Creating an intelligent conversational system that understands vision an...
research
12/15/2017

CoDraw: Visual Dialog for Collaborative Drawing

In this work, we propose a goal-driven collaborative task that contains ...
research
04/23/2022

Supplementing Missing Visions via Dialog for Scene Graph Generations

Most current AI systems rely on the premise that the input visual data a...
research
12/07/2021

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

We present our work on the multimodal coreference resolution task of the...
research
07/10/2023

SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation

SimpleMTOD is a simple language model which recasts several sub-tasks in...

Please sign up or login with your details

Forgot password? Click here to reset