DeepAI AI Chat
Log In Sign Up

Situated and Interactive Multimodal Conversations

by   Seungwhan Moon, et al.

Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, etc., in addition to the user's utterances), and perform multimodal actions (e.g., displaying a route in addition to generating the system's utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling  13K human-human dialogs ( 169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images). We also provide logs of the items appearing in each scene, and contextual NLU and coreference annotations, using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, code, and models will be made publicly available.


page 1

page 7

page 12

page 16


SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

We present a new corpus for the Situated and Interactive Multimodal Conv...

A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Multimodal search-based dialogue is a challenging new task: It extends v...

Audio-Visual Understanding of Passenger Intents for In-Cabin Conversational Agents

Building multimodal dialogue understanding capabilities situated in the ...

Multimodal Conversational AI: A Survey of Datasets and Approaches

As humans, we experience the world with all our senses or modalities (so...

Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

The popularity of image sharing on social media and the engagement it cr...

Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations

While multimodal conversation agents are gaining importance in several d...

An empirical user-study of text-based nonverbal annotation systems for human-human conversations

the substantial increase in the number of online human-human conversatio...

Code Repositories


With the aim of building next generation virtual assistants that can handle multimodal inputs and perform multimodal actions, we introduce two new datasets (both in the virtual shopping domain), the annotation schema, the core technical tasks, and the baseline models. The code for the baselines and the datasets will be opensourced.

view repo


Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

view repo


Situated Interactive MultiModal Conversations (SIMMC) Challenge 2020

view repo