Situated and Interactive Multimodal Conversations

06/02/2020 ∙ by Seungwhan Moon, et al. ∙ Facebook 8

Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, etc., in addition to the user's utterances), and perform multimodal actions (e.g., displaying a route in addition to generating the system's utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling  13K human-human dialogs ( 169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images). We also provide logs of the items appearing in each scene, and contextual NLU and coreference annotations, using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, code, and models will be made publicly available.



There are no comments yet.


page 1

page 7

page 12

page 16

Code Repositories


With the aim of building next generation virtual assistants that can handle multimodal inputs and perform multimodal actions, we introduce two new datasets (both in the virtual shopping domain), the annotation schema, the core technical tasks, and the baseline models. The code for the baselines and the datasets will be opensourced.

view repo


Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

view repo


Situated Interactive MultiModal Conversations (SIMMC) Challenge 2020

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of a Situated Interactive multimodal Conversation (SIMMC): Two agents, a “user” and an “assistant”, interact in the shared (co-observed) multimodal environment for a shopping scenario, where a dialog is grounded in an evolving multimodal context. The ground-truth of which items (e.g. prefabs) appear is known for each view.

As virtual digital assistants become increasingly ubiquitous, the expectation is that they will become embedded in the day-to-day life of users the same way a human assistant would. We thus envision that the next generation of virtual assistants will be equipped with capabilities to process multimodal inputs that a user and an assistant co-observe, and provide outputs in modalities beyond the traditional NLP stack, much like the human counterparts they intend to emulate. It is therefore important for the community to tackle the plethora of novel and non-trivial research challenges that will arise as a result.

To this end, we present Situated Interactive MultiModal Conversations (SIMMC). SIMMC comprises the core technical tasks and the datasets that enable the community to start on the challenges involved in this new research direction. Specifically, the SIMMC tasks address task-oriented dialogs that encompass a rich, situated multimodal user context in the form of a co-observed image or a VR environment, which gets updated dynamically based on the dialog flow and the assistant actions. To address these tasks, we provide two new SIMMC datasets in the domain of interactive shopping (Section 3), collected using the SIMMC Platform Crook et al. (2019). In addition, we provide the fine-grained annotations to allow for both end-to-end and component-level modelling, e.g

., natural language understanding (NLU), dialog state tracking (DST), dialog management (DM), and natural language generation (NLG) – (Section 


Dataset Modality Task Provided Context Updated Annotation
Q’er A’er Context Granularity
Visual Dialog Das et al. (2017) Image Q&A N/A Visual N/A N/A
CLEVR-Dialog Kottur et al. (2019) Simulated Q&A N/A Visual N/A N/A
GuessWhat De Vries et al. (2017) Image Q&A N/A Visual N/A N/A
Audio Visual Scene-Aware Dialog Hori et al. (2018) Video Q&A N/A Visual N/A N/A
TalkTheWalk de Vries et al. (2018) Image Navigation Visual Visual + Meta Location U A
Visual-Dialog Navigation Thomason et al. (2019) Simulated Navigation Visual Visual + Meta Location U A
Relative Captioning Guo et al. (2018) Image Image Retrieval Visual Visual + Meta New Image U A
MMD Saha et al. (2018) Image Image Retrieval Visual Visual + Meta New Image U A
SIMMC (proposed) Image/VR Task-oriented Visual Visual + Meta Situated U A + Semantic
Table 1: Comparison with the existing multimodal dialog corpora. Notations: (U A) Utterance to action pair labels. (Task-oriented) Includes API action prediction, Q&A, recommendation, item / image retrieval and interaction. (Semantic) Dialog annotations such as NLU, NLG, DST, and Coref. (Situated) VR environment and/or new highlighted images.

Figure 1 illustrates an exemplary SIMMC dialog from our SIMMC-Furniture Dataset (Section 3), where a user is interacting with an assistant with the goal of browsing and shopping for furniture. In our setting, the assistant can dynamically update the co-observed environment to create a new situated context based on the preceding dialog with the user (e.g. visually presenting recommended chairs in the VR environment, or responding to the request “I like the brown one. Show me the back of it.” by executing the actions of focusing on, and rotating the indicated item). These assistant actions change the shared multimodal context, which grounds the next part of the dialog. The examples also highlight a number of challenges such as multimodal action prediction, and multimodal coreference resolution (as indicated by the underlined elements), etc.

The rest of this paper is organized as follows.

  • [leftmargin=*]

  • Section 2 highlights the novelty of the proposed datasets and tasks with respect to the existing literature.

  • Section 3 describes the SIMMC-Furniture (VR) and SIMMC-Fashion (Image) datasets.

  • Section 4 presents the SIMMC Dialog Annotation Schema for the datasets.

  • Section 5 provides the detailed analysis on the dataset and annotations.

  • Section 6 defines the SIMMC tasks and metrics.

  • Section 7 presents our SIMMC models, which are adaptations of the state-of-the-art models for solving the SIMMC tasks.

  • Section 8 provides the experimental results using the baseline models on the SIMMC-Furniture and SIMMC-Fashion datasets.

  • Section 9 concludes this paper.

2 Novelty & Related Work

Novelty of SIMMC. The SIMMC datasets present the following important distinctions from the existing multimodal dialog datasets (Table 1).

First, with the ultimate goal of laying the foundations for the real-world assistant scenarios, we assume a co-observed multimodal context between a user and an assistant, and record the ground-truth item appearance logs of each item that appears. This shifts the primary focus onto the core problem of grounding conversations in the co-observed multimodal context. In contrast, the existing literature Das et al. (2017); Kottur et al. (2019); De Vries et al. (2017); de Vries et al. (2018), drawing motivation from the Visual Question Answering Antol et al. (2015), often posits the roles of a primary and secondary observer, i.e., “Questioner” and “Answerer”, who do not co-observe the same multimodal context. Additionally, while work in this area has focused heavily on raw image processing, the SIMMC tasks emphasize semantic processing of the input modalities.

Secondly, we frame the problem as a task-oriented, multimodal dialog system, with the aim of extending the capabilities of digital assistants to real-world multimodal settings. Compared to the conventional task-oriented conversational datasets (e.g. MultiWoZ Budzianowski et al. (2018)), the agent actions in the SIMMC datasets span across a diverse multimodal action space (e.g. rotate, search, add_to_cart). Our study thus shifts the focus of the visual dialog research from the token or the phrase-level grounding of visual scenes to the task-level understanding of dialogs given complex multimodal context.

Third, we primarily study scenarios in which the situated multimodal context gets dynamically updated, reflecting the corresponding agent actions. In our settings, agent actions can be enacted on both the object-level (e.g. changing the view of a specific object within a scene) and the scene-level (e.g. introducing a new scene or an image). While the dialog-based image retrieval tasks Guo et al. (2018); Saha et al. (2018) and the visual navigation tasks Thomason et al. (2019); de Vries et al. (2018) do comprise context updates, they are limited to the introduction of new visual scenes (e.g. new images, new locations).

Last but not least, we present a novel flexible schema for semantic annotations that we developed specifically for the natural multimodal conversations. The proposed SIMMC annotation schema allows for a more systematic and structural approach for visual grounding of conversations, which is essential for solving this challenging problem in the real-world scenarios. To the best of our knowledge, our dataset is the first among the related multimodal dialog corpora to provide fine-grained semantic annotations.

Multimodal Dialog Datasets. Grounded conversational learning has recently gained traction in the community, spanning across various tasks and settings. For example, inspired by the Visual Question Answering (VQA) Antol et al. (2015) dataset, many previous works Das et al. (2017); Kottur et al. (2019); De Vries et al. (2017); Al Amri et al. (2018); Hori et al. (2018) tackle the problem of answering multi-turn questions about the provided multimodal contexts. Another line of work studies scenarios where conversations are grounded in an interactive multimodal environment, such as visual dialog navigation Thomason et al. (2019) or TalkTheWalk de Vries et al. (2018). Guo et al. (2018) study the task of retrieving relevant images through a dialog, where the focus is on the language understanding of visual characteristics. Unlike the existing multimodal dialog datasets, we bring the primary focus on the grounding of the co-observed and dynamic multimodal contexts, targeted mainly towards building the real-world assistant scenarios.

Task-oriented Dialog Datasets. A main focus of the dialog community has been on task-oriented dialog for its practical applicability in many consumer-facing virtual assistants. The existing task-oriented datasets often focus on tasks in a single specified domain (e.g. restaurant booking) Henderson et al. (2014), or across multiple domains Budzianowski et al. (2018); Eric et al. (2019); Rastogi et al. (2019), where the success of agents can be automatically verified through task success rate (e.g. did the agent book the correct restaurant?). In SIMMC-Furniture, target goals, in the form of images or item descriptions, are provided to the user.

A key focus of dialog literature lies in the tracking of the cumulative dialog context Wu et al. (2019); Gao et al. (2019); Chao and Lane (2019), which is often neglected in many existing multimodal ‘dialog’ datasets (both in terms of the task design and the annotations), where the primary efforts lie in visual grounding of language. Our emphasis on the task-oriented dialog brings many important challenges actively studied in the dialog community to the multimodal setting, bridging the gap between the two fields.

3 SIMMC Datasets

We choose shopping experiences as the domain for the SIMMC datasets, as it often induces rich multimodal interactions around browsing visually grounded items. As shown in Figure 1, the setup consists of two agents, a user and an assistant conversing with each other to simulate a shopping scenario. In addition to having an interactive dialog, the assistant manipulates the co-observed environment to display items from the shopping inventory and help the user. Thus, a conversational assistant model for the SIMMC datasets would need to (i) understand the user’s utterance using both the dialog history and the state of the environment – the latter provided as multimodal context, and (ii) produce a multimodal response to the user utterance, including updates to the co-observed environment to convey meaningful information as part of the user’s shopping experience.

We provide two SIMMC datasets with slightly different setups and modalities: (1) SIMMC-Furniture (VR) Dataset, where the assistant can manipulate a virtual 3D environment constructed in Unity while engaging in a conversation, and (2) SIMMC-Fashion (Image) Dataset, in which the conversations are grounded in real-world images that simulate a shopping scene in a user’s point-of-view (POV). Both datasets were collected through the SIMMC Platform Crook et al. (2019), an extension to ParlAI Miller et al. (2017) for multimodal conversational data collection and system evaluation that allows human annotators to each play the role of either the assistant or the user.

3.1 SIMMC-Furniture (VR) Dataset

The SIMMC-Furniture dataset captures a scenario where a user is interacting with a conversational assistant to obtain recommendations for a furniture item (e.g., couch, side table, etc.). We seed the conversation by presenting the user with either a high-level directive such as ‘Shop for a table’ or an image of a furniture item to shop for. The user is then connected randomly with a human assistant, who addresses this request by conversing with the user, in addition to manipulating the co-observed Unity UI. Through this interface, the assistant can filter the available catalog of 3D Wayfair assets using attributes such as type of furniture category, price, color, and material. They can also navigate through the filtered results and share their view with the user. The user then requests one of the subsequent follow-ups: (i) Look in depth into one of the available options, (ii) Show other furniture options. If the user picks (i), the assistant can either zoom into the object, interact and present an alternate object view by rotating it, or look at the catalog description to answer further questions. To enable these assistant interactions, the environment is designed to transition between the following two states: (a) Carousel, which contains three slots in view to display filtered furniture items (top view, Figure 1); and (b) Focused, which provides a zoomed in view of an item from the carousel view (bottom view, Figure 1). The conversation between the user and the assistant, grounded in the Unity UI, continues for 6–12 turns until the user considers that they have reached a successful outcome. Table 8 shows example dialogs from the SIMMC-Furniture dataset.

3.2 SIMMC-Fashion (Image) Dataset

Akin to SIMMC-Furniture, the SIMMC-Fashion dataset represents user interactions with an assistant to obtain recommendations for clothing items (e.g., jacket, dress, etc.). We present the user with a randomly selected ‘seed’ item from the catalog to emulate (visually) the act of shopping in a store, as well as a sequence of synthetic memories of ‘previously viewed items’. In addition to the user’s context, the assistant has access to a broader catalog with fine-grained information (e.g., price, brand, color, etc.) to allow for information lookup and item recommendations in response to the user’s requests. We ask the user to browse and explore options by asking the assistant for recommendations based on, e.g., the shared attributes, preferences, etc., as referred from visual scenes, memories, and assistant-recommend items. The conversation continues for 6–10 turns until the user is assumed to be given a successful recommendation. Please refer to Table 9 for example dialogs from the SIMMC-Fashion dataset.

3.3 Item Appearance Logs

For both datasets, the ground-truth of which items appear in each view is logged. This allows the problem of computer vision to be sidestepped and focus to be placed on semantically combining the modalities. In the SIMMC-Furniture dataset, the item appearance logs consist of item (prefab) identifiers – see Figure 

1. When the carousel is displayed, identifiers are listed in the same order as they appear in the scene. Similarly in the SIMMC-Fashion dataset, the item appearance logs are the identifier for the displayed clothing item. Given an item’s identifier, its catalog description and other attributes can be easily retrieved.

4 SIMMC Dialog Annotations

Building a task-oriented multimodal conversational model introduces many new challenges, as it requires both action and item-level understanding of multimodal interactions. While most of the previous multimodal corpora provide surface-level annotations (e.g., utterance to multimodal action pairs), we believe it is critical to provide the semantic-level fine-grained annotations that ground the visual context, allowing for a more systematic and structural study for visual grounding of conversations. Towards this end, we develop a novel SIMMC ontology that captures the detailed multimodal interactions within dialog flows. In this section, we describe the proposed SIMMC ontology and the hierarchical labeling language centered around objects (Section 4.1 and 4.2), and the multimodal coreference schema that links the annotated language with the co-observed multimodal context (Section 4.3).

4.1 SIMMC Annotation Ontology

The SIMMC ontology provides common semantics for both the assistant and user utterances. The ontology is developed in the Resource Development Framework (RDF) and is an expansion of the Basic Formal Ontology Arp et al. (2015). It consists of four primary components:

  • [leftmargin=*,noitemsep,parsep=0pt,partopsep=0pt]

  • Objects: A hierarchy of objects is defined in the ontology. This hierarchy is a rooted tree, with finer-grained objects at deeper levels. Sub-types are related to super-types via the isA relationship, e.g., sofa isA furniture. Fine-grained objects include user, dress, and sofa.

  • Activities: A hierarchy of activities are defined as a sub-graph of objects within the ontology. These represent activities the virtual assistant can take like get, refine, and add_to_cart.

  • Attributes: A given object has a list of attributes which relate that object to other objects, to primitive data types, or to enums. Finer-grained objects inherit the attributes of their parents. There are restrictions on the available types for both the domain and range of attributes. For example, a sofa can be related to a company via the brand attribute. A person can be related to an item of clothing via the attentionOn attribute. Activities are related to the objects that they act upon via the takesArgument attribute.

  • Dialog Acts: A hierarchy of dialog acts is also defined as a sub-graph of objects within the ontology. Dialog acts indicate the linguistically motivated purpose of the user or system’s utterance. They define the manner in which the system conveys information to the user and vice versa. Examples of dialog acts include: ask, inform, and prompt. Dialog acts are related to the activities that they act upon via the takesArgument attribute. Table 7 lists the activities and dialog acts used in our work.

4.2 SIMMC Labeling Language

From the SIMMC ontology, we derive a compositional, linearized, and interpretable labeling language for linguistic annotation, allowing for the representation of the natural language utterances as well-formed subgraphs of the ontology Kollar et al. (2018). The labeling language consists of intents and slots Gupta et al. (2006). Intents are taken to represents instances of the types they are composed of and take one of two forms: 1) dialog_act:activity:object or 2) dialog_act:activity:object.attribute. Only combinations of objects and attributes declared to be valid in the ontology are made available in the labeling language. Within these intents, slots further specify values for attributes of objects, activities, and attribute types. In the basic case, slots take the form of attributes of the intent-level objects and restrict those attributes. More complex cases include slot-in-slot nesting to restrict the type of the embedding slot, object-attribute combinations for type-shifting contexts, i.e., utterances in which an intent-level object is identical to the range of another object’s property, and a system of indexing to restrict objects introduced within the intent. Crucially, the labeling language is speaker agnostic. It makes no distinction in the parses of the user’s utterance versus those of the assistant.

A number of additional conventions are placed on the annotation task to ensure consistency and accuracy.

Type ambiguity. When an object appears in an utterance, the most fine-grained type is annotated. For example, in the utterance “Show me some dresses”, the token ‘dresses’ needs to be annotated as dress, as opposed to a coarser-grained type clothing. When more than one fine-grained type is possible, the annotator utilizes a parent-level coarse-grained type instead. Thus the assigned type is the finest-grained type that still captures the ambiguity.

Attribute ambiguity. Attributes are annotated when they are unambiguous. When there is uncertainty in the attribute that should be selected for the representation, the annotator falls back to a more generic attribute.

Attribute inverses. When an attribute can be annotated in two different directions, a canonical attribute is defined in the ontology and used for all annotations. For example, attentionOn and inAttentionOf are inverses. The former is designated as the canonical attribute in this case.

Smart prefixes. Attribute slots are prefixed by A and O respectively to indicate whether they serve to restrict the intent-level Activity or Object. This is primarily for human-annotator convenience.

Attribute variables. The attribute .info is employed when the speaker’s intent targets more than one attribute simultaneously. The specific attributes being targeted are then identified with the INFO smart prefix. Table 8 and Table 9 show our SIMMC ontology in action for both our datasets.

(a) Distribution of Rounds (SIMMC-Furniture)
(b) Distribution of Utterance Lens (SIMMC-Furniture)
(c) Distribution of Rounds (SIMMC-Fashion)
(d) Distribution of Utterance Lens (SIMMC-Fashion)
Figure 2: SIMMC Datasets Analysis. Distribution of Rounds and Utterance Lengths (# of tokens).

4.3 SIMMC Coreference Annotations

Note that the proposed labeling language allows for the annotation of object types in a dialog, which may in turn refer to specific canonical listings from the underlying multimodal contexts. For example, given an annotated utterance “[da:request:get:chair Show me the back of it]”, the annotated object ‘chair’ (it) would refer to a specific catalog item, represented as a item id within the image metadata. To allow for structural grounding between the verbal and visual modalities in a shared catalog, we further annotate the mapping of object type mentions in the annotated utterance to the corresponding item id in the image metadata. The final SIMMC annotations thus capture the semantic relations of objects in multimodal contexts with their corresponding dialog annotations (activities, attributes and dialog acts), as outlined in the proposed SIMMC ontology (Section 4.1).

5 Dataset & Annotation Analysis

Statistics Furniture (VR) Fashion (Image)
Text Audio
Total # dialogs 6.4k 1.3k 6.6k
Total # utterances 97.6k 15.8k 71.2k
Avg # rounds / dialog 7.62 7.16 5.39
Avg # tokens (user) 11.0 N/A 11.10
Avg # tokens (assistant) 12.2 N/A 10.87
Table 2: SIMMC Datasets Statistics. We also collected additional SIMMC-Furniture dialogs in aural medium where annotators exchanged audio messages instead of text.

We now analyze the dataset and annotation trends for both of the proposed SIMMC datasets, and compare the two when meaningful. Table 2 contains the overall statistics. SIMMC-Furniture has dialogs with an average of rounds (or turn pairs) leading to a total of about utterances. Similarly, SIMMC-Fashion consists of dialogs, each around rounds on average, totaling utterances. In addition to these sets, we also collect a smaller, audio-based SIMMC-Furniture dataset ( dialogs) where the dialog exchanges are aural as opposed to written text.

Figure 3: Distribution of Dialog Acts and Activities in the SIMMC datasets. See Section 5 for details.

Dataset Analysis.

In Figure 2, we visualize: (a) Distribution of rounds. Dialogs in SIMMC-Furniture range from (shorter ones are omitted from the dataset) to a maximum of rounds, with of the dialogs containing rounds (Figure 1(a)). Dialogs in SIMMC-Fashion range from rounds at an average of rounds per dialog, as shown in Figure 1(c). We hope that this widespread range will help train models that can handle diverse conversations of varied lengths. (b) Distribution of utterance lengths. For both user and assistant, we tokenize their utterances and plot the distribution in Figure 1(b)

. For SIMMC-Furniture, the assistant utterances are slightly longer with higher variance at

when compared to those from the user, at . A potential reason is that because the assistant has access to the catalog, it is expected to be more verbose while responding to description related queries (‘User: Tell me more about the brown table’). However, we do not observe a similar trend for SIMMC-Fashion where user and assistant turns average around tokens per utterance (Figure 1(d)). (c) Catalog coverage. Recall that both SIMMC datasets contain conversations in a shopping scenario grounded in a catalog of furniture and fashion items respectively. SIMMC-Furniture builds on a catalog of items, where each dialog contains around shares of different views between the user and assistant, and each furniture item is shared in roughly dialogs. Similarly, SIMMC-Fashion contains items that appear in dialogs on average, thus providing a rich catalog context to support interesting multimodal dialogs.

Annotation Analysis.

Using the unified ontology framework described in Section 4.1, we annotate both the user and assistant utterances of the SIMMC datasets. There are effectively dialog acts which are respectively combined with activities for (SIMMC-Furniture) and activities for (SIMMC-Fashion); the latter by design excludes count and rotate. A detailed list with examples is in Appendix, Table 7. Not all combinations of dialog acts and activities are observed in our dataset, i.e., about for SIMMC-Furniture and for SIMMC-Fashion respectively. For instance, a request:disprefer utterance is an invalid combination. The key takeaways from Figure 3 are: (a) inform is the most dominant dialog act ( in SIMMC-Fashion and in SIMMC-Furniture). This is intuitive as conversations in shopping domain require the user to inform the assistant of their preferences, while the assistant informs the user about the item attributes and availability. (b) Interestingly, get is the dominant activity across most dialog acts, where the assistant either gets new items or additional information about existing items that the user is perusing. (c) The relatively low occurrence of the confirm dialog act perhaps arises from the effectiveness of the human assistant agent. This is desirable to avoid learning assistant models that excessively repeat user requests, e.g., repeatedly seek explicit confirm, as this leads to lower user satisfaction. Note that this analysis of the dialog act and activity distribution is per sentence, with an utterance occasionally containing multiple sentences (see Figure 1 for an example).

User Satisfaction Metrics.

Since SIMMC datasets aim at goal-oriented dialog, we also collect turn-level and dialog-level user satisfaction scores in the range of 1-5 as part of the data collection. The dialog-level user satisfaction scores for the SIMMC-Furniture dataset average at , showing a heavy concentration around 5. Since the dialogs are collected between humans interacting with each other, we hypothesize that the the assistant (wizard) is able to efficiently respond to user requests leading to a high satisfaction score. Similar trends were observed across different metrics for both the datasets. Therefore, we drop further analysis on this front due to the absence of a clear signal in these collected metrics.

6 SIMMC Tasks & Metrics

As a first step towards evaluation of models trained on SIMMC datasets, we define several offline evaluation tasks within the SIMMC framework to train reasonable models on these new datasets using the fine-grained annotations that are provided. We first provide the general offline evaluation framework for defining SIMMC tasks (Section 6.1), and then present three major tasks that we focus on in this paper111Currently, only results for the first two tasks are presented.. These are primarily aimed at replicating human-assistant actions in order to enable rich and interactive shopping scenarios (Section 6.2).

6.1 Offline Evaluation Framework

Consider a generic SIMMC dialog that is rounds long, where and are the user and assistant utterances, is the domain-specific multimodal context, and is the action (API call) taken by the assistant at round , respectively. Formally, a task is defined as: At each round , given the current user utterance , the dialog history , multimodal context , predict the assistant action along with the free-form, natural language assistant response .

Task Name Description Evaluation
Assistant Action Selection (Structural API Call Prediction) Given user utterances, evaluate the models’ performance on retrieving the correct API(s) Perplexity, Mean Average Precision, (Human Evaluation)
Response Generation Given user utterances and/or ground-truth APIs, evaluate the model response generation (both as generation and retrieval) Generation: BLEU, Perplexity, Human Evaluation (Naturalness, etc.); Retrieval: Accuracy@k, Entropy
Dialog State Tracking Given user utterances, evaluate the models’ performance on tracking the cumulative dialog states across multiple turns. Intent Accuracy, Slot Prec / Rec / F1, Coreference Prec / Rec / F1
Table 3: Overview of tasks enabled by our SIMMC datasets.

The proposed offline evaluation framework has a three-fold advantage: (a) It accurately represents the scenario encountered by a SIMMC model during deployment. In other words, models trained for the above task can be deployed to interact with humans to provide a situated, interactive, multimodal conversation. (b) Instead of evaluating the performance on the entire dialog, we evaluate models on a per-turn basis with the ground-truth history. This avoids taking the conversation out of the dataset and reduces the dependency on a user simulator, with the caveat of not encouraging the model to be able to learn multiple equally valid routes to satisfy the user’s request. (c) Finally, it facilitates us to define and evaluate several sub-tasks such as action prediction, response generation, and dialog state tracking, within SIMMC, which allows us to bootstrap from prior work on these sub-tasks.

Table 3 provides an overview of the sub-tasks we study in this work that are described below.

API Name Arguments
Search items using the item attributes
Category, color, intended room, material, price range, etc.


plus1fil minus1fil

Get and specify information (attributes) about an item
Material, price range (min–max), customer rating, etc.


plus1fil minus1fil

Focus on an item to enlarge (for a better view)
Position of argument item on the carousel (left, center, right)


plus1fil minus1fil

Rotate a focused furniture item in the view
Rotational directions (left, right, up, down, front, back)


plus1fil minus1fil

Navigate the carousel to explore search results
Navigating directions (next and previous)
Get and specify information (attributes) about an item
Brand, price, customer rating, available sizes, colors, etc.


plus1fil minus1fil

Search(DatabaseMemory): Select a relevant image from either the database or memory, and specify information Brand, price, customer rating, available sizes, colors, etc.
Table 4: List of APIs supported in our SIMMC datasets with attributes. We also include None as an action when no API call is required and AddToCart to specify adding an item to cart for purchase.

6.2 SIMMC Furniture & Fashion Tasks

Task 1: Structural API Call Prediction. This task involves predicting the assistant action as an API call along with the necessary arguments, using as inputs. For example, enquiring about an attribute value (e.g., price) for a shared furniture item is realized through a call to the SpecifyInfo API with the price argument. A comprehensive set of APIs for our SIMMC dataset is given in Table 4. Apart from these APIs, we also include a None API call to catch situations without an underlying API call, e.g., a respond to ‘U: Can I see some tables?’ as ‘A: What color are you looking for?’ does not require any API calls. Action prediction is cast as a round-wise, multiclass classification problem over the set of APIs, measured using accuracy of predicting the action taken by the assistant during data collection. However, we note that there could be several actions that are equally valid in a given context. For instance, in response to ‘U: Show me some black couches.’, one could show black couches ‘A: Here are a few.’ or enquire further about specific preferences ‘A: What price range would you like to look at?’. Since accuracy does not account for the existence of multiple valid actions, we use perplexity (defined as the exponential of the mean log-likelihood) alongside accuracy. To also measure the correctness of the predicted action (API) arguments, we use attribute accuracy compared to the collected datasets.

Task 2: Response Generation. This task measures the relevance of the assistant response in the current turn. We treat response generation as a conditional language modeling problem, and use as the metric the token-wise perplexity of the ground-truth responses according to the model. In addition, taking inspiration from machine translation literature, we use BLEU scores Papineni et al. (2002) to measure the closeness between the generated and ground-truth response.

Task 3: Dialog State Tracking (DST). The dialog annotations collected using the flexible ontology enable us to study dialog state tracking (DST) in SIMMC, aside from providing additional supervision to train goal-driven agents. As mentioned in Section 4, the user and assistant utterances are accompanied with a hierarchy of dialog act labels and text spans for the corresponding attributes, if any. The goal of DST is to systematically track the dialog acts and the associated slot pairs across multiple turns. We use the intent and slot accuracy metrics, following the previous literature in DST Henderson et al. (2014). In addition, we measure the performance for resolving coreferences across modalities, using the annotated labels.

Figure 4: Model Architecture Overview (Section 7) with four components: utterance and history encoder, multimodal fusion, action predictor, and response decoder. Example taken from Figure 1.

7 Modeling for SIMMC Tasks

We now propose several models building on top of prior work and train them on the tasks formulated in Section 6 to benchmark the SIMMC dataset. Our overall model architecture is illustrated in Figure 4, which is composed of four main components: Utterance and History Encoder, MultiModal Fusion, Action Predictor, and Response Generator. In this section, we describe our modeling choices for these components.

7.1 Utterance & History Encoder

The utterance and history encoder takes as input the user utterance at the current round and the dialog history so far , to produce the utterance encoding and history encoding to capture the respective textual semantics. Inspired from prior work, we consider several utterance and history encoders, whose functional forms are outlined in Table 5. We embed each token in the input sequences ( or ) through learned word embeddings of size , which are further fed into the encoders. These output and , where is the embedding size, is the number of tokens in the user utterance.

(a) History-Agnostic Encoder (HAE)

ignores the dialog context to only encode the user utterance through an LSTM Hochreiter and Schmidhuber (1997) for the downstream components.

(b) Hierarchical Recurrent Encoder (HRE)

Serban et al. (2016) models dialogs at two hierarchical recurrence levels of utterance and turn. The utterance encoder LSTM operates at the former, while a history LSTM that consumes the hidden states of utterance encoder LSTM from all the previous rounds () operates at the latter.

(c) Memory Network (MN)

encoder Sukhbaatar et al. (2015) treats dialog history as a collection of memory units comprising user and assistant utterance pairs concatenated together, and uses the current utterance encoding to selectively attend to these units to produce the utterance-conditioned history encoding .

(d) Transformer-based History-Agnostic Encoder (T-HAE)

is a variant of HAE with the LSTMs replaced with Transformer units Vaswani et al. (2017) that achieved state-of-the-art results in language modeling Devlin et al. (2019).

Model Functions
Recurrent (HRE)
Network (MN)
Transformer (T-HAE)
Table 5: Overview of user utterance and history encoders for SIMMC models. Attention() is defined in Eq. 5. For further details, see Section 7.

7.2 Multimodal Fusion

As the name suggests, this component fuses semantic information from the text ( and ) and the multimodal context (described in Section 8

), to create the fused context tensor

, which is double the size of in the last dimension. In our setup, the multimodal context is modelled as a tensor of size , where is the number of multimodal units for the current round and is the multimodal embedding size. Note that all of our models have the same architecture to fuse multimodal information. At a high level, we first embed to match its size to

using a linear layer followed by a non-linearity (ReLU) (Eq. 

1), then use the utterance encoding to attend to the multimodal units (Eq. 2), and finally fuse the attended multimodal information by concatenating it with (Eq. 3). More concretely,


where Attention operator Vaswani et al. (2017) for a query over the key (of size ) and value is defined as


7.3 Action Predictor

Using the fused context , the Action Predictor predicts the appropriate action (API)

and the corresponding API arguments to be taken by the assistant. The former is a multi-class classification that chooses from a set of actions (APIs) while the latter is a multi-way classification modelled as a set of binary classifiers one for each attribute like

category, color, price, etc. For a list of APIs and their arguments supported in our work, see Table 4. First, the tensor

is transformed into a vector

through self-attention through the attention parameter (Eq. 5). Next, we learn a classifier (MLP) that takes in to predict the distribution over the possible APIs (Eq. 6). In addition, we also learn several binary classifiers (MLP) one each for the corresponding API arguments. Having predicted the structured API calls, we execute them and encode the output as action context , where is the number of context units and is action context embedding size. The dataset-dependent specifics about the API call output encoding are in Section 8. Finally, and feed into the last component to generate the assistant response. As the training objective, we minimize the cross entropy loss for both the action and action attributes.


7.4 Response Generator

As the last component, the response generator (decoder) generates the assistant response . In our work, we model it as a language model conditioned on both and . The former ensures that the response is influenced by the API call output while the latter maintains semantic relevance to the user utterance. For example, the response to ‘Show me black couches less than $500’ depends on the availability of such couches in the inventory and could lead to either ‘Here are some’ or ‘Sorry, we do not have any black couches cheaper than $500’. For models that use LSTM for user and history encoders, the response decoder is also an LSTM with attention over fused context and action API output at every decoding time step, similar to Bahdanau et al. (2014). Similarly, we use a Transformer-based decoder for the other models to ensure consistent underlying architecture (either LSTM or transformer). Like any conditional language model, we decode individual tokens at each time step to generate , and minimize the negative loglikelihood of the human response under the model during training.

8 Experiments & Results

Dataset Splits and Baselines.

All our models are learned on a random

split with the model hyperparameters chosen via early stopping using performance on another randomly sampled

, and evaluation numbers reported on the remaining unseen data. In addition to the models described in Section 7, we consider two simple baselines that use TF-IDF features for utterance and history encoders for action prediction, and LSTM-based language model (LM-LSTM) trained solely on assistant responses, and compare against them.

Dataset-specific Model Details.

Below are the dataset-specific details surrounding our models, particularly around modeling multimodal context and encoding action (API call) output for each of the SIMMC datasets.

SIMMC-Furniture (VR).

Since the data collection for SIMMC-Furniture is grounded in a co-observed virtual 3D environment (Section 3), its state becomes the multimodal context . For both carousel and focused environment states, we concatenate the furniture item representation in the corresponding slot (or zero vector if empty) with its positional embedding (‘left’, ‘center’, ‘right’, ‘focused’) that are jointly learned, to give with (carousel) or (focused). In addition, each furniture item is represented with the concatenated GloVe embeddings Pennington et al. (2014) of its attributes like category, color, intended room, etc. Similarly, we construct the action output using the environment representation after executing the necessary structural API call, e.g., search for an item or focus on an existing item. The information seeking action SpecifyInfo is an exception, for which is the GloVe embedding of the attributes of the desired item.

SIMMC-Fashion (Image).

Dialogs in SIMMC-Fashion use a fashion item (updated as the conversation progresses) and a sequence of ‘previously viewed items’ (memory) as context (Section 3.1). To reflect this scenario, we extract the representations for each fashion item using concatenated GloVe embeddings of its attributes (similar to SIMMC-Furniture) in addition to learning the source embedding (‘memory’ or ‘current’ item), as the multimodal context . Akin to SIMMC-Furniture, is modeled simply as the updated multimodal state after executing the current API.

Model API Response
Acc Perp Att.Acc BLEU Perp
TD-IDF 76.5 2.94 42.7 - -

- - - 0.10 9.65

79.6 1.73 52.2 0.22 8.53

79.5 1.72 51.2 0.23 9.03

78.2 1.87 51.8 0.23 8.68

78.7 1.79 51.2 0.14 7.72

TD-IDF 82.5 3.75 77.9 - -

- - - 0.10 7.08

84.5 1.77 80.2 0.23 6.41

85.1 1.75 81.2 0.19 6.55

84.5 1.67 80.0 0.25 6.54

84.6 1.73 80.3 0.17 5.63
Table 6: Results on SIMMC-Furniture and SIMMC-Fashion for: (a) API prediction, measured using accuracy (acc), perplexity (perp) and attribute prediction accuracy (Att.Acc), and, (b) Response generation, measured using BLEU and perplexity (perp). : higher is better, : lower is better. See text for details.


We learn SIMMC models end-to-end by jointly minimizing the sum of the action prediction and the response generation losses, i.e., . To extract supervision for API call prediction (along with attributes), we utilize a combination of assistant (Wizard) interface activity logged during data collection (Section 3.3) and the fine-grained NLU annotations.

Implementation Details.

All our models are trained using PyTorch

Paszke et al. (2019). We consider words (after converting them to lowercase) that occur at least times in the training set, to yield model dictionaries of size and for SIMMC-Furniture and SIMMC-Fashion, respectively. We learn dimensional word embeddings for each of these words that are fed into utterance and history encoder. All the LSTMs ( layers) and Transformers ( layers, heads each, with internal state) have a hidden state of size , in our experiments. We optimize the objective function using Adam Kingma and Ba (2015) with a learning rate of and clip the gradients by value to be within . The model hyperparameters are selected via early stopping on a randomly chosen of the unseen data that is set aside.


Table 6 summarizes the performance of SIMMC baseline models on structural API prediction and response generation. The key observations are: (a) All SIMMC neural models (HAE, HRE, MN, T-HAE) outperform the baselines (TF-IDF and LM-LSTM) across all metrics for both the datasets. (b) For response generation, MN has superior BLEU score and T-HAE has the least perplexity for both SIMMC-Furniture and SIMMC-Fashion across all models. Surprisingly, T-HAE has one of the least BLEU scores amongst SIMMC models perhaps due to resorting to safe, frequent responses. (c) HRE consistently achieves the highest API prediction accuracy for SIMMC-Furniture (, jointly with HRE) and SIMMC-Fashion (

), followed by T-HAE in both the cases. (d) The confusion matrix for HRE on SIMMC-Furniture (Figure 

5) reveals a high confusion between SearchFurniture and None. This can be understood as the natural decision between searching for an item or further obtaining user preferences to narrow the search. Note that the proposed baselines do not leverage the rich, fine-grained annotations of the SIMMC datasets (understandably so) as they are mainly adaptations of existing state-of-the-art models.

Figure 5: Confusion matrix for hierarchical recurrent encoder (HRE) on SIMMC-Furniture.

9 Conclusions

In this work, we presented Situated Interactive Multi-Modal Conversations (SIMMC), an important new direction towards building next generation virtual assistants with evolving multimodal inputs. In particular, we collected two new datasets using the SIMMC platform, and provided the contextual NLU and coreference annotations on these datasets, creating a new SIMMC task for the community to study. We established several strong baselines for some of the tasks enabled by the datasets, showcasing various uses of the datasets in real-world applications. The fine-grained annotations we collected open the door for studying several different tasks in addition to the ones highlighted in this work, which we leave as future work for the community to tackle.

Acknowledgements. We thank the following for their invaluable technical contributions to the data collection platforms, annotation schema development, annotation process, tooling and coordination; Pararth Shah, Oksana Buniak, Semir Shafi, Ümit Atlamaz, Jefferson Barlew, Becka Silvert, Kent Jiang, Himanshu Awasthi, and Nicholas Flores. We also extend many thanks to all the annotators who meticulously labelled these datasets.


  • Al Amri et al. (2018) H. Al Amri, V. Cartillier, R. G. Lopes, A. Das, J. Wang, I. Essa, D. Batra, D. Parikh, A. Cherian, T. K. Marks, and C. Hori. 2018. Audio visual scene-aware dialog (avsd) challenge at dstc7.
  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV.
  • Arp et al. (2015) Robert Arp, Barry Smith, and Andrew D. Spear. 2015. Building Ontologies with Basic Formal Ontology. MIT Press.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate.
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In EMNLP.
  • Chao and Lane (2019) Guan-Lin Chao and Ian Lane. 2019. Bert-dst: Scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer. In INTERSPEECH.
  • Crook et al. (2019) Paul A. Crook, Shivani Poddar, Ankita De, Semir Shafi, David Whitney, Alborz Geramifard, and Rajen Subba. 2019. SIMMC: Situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. ASRU.
  • Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In CVPR.
  • De Vries et al. (2017) Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
  • Eric et al. (2019) Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Kumar, Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669.
  • Gao et al. (2019) Shuyang Gao, Sanchit Agarwal Abhishek Seth and, Tagyoung Chun, and Dilek Hakkani-Ture. 2019. Dialog state tracking: A neural reading comprehension approach. In SIGDIAL.
  • Guo et al. (2018) Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In NeurIPS.
  • Gupta et al. (2006) N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M. Gilbert. 2006. The at t spoken language understanding system. TASLP.
  • Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D. Williams. 2014. The second dialog state tracking challenge. In SIGDIAL.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  • Hori et al. (2018) Chiori Hori, Anoop Cherian, Tim K. Marks, and Florian Metze. 2018. Audio visual scene-aware dialog track in dstc8. DSTC Track Proposal.
  • Kingma and Ba (2015) Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  • Kollar et al. (2018) Thomas Kollar, Danielle Berry, Lauren Stuart, Karolina Owczarzak, Tagyoung Chung, Lambert Mathias, Michael Kayser, Bradford Snow, and Spyros Matsoukas. 2018. The Alexa meaning representation language. In NAACL.
  • Kottur et al. (2019) Satwik Kottur, José MF Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166.
  • Miller et al. (2017) A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston. 2017. ParlAI: A Dialog Research Software Platform. arXiv.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.

    Pytorch: An imperative style, high-performance deep learning library.

    In NeurIPS.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
  • Rastogi et al. (2019) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019.

    Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset.

    In AAAI.
  • Saha et al. (2018) Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. In AAAI.
  • Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016.

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    In AAAI.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
  • Thomason et al. (2019) Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2019. Vision-and-dialog navigation. arXiv preprint arXiv:1907.04957.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
  • de Vries et al. (2018) Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the walk: Navigating new york city through grounded dialogue. arXiv preprint arXiv:1807.03367.
  • Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. In ACL.