Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

by   Abhinav Rastogi, et al.

Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a zero-shot dialogue state tracking model that achieves state-of-the-art performance on recent benchmark datasets.


page 1

page 2

page 3

page 4


Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker

Dialogue State Tracking (DST) is a core component of virtual assistants ...

The SPPD System for Schema Guided Dialogue State Tracking Challenge

This paper introduces one of our group's work on the Dialog System Techn...

Dialogue State Tracking with a Language Model using Schema-Driven Prompting

Task-oriented conversational systems often use dialogue state tracking t...

Manual-Guided Dialogue for Flexible Conversational Agents

How to build and use dialogue data efficiently, and how to deploy models...

A Flexible Schema-Guided Dialogue Management Framework: From Friendly Peer to Virtual Standardized Cancer Patient

A schema-guided approach to dialogue management has been shown in recent...

Schema-Guided Dialogue State Tracking Task at DSTC8

This paper gives an overview of the Schema-Guided Dialogue State Trackin...

Schema-Guided Natural Language Generation

Neural network based approaches to natural language generation (NLG) hav...

1 Introduction

Virtual assistants help users accomplish tasks including but not limited to finding flights, booking restaurants and, more recently, navigating user interfaces, by providing a natural language interface to services and APIs on the web. The recent popularity of conversational interfaces and the advent of frameworks like Actions on Google and Alexa Skills, which allow developers to easily add support for new services, has resulted in a major increase in the number of application domains and individual services that assistants need to support, following the pattern of smartphone applications.

Consequently, recent work has focused on scalable dialogue systems that can handle tasks across multiple application domains. Data-driven deep learning based approaches for multi-domain modeling have shown promise, both for end-to-end and modular systems involving dialogue state tracking and policy learning. This line of work has been facilitated by the release of multi-domain dialogue corpora such as MultiWOZ

[budzianowski2018multiwoz], M2M [shah2018building] and FRAMES [el2017frames].

However, existing datasets for multi-domain task-oriented dialogue do not sufficiently capture a number of challenges that arise with scaling virtual assistants in production. These assistants need to support a large [kim-etal-2018-efficient], constantly increasing number of services over a large number of domains. In comparison, existing public datasets cover few domains. Furthermore, they define a single static API per domain, whereas multiple services with overlapping functionality, but heterogeneous interfaces, exist in the real world.

To highlight these challenges, we introduce the Schema-Guided Dialogue (SGD) dataset111The dataset has been released at, which is, to the best of our knowledge, the largest public task-oriented dialogue corpus. It exceeds existing corpora in scale, with over 16000 dialogues in the training set spanning 26 services belonging to 16 domains (more details in Table 1). Further, to adequately test the models’ ability to generalize in zero-shot settings, the evaluation sets contain unseen services and domains. The dataset is designed to serve as an effective testbed for intent prediction, slot filling, state tracking and language generation, among other tasks in large-scale virtual assistants.

We also propose the schema-guided paradigm for task-oriented dialogue, advocating building a single unified dialogue model for all services and APIs. Using a service’s schema as input, the model would make predictions over this dynamic set of intents and slots present in the schema. This setting enables effective sharing of knowledge among all services, by relating the semantic information in the schemas, and allows the model to handle unseen services and APIs. Under the proposed paradigm, we present a novel architecture for multi-domain dialogue state tracking. By using large pretrained models like BERT [devlin2019bert], our model can generalize to unseen services and is robust to API changes, while achieving state-of-the-art results on the original and updated [eric2019multiwoz] MultiWOZ datasets.

Metric Dataset DSTC2 WOZ2.0 FRAMES M2M MultiWOZ SGD
No. of domains 1 1 3 2 7 16
No. of dialogues 1,612 600 1,369 1,500 8,438 16,142
Total no. of turns 23,354 4,472 19,986 14,796 113,556 329,964
Avg. turns per dialogue 14.49 7.45 14.60 9.86 13.46 20.44
Avg. tokens per turn 8.54 11.24 12.60 8.24 13.13 9.75
Total unique tokens 986 2,142 12,043 1,008 23,689 30,352
No. of slots 8 4 61 13 24 214
No. of slot values 212 99 3,871 138 4,510 14,139
Table 1: Comparison of our SGD dataset to existing related datasets for task-oriented dialogue. Note that the numbers reported are for the training portions for all datasets except FRAMES, where the numbers for the complete dataset are reported.

2 Related Work

Task-oriented dialogue systems have constituted an active area of research for decades. The growth of this field has been consistently fueled by the development of new datasets. Initial datasets were limited to one domain, such as ATIS [hemphill1990atis] for spoken language understanding for flights. The Dialogue State Tracking Challenges [williams2013dialog, henderson2014second, Henderson2014TheTD, kim2017fourth] contributed to the creation of dialogue datasets with increasing complexity. Other notable related datasets include WOZ2.0 [wen2017network], FRAMES [el2017frames], M2M [shah2018building] and MultiWOZ [budzianowski2018multiwoz]. These datasets have utilized a variety of data collection techniques, falling within two broad categories:

  • Wizard-of-Oz This setup [kelley1984iterative] connects two crowd workers playing the roles of the user and the system. The user is provided a goal to satisfy, and the system accesses a database of entities, which it queries as per the user’s preferences. WOZ2.0, FRAMES and MultiWOZ, among others, have utilized such methods.

  • Machine-machine Interaction A related line of work explores simulation-based dialogue generation, where the user and system roles are simulated to generate a complete conversation flow, which can then be converted to natural language using crowd workers [shah2018building]. Such a framework may be cost-effective and error-resistant since the underlying crowd worker task is simpler, and semantic annotations are obtained automatically.

As virtual assistants incorporate diverse domains, recent work has focused on zero-shot modeling [bapna2017towards, xia2018zero, shah-etal-2019-robust]

, domain adaptation and transfer learning techniques


. Deep-learning based approaches have achieved state of the art performance on dialogue state tracking tasks. Popular approaches on small-scale datasets estimate the dialogue state as a distribution over all possible slot-values

[henderson2014, wen2017network] or individually score all slot-value combinations [mrkvsic2017neural, zhong-etal-2018-global]. Such approaches are not practical for deployment in virtual assistants operating over real-world services having a very large and dynamic set of possible values. Addressing these concerns, approaches utilizing a dynamic vocabulary of slot values have been proposed [rastogi2018multi, goel2019hyst, wu-etal-2019-transferable].

3 The Schema-Guided Dialogue Dataset

An important goal of this work is to create a benchmark dataset highlighting the challenges associated with building large-scale virtual assistants. Table 1 compares our dataset with other public datasets. Our Schema-Guided Dialogue (SGD) dataset exceeds other datasets in most of the metrics at scale. The especially larger number of domains, slots, and slot values, and the presence of multiple services per domain, are representative of these scale-related challenges. Furthermore, our evaluation sets contain many services, and consequently slots, which are not present in the training set, to help evaluate model performance on unseen services.

The 17 domains (‘Alarm’ domain not included in training) present in our dataset are listed in Table 2. We create synthetic implementations of a total of 34 services or APIs over these domains. Our simulator framework interacts with these services to generate dialogue outlines, which are a structured representation of dialogue semantics. We then used a crowd-sourcing procedure to paraphrase these outlines to natural language utterances. Our novel crowd-sourcing procedure preserves all annotations obtained from the simulator and does not require any extra annotations after dialogue collection. In this section, we describe these steps in detail and then present analyses of the collected dataset.

Domain #Intents #Dialogs Domain #Intents #Dialogs
Alarm 2 (1) 37 Movie 4 (2) 1758
Bank 4 (2) 1021 Music 4 (2) 1486
Bus 4 (2) 2609 RentalCar 4 (2) 1966
Calendar 3 (1) 1602 Restaurant 4 (2) 2755
Event 5 (2) 3927 RideShare 2 (2) 1973
Flight 8 (3) 3138 Service 8 (4) 2090
Home 2 (1) 1027 Travel 1 (1) 2154
Hotel 8 (4) 3930 Weather 1 (1) 1308
Media 4 (2) 1292
Table 2: The number of intents (services in parentheses) and dialogues for each domain in the train and dev sets. Multi-domain dialogues contribute to counts of each domain. The domain Service includes salons, dentists, doctors etc.

3.1 Services and APIs

We define the schema for a service as a combination of intents and slots with additional constraints, with an example in Figure 1. We implement all services using a SQL engine. For constructing the underlying tables, we sample a set of entities from Freebase and obtain the values for slots defined in the schema from the appropriate attribute in Freebase. We decided to use Freebase to sample real-world entities instead of synthetic ones since entity attributes are often correlated (e.g, a restaurant’s name is indicative of the cuisine served). Some slots like event dates/times and available ticket counts, which are not present in Freebase, are synthetically sampled.

To reflect the constraints present in real-world services and APIs, we impose a few other restrictions. First, our dataset does not expose the set of all possible slot values for some slots. Having such a list is impractical for slots like date or time because they have infinitely many possible values or for slots like movie or song names, for which new values are periodically added. Our dataset specifically identifies such slots as non-categorical and does not provide a set of all possible values for these. We also ensure that the evaluation sets have a considerable fraction of slot values not present in the training set to evaluate the models in the presence of new values. Some slots like gender, number of people, day of the week etc. are defined as categorical and we specify the set of all possible values taken by them. However, these values are not assumed to be consistent across services. E.g., different services may use (‘male’, ‘female’), (‘M’, ‘F’) or (‘he’, ‘she’) as possible values for gender slot.

Second, real-world services can only be invoked with a limited number of slot combinations: e.g. restaurant reservation APIs do not let the user search for restaurants by date without specifying a location. However, existing datasets simplistically allow service calls with any given combination of slot values, thus giving rise to flows unsupported by actual services or APIs. As in Figure 1, the different service calls supported by a service are listed as intents. Each intent specifies a set of required slots and the system is not allowed to call this intent without specifying values for these required slots. Each intent also lists a set of optional slots with default values, which the user can override.

Figure 1: Example schema for a digital wallet service.

3.2 Dialogue Simulator Framework

The dialogue simulator interacts with the services to generate dialogue outlines. Figure 2 shows the overall architecture of our dialogue simulator framework. It consists of two agents playing the roles of the user and the system. Both agents interact with each other using a finite set of actions specified through dialogue acts over a probabilistic automaton designed to capture varied dialogue trajectories. These dialogue acts can take a slot or a slot-value pair as argument. Figure (b)b shows all dialogue acts supported by the agents.

At the start of a conversation, the user agent is seeded with a scenario, which is a sequence of intents to be fulfilled. We identified over 200 distinct scenarios for the training set, each comprising up to 5 intents. For multi-domain dialogues, we also identify combinations of slots whose values may be transferred when switching intents e.g. the ’address’ slot value in a restaurant service could be transferred to the ’destination’ slot for a taxi service invoked right after.

The user agent then generates the dialogue acts to be output in the next turn. It may retrieve arguments i.e. slot values for some of the generated acts by accessing either the service schema or the raw SQL backend. The acts, combined with the respective parameters yield the corresponding user actions. Next, the system agent generates the next set of actions using a similar procedure. Unlike the user agent, however, the system agent has restricted access to the services (denoted by dashed line), e.g. it can only query the services by supplying values for all required slots for some service call. This helps us ensure that all generated flows are valid.

After an intent is fulfilled through a series of user and system actions, the user agent queries the scenario to proceed to the next intent. Alternatively, the system may suggest related intents e.g. reserving a table after searching for a restaurant. The simulator also allows for multiple intents to be active during a given turn. While we skip many implementation details for brevity, it is worth noting that we do not include any domain-specific constraints in the simulation automaton. All domain-specific constraints are encoded in the schema and scenario, allowing us to conveniently use the simulator across a wide variety of domains and services.

Figure 2: The overall architecture of the dialogue simulation framework for generating dialogue outlines.

3.3 Dialogue Paraphrasing

The dialogue paraphrasing framework converts the outlines generated by the simulator into a natural conversation. Figure 3a shows a snippet of the dialogue outline generated by the simulator, containing a sequence of user and system actions. The slot values present in these actions are in a canonical form because they obtained directly from the service. However, users may refer to these values in various different ways during the conversation, e.g., “los angeles” may be referred to as “LA” or “LAX”. To introduce these natural variations in the slot values, we replace different slot values with a randomly selected variation (kept consistent across user turns in a dialogue) as shown in Figure 3b.

Next we define a set of action templates for converting each action into a utterance. A few examples of such templates are shown below. These templates are used to convert each action into a natural language utterance, and the resulting utterances for the different actions in a turn are concatenated together as shown in Figure 3c. The dialogue transformed by these steps is then sent to the crowd workers. One crowd worker is tasked with paraphrasing all utterances of a dialogue to ensure naturalness and coherence.


In our paraphrasing task, the crowd workers are instructed to exactly repeat the slot values in their paraphrases. This not only helps us verify the correctness of the paraphrases, but also lets us automatically obtain slot spans in the generated utterances by string search. This automatic slot span generation greatly reduced the annotation effort required, with little impact on dialogue naturalness, thus allowing us to collect more data with the same resources. Furthermore, it is important to note that this entire procedure preserves all other annotations obtained from the simulator including the dialogue state. Hence, no further annotation is needed.

Figure 3: Steps for obtaining paraphrased conversations. To increase the presence of relative dates like tomorrow, next Monday, the current date is assumed to be March 1, 2019.

3.4 Dataset Analysis

With over 16000 dialogues in the training set, the Schema-Guided Dialogue dataset is the largest publicly available annotated task-oriented dialogue dataset. The annotations include the active intents and dialogue states for each user utterance and the system actions for every system utterance. We have a few other annotations like the user actions but we withhold them from the public release. These annotations enable our dataset to be used as benchmark for tasks like intent detection, dialogue state tracking, imitation learning of dialogue policy, dialogue act to text generation etc. The schemas contain semantic information about the schema and the constituent intents and slots, in the form of natural language descriptions and other details (example in Figure


The single-domain dialogues in our dataset contain an average of 15.3 turns, whereas the multi-domain ones contain 23 turns on an average. These numbers are also reflected in Figure (a)a showing the histogram of dialogue lengths on the training set. Table 2 shows the distribution of dialogues across the different domains. We note that the dataset is largely balanced in terms of the domains and services covered, with the exception of Alarm domain, which is only present in the development set. Figure (b)b shows the frequency of dialogue acts contained in the dataset. Note that all dialogue acts except INFORM, REQUEST and GOODBYE are specific to either the user or the system.

(a) Histogram of lengths of training set dialogues.
(b) Distribution of dialogue acts in training set.
Figure 6: Detailed statistics of the SGD dataset.

4 The Schema-Guided Approach

Figure 7: The predicted dialogue state (shown with dashed edges) for the first two user turns for an example dialogue, showing the active intent and slot assignments, with two related annotation schemas. Note that the dialogue state representation is conditioned on the schema under consideration, which is provided as input, as are the user and system utterances.

Virtual assistants aim to support a large number of services available on the web. One possible approach is to define a large unified schema for the assistant, to which different service providers can integrate with. However, it is difficult to come up with a common schema covering all use cases. Having a common schema also complicates integration of tail services with limited developer support. We propose the schema-guided approach as an alternative to allow easy integration of new services and APIs.

Under our proposed approach, each service provides a schema listing the supported slots and intents along with their natural language descriptions (Figure 1 shows an example). These descriptions are used to obtain a semantic representation of these schema elements. The assistant employs a single unified model containing no domain or service specific parameters to make predictions conditioned on these schema elements. For example, Figure 7 shows how dialogue state representation for the same dialogue can vary for two different services. Here, the departure and arrival cities are captured by analogously functioning but differently named slots in both schemas. Furthermore, values for the number_stops and direct_only slots highlight idiosyncrasies between services interpreting the same concept.

There are many advantages to this approach. First, using a single model facilitates representation and transfer of common knowledge across related services. Second, since the model utilizes semantic representation of schema elements as input, it can interface with unseen services or APIs on which it has not been trained. Third, it is robust to changes like addition of new intents or slots to the service.

5 Zero-Shot Dialogue State Tracking

Models in the schema-guided setting can condition on the pertinent services’ schemas using descriptions of intents and slots. These models, however, also need access to representations for potentially unseen inputs from new services. Recent pretrained models like ELMo [peters2018deep] and BERT [devlin2019bert] can help, since they are trained on very large corpora. Building upon these, we present our zero-shot schema-guided dialogue state tracking model.

5.1 Model

We use a single model222Our model code is available at, shared among all services and domains, to make these predictions. We first encode all the intents, slots and slot values for categorical slots present in the schema into an embedded representation. Since different schemas can have differing numbers of intents or slots, predictions are made over dynamic sets of schema elements by conditioning them on the corresponding schema embeddings. This is in contrast to existing models which make predictions over a static schema and are hence unable to share knowledge across domains and services. They are also not robust to changes in schema and require the model to be retrained with new annotated data upon addition of a new intent, slot, or in some cases, a slot value to a service.

Schema Embedding

This component obtains the embedded representations of intents, slots and categorical slot values in each service schema. Table 3 shows the sequence pairs used for embedding each schema element. These sequence pairs are fed to a pretrained BERT encoder shown in Figure 8 and the output is used as the schema embedding.

For a given service with intents and slots, let , and , be the embeddings of all intents and slots respectively. As a special case, we let , denote the embeddings for the non-categorical slots in the service. Also, let , denote the embeddings for all possible values taken by the categorical slot, , with being the number of categorical slots and . All these embeddings are collectively called schema embeddings.

Sequence 1 Sequence 2
Intent service description intent description
Slot service description slot description
Value slot description value
Table 3: Input sequences for the pretrained BERT model to obtain embeddings of different schema elements.

Utterance Encoding

Like [chao2019bert], we use BERT to encode the user utterance and the preceding system utterance to obtain utterance pair embedding and token level representations , being the total number of tokens in the two utterances. The utterance and schema embeddings are used together to obtain model predictions using a set of projections (defined below).

Figure 8: BERT encoder, taking in two sequences and as input and outputs an embedded sequence pair representation and token level representations . We use BERT to obtain schema element embeddings and encode system and user utterances for dialogue state tracking.


Let . For a task , we define as a projection transforming and

into the vector

using Equations 1-3. Here, , and for are trainable parameters of suitable dimensions and

is the activation function. We use

gelu [hendrycks2016gaussian] activation as in BERT.


Active Intent

For a given service, the active intent denotes the intent requested by the user and currently being fulfilled by the system. It takes the value “NONE” if no intent for the service is currently being processed. Let be a trainable parameter in for the “NONE” intent. We define the intent network as below.


The logits

are normalized using softmax to yield a distribution over all

intents and the “NONE” intent. During inference, we predict the highest probability intent as active.

Requested Slots

These are the slots whose values are requested by the user in the current utterance. Projection predicts logit for the slot. Obtained logits are normalized using sigmoid to get a score in . During inference, all slots with are predicted as requested.


User Goal

We define the user goal as the user constraints specified over the dialogue context till the current user utterance. Instead of predicting the entire user goal after each user utterance, we predict the difference between the user goal for the current turn and preceding user turn. During inference, the predicted user goal updates are accumulated to yield the predicted user goal. We predict the user goal updates in two stages. First, for each slot, a distribution of size 3 denoting the slot status and taking values none, dontcare and active is obtained by normalizing the logits obtained in equation 6 using softmax. If the status of a slot is predicted to be none, its assigned value is assumed to be unchanged. If the prediction is dontcare, then the special dontcare value is assigned to it. Otherwise, a slot value is predicted and assigned to it in the second stage.


In the second stage, equation 7 is used to obtain a logit for each value taken by each categorical slot. Logits for a given categorical slot are normalized using softmax to get a distribution over all possible values. The value with the maximum mass is assigned to the slot. For each non-categorical slot, logits obtained using equations 8 and 9 are normalized using softmax to yield two distributions over all tokens. These two distributions respectively correspond to the start and end index of the span corresponding to the slot. The indices maximizing are predicted to be the span boundary and the corresponding value is assigned to the slot.

5.2 Evaluation

We consider the following metrics for evaluation of the dialogue state tracking task:

  1. Active Intent Accuracy: The fraction of user turns for which the active intent has been correctly predicted.

  2. Requested Slot F1: The macro-averaged F1 score for requested slots over all eligible turns. Turns with no requested slots in ground truth and predictions are skipped.

  3. Average Goal Accuracy: For each turn, we predict a single value for each slot present in the dialogue state. The slots which have a non-empty assignment in the ground truth dialogue state are considered for accuracy. This is the average accuracy of predicting the value of a slot correctly. A fuzzy matching score is used for non-categorical slots to reward partial matches with the ground truth.

  4. Joint Goal Accuracy: This is the average accuracy of predicting all slot assignments for a turn correctly. For non-categorical slots a fuzzy matching score is used.

Performance on other datasets

We evaluate our model on public datasets WOZ2.0, MultiWOZ 2.0 and the updated MultiWOZ 2.1 [eric2019multiwoz]. As results in Table 4 show, our model performs competitively on all these datasets. Furthermore, we obtain state-of-the-art joint goal accuracies of 0.516 on MultiWOZ 2.0 and 0.489 on MultiWOZ 2.1 test sets respectively, exceeding the best-known results of 0.486 and 0.456 on these datasets as reported in [eric2019multiwoz].

Performance on SGD

The model performs well for Active Intent Accuracy and Requested Slots F1 across both seen and unseen services, shown in Table 4. For joint goal and average goal accuracy, the model performs better on seen services compared to unseen ones (Figure 9). The main reason for this performance difference is a significantly higher OOV rate for slot values of unseen services.

Performance on different domains (SGD)

The model performance also varies across various domains. The performance for the different domains is shown in (Table 5

) below. We observe that one of the factors affecting the performance across domains is still the presence of the service in the training data (seen services). Among the seen services, those in the ‘Events’ domain have a very low OOV rate for slot values and the largest number of training examples which might be contributing to the high joint goal accuracy. For unseen services, we notice that the ‘Services’ domain has a lower joint goal accuracy because of higher OOV rate and higher average turns per dialogue. For ‘Services’ and ‘Flights’ domains, the difference between joint goal accuracy and average accuracy indicates a possible skew in performance across slots where the performance on a few of the slots is much worse compared to all the other slots, thus considerably degrading the joint goal accuracy. The ‘RideSharing’ domain also exhibits poor performance, since it possesses the largest number of the possible slot values across the dataset. We also notice that for categorical slots, with similar slot values (e.g. “Psychologist” and “Psychiatrist”), there is a very weak signal for the model to distinguish between the different classes, resulting in inferior performance.

Dataset Active Int Acc Req Slot F1 Avg GA Joint GA
WOZ2.0 N.A. 0.970 0.920 0.810
MultiWOZ 2.0 N.A. N.A. 0.831 0.516
MultiWOZ 2.1 N.A. N.A. 0.823 0.489
SGD-S 0.966 0.965 0.776 0.486
SGD-All 0.908 0.973 0.740 0.411
Table 4: Model performance on test sets of the respective datasets (except SGD variants, where dev sets were used). SGD-Single model is trained and evaluated on single-domain dialogues only whereas SGD-All model is trained and evaluated on the entire dataset. We also report results on MultiWOZ 2.0, the updated MultiWOZ 2.1, and WOZ2.0. N.A. indicates tasks not available for those datasets.
Figure 9: Performance of the model on all services, services seen in training data, services not seen in training data.
Domain Joint GA Avg GA Domain Joint GA Avg GA
Services* 0.115 0.556 Movies* 0.345 0.337
Flights* 0.173 0.620 Weather 0.423 0.674
Alarm* 0.205 0.086 Travel 0.437 0.641
Restaurants* 0.247 0.564 Buses 0.511 0.784
Media* 0.250 0.670 Music 0.552 0.753
Hotels** 0.286 0.601 Homes 0.564 0.831
RideSharing 0.300 0.631 Events 0.656 0.890
RentalCars 0.323 0.786 Banks* 0.708 0.908
Table 5: Model performance per domain (GA: goal accuracy). Domains marked with ’*’ are those for which the service in the dev set is not present in the training set. Hotels domain marked with ’**’ has one unseen and one seen service. For other domains, the service in the dev set was also seen in the training set. We see that the model generally performs better for domains containing services seen during training.

6 Discussion

It is often argued that simulation-based data collection does not yield natural dialogues or sufficient coverage, when compared to other approaches such as Wizard-of-Oz. We argue that simulation-based collection is a better alternative for collecting datasets like this owing to the factors below.

  • Fewer Annotation Errors: All annotations are automatically generated, so these errors are rare. In contrast,  [eric2019multiwoz] reported annotation errors in 40% of turns in MultiWOZ 2.0 which utilized a Wizard-of-Oz setup.

  • Simpler Task: The crowd worker task of paraphrasing a readable utterance for each turn is simple. The error-prone annotation task requiring skilled workers is not needed.

  • Low Cost: The simplicity of the crowd worker task and lack of an annotation task greatly cut data collection costs.

  • Better Coverage: A wide variety of dialogue flows can be collected and specific usecases can be targeted.

7 Conclusions

We presented the Schema-Guided Dialogue dataset to encourage scalable modeling approaches for virtual assistants. We also introduced the schema-guided paradigm for task-oriented dialogue that simplifies the integration of new services and APIs with large scale virtual assistants. Building upon this paradigm, we present a scalable zero-shot dialogue state tracking model achieving state-of-the-art results.


The authors thank Guan-Lin Chao for help with model design and implementation, and Amir Fayazi and Maria Wang for help with data collection.


Appendix A Appendix

User I have some free time this weekend and I am looking for some interesting events to go to. Can you find me any interesting events I could attend?
System Is there a particular city you would like me to search for events in. Do you have a preference on the category of events.
User I am interested in events around New York. I enjoy going to sports games.
System I found 10 events that you would enjoy. The Mets vs Braves are playing at Citi Field on March 13th at 6 pm.
User Please share some more information with me. Where is the venue located?
System This is a baseball game. The address is 123-01 Roosevelt Avenue
User This is perfect! Can you find some other events for me? I will be in Toronto, Ontario. Please look for events happening near Toronto on Friday next week.
System There is a Blue Jays vs Braves game. They are playing at the Rogers Centre next Friday at 8 pm.
User This is perfect! Please reserve 2 tickets to the Blue Jays vs. Indians game for me
System Please confirm the following details: You are reserving 2 tickets for the Blue Jays vs Indians event. This reservation is for next Friday in Toronto.
User Incorrect. I would like to reserve four tickets.
System Please confirm the following details: You are reserving 4 tickets for the Blue Jays vs. Indians event in Toronto.
User Yes, this is the correct information. May you provide me the address of the venue?
System Your reservation is confirmed Enjoy the game! The address is 1 Blue Jays Way.
User Thank you so much for your help! That is all I need.
System Have a great day.
Figure A.1: An example single domain dialogue from our dataset in the “Events” domain.
User I wanna change my hair style, help me find a salon
System Where are you trying to get your hair done?
User Let’s find a place in Oakland
System How about 17 Jewels Salon in Oakland, I hear it’s nice
User Umm…Doesn’t sound like something I’d like, find me something else
System What about this other Oakland salon called Benny Adem Grooming Parlor
User What’s their contact info and address?
System You can reach them at 510-227-4268 and they’re at 408 14th Street
User Actually find me something else
System Bibo Salon in Oakland has great reviews
User Yes that’ perfect
System Shall I book an appointment for you then?
User Yeah I want it done ASAP, maybe later today if possible?
System What time then?
User afternoon 12:30 would be great
System Confirming your 12:30 pm Bibo Salon appointment today
User Yes that’s correct
System Ok glad I was able to book it
User Thanks that’s it for now
System Can’t wait to see your new hair!
Figure A.2: An example single domain dialogue from our dataset in the “Services” domain.
User I’d like to catch a movie nearby. Can you find me any movie showings?
System Sure, where should I look?
User Please look around SFO for movie showings.
System Okay, I found 10 movies being shown in that area that you might like. What do you think about the movies Ash Is Purest White, Captain Marvel or Citizen Kane?
User Citizen Kane sounds like a good one.
System How about buying any tickets for the movie?
User No thanks, not right now. I’d like to grab something to eat around there too. Can you find any good restaurants to eat at?
System Are you looking for a restaurant that serves a certain kind of food?
User Yeah, can you find restaurants that serve Punjabi food and are moderate with their price range?
System How about eating at a nice restaurant in San Francisco called August 1 Five?
User Great, that sounds perfect. When can I watch the movie you found me earlier?
System How about seeing it on March 2nd?
User Yeah, that works for me.
System Okay, there’s 1 showing at 7:30 pm at the Vogue Theatre.
User Great, that works.
System Anything else I can help you with today?
User Yeah, can you book me a table at the restaurant you found for me?
System Sure, what time would you like to eat there?
User Reserve a table for me at around 5:30 pm, dinner time.
System Okay, please confirm that you’d like to reserve a table for 2 at August 1 Five located in San Francisco for March 2nd at 5:30 pm.
User Yup, that sounds right. Where exactly are they located and do you know if they play live music?
System Your table has been successfully booked! Their address is 524 Van Ness Avenue and unfortunately they don’t have live music here.
User Okay, can you get me a cab to take me there?
System Sure, do you have a ride preference? I can get you a regular, luxury or other type of ride if you’d like.
User Yes, please get me a luxury ride.
System Okay, please confirm that you want a luxury ride with 2 seats to take you to 524 Van Ness Avenue.
User Yeah that sounds right to me.
System Okay, your cab has been booked and it’s on its way.
User Awesome, thanks!
System Anything else I can do for you today?
User Nope, that’s all I needed, thanks!
System No problem, enjoy your day!
Figure A.3: An example multi-domain dialogue from our dataset covering “Movies”, “Restaurants” and “RideSharing” domains.
User My buddy is getting married and I need to find a bus ticket.
System I can help you with that. Can you give me a little more information like where you are traveling from, where you are traveling to, and when you would like to leave.
User I need to go from Las Vegas to Sacramento on the 7th of this month.
System Great, I was able to find 4 buss’s matching that criteria. How does a bus with 1 transfer, leaving at 7:20 am, for $88 sound?
User That sounds like just what i’m looking for. I will also be needing a rental car until the 11th of this month.
System Will this car be from Sacramento from March 7th.
User I would like to pick up the car on the 8th around two in the afternoon.
System The first of 5 cars I have found for you is a standard Accord, located at SMF International Airport, available for pickup on March 8th.
User That one sounds perfect.
System Would you like me to proceed with making that reservation for you?
User Yes, let’s go ahead with that car.
System Before I reserve this I would like to make sure all the details I have are correct. You would like to pickup a standard vehicle from SMF International Airport on March 8th at 2 pm, and will be returning the car on March 11th.
User Yes, that is all correct but can you give me an estimated price too please?
System I was successfully able to reserve that with a total cost of $105 for you.
User Great, could you also help me find an online movie to watch?
System What genre of movie are you looking for?
User I am really into detective movies right now.
System The 2 movies I found that you may like are Body Double and Citizen Kane.
User I would really like to see Body Double now if possible, I heard it’s pretty good.
System Okay, so you would like me to begin playing the movie Body Double without subtitles now. Is that correct?
User Yes, that would be great.
System I’ve went ahead and started playing that movie for you.
User Thanks so much!
System Is there anything else I can help you with for now?
User No, that’s all for now.
System Have a good day and enjoy your movie!
Figure A.4: An example multi-domain dialogue from our dataset covering “Buses”, “RentalCars” and “Media” domains.
Figure A.5: Interface of the dialogue paraphrasing task where the crowd workers are asked to rephrase the dialogue outlines to a more natural expression. The actual interface shows the entire conversation, but only a few utterances have been shown in this figure. All non-categorical slot values are highlighted in blue. The task cannot be submitted unless all highlighted values in the outline are also present in the conversational dialogue.