The Eighth Dialog System Technology Challenge

by   Seokhwan Kim, et al.

This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.


page 1

page 2

page 3

page 4


Overview of the Ninth Dialog System Technology Challenge: DSTC9

This paper introduces the Ninth Dialog System Technology Challenge (DSTC...

Dialog System Technology Challenge 7

This paper introduces the Seventh Dialog System Technology Challenges (D...

Preview, Attend and Review: Schema-Aware Curriculum Learning for Multi-Domain Dialog State Tracking

Existing dialog state tracking (DST) models are trained with dialog data...

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

This paper describes our submission for the End-to-end Multi-domain Task...

Interactive Evaluation of Dialog Track at DSTC9

The ultimate goal of dialog research is to develop systems that can be e...

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (...

Ensemble based discriminative models for Visual Dialog Challenge 2018

This manuscript describes our approach for the Visual Dialog Challenge 2...

1 Introduction

The Dialog System Technology Challenge (DSTC) is an ongoing series of research competitions for dialog systems. To accelerate the development of new dialog technologies, the DSTCs have provided common testbeds for various research problems. The earlier Dialog State Tracking Challenges Williams et al. (2013); Henderson et al. (2014c, b) focused on developing a single component for dialog state tracking on goal-oriented human-machine conversations. Then, DSTC4 Kim et al. (2017) and DSTC5 Kim et al. (2016) introduced human-human conversations and started to offer multiple tasks not only for dialog state tracking, but also for other components in dialog systems as the pilot tasks. From the sixth challenge Hori et al. (2019c), the DSTC has rebranded itself as “Dialog System Technology Challenge” and organized multiple main tracks in parallel to address a wider variety of dialog related problems. Most recently, DSTC7 Yoshino et al. (2019) focused on developing end-to-end dialog technologies for the following three tracks: noetic response selection Gunasekara et al. (2019, 2019), grounded response generation Galley et al. (2019), and audio visual scene aware dialog Alamri et al. (2018).

For the eighth DSTC, we received seven track proposals and went through a formal peer review process focusing on each task’s potential for (a) broad interest from the research community, (b) practical impact of the task outcomes, and (c) continuity from the previous challenges. Finally, we ended up with the four main tracks including two newly introduced tasks and two follow-up tasks of DSTC7. Multi-domain task-completion track (Section 2) addresses the end-to-end response generation problems in multi-domain task completion and cross-domain adaptation scenarios. NOESIS II (Section 3) explores a response selection task extending the first NOESIS track in DSTC7 and offers two additional subtasks for identifying task success and disentangling conversations. Audio visual scene-aware dialog track (Section 4) is another follow-up track of DSTC7 which aims to generate dialog responses using multi-modal information given in an input video. Schema-guided dialog state tracking track (Section 5) revisits dialog state tracking problems in a practical setting associated with a large number of services/APIs required to build virtual assistants in practice. The remainder of this paper describes the details of each track.

2 Multi-Domain Task-Completion Track

This track offers two tasks to foster progress in two important aspects of dialog systems: dialog complexity and scaling to new domains.

2.1 Task 1: End-to-end multi-domain dialog system

Previous work in dialog research communities mainly focuses on individual components in a dialog system and pushes forward the performance of each component. However, the improvement of individual components does not necessarily boost the entire system performance Lee et al. (2019); Gao et al. (2019). The metrics used for an individual component might not be significant for an end-to-end system, and the propagation of error down the pipeline is likely to mitigate the component-wise improvement. With these concerns, recently researchers have taken efforts to create end-to-end approaches Wen et al. (2017a); Lei et al. (2018), but it is hard to compare them with conventional methods given the efforts and complexity to combine individual models in conventional approaches.

To address these concerns, we provide ConvLab ( Lee et al. (2019), a multi-domain end-to-end dialog system platform covering a range of state-of-the-art models, to reduce the efforts of building and evaluating end-to-end dialog systems. Based on ConvLab, participants of the task are to build a dialog system that takes natural language as input, tracks dialog states during the conversation, interacts with a task-specific knowledge base, and generates natural language response as output. There is no restriction on system architectures, and participants are encouraged to explore various approaches ranging from conventional pipeline systems and end-to-end neural approaches.

2.1.1 Data

In this task, we consider MultiWOZ Budzianowski et al. (2018) dataset, a dialog corpus collected from conversations over multiple domains under the tourist information desk setting. We enhanced the dataset with additional annotation for user dialog acts, which is missing in the original dataset, and included it in ConvLab.

2.1.2 Evaluation and Results

Human Evaluation Simulator-based Evaluation
Team Succ. % Under. Appr. Turns Succ. % Reward Turns Prec. Rec. F1 Book %
Best 68.32 4.15 4.29 19.51 88.80 61.56 7.00 0.92 0.96 0.93 93.75
Baseline 56.45 3.10 3.56 17.54 63.40 30.41 7.67 0.72 0.83 0.75 86.37
The best results for human evaluation and simulator-based evaluation are from different teams.
Metrics: Succ.: success rate, Under.: understanding score, Appr.: appropriateness score, Prec./Rec.: precision/recall of slots prediction.
Table 1: Task 1 Evaluation Results

Two evaluation metrics are offered in this task:

Simulator-based evaluation: The end-to-end user simulator for automatic evaluation is constructed by combining agenda-based user simulator Schatzmann et al. (2007), rule-based NLG and MILU, all of which have been implemented in ConvLab. The evaluation metrics employed include success rate, average reward, and number of turns for each dialog. We also report precision, recall, and F1 score for slot prediction.
Crowdworker-based human evaluation: With simulator-based automatic evaluation, we filter out low-quality submissions and send the remaining systems to Amazon Mechanic Turk for human evaluation. Crowd-workers communicate with the system via natural language, judge the system and provide ratings based on language understanding correctness, response appropriateness on 5 point scale. Extra metrics including success rate and number of turns are also reported.

Twelve teams participated in this task. Table 1 lists the results for both human evaluation and simulator-based evaluation. A component-wise system with BERT-based NLU model Devlin et al. (2019)

, elaborated rule-based dialog policy and dialog state tracker achieves the best success rate of 88.80% in simulator-based evaluation. However, there are discrepancies between human evaluation and simulator-based evaluation. The best system in the human evaluation is based on fine-tuning GPT-2

Radford et al. (2019). It predicts dialog states, system actions, and responses in an end-to-end fashion, and achieves a success rate of 68.32%.

2.2 Task 2: Fast Adaptation Task

Neural dialog response generators require very large datasets to learn to output consistent and grammatically correct sentences Vinyals and Le (2015); Li et al. (2016); Wen et al. (2017b). This makes it extremely hard to scale out the system to new domains with limited in-domain data, for example, when modeling user responses for a task-oriented chatbot on a narrow domain. With this challenge, our goal is to investigate whether sample complexity can decrease with time, i.e., if a dialog system that was trained on a large corpus can learn to converse about a new domain given a much smaller in-domain corpus.

2.2.1 Data

We provide two dialog datasets, where each dialog belongs to exactly one domain.

Reddit Dataset We constructed a corpus of over five million dialogs from Reddit submissions and comments spanning one year of data. Content is selected from a curated list of one thousand subreddits using a methodology similar to the DSTC7 sentence generation task Galley et al. (2019). We provide pre-processing code for Reddit data so that all participants work on the same corpus.

Goal-Oriented Corpus MetaLWOz We collected 37 884 goal-oriented dialogs via crowd-sourcing using a Wizard of Oz scheme. These dialogs span 47 domains (e.g. bus schedule, alarm setting, banking) and are particularly suited for meta-learning dialog models. For each dialog, we paired two crowd-workers, one had the role of being a bot, and the other one was the user. We defined 227 tasks distributed over the domains. Note that all entities were invented by the crowd-workers (for instance, the address of a bus stop) and the goal of this challenge is to predict convincing user utterances.

2.2.2 Evaluation and Results

We evaluate responses by the domain-adapted dialog model using two metrics:
Automatic metrics: A small set of complete single-domain MultiWOZ Budzianowski et al. (2018) dialogs is provided to the model, which is then asked to respond to an incomplete dialog. Intents and slot values correctly detected by the baseline NLU (cf. Sec. 2.1) in the response serve as an indicator that the domain adaptation was successful. We report intent F1 as well as intent+slot F1.

Human evaluation: The model is given a small set of complete dialogs from a held-out MetaLWOz domain, and is asked to predict a response to an incomplete dialog from the same domain. Human annotators were asked to judge the appropriateness, informativeness and utility of the responses Galley et al. (2019) given the MetaLWOz task, i.e. whether the simulated user tries to complete the task. Crowd-workers submit pairwise binary preference judgements given dialog context and metric. Pairs are picked using Multisort Maystre and Grossglauser (2017) and per dialog/metric rankings are aggregated using Copeland’s method Copeland (1951). We use bootstrapping Hall et al. (2009) over dialog contexts to assess ranking robustness and found it to be stable. Inter-annotator agreement Cohen (1960); Callison-Burch et al. (2011) is at . No method outperformed the ground truth.

As a baseline, we provided a retrieval model that relies on FastText Joulin et al. (2016) embeddings of SentencePiece Kudo and Richardson (2018) tokens and only takes into account the given in-domain dialogs. The track received four submissions, all of which surpassed baseline performance on automatic evaluation. As in Task 1 (Sec. 2.1.2), we find differences in ranking between human and automatic evaluation.

The two best teams use a Transformer Vaswani et al. (2017) (TeamB) or BiLSTM-based Hochreiter and Schmidhuber (1997) (TeamA) base model that is fine-tuned on the in-domain dialogs. The BiLSTM-based model is additionally fine-tuned on dynamically sampled Reddit dialogs, while the Transformer model additionally ranks both the observed in-domain dialog responses and the generated response using next sentence classification.


Automatic Evaluation Human Evaluation
Submission Intent F1 Intent & Slot F1 Mean Bootstrap Rank Final Rank
Table 2: Fast Adaptation Task Evaluation Results

3 NOESIS II: Predicting Responses Track

This track is a follow-up to DSTC 7 Track 1, "NOESIS: Noetic End-to-End Response Selection Challenge" Yoshino et al. (2019). That task considered the next-utterance selection problem in dialogues with two participants and in two domains. This task extends the challenge in three ways: (1) conversations with more than two participants; (2) being able to predict whether a dialogue has solved the problem yet; (3) handling multiple simultaneous conversations in the same communication channel. Each of these adds an important aspect of real-world conversations.

3.1 Task definition

The primary task is next-utterance selection. In this problem, each example consists of a partial dialogue and a set of potential messages to come next in the dialogue. Participants must rank the potential messages plus the possibility that the true next message is not in the set. We followed the configuration from DSTC 7 track 1, with one hundred options for the next message. In 20% of cases the true next message is not in the set. Participants are also permitted to use certain external knowledge sources in their system.

We also consider three other subtasks that probe specific challenges in dialogue. Second subtask, a variant of main task in which the conversation context contains a combination of different conversations. This can occur in settings where a group of people are communicating in the same channel. To reduce ambiguity about which conversation the next message is part of, we provide the identity of the speaker. In the third subtask, we consider a task in which the goal is to determine whether the conversation has succeeded in solving the user’s problem. Systems must predict the point in the conversation so far at which success or failure occurred or that no conclusion has been reached yet. As an optional task, we consider a conversation disentanglement problem, in which data from a channel with multiple conversations must be separated into a set of separate conversations.

3.2 Data

As in DSTC 7 track 1, two sources of data were considered. Both are task oriented, but one is much broader in scope and has more data (Ubuntu) while the other is smaller and more focused (Advising).

Time Speaker Message
12:30 s how can i boost microphone volume? The volume is toooooo low
12:30 s s , look for a microphone boost in alsamixer
12:30 s s : type ’alsamixer’ into terminal
12:31 s how the heck do i use alsamixer? :P what is microphone ?
12:32 s how do i choose volume on input s ?
12:33 s s : arrow keys up and down
12:33 s s , yes i understand that. But wich one of those things am i supposed to choose ?
12:33 s s : you wanted input, right?
12:34 s s , yes. But i there is no way i can turn that up. :S
12:34 s s : press tab to go over to capture, then turn it up
12:34 s aha :) thanks
Speaker Message
Student Hello!
Advisor Hi!
Student I am currently trying to figure out what courses to take next semester.
Student Could you suggest any?
Advisor Let me see. Give me a minute to go over your transcript
Advisor Can you tell me what your preferences are?
Student Of course! I am interested in Computer Science, video game design is something that has always
been interesting for me.
Advisor Eecs 280 I should a prerequisite for most computer science classes, including game design
Student Okay yeah I will take that course. Do you know of any other prerequisites for game design?
Advisor Eecs 281 is also necessary, and unfortunately you can’t take both 280 and 281 in the same
Advisor You should take Eecs 203 as that is also a prerequisite for most Eecs classes
Student Okay thanks for the info! Are both EECS 203 and EECS 280 project based?
Advisor 280 is all project based and 203 is not, but don’t let that fool you. Many students say 203 is
harder than 280
Student Oh wow okay so do you think that taking them both in the same semester will be manageable?

If you have a good grasp of probability and combinations it I should perfectly manageable

Figure 1: Examples of data in NOESIS II track: new dialogues from Ubuntu (top) and prior dialogues from Advising (bottom).

A new set of disentangled Ubuntu IRC dialogs was provided for this challenge based on recent work Kummerfeld et al. (2019). These are derived from the raw Ubuntu logs directly, not from any prior corpus. The dataset consists of multi-party conversations extracted from the Ubuntu IRC channel.111 A typical dialog starts with a question that was asked by one participant, and then other participants respond with either an answer or follow-up questions that then lead to a back-and-forth conversation. In this challenge, the context of each dialog contains at least three messages between the participants. The next turn in the conversation is guaranteed to be from one of the participants who has spoken so far.


This dataset contains two party dialogues that simulate a discussion between a student and an academic advisor. The purpose of the dialogues is to guide the student to pick courses that fit not only their curriculum, but also personal preferences about time, difficulty, areas of interest, etc. The conversations used are the same as those used in DSTC 7 task 1 Yoshino et al. (2019). They were collected by having students at the University of Michigan act as the two roles using provided personas. Structured information in the form of a database of course information was provided, as well as the personas (though at test time only information available to the advisor was provided, i.e. not the explicit student preferences). The data also includes paraphrases of the sentences and of the target responses.

3.3 Evaluation and Results

The main task and the second subtask used Recall@k (k=1,10) and mean reciprocal rank (MRR) as the evaluation metrics, following DSTC 7 track 1. The teams were ranked using the mean of recall at 10 and MRR. The third subtask used accuracy, precision, recall, and f-score which indicates the model’s ability to correctly identify whether the dialog task has succeeded or not.

We received 20 submissions from 17 teams. Tables 3 and 4 show the performances of the top 3 teams for main task and subtasks respectively. The best performing team (Team 15) of the main task used the BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) models and fine-tuned the models on the provided in-domain dialogs.

Ubuntu Advising
Team Recall@1 Recall@10 MRR Team Recall@1 Recall@10 MRR
15 0.761 0.979 0.848 17 0.564 0.878 0.677
12 0.719 0.976 0.819 15 0.306 0.762 0.455
5 0.663 0.974 0.786 13 0.254 0.69 0.401
Table 3: Results of the top 3 performers in Track 2 - main task (subtask 1)
Team Recall@1 Recall@10 MRR
15 0.706 0.957 0.799
13 0.596 0.904 0.707
3 0.505 0.834 0.621
(a) Subtask 2 - Ubuntu
Team Accuracy Precision Recall F1
15 0.802 0.832 0.802 0.817
3 0.802 0.832 0.802 0.817
13 0.662 0.707 0.687 0.697
(b) Subtask 3 - Advising
Table 4: Results of the top 3 performers in Track 2 - Subtask 2 and 3

4 Audio Visual Scene-Aware Dialog Track

The goal of building an automated system that can converse with humans about video scenes via natural dialogs is a challenging research problem that lies at the intersection of natural language processing, computer vision, and audio processing. As seen at DSTC6 and DSTC7, end-to-end dialog modeling using paired input and output sentences is a way to reduce the cost of data preparation and system development to generate reasonable dialogs in many situations. Such end-to-end approaches have been shown to better handle flexible conversations by enabling model training on large conversational datasets

Hori et al. (2019c); Yoshino et al. (2019). In the field of computer vision, interaction with humans about visual information has been explored in visual question answering (VQA) by Antol et al. (2015) and visual dialog (VisDial) by Das et al. (2017). The state of the art in video description uses multimodal fusion to combine different input modalities (feature types), such as spatiotemporal motion features and audio features Hori et al. (2017)

. Since the recent revolution of neural network models allows us to combine different modules into a single end-to-end differentiable network, this framework allows us to build scene aware dialog systems by combining dialog and multimodal video description approaches. That is, we can simultaneously use video features and user utterances as input to an encoder-decoder-based system whose outputs are natural-language responses. To advance research into multimodal reasoning-based dialog generation, we developed the Audio Visual Scene-Aware Dialog (AVSD) dataset and proposed the AVSD challenge in DSTC7. The goal was to design systems to generate responses in a dialog about a video, given the dialog history and audio-visual content of the video. The winning system of the challenge applied hierarchical attention mechanisms to combine text and visual information, yielding a relative improvement of 22% in the human rating of the output of the winning system vs. that of the baseline system. This suggests that there is perhaps significantly more potential in store for advancing this new research area. Toward this end, we propose a second edition of our AVSD challenge in DSTC8.

4.1 Task definition

In this track, the system must generate responses to a user input in the context of a given dialog. The target of both VQA and VisDial was sentence selection based on information retrieval. For real-world applications, however, spoken dialog systems cannot simply select from a small set of pre-determined sentences. Instead, they need to immediately output a response to a user input. For this reason, in this track we focus on sentence generation rather than sentence selection. In this track, the system’s task is to use a dialog history (the previous rounds of questions and answers in a dialog between user and system) and (optionally) a brief video script, plus (in one version of the task) the visual and audio information from the input video, to answer a next question about the video. The detailed task description is shown at the github page of DSTC8 AVSD222

4.2 Data and Baseline System

We collected (in Alamri et al. (2018)) text-based dialogs about short videos from the Charades dataset333 Sigurdsson et al. (2016), which consists of untrimmed and multi-action videos along with a brief script for each video. The data collection paradigm for dialogs was introduced in Alamri et al. (2019). In our audio visual scene-aware dialog case, two parties had a discussion about events in a video. One of the two parties played the role of an answerer who had already watched the video and read the video script. The answerer answered questions asked by their counterpart, the questioner. The questioner was not allowed to watch the video but was able to see three frames of the video (the first, middle, and last frames) as static images. The two parties had 10 rounds of Q and A, in which the questioner asked about the events that happened in the video. At the end, the questioner summarized the events in the video as a video description. This downstream task incentivized the questioner to collect useful answers for the video description.

The baseline system and an additional submitted system featuring encoder-decoder models using multimodal fusion are described in Hori et al. (2019a). Detailed results from all models on the DSTC7 challenge, including additional techniques and data set details, were reported in Alamri et al. (2019).

4.3 Evaluation

The automatically generated answers are evaluated by comparing with the 6 ground truth sentences (one original answer and 5 subsequently collected answers). We used the MS COCO evaluation tool for objective evaluation of system outputs444 The supported metrics include word-overlap-based metrics such as BLEU, METEOR, ROUGE_L, and CIDEr. We also collected human ratings of the responses of each system using a 5-point Likert Scale, where humans rated system responses given a dialog context as: 5 (very good), 4 (good), 3 (acceptable), 2 (poor), or 1 (very poor).

4.4 What We Learned from DSTC7

AVSD at DSTC7 was the first attempt to combine end-to-end conversation and end-to-end multimodal video description models into a single end-to-end differentiable network to build scene-aware dialog systems. Most systems employed an LSTM, Bi-LSTM, or GRU encoder/decoder. Some systems used hierarchical and attention frameworks. Furthermore, several additional techniques were introduced to improve the response quality, such as MMI and Episodic Memory Module Alamri et al. (2019). The best system applied hierarchical attention mechanisms to combine text and visual information, yielding an improvement of 22% in human ratings compared to the baseline system. The language models trained from QA (without video or audio) also performed strongly despite the lack of multimodal information.

After the AVSD challenge at DSTC7, Alamri et al. (2019) reported the performance of sentence selection (as opposed to sentence generation, which was used in this AVSD challenge) using the AVSD dataset. In the paper, Question (Q), V (Video), Dialog History (DH), and Audio (A) were fused. The addition of audio features generally improves model performance (Q+V to Q+V+A being the exception). Interestingly, the model performance improves even more when combined with dialog history and video features (Q+DH+V+A) for some metrics, indicating that audio signals still provide complementary knowledge to the video signals despite their close relationship.

Further, it is found that the best performance is achieved when including text features extracted from the available summary (video script). Using such manual descriptions improves the performance of all systems. However, such summaries are unavailable in the real world, posing challenges during deployment. Recently,

Hori et al. (2019b) proposed an approach to transfer the power of a teacher model that was trained using summaries to a student model that does not have access to summaries at test time.

4.5 DSTC8 Results

The AVSD Task received 27 system submission from 12 teams. The best system applied "Fine tuned seq-to-seq model with GPT-2 embedding". Table 5 shows the evaluation results for the baseline and best systems at DSTC7 and DSTC8 in terms of human rating.

Task System BLEU-4 METEOR CIDEr Human rating
Baseline 0.309 0.215 0.746 2.848
DSTC7 Best 0.394 0.267 1.094 3.491
Human - - - 3.938
Baseline 0.289 0.21 0.651 2.885
DSTC8 Best 0.442 0.287 1.231 3.934
Human - - - 4.000
Table 5: Performance comparison between the baseline and the best system.

4.6 Summary

We followed up the natural language generation task for Audio Visual Scene-Aware Dialog (AVSD) in DSTC8. This is the attempt to combine end-to-end conversation and end-to-end multimodal video description models into a single end-to-end differentiable network to build scene-aware dialog systems. The language models trained from QA and video description are still strong approaches but the quality of the results obtained using text only models and multimodal fusion models are almost comparable at this task. The power to predict the objects and events in the video has been improved and answer the questions more correctly. Future work includes an exploratory research on reasoning features in response to questions.

5 Schema-Guided Dialogue State Tracking Track

Today’s virtual assistants such as the Google Assistant, Alexa, Siri, Cortana, etc. help users accomplish a wide variety of tasks including finding flights, searching for nearby events, surfacing information from the web etc. They provide this functionality by offering a unified natural language interface to a variety of services and APIs from the web. Building such large scale assistants offers many new challenges such as supporting a large variety of domains, data-efficient handling of APIs with similar functionality and reducing maintenance overhead for integration of new APIs among others. Despite tremendous progress in dialogue research, these critical challenges have not been sufficiently explored, owing to an absence of datasets matching the scale and complexity presented by virtual assistants. To this end, we created the Schema-Guided Dialogue (SGD) dataset, a large-scale corpus of over 18K multi-domain task-oriented conversations spanning 17 domains. This track explores the aforementioned challenges on this dataset, focusing on dialogue state tracking (DST).

5.1 Task definition

The dialogue state is a summary of the entire conversation till the current turn. In a task-oriented system, it is used to invoke APIs with appropriate parameters as specified by the user over the dialogue history. The state is also used by the assistant to generate the next actions to continue the dialogue. DST, therefore, is a core component of virtual assistants. Deep learning-based approaches to DST have recently gained popularity. Some of these approaches estimate the dialogue state as a distribution over all possible slot-values

Henderson et al. (2014a); Wen et al. (2017a) or individually score all slot-value combinations Mrkšić et al. (2017); Zhong et al. (2018). Such approaches are, however, hard to scale to real-world virtual assistants, where the set of possible values for certain slots may be very large (date, time or restaurant name) and even dynamic (movie or event name). Other approaches utilizing a dynamic vocabulary of slot values Rastogi et al. (2018); Goel et al. (2019) still do not allow zero-shot generalization to new services and APIs Wu et al. (2019), since they use schema elements i.e. intents and slots as class labels.

The primary task of this challenge is to develop multi-domain models for DST with particular emphasis on joint modeling across different services or APIs (for data-efficiency) and zero-shot generalization (for handling new/unseen APIs). This takes the shape of a DST task where the dialogue state annotations are guided by the APIs under consideration. Figure 2 illustrates how the dialogue state representations can be conditioned on the corresponding schema for two different flight services (extreme left and right). In order to generate these schema-guided dialogue state representations, the systems are required to take the relevant schemas as additional inputs. The systems can also utilize the natural language descriptions of slots and intents supported by the APIs to yield distributed semantic representations, which can help in joint modeling of related concepts and generalization to new APIs. In addition, the participants are allowed to use any external datasets or resources to bootstrap their models.

Figure 2: Illustration of Track 4: the dialogue state (dashed edges) for the same dialogue is conditioned on the domain/service schema under consideration (extreme left/right), provided as input.

5.2 Data and Baseline

Domain #Intents #Dialogs Domain #Intents #Dialogs Domain #Intents #Dialogs
Alarm 2 (1) 37 Home 2 (1) 1027 Restaurant 4 (2) 2755
Bank 4 (2) 1021 Hotel 8 (4) 3930 RideShare 2 (2) 1973
Bus 4 (2) 2609 Media 4 (2) 1292 Service 8 (4) 2090
Calendar 3 (1) 1602 Movie 4 (2) 1758 Travel 1 (1) 2154
Event 5 (2) 3927 Music 4 (2) 1486 Weather 1 (1) 1308
Flight 8 (3) 3138 RentalCar 4 (2) 1966
Table 6: The number of intents (services in parentheses) and dialogues per domain in the train and dev sets for Track 4. Multi-domain dialogues contribute to counts of each domain.

The SGD dataset555 consists of over 18K annotated multi-domain task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services/APIs spanning 17 domains (see Table 6). For most of these domains, SGD contains multiple APIs having overlapping functionalities but different interfaces - common in the real world; it is the first dataset set up this way. The schemas for all services/APIs pertinent to a dialogue, as well as natural language descriptions and other semantic features for a service and its intents and slots, are also included in the dataset. Rastogi et al. (2019) contains more details about the dataset and the data collection methodology.

With annotations for slot spans, intent, dialogue state and system actions, our dataset is designed to serve as an effective testbed for intent prediction, slot filling, state tracking and language generation, among other tasks in large-scale virtual assistants. Furthermore, the evaluation set is tailored to contain many new services not present in the training set. This helps to quantify the robustness to changes in an API’s interface or the addition of new APIs.

We also provide a baseline system Rastogi et al. (2019), using user and system utterances and schema element descriptions as inputs to a model based on BERT Devlin et al. (2019). The baseline model extends BERT-DST Chao and Lane (2019) by removing all domain-specific parameters, accomplishing zero-shot generalization to new APIs.

5.3 Evaluation

Joint goal accuracy, popular for DST evaluation, is our primary metric for comparison of different approaches, with a modification that uses a fuzzy matching score for non-categorical slots (i.e. slots with large or unbounded sets of possible values) to reward partial matches. For better understanding of the underlying models, we define other auxiliary metrics such as:

  • [leftmargin=*]

  • Active Intent Accuracy: Fraction of user turns for which the active intent is predicted correctly.

  • Requested Slot F1: Macro-averaged F1 score for slots requested by the user over all valid turns.

  • Average Goal Accuracy: Average accuracy of predicting the slot assignments for a turn correctly. Like the joint goal accuracy, this also uses a fuzzy matching score for non-categorical slots.

5.4 Results

We received submissions from 25 teams. Table 7 lists the results for the top 3 teams (determined by joint goal accuracy) and the baseline system. The evaluation set includes three new domains - “Messaging", “Payment" and “Trains", in addition to having a few unseen APIs for some of the domains present in training and dev sets. We observe that the submitted models are able to generalize well to new APIs and domains. Most of the submitted models make use of pre-trained models like BERT Devlin et al. (2019), XLNet Yang et al. (2019) etc. to generalize to unseen domains and APIs.

We also observe a higher joint goal accuracy metric than reported on other public datasets. This is because our dataset excludes the slots for APIs not under consideration in the current turn from the dialogue state for multi-domain dialogues, as opposed to other datasets which include slots for all domains and APIs present over the dialogue history. Thus, in our setup, an incorrect dialogue state prediction for a service only penalizes the joint goal accuracy metric for the turns in which that service is under consideration by the user or the system. Further, our fuzzy matching score rewards partial matches for non-categorical slots, leading to still higher joint and average goal accuracy values.

Team Joint Goal Accuracy Avg Goal Accuracy Active Intent Accuracy Requested Slots F1
Baseline 0.254 0.560 0.906 0.965
Team 9 0.865 0.971 0.948 0.985
Team 14 0.773 0.922 0.969 0.995
Team 12 0.738 0.920 0.926 0.995
Table 7: Evaluation Results for Schema-Guided State Tracking track

6 Conclusions

This paper summarizes the tracks of the eighth dialog system technology challenges (DSTC8). Multi-domain task-completion track offered two sub-tasks: end-to-end multi-domain dialog task and fast adaptation task. NOESIS II track extended the response selection task of DSTC7 with new datasets with multi-party dialogs and two additional subtasks. Audio visual scene-aware dialog track explored further improvements from its first edition on DSTC7 with a new test dataset. Schema-guided dialog state tracking track introduced a new dialog state tracking task from a practical perspective. All the datasets and resources introduced for every track will still be publicly available after the challenge period to support future dialog system research.


  • [1] H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh (2019-06) Audio visual scene-aware dialog. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.2, §4.4.
  • [2] H. Alamri, C. Hori, T. K. Marks, D. Batr, and D. Parikh (2018) Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In DSTC7 at AAAI2019 Workshop, Vol. 2. Cited by: §1, §4.2.
  • [3] H. Alamri, C. Hori, T. K. Marks, D. Batra, and D. Parikh (2019) Track 3 overview: audio visual scene-aware dialog (AVSD) track for natural language generation in dstc7. In AAAI 2019 Workshop: DSTC7, Note: Cited by: §4.2, §4.4.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), Cited by: §4.
  • [5] P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278. Cited by: §2.1.1, §2.2.2.
  • [6] C. Callison-Burch, P. Koehn, C. Monz, and O. F. Zaidan (2011) Findings of the 2011 workshop on statistical machine translation. In Proc. of the Workshop on Statistical Machine Translation, Cited by: §2.2.2.
  • [7] G. Chao and I. Lane (2019) BERT-dst: scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer. arXiv preprint arXiv:1907.03040. Cited by: §5.2.
  • [8] J. Cohen (1960) A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), pp. 37–46. Cited by: §2.2.2.
  • [9] A. H. Copeland (1951) A ‘reasonable’ social welfare function. In Seminar on Mathematics in Social Sciences, Cited by: §2.2.2.
  • [10] A. Das, S. Kottur, J. M.F. Moura, S. Lee, and D. Batra (2017)

    Learning cooperative visual dialog agents with deep reinforcement learning

    In International Conference on Computer Vision (ICCV), Cited by: §4.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.1.2, §3.3, §5.2, §5.4.
  • [12] M. Galley, C. Brockett, X. Gao, J. Gao, and B. Dolan (2019) Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop, Cited by: §1, §2.2.1, §2.2.2.
  • [13] J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. External Links: Link, Document, ISSN 1554-0669 Cited by: §2.1.
  • [14] R. Goel, S. Paul, and D. Hakkani-Tür (2019) Hyst: a hybrid approach for flexible and accurate dialogue state tracking. arXiv preprint arXiv:1907.00883. Cited by: §5.1.
  • [15] C. Gunasekara, J. K. Kummerfeld, L. Polymenakos, and W. Lasecki (2019) Dstc7 task 1: noetic end-to-end response selection. In Proceedings of the First Workshop on NLP for Conversational AI, pp. 60–67. Cited by: §1.
  • [16] C. Gunasekara, J. K. Kummerfeld, L. Polymenakos, and W. S. Lasecki (2019-01) DSTC7 task 1: noetic end-to-end response selection. In 7th Edition of the Dialog System Technology Challenges at AAAI 2019, External Links: Link Cited by: §1.
  • [17] P. Hall, H. Miller, et al. (2009) Using the bootstrap to quantify the authority of an empirical ranking. The Annals of Statistics 37 (6B), pp. 3929–3959. Cited by: §2.2.2.
  • [18] M. Henderson, B. Thomson, and S. Young (2014)

    Word-based dialog state tracking with recurrent neural networks

    In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 292–299. Cited by: §5.1.
  • [19] M. Henderson, B. Thomson, and J. D. Williams (2014) The third dialog state tracking challenge. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pp. 324–329. Cited by: §1.
  • [20] M. Henderson, B. Thomson, and J. Williams (2014) The second dialog state tracking challenge. In 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 263. Cited by: §1.
  • [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8). Cited by: §2.2.2.
  • [22] C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, A. Das, et al. (2019) End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356. Cited by: §4.2.
  • [23] C. Hori, T. Hori, A. Cherian, and T. K. Marks (2019) Joint student-teacher learning for audio-visual scene-aware dialog. In Interspeech 2019, Cited by: §4.4.
  • [24] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi (2017) Attention-based multimodal fusion for video description. In ICCV, Cited by: §4.
  • [25] C. Hori, J. Perez, R. Higashinaka, T. Hori, Y. Boureau, M. Inaba, Y. Tsunomori, T. Takahashi, K. Yoshino, and S. Kim (2019) Overview of the sixth dialog system technology challenge: dstc6. Computer Speech & Language 55, pp. 1–25. Cited by: §1, §4.
  • [26] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016) compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.2.2.
  • [27] S. Kim, L. F. D’Haro, R. E. Banchs, J. D. Williams, M. Henderson, and K. Yoshino (2016) The fifth dialog state tracking challenge. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 511–517. Cited by: §1.
  • [28] S. Kim, L. F. D’Haro, R. E. Banchs, J. D. Williams, and M. Henderson (2017) The fourth dialog state tracking challenge. In Dialogues with Social Robots, pp. 435–449. Cited by: §1.
  • [29] T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.2.2.
  • [30] J. K. Kummerfeld, S. R. Gouravajhala, J. J. Peper, V. Athreya, C. Gunasekara, J. Ganhotra, S. S. Patel, L. Polymenakos, and W. S. Lasecki (2019-07) A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3846–3856. External Links: Link Cited by: §3.2.
  • [31] S. Lee, Q. Zhu, R. Takanobu, Z. Zhang, Y. Zhang, X. Li, J. Li, B. Peng, X. Li, M. Huang, and J. Gao (2019-07) ConvLab: multi-domain end-to-end dialog system platform. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, pp. 64–69. External Links: Link, Document Cited by: §2.1, §2.1.
  • [32] W. Lei, X. Jin, M. Kan, Z. Ren, X. He, and D. Yin (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1437–1447. Cited by: §2.1.
  • [33] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky (2016) Deep Reinforcement Learning for Dialogue Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §2.2.
  • [34] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §3.3.
  • [35] L. Maystre and M. Grossglauser (2017) Just sort it! a simple and effective approach to active preference learning. In

    International Conference on Machine Learning (ICML)

    Cited by: §2.2.2.
  • [36] N. Mrkšić, D. Ó. Séaghdha, T. Wen, B. Thomson, and S. Young (2017) Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1777–1788. Cited by: §5.1.
  • [37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §2.1.2.
  • [38] A. Rastogi, R. Gupta, and D. Hakkani-Tur (2018) Multi-task learning for joint language understanding and dialogue state tracking. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 376–384. Cited by: §5.1.
  • [39] A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2019)

    Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset

    arXiv preprint arXiv:1909.05855. Cited by: §5.2, §5.2.
  • [40] J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, and S. Young (2007) Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 149–152. Cited by: §2.1.2.
  • [41] G. A. Sigurdsson, G. Varol, X. Wang, I. Laptev, A. Farhadi, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. ArXiv. External Links: 1604.01753, Link Cited by: §4.2.
  • [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §2.2.2.
  • [43] O. Vinyals and Q. V. Le (2015) A neural conversational model. arXiv:1506.05869. Cited by: §2.2.
  • [44] T. Wen, D. Vandyke, N. Mrkšíc, M. Gašíc, L. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017-Proceedings of Conference, Vol. 1, pp. 438–449. Cited by: §2.1, §5.1.
  • [45] T. Wen, Y. Miao, P. Blunsom, and S. Young (2017) Latent Intention Dialogue Models. In Proceedings of the International Conference on Machine Learning, Cited by: §2.2.
  • [46] J. Williams, A. Raux, D. Ramachandran, and A. Black (2013) The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, pp. 404–413. Cited by: §1.
  • [47] C. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung (2019-07) Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 808–819. External Links: Link Cited by: §5.1.
  • [48] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §5.4.
  • [49] K. Yoshino, C. Hori, J. Perez, L. F. D’Haro, L. Polymenakos, C. Gunasekara, W. S. Lasecki, J. K. Kummerfeld, M. Galley, C. Brockett, et al. (2019) Dialog system technology challenge 7. arXiv preprint arXiv:1901.03461. Cited by: §1, §3.2, §3, §4.
  • [50] V. Zhong, C. Xiong, and R. Socher (2018-07) Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1458–1467. External Links: Link, Document Cited by: §5.1.