RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems

For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.


GODEL: Large-Scale Pre-Training for Goal-Directed Dialog

We introduce GODEL (Grounded Open Dialogue Language Model), a large pre-...

A Tailored Pre-Training Model for Task-Oriented Dialog Generation

The recent success of large pre-trained language models such as BERT and...

GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection

Pre-trained models have proved to be powerful in enhancing task-oriented...

Hierarchical Pre-training for Sequence Labelling in Spoken Dialog

Sequence labelling tasks like Dialog Act and Emotion/Sentiment identific...

DS-TOD: Efficient Domain Specialization for Task Oriented Dialog

Recent work has shown that self-supervised dialog-specific pretraining o...

Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation

Recent works have shown that generative data augmentation, where synthet...

CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Recent neural models that extend the pretrain-then-finetune paradigm con...

1 Introduction

Dialogs constitute a crucial communication channel in completing a broad range of tasks, such as weather query, flight and restaurant booking, movie booking, IT helpdesk, etc.  Comparing to chit-chat systems that are usually modeled with single-turn context-response pairs, task-oriented dialog systems involve retrieving information from knowledge bases and reasoning over multiple dialog turns. This makes it especially important for a system to be able to produce response that are grounded on tasks goals and user intents. In a bid to support human-computer interactions, task-oriented dialog systems have been built to allow users to converse with a computer system using natural language, such as Siri 333https://www.apple.com/siri/, Google Assistant 444https://assistant.google.com/, Amazon Alexa 555https://developer.amazon.com/en-US/alexa, Microsoft XiaoIce Zhou et al. (2020). Traditionally, a task-oriented dialog system uses a modularized pipeline with four modules that execute sequentially Gao et al. (2019). A natural language understanding (NLU) module identifies user intents and extracts associated information such as slots and corresponding values from user input. A dialog state tracker (DST) infers the belief state (or user goal) from dialog history. The belief state is often used to query a task-specific database (DB) to obtain the DB state, such as the number of entities that match the user goal. The dialog state and DB state are then passed to a dialog policy (POL

) module to select the next system action. A natural language generation (

NLG) module converts the action to a natural language response.

The human ability to converse is general, flexible, and robust. In contrast, most popular tools for dialog system development adopting the above modular systems are designed for specific tasks and struggle with out-of-scope data. If we aspire to develop models beyond extensively hand-crafted rules and annotated data for each single domain/task, it is critical to develop a more unified, efficient and robust model that can more quickly learn to execute a range of different tasks in different domains.

To fuel research in this direction, we present the Raddle benchmark. It includes a collection of task-oriented dialog tasks in diverse domains (e.g. end-to-end modeling, dialog state tracking). The benchmark also has a companion online platform for model evaluation, comparison, and robustness analysis. Importantly, Raddle exhibits two unique advantages that pave the way for building more pragmatic dialog systems: Limited data setting is the major focus of Raddle, to evaluate the generalization ability of models. It aims at simulating the real-world application scenarios where only very limited amount of labelled data is available for new domains. Given this focus, Raddle is therefore a favorable benchmark to evaluate recent models in the pre-training and fine-tuning paradigm, which learn to represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge transfer. Robustness analysis is introduced to study model performance in various challenging scenarios, where models are evaluated with anomalous user input such as language variations, speech errors, unseen entities and out-of-domain utterances. Failing to handle these inputs often produce inappropriate responses leading to frustrating user experience. These scenarios are common for deployed systems in the real world, but are largely ignored in existing dialog benchmarks. To the best of our knowledge, Raddle presents the first work to fill this gap.

To better understand the challenges posed by Raddle, we conduct experiments with simple baselines and state-of-the-art task-oriented dialog models. We find that grounded pre-trained models with a unified multi-task learning objective outperform models separately trained on each domain. Moreover, even the best performing model (Soloist Peng et al. (2020a)) in our evaluation achieves a fairly low score in robustness analysis. This suggests that our baseline models can handle common inputs with strong regularities, but struggle with anomalous inputs that require deeper reasoning.

In summary, our key contributions are: A novel dialog benchmark with an emphasis on limited data and multiple domains/tasks, which formally creates a scenario to evaluate the grounding and generalization ability of pre-trained models. A crowd-sourced diagnostic evaluation dataset to cover a broad range of real-world sophistication to study model robustness. An online evaluation platform and leaderboard to track research progress, with human evaluation services to be granted to top-ranked submissions on a bi-monthly basis. Baseline results for major existing approaches to task-oriented dialogs.

2 Related Work

2.1 Dialog Benchmarks

To drive the progress of building dialogue systems using data-driven approaches, a number of conversational corpora have been released. They are roughly grouped into two categories: Corpora with structured semantic labels Wen et al. (2016); Shah et al. (2018). These datasets are often specifically annotated, and used to study an individual module in the dialog pipeline. For example, DialoGLUE Mehri et al. (2020) is a recently proposed benchmark with a focus on NLU and DST tasks. Corpora with an implicit user goal Lowe et al. (2015). These datasets are often without semantic labels but can be used in end-to-end (E2E) dialog modeling Li et al. (2016); Zhu (2020); Wu et al. (2019); Zhu et al. (2019a). For example, ConvLab Lee et al. (2019); Zhu et al. (2020) is a recent platform for multi-domain E2E evaluation.

MultiWOZ Budzianowski et al. (2018) is the most related work to Raddle. It is a large-scale multi-turn conversational corpus across several domains. It can be used to develop individual dialog modules as separate tasks for existing modular-based methods, or serves as a benchmark for E2E dialogue modeling methods. Raddle inherits the advantages of MultiWOZ in its flexibility for separate/joint task modeling and its comprehensiveness in multi-domain data coverage, but differs significantly in two aspects: an emphasis on limited data settings and an unique robustness checklist. Both are essential qualities in building task bots at scale.

Further, Raddle provides an online platform for model evaluation and fair comparison based on privately-held test data, inspired by GLUE Wang et al. (2018). To the best of our knowledge, Raddle is the first online platform for DST and E2E tasks in the dialog community. This can reduce the inconsistency caused by different researchers/teams using varying processing/evaluation scripts to dilute where the gain comes from.

Standard Language variations / Speech Errors Unseen OOD
Domain Attraction Train Hotel Restaurant Attraction Train Hotel Restaurant Reminder Attraction
#Train 50 50 50 50 - - - - 50 50
#Test 100 200 200 200 100 200 200 200 400 800
Task Dialog State Tracking / End-to-End Modeling DST / IC DST / OOD
Metrics Joint Goal Accuracy / Combined Score JGA / Acc. JGA / F1
Table 1: Dataset descriptions and statistics. DST is short for Dialog State Tracking, E2E denotes End-to-end modeling, and stands for Intent Classification. Joint Goal Accuracy () is used for DST and Combined score is used for E2E.

2.2 Evaluation of Pre-trained Models

Pre-trained language models (PLMs) have substantially advanced the state of the art across a variety of language understanding and generation tasks Peters et al. (2018); Devlin et al. (2019); Yang et al. (2019); Liu et al. (2019); Radford et al. (2019); Keskar et al. (2019); Dong et al. (2019); Peng et al. (2020b, c); Li et al. (2020a). PLMs are often trained to predict words based on their context on massive text data, and the learned models can be fine-tuned to quickly adapt to various downstream tasks, exhibiting strong generalization capacity even with just a few in-domain training examples. Building task bots at scale requires the model to deal with the limited data problem for each domain, which can be used as a testbed to evaluate the generalization ability of PLMs. To this end, we limit the number of task-specific training examples in Raddle to evaluate the sample-efficiency of models.

Meanwhile, task-oriented dialogues pose a unique set of challenges for PLMs Gao et al. (2020): a dialog is intrinsically goal-driven, multi-turn and often informal/noisy. Indeed, dialog-specific PLMs are proposed Wu et al. (2020a); Peng et al. (2020a). However, the robustness of PLMs to linguistic perturbations often occurring in dialog settings (See Section 4 for details) is largely unexplored. Note that our notion of robustness emphasizes natural language variations, which is different from adversarial examples/training that aim to fool a trained model Nie et al. (2019). From this perspective, Raddle provides an unique benchmark for assessing PLMs with a robustness orientation.

3 Tasks

Raddle is centered on five English dialog scenarios in daily life, which cover a broad range of data collection schemes, task types and complexities. As our first goal of Raddle is to spur development of generalizable dialog systems, we design the benchmark such that a good performance requires a model to leverage substantial knowledge (e.g., pre-trained parameters) learned from its previous life cycle, while still maintaining some task-specific components Coope et al. (2020); Henderson et al. (2020); Peng et al. (2020a); Wu et al. (2020b). Specifically, we deliberately keep a small number of training examples for each scenarios. This is consistent with the common practice that only limited labelled data is provided when deploying a dialog system to new domains. Table 1 shows the data statistics. Four domains in the standard-setting are sampled from MultiWOZ2.0 Budzianowski et al. (2018). Reminder is intentionally only utilized for unseen entity tracking. Because it is a human-machine corpus with a relatively smaller action space meaning that the impact of policy learning on models is largely alleviated. Therefore, the performance of models on this corpus will mostly reflect its capability of unseen entity tracking. Note that The number of training examples is limited to 50, an accepted scale that users can provide. Though it is possible to train a single model for each task from scratch without outside sources of knowledge, we expect that our focus on data-scarce settings will render this approach uncompetitive.

Furthermore, a typical task-oriented dialog system uses a modularized pipeline that has four modules and executes sequentially. Recent research has shown promising results on parameterizing the modularized pipeline using a single neural auto-regressive model, and training it in an end-to-end manner Peng et al. (2020a); Ham et al. (2020); Hosseini-Asl et al. (2020). In fact, a single auto-regressive model can significantly ease the workflow of training and deploying dialog systems for new tasks, compared to existing modularized tools and methods. Therefore, we design the benchmark to allow evaluations on end-to-end dialog modeling, in addition to the modularized evaluation on dialog state tracking. To reveal the gap between the complexity of dialogues in lab environments and that in real scenarios, we construct a suite of tasks to study the robustness of models. We describe these tasks below and in Table 1.

On the evaluation front, we concentrate on simulation-based methodologies, in order to facilitate automation. Although we only offer human-based evaluations Gao et al. (2019) to top-ranked submissions at this point, we emphasize realistic scenarios in pursuit of system robustness (see Section 4).

Task 1: Dialog State Tracking

A robust NLU and DST is the first step towards building a reliable dialog system. The dialogue state is a summary of the entire conversation till the current turn. In a task-oriented system, it is represented in the form of slot-value pairs, where slot indicates the category/attribute of the user goal expressed in the utterance, and value

is the corresponding information. For the evaluation metric, we report

joint goal accuracy, which indicates the proportion of dialogue turns where all the user’s search goal constraints are correctly identified Mrksic et al. (2017). To specially study the NLU performance, we consider intent classification, which aims to automatically extract meaning from a natural language utterance in order to understand user’s goal Hemphill et al. (1990); Zhu et al. (2019b).

Task 2: End-to-end Modeling

The end-to-end (E2E) dialog models consider dialog history as input, and produce the natural language response. It jointly implements the dialogue management (including DST and POL) and response generation (i.e., NLG) components. Following Budzianowski et al. (2018), , , and scores are reported. The first two metrics evaluate dialog task completion: measures if the system provides a correct entity (inform rate), meanwhile measures the exact matching of answering all the requested information (success rate,) and if the answered information matches users’ goal. evaluates how fluent the generated responses are compared to human-written responses. A combined score () is also reported using as an overall quality measure, as suggested in Budzianowski et al. (2018).

4 Robustness Diagnostic Checklist

Existing benchmarks assume a world of a “perfect” user who always provides precise, concise, and semantically unambiguous utterances. These goal-oriented dialog datasets are largely collected by crowd-sourcing, where a crowd-sourced worker enacts the part of a real user by following a set of template instructions provided for the task. This method results in a dataset where most user utterances are straight-forward, stick to the goal and tend to leave out the variation/errors commonly found in real-world conversational data. To this end, we collect a suite of language variations to reveal the dialog sophistication in the real world, and measure the robustness of dialog models.

(a) Standard dialog session (b) Paraphrase
(c) Verbosity (d) Simplification
(e) Typos (f) Speech errors
(g) Unseen entities (h) Out-of-domain utterance
Figure 1: Illustration of different language perturbations in the robustness diagnostic checklist. The standard dialog example is shown in (a). Based on it, (b)-(e) are four types of language variations  , (f) shows speech error  , (e) shows unseen entities  , and (h) shows out-of-domain utterance  . In each case, some representative examples are highlighted in red text.

Language Variations 

It is well-known that humans communicate using language with fairly large variations such as different ways of expressions or personalized styles Sacks et al. (1978), while template-based crowd-sourcing fails in covering the linguistic variations Schegloff et al. (1977); Moore and Arar (2019). Specifically, we consider four types of variations in Raddle: Paraphrase widely exists among different users, who may present restatements of the meaning of a text or message using other words. Verbosity describes a quality that users may express their intents using more words than needed. Simplification is a quality that users express their intents using fewer words to be concise. Typos often result from illegitimate abbreviations. In Figure 1(b)-(e), we provide examples to illustrate these language variations.

Speech Errors 

It is desirable that dialog systems can leverage automatic speech recognition (ASR) techniques to serve the speech modality, as in Amazon Alexa. However, almost all dialog systems have typically assumed that the user input is written text, and hoped that the system would seamlessly integrate with speech inputs. Recently, It has been empirically shown in 

Gopalakrishnan et al. (2020) that dialog systems trained on written data is very sensitive to various types of synthetic and actual ASR hypotheses in the dialog history. To bring attention to this gap, Raddle promotes speech robustness as an evaluation criterion. For example in Figure 1(f), “what’s available” can be transcribed as “once available” due to ASR deficiency, and a robust dialog system is expected to still correctly perceive user intents.

Unseen Entities 

Most existing DST methods are not designed to handle slot values that are not known to the tracker. The assumption that a pre-defined ontology exists for the dialog and one can enumerate all possible values for each slot is often not valid in real-world scenarios. Even if such lists or dictionaries exist, they can be very large in size and highly dynamic Xu and Hu (2018). Therefore, unseen entities are common in dialogs, i.e., entities that are not observed during training, but appear in the testing stage. In Figure 1(g), the entity Bellevue downtown is in the knowledge base but never appears in model training, a robust DST should be able to recognize it as a city/place, via generalizing from other similar entities learned during training.

Out-of-Domain Utterances 

Most deployed task-oriented dialog systems are built for a closed set of target domains. Thus, they are fragile when dealing with out-of-domain (OOD) utterances Lee and Shalyminov (2019). Failure to detect OOD utterances often prevents the model from responding with an appropriate fallback action, hence leading to frustrating user experience. Therefore, it is important to endow task bots with the ability to detect OOD utterances for special handling Larson et al. (2019). For example, in Figure 1

(h), the user suggests an excursion to a task bot trained in college consulting, which is out of the bot’s scope. The bot is expected to raise a flag to label the utterance as an outlier, and guides the user to focus on the current domain.

Collection Protocols

The standard setting is sampled from MultiWOZ2.0 Budzianowski et al. (2018) but re-purposed in a few-shot learning setting. The language variations corpus is then created by workers on Amazon Mechanical Turks based on the standard corpus. To maximize the quality, we require workers in US locale and have a minimal previous approval rate of 90%. Assignments are constructed at the turn level. Given a user utterance and associated dialog history, workers are required to answer four questions, what are the paraphrase, typos, verbose, and simplified versions of the user utterance. Moreover, in each assignment, the workers are instructed to exactly mention the slot values in the answers if the given user utterance has them. For the speech recognition errors setting, we employ the audio-level error simulation Gopalakrishnan et al. (2020), which generates audio signals from texts, adds noise into the audio, and then decodes the audio with an ASR model to obtain hypotheses. In particular, we employ Microsoft Cognition text-to-speech service to synthesize audio signals. After injecting background noise into the audio signals, we use the speech recognition service to obtain a corpus of Word Error Rate (WER) of 30%. For the reminder domain that is applied for unseen entity evaluation, we firstly simulate several dialogs as seed scenarios using an agenda-based simulator and then randomly replace the slots in the dialogs with new values. Similar to constructing the language variations corpus, we then hire workers to rewrite the corpus as diverse and realistic as possible. Finally, the out-of-domain corpus is developed following Lee and Shalyminov (2019). We randomly choose 50% utterances in DSTC Henderson et al. (2014) for the Attraction domain as the training set. For the test set, besides utterance from DSTC, we also introduce utterance from a diverse set of domains like Stanford Eric and Manning (2017), Reddit, Twitter Sordoni et al. (2015) to evaluate the capability of handling different out-of-domain utterances.


Standard Paraphrase Simplification Typos Verbosity Speech ERR Unseen OOD
Model Avg.
DAMD - 14.18 48.99 6.75 44.13 5.78 42.93 5.33 42.58 7.08 42.56 9.1 45.94 - - - -
GPT-2 47.46 40.52 67.36 31.36 62.72 28.82 59.44 22.31 54.15 30.40 54.16 31.41 65.95 28.28 51.29 47.37 83.86
Soloist 59.09 53.17 76.13 40.27 64.89 37.18 63.61 22.73 57.77 38.21 65.71 36.81 70.48 69.05 96.98 56.28 96.18


Table 2: Overall results of baselines across all Raddle tasks. indicates the metric, denotes intent classification accuracy. Note that it is not straightforward to directly apply DAMD to Unseen and OOD tasks since it requires extra annotations. As such, we omit results of DAMD on these two tasks.

5 Baselines

For baselines, we consider three representative methods, holding state-of-the-art positions on existing benchmarks such as MultiWoZ Budzianowski et al. (2018).


Zhang et al. (2019a)

is a state-of-the-art modular system, where each dialog module is implemented using a neural network, and the whole system is trained in an end-to-end manner.


represents a single multi-task learning model with impressive results on general language understanding and generation tasks. GPT-2 is an auto-regressive language model that leverages 12-24 layers of masked, multi-head self-attention Transformers. GPT-2 is pre-trained on extremely massive text data OpenWebText (Radford et al., 2019). It has demonstrated superior performance on characterizing human language data distribution and knowledge transfer. Given text prompts, GPT-2 can often generate fluent sentences. Its ancestral work GPT (with a smaller model size and less training data) has shown impressive results on language understanding tasks. In this paper, we consider GPT-2 as the approach of directly fine-tuning the pre-trained GPT-2 on a specific domain. Hence, GPT-2 can be viewed as SOLOIST without grounded pre-training, and serve as a strong baseline for both DST and E2E task.


represents recent model variants Ham et al. (2020); Hosseini-Asl et al. (2020) to parameterize dialog system as a single auto-regressive model. SOLOIST subsumes different dialog modules (e.g. state tracker, dialog policy, response generator) into a single Transformer model. It has the similar capability with GPT-2 in understanding and generating natural language sentences but is pre-trained on large heterogeneous dialog corpora to gain additional capability of grounding text response in user goals and real-world knowledge for task completion Peng et al. (2020a); Gao et al. (2020).


We leverage the pre-trained checkpoints from the corresponding work, and fine-tune them on Raddle

. Each domain is trained separately. We train our models with Adam with initial learning rate 5e-5 and batch size 1 for 20 epochs. To allow for fair comparisons with the both models, we do not tune hyper parameters or training settings for each model.


The Raddle benchmark follows the same evaluation model as GLUE Wang et al. (2018) or Kaggle666https://www.kaggle.com/. To evaluate a system on the benchmark, one must run the system on the provided test data for the tasks, then upload the results to the website http://aka.ms/raddle for scoring The benchmark site shows per-task scores and a macro-average of those scores to determine a system’s position on the leaderboard. The website also provides fine- and coarse-grained results on the robustness diagnostic datasets. We will provide human evaluation services for top-ranked submissions on a bimonthly basis. The human evaluation protocol follows (Peng et al., 2020a; Li et al., 2020b)

6 Benchmark Results

6.1 Overall Results

We first present the results of baseline methods across all tasks on the Raddle benchmark in Table 2. As shown, GPT-2 fine-tuned with domain-specific dialog corpora outperforms the strong modular-based method DAMD. This highlights the efficacy of pre-trained language models. Soloist is the best-performing model and improves upon GPT-2 over 10 points in terms of average score, and consistently performs better than GPT-2 across all the tasks. These strong results indicate that large-scale task-specific pre-training on dialog corpora is crucial for effective and robust task adaptation.

6.2 Robustness Diagnostic Checklist Results

Attraction Train Hotel Restaurant
Env-0 79.00 61.00 13.33 78.28 68.18 11.73 71.00 44.00 10.21 84.00 53.00 12.20
Env-1 Para. 71.00 51.00 12.40 81.31 71.72 11.74 66.50 35.50 9.40 70.50 40.00 12.09
Simp. 63.00 47.00 12.40 78.28 69.19 11.97 57.50 32.50 9.45 68.00 43.00 12.23
Typo 68.00 49.00 11.99 78.28 69.19 11.72 53.50 30.50 9.76 63.00 37.00 11.57
Verbo. 76.00 54.00 12.79 75.25 67.17 11.97 61.50 40.00 10.41 71.00 43.50 11.43
Env-2 Para. 63.00 44.00 12.41 75.76 66.16 11.70 56.00 33.00 9.96 66.00 40.50 12.29
Simp. 58.00 45.00 12.40 76.77 65.66 11.74 56.00 33.00 9.52 68.00 42.50 12.25
Typo 60.00 41.00 11.75 75.25 66.67 11.67 49.00 27.50 10.08 52.50 30.50 11.62
Verbo. 74.00 53.00 12.46 72.73 64.65 11.50 56.50 37.00 9.92 68.00 42.00 10.91
Env-3 Para. 63.00 39.00 12.48 78.28 64.65 11.26 59.00 35.00 10.08 63.50 34.00 11.67
Simp. 63.00 43.00 11.37 76.77 63.64 11.21 53.00 27.00 9.68 66.50 31.00 11.81
Typo 62.00 33.00 11.13 74.24 61.11 11.14 46.50 23.00 9.60 52.00 24.00 10.82
Verbo. 75.00 50.00 11.25 72.73 58.59 11.30 56.00 34.00 10.04 64.00 33.50 10.88
Table 3: End-to-end Evaluation on Raddle and environments mixed with different ratios of user language variations. Env-X denotes replacing original test set with randomly sampled language variations examples, 10%, 50%, 80%, respectively.
Figure 2: Evaluation results of Soloist on different levels of language variation corpus.
Task Attraction Train Hotel Restaurant
Env-0 56.10 60.47 29.10 67.01
Env-1 Para. 45.45 53.39 25.67 55.59
Simp. 39.48 53.39 24.37 53.54
Typo 34.55 36.97 16.96 42.69
Verbo. 45.45 47.79 23.82 55.82
Env-2 Para. 43.38 46.61 22.24 50.91
Simp. 37.66 42.18 22.98 50.46
Typo 29.87 21.04 13.62 31.74
Verbo. 43.90 39.04 21.59 54.11
Env-3 Para. 42.60 45.03 22.06 51.03
Simp. 37.92 39.92 21.59 48.97
Typo 29.09 19.37 12.60 30.25
Verbo. 43.90 36.18 20.67 52.05
Table 4: State tracking evaluation on Raddle and environments mixed with different ratios of language variations
Attraction Train Hotel Restaurant
SR-0 79.00 61.00 13.33 78.28 68.18 11.73 71.00 44.00 10.21 84.00 53.00 12.20
SR-20 72.00 53.00 13.37 77.78 67.17 12.37 62.00 39.00 9.99 73.50 44.50 11.38
SR-30 74.00 52.00 13.10 73.74 62.63 11.68 52.50 30.00 9.90 69.50 37.50 11.34
Table 5: Evaluation results of Soloist with different levels of speech errors. SR denotes speech errors. SR-X means corpus with X% Word Error Rate (WER).
DSTC Stanford Reddit Twitter
0% GPT-2 - - - 46.75 - - - 46.75 - - - 46.75 - - - 46.75
Soloist - - - 56.10 - - - 56.10 - - - 56.10 - - - 56.10
10% GPT-2 98.86 90.58 94.54 42.88 96.08 27.53 42.79 41.92 94.59 19.66 32.56 42.10 95.35 23.03 37.10 42.45
Soloist 99.45 95.81 97.60 49.48 99.12 63.48 77.40 47.96 98.25 31.46 47.66 46.89 97.73 24.16 38.74 47.96
20% GPT-2 100.00 98.42 99.21 45.66 100.00 76.40 86.62 44.58 100.00 48.88 65.66 44.40 100.00 57.30 72.86 45.12
Soloist 97.94 99.48 98.70 49.48 97.65 93.26 95.40 49.56 96.75 66.85 79.07 50.62 96.99 72.47 82.96 52.04
50% GPT-2 98.96 100.00 99.48 50.52 100.00 82.58 90.46 48.13 100.00 52.81 69.12 48.85 100.00 61.80 76.39 49.02
Soloist 99.47 100.00 99.74 55.21 100.00 97.75 98.86 52.93 100.00 83.71 91.13 53.82 100.00 90.45 94.99 54.71
Table 6: Results of out-of-domain detection using varying size of training examples on different target domains. N% denotes injecting out-of-domain utterances N% of the training set.
Name Time Day
GPT-2 84.47 65.01 26.85 28.28 51.29
Soloist 91.00 90.89 78.21 69.05 96.98
Table 7: Evaluation results on unseen entities.

Table 2 shows the overall performance of DST and E2E modeling under different variation settings.

Language variations

It is noticeable that all the models incur significant performance drops under each type of variation. Among all variation types, Typos has the most substantial impact on both JGA and score resulting in 10 to 20 points of drop in performance. This is expected as misspelled keywords pose significant challenges for state tracking. The influence of other three types of variations are also prominent. The results reveal that existing SOTA dialog models trained on limited task-specific examples are not robust enough to handle various types of user utterances.

Speech errors

We observe a clear degradation in all metrics for all models. This shows that during inference, models trained on textual data are sensitive and not robust to actual ASR hypotheses introduced in dialog history.

Unseen entities

Without task-specific pre-training, GPT-2 only achieves less than 30% of JGA and 51.20 of dialog act accuracy even on a simple domain with most of the common entity values. Soloist performs significantly better than GPT-2 by achieving 69.05% JGA and 96.98 dialog act accuracy but remains imperfect. These results imply that task-specific pre-training can improve the generalization capability of models but is still far from enough for production environments.

Out-of-domain utterances

It is non-trivial for conventional modular-based dialog systems to handle out-of-domain detection. It often requires an additional component to classify whether a user utterance as in-domain or not. As such, we omit the result of DAMD in our experiments. We observe that pre-trained models handle out-of-domain detection relatively well. GPT-2

achieves 83.96 F1 score while Soloist has 96.18 F1 score, which shows that task-specific pre-training can improve robustness of models to out-of-domain utterances.

6.3 Robustness detailed case studies

First, to better understand the impact of different language variations, we evaluated Soloist on the corpus of different variation levels. Env-0 denotes the standard corpus, while Env-1, Env-2, Env-3 represent that 10%, 50% 80% of the standard corpus is replaced with language variation examples, respectively. Table 3 lists the detailed results on the end-to-end task and Table 4 shows the performance of state tracking. In general, the performance drops as the variation level increases for all types of variations across four domains. Even for a small variation level Env-1 (10%), the performance drops significantly. We found that the degradation is mainly due to incorrectly tracked dialog states. Moreover, as depicted in Fig. 2 and shown in Table 3, although the combined score drops on Env-2 and Env-3, Soloist still maintains good BLEU scores. These observations indicate that policy and response generation of Soloist are relatively robust to language variations and the dialog state tracking capability is the major bottleneck towards robust dialog models. An intriguing possibility to improve robustness is to apply adversarial training Liu et al. (2020) to task-specific pre-training.

Next, similar to the experiments on language variations, we evaluated Soloist on corpus with different levels speech errors. Results are shown in Table 5. We observe that compared with language variations, speech errors have a smaller impact on the performance for Soloist. It is noteworthy that the evaluation corpus we choose has considerably higher word error rates, ranging from 10% to 30%, than a modern speech recognizer which usually has single-digit word error rate in quiet environments. We speculate that pre-trained dialog models trained on textual data has the potential to be deployed to smart home devices like Amazon Alexa, Apple Homepod. However, it still has defects when used in noisy environments such as smart assistant in cars or outdoor usage. There is less work on jointly pre-training speech and text modalities in dialog community. We believe that adding the speech modality to dialog pre-training may enhance robustness to speech errors.

Evaluation results on unseen entities are listed in Table 7. We observe that GPT-2 is able to handle unseen entities like Name and Time to some extent in this controlled experiment but fails in tracking Day properly, leading to inferior results in terms of joint goal accuracy and action selection. In contrast, with task-specific pre-training, Soloist substantially improves the performance in all the metrics. Nevertheless, for this simple task, string matching can effortlessly achieve near 100% accuracy. Therefore, 69.05 joint goal accuracy is insufficient to affirm that Soloist is robust to unseen entities. Incorporating knowledge into pre-training can be a solid basis for further research to improve robustness to unseen entities.

We also present the evaluation results of out-of-domain detection using varying sizes of training examples on different target domains in Table 6. In the homologous DSTC domain Lee and Shalyminov (2019), GPT-2 performs similarly with Soloist. They are both able to identify out-of-domain utterances with near 100% F1 score when injecting 50% OOD data. However, Soloist leads 3 points in F1 score when trained using only 10% data. In the other heterogeneous domains, Soloist performs consistently better than GPT-2. In Reddit and Twitter domains that are distinct from DSTC, Soloist outperforms by over 20 points in F1 score when trained using 50% data, showing that Soloist is more robust than GPT-2 to out-of-domain utterances. An inspiring observation is that injecting out-of-domain data can increase the performance of state tracking. While task specific pre-training helps with OOD detection, involving open-domain data into pre-training or initializing from open-domain dialog models such as DialoGPT Zhang et al. (2019b) might further enhance robustness of dialog models.

(a) DSTC-8 (b) DSTC-9
Figure 3: Corpus and human evaluation for different models in two recent Multi-domain Dialog Challenges: (a) DSTC8 and (b) DSTC9. The regions indicate the gap between human and corpus evaluations for different types of models. We observe that In DSTC8, Team 5 is the winner, and the only submission adopting pre-trained GPT-2 models; The performance discrepancy between the corpus and human evaluation is significantly smaller than other teams using modular-based methods without pre-training. a general trend shifting from modular based systems to pre-trained end-to-end systems. a substantial drop in performance which indicates that pre-trained methods remain sensitive to noisy inputs.

Finally, it is worth pointing out some important trends in the dialog research community, based on the DSTC challenge  Kim et al. (2019); Gunasekara et al. (2020) in the last 2 years (Figure 3). In DSTC8  Kim et al. (2019), the winning submission by Team 5 is the only one that uses pre-trained models (GPT-2). When moving from corpus evaluation to human evaluation, it exhibits the least performance drop relative to other submissions, which is strong evidence to demonstrate robustness of pre-trained models. By the time of DSTC9 Gunasekara et al. (2020), the community have witnessed a general trend shift from modular systems to pre-trained end-to-end architectures. However, the significant performance gap between corpus evaluation and human evaluation indicates that pre-trained methods remain sensitive to noisy inputs. Such observations underscore the importance of robustness-oriented design and evaluation, for which Raddle fills a major void.

7 Conclusion

We introduce Raddle

, a platform and collection of resources for evaluating and analyzing task-oriented dialog systems. We confirm the utility of grounded pre-training and transfer learning methods in dialog systems: pre-training improves generalization in a limited data setting, but still leaves room for improvement. When evaluating these models on our diagnostic dataset, we find that they fail (often spectacularly) on many robustness test cases, suggesting possible avenues for future work. In summary, the question of how to design unified, efficient, robust models remains largely unexplored, and we believe that

Raddle can provide fertile soil for addressing this challenge.


  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278. Cited by: §2.1, §3, §3, §4, §5.
  • S. Coope, T. Farghly, D. Gerz, I. Vulic, and M. Henderson (2020) Span-convert: few-shot span extraction for dialog with pretrained conversational representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 107–121. External Links: Link Cited by: §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §2.2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pp. 13042–13054. Cited by: §2.2.
  • M. Eric and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414. Cited by: §4.
  • J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1, §3.
  • J. Gao, B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and H. Shum (2020) Robust conversational ai with grounded text generation. arXiv preprint arXiv:2009.03457. Cited by: §2.2, §5.
  • K. Gopalakrishnan, B. Hedayatnia, L. Wang, Y. Liu, and D. Hakkani-Tur (2020) Are neural open-domain dialog systems robust to speech recognition errors in the dialog history? an empirical study. arXiv preprint arXiv:2008.07683. Cited by: §4, §4.
  • C. Gunasekara, S. Kim, L. F. D’Haro, A. Rastogi, Y. Chen, M. Eric, B. Hedayatnia, K. Gopalakrishnan, Y. Liu, C. Huang, et al. (2020) Overview of the ninth dialog system technology challenge: dstc9. arXiv preprint arXiv:2011.06486. Cited by: §6.3.
  • D. Ham, J. Lee, Y. Jang, and K. Kim (2020) End-to-end neural pipeline for goal-oriented dialogue system using gpt-2. ACL. Cited by: §3, §5.
  • C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, External Links: Link Cited by: §3.
  • M. Henderson, I. Casanueva, N. Mrksic, P. Su, T. Wen, and I. Vulic (2020) ConveRT: efficient and accurate conversational representations from transformers. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020

    , T. Cohn, Y. He, and Y. Liu (Eds.),
    pp. 2161–2174. External Links: Link Cited by: §3.
  • M. Henderson, B. Thomson, and J. D. Williams (2014) The second dialog state tracking challenge. In Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL), pp. 263–272. Cited by: §4.
  • E. Hosseini-Asl, B. McCann, C. Wu, S. Yavuz, and R. Socher (2020) A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: §3, §5.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §2.2.
  • S. Kim, M. Galley, C. Gunasekara, S. Lee, A. Atkinson, B. Peng, H. Schulz, J. Gao, J. Li, M. Adada, et al. (2019) The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394. Cited by: §6.3.
  • S. Larson, A. Mahendran, J. J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kummerfeld, K. Leach, M. A. Laurenzano, L. Tang, and J. Mars (2019) An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1311–1316. External Links: Link, Document Cited by: §4.
  • S. Lee and I. Shalyminov (2019) Contextual out-of-domain utterance handling with counterfeit data augmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7205–7209. Cited by: §4, §4, §6.3.
  • S. Lee, Q. Zhu, R. Takanobu, X. Li, Y. Zhang, Z. Zhang, J. Li, B. Peng, X. Li, M. Huang, and J. Gao (2019) ConvLab: multi-domain end-to-end dialog system platform. CoRR abs/1904.08637. External Links: Link, 1904.08637 Cited by: §2.1.
  • C. Li, X. Gao, Y. Li, X. Li, B. Peng, Y. Zhang, and J. Gao (2020a) Optimus: organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092. Cited by: §2.2.
  • J. Li, B. Peng, S. Lee, J. Gao, R. Takanobu, Q. Zhu, M. Huang, H. Schulz, A. Atkinson, and M. Adada (2020b) Results of the multi-domain task-completion dialog challenge. In

    Proceedings of the 34th AAAI Conference on Artificial Intelligence, Eighth Dialog System Technology Challenge Workshop

    Cited by: §5.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §2.1.
  • X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao (2020) Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994. Cited by: §6.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.2.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909. Cited by: §2.1.
  • S. Mehri, M. Eric, and D. Hakkani-Tur (2020) DialoGLUE: a natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570. Cited by: §2.1.
  • R. J. Moore and R. Arar (2019) Conversational ux design: a practitioner’s guide to the natural conversation framework. ACM. Cited by: §4.
  • N. Mrksic, D. Ó. Séaghdha, T. Wen, B. Thomson, and S. J. Young (2017) Neural belief tracker: data-driven dialogue state tracking. In ACL (1), Cited by: §3.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2019) Adversarial NLI: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599. Cited by: §2.2.
  • B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and J. Gao (2020a) SOLOIST: few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprint arXiv:2005.05298. Cited by: §1, §2.2, §3, §3, §5, §5.
  • B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao (2020b) Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §2.2.
  • B. Peng, C. Zhu, M. Zeng, and J. Gao (2020c) Data augmentation for spoken language understanding via pretrained models. arXiv preprint arXiv:2004.13952. Cited by: §2.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §2.2, §5.
  • H. Sacks, E. A. Schegloff, and G. Jefferson (1978) A simplest systematics for the organization of turn taking for conversation. In Studies in the organization of conversational interaction, Cited by: §4.
  • E. A. Schegloff, G. Jefferson, and H. Sacks (1977) The preference for self-correction in the organization of repair in conversation. Language. Cited by: §4.
  • P. Shah, D. Hakkani-Tür, G. Tür, A. Rastogi, A. Bapna, N. Nayak, and L. Heck (2018)

    Building a conversational agent overnight with dialogue self-play

    arXiv preprint arXiv:1801.04871. Cited by: §2.1.
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §4.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §2.1, §5.
  • T. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2016) A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §2.1.
  • C. Wu, S. Hoi, R. Socher, and C. Xiong (2020a) Tod-bert: pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871. Cited by: §2.2.
  • C. Wu, S. Hoi, R. Socher, and C. Xiong (2020b) ToD-BERT: pre-trained natural language understanding for task-oriented dialogues. Cited by: §3.
  • Q. Wu, Y. Zhang, Y. Li, and Z. Yu (2019) Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Cited by: §2.1.
  • P. Xu and Q. Hu (2018) An end-to-end approach for handling unknown slot values in dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1448–1457. External Links: Link, Document Cited by: §4.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. NeurIPS. Cited by: §2.2.
  • Y. Zhang, Z. Ou, and Z. Yu (2019a) Task-oriented dialog systems that consider multiple appropriate responses under the same context. arXiv preprint arXiv:1911.10484. Cited by: §5.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019b) DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: §6.3.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2020) The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics 46 (1), pp. 53–93. Cited by: §1.
  • C. Zhu, M. Zeng, and X. Huang (2019a) Multi-task learning for natural language generation in task-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1261–1266. Cited by: §2.1.
  • C. Zhu, M. Zeng, and X. Huang (2019b) SIM: a slot-independent neural model for dialogue state tracking. arXiv preprint arXiv:1909.11833. Cited by: §3.
  • C. Zhu (2020) Boosting naturalness of language in task-oriented dialogues via adversarial training. arXiv preprint arXiv:2004.14565. Cited by: §2.1.
  • Q. Zhu, Z. Zhang, Y. Fang, X. Li, R. Takanobu, J. Li, B. Peng, J. Gao, X. Zhu, and M. Huang (2020)

    ConvLab-2: an open-source toolkit for building, evaluating, and diagnosing dialogue systems

    CoRR abs/2002.04793. External Links: Link, 2002.04793 Cited by: §2.1.