ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues

04/15/2020 ∙ by Chien-Sheng Wu, et al. ∙ 0

The use of pre-trained language models has emerged as a promising direction for improving dialogue systems. However, the underlying difference of linguistic patterns between conversational data and general text makes the existing pre-trained language models not as effective as they have been shown to be. Recently, there are some pre-training approaches based on open-domain dialogues, leveraging large-scale social media data such as Twitter or Reddit. Pre-training for task-oriented dialogues, on the other hand, is rarely discussed because of the long-standing and crucial data scarcity problem. In this work, we combine nine English-based, human-human, multi-turn and publicly available task-oriented dialogue datasets to conduct language model pre-training. The experimental results show that our pre-trained task-oriented dialogue BERT (ToD-BERT) surpasses BERT and other strong baselines in four downstream task-oriented dialogue applications, including intention detection, dialogue state tracking, dialogue act prediction, and response selection. Moreover, in the simulated limited data experiments, we show that ToD-BERT has stronger few-shot capacity that can mitigate the data scarcity problem in task-oriented dialogues.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in pre-training using self-attention encoder architectures Devlin et al. (2018); Liu et al. (2019); Lan et al. (2019) have been commonly used in many NLP applications. Such models are usually constructed based on a massive scale of general text corpora, such as English Wikipedia or books Zhu et al. (2015)

. The distributed representations are learned self-supervised from the raw text. By further fine-tuning these representations, breakthroughs have been continuously reported for various downstream tasks, especially those of natural language understanding.

However, previous work Rashkin et al. (2018); Wolf et al. (2019) shows that there are some deficiencies in the performance to directly apply fine-tuning on conversational corpora. One possible reason could be the intrinsic difference of linguistic patterns between human conversations and writing text, resulting in a large gap of data distributions Bao et al. (2019). Therefore, pre-training dialogue language models using chit-chat conversational corpora from social media, such as Twitter or Reddit, has been recently investigated, especially for dialogue response generation Zhang et al. (2019b) and retrieval Henderson et al. (2019b) tasks. Although these open-domain dialogues are diverse and easy-to-get, they are usually short, noisy and without specific chatting goals.

Task-oriented dialogues, on the other hand, have explicit goals (e.g. restaurant reservation or ticket booking) and many conversational interactions. But each of these datasets is usually small and scattered since obtaining and labeling such data is difficult and expensive. Moreover, task-oriented dialogues have clear user and system behaviors where the user has his/her goal and the system has its belief and database information, which makes the language understanding component more essential than those chit-chat scenarios.

In this paper, we aim to prove this hypothesis: self-supervised language model pre-training using task-oriented corpora can learn better representations than existing pre-trained models for those task-oriented downstream tasks. We emphasize that what we care the most is not whether our pre-trained model can achieve state-of-the-art results on each downstream task, since most of the current best models are built on top of pre-trained models, which can be easily replaced by our pre-trained model. In our experiments, we avoid adding too many additional components on top of pre-training architectures when fine-tuning on each downstream task, and simply rely on the learned representations to show the full strength of a pre-trained model.

We collect and combine nine English-based, human-human, multi-turn, and publicly available task-oriented dialogue corpora to train a task-oriented dialogue BERT (ToD-BERT). In total, there are around 100k dialogues with 1.4M utterances across 60 different domains. Like BERT Devlin et al. (2018), ToD-BERT is formulated as a masked language model, and uses the deep bidirectional Transformer Vaswani et al. (2017) encoder as its model architecture. We select BERT architecture simply because it is the most widely used model in NLP research recently. Note that the unified datasets we combine can be easily applied to pre-train any existing language models.

We test ToD-BERT on four common downstream tasks of task-oriented dialogue systems, including intention detection, dialogue state tracking, dialogue act prediction, and response selection. This is what we observe: ToD-BERT outperforms BERT and other strong baselines on all the selected downstream tasks, which further confirms its effectiveness for improving dialogue language understanding. More importantly, ToD-BERT has stronger few-shot capacity than BERT on each task, implying that it can reduce the need for expensive human-annotated labels in the future. ToD-BERT can be easily leveraged and adapted to new task-oriented dialogue datasets, especially those with few training examples. Our source code and pre-trained model will be released soon to facilitate future research.

2 Related Work

General Pre-trained Language Models,

which are trained on massive general text such as Wikipedia and BookCorpus, can be roughly divided into two categories: uni-directional or bi-directional attention mechanisms. GPT Radford et al. (a)

and GPT-2

Radford et al. (b) are representatives of uni-directional language models using a Transformer decoder, where the objective is to maximize left-to-right generation likelihood. These models are commonly applied in natural language generation tasks. On the other hand, BERT Devlin et al. (2018), RoBERTa Liu et al. (2019)

and their variances are pre-trained using a Transformer encoder with bi-directional token prediction. These models are usually evaluated on classification tasks such as GLUE benchmark

Wang et al. (2018) or span-based question answering tasks Rajpurkar et al. (2016).

Some language models can support both uni-directional and bi-directional attention, such as UniLM Dong et al. (2019). Conditional language model pre-training is also proposed, for example, CTRL Keskar et al. (2019) is a conditional Transformer model, trained to condition on control codes that govern style, content, and task-specific behavior. Recently, multi-task language model pre-training with unified sequence-to-sequence generation is proposed. Text-to-text Transformer (T5) Raffel et al. (2019) unifies multiple text modeling tasks, and achieves the promising results in various NLP benchmarks.

Dialogue Pre-trained Language Models

are mostly trained on open-domain conversational data from Reddit or Twitter for dialogue response generation. Transfertransfo Wolf et al. (2019) achieves good performance on ConvAI-2 dialogue competition using GPT-2. DialoGPT Zhang et al. (2019b) is an extension of GPT-2 that is pre-trained on Reddit data for open-domain response generation. ConveRT Henderson et al. (2019a) pre-trained a dual transformer encoder for response selection task on large-scale Reddit (input, response) pairs. PLATO Bao et al. (2019) uses both Twitter and Reddit data to pre-trained a dialogue generation model with discrete latent variables. All of them are designed to cope with the response generation task for open-domain chatbots.

Name # Dialogue # Utterance Avg. Turn # Domain
MetaLWOZ Lee et al. (2019) 37,884 432,036 11.4 47
Schema Rastogi et al. (2019) 22,825 463,284 20.3 17
Taskmaster Byrne et al. (2019) 13,215 303,066 22.9 6
MWOZ Budzianowski et al. (2018) 10,420 71,410 6.9 7
MSR-E2E Li et al. (2018) 10,087 74,686 7.4 3
SMD Eric and Manning (2017) 3,031 15,928 5.3 3
Frames Asri et al. (2017) 1,369 19,986 14.6 3
WOZ Mrkšić et al. (2016) 1,200 5,012 4.2 1
CamRest676 Wen et al. (2016) 676 2,744 4.1 1
Table 1: Data statistics for task-oriented dialogue pre-training.

Pretraining for task-oriented dialogues, on the other hand, has few related works. Budzianowski and Vulić (2019) first apply the GPT-2 model to train on response generation task, which takes system belief, database result, and last dialogue turn as input to predict next system responses. It only use one dataset to train its model because few public datasets have database information available. Henderson et al. (2019b) pre-trained a response selection model for task-oriented dialogues. They first pre-train on Reddit corpora and then fine-tune on target dialogue domains, but their training and fine-tuning code is not released. Peng et al. (2020) focus on the natural language generation (NLG) task, which assumes dialogue acts and slot-tagging results are given to generate a natural language response. By pre-training on a set of annotated NLG corpora, it can improve conditional generation quality using a GPT-2 model.

3 Method

In this section, we first discuss each dataset used for our task-oriented pre-training and how we process the data. Then we introduce the selected pre-training base model and its objective functions.

3.1 Datasets

We collect nine different task-oriented datasets which are English-based, human-human, multi-turn and publicly available. In total, there are 100,707 dialogues, which contain 1,388,152 utterances over 60 domains. Dataset statistics is shown in Table 1.

  • [leftmargin=*]

  • MetaLWOZ Lee et al. (2019): Meta-Learning Wizard-of-Oz is a dataset designed to help develop models capable of predicting user responses in unseen domains. This large dataset was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. The MetaLWOZ dataset is used as the fast adaptation task for DSTC8 Kim et al. (2019) dialogue competition.

  • Schema Rastogi et al. (2019): Schema-guided dialogue has 22,825 dialogues and provides a challenging testbed for several tasks, in particular, dialogue state tracking. Each schema is a set of tracking slots and each domain could have multiple possible schemas. This allows a single dialogue system to support a large number of services and facilitates the simple integration of new services without requiring much training data. The Schema dataset is used as the dialogue state tracking task for DSTC8 Kim et al. (2019) dialogue competition.

  • Taskmaster Byrne et al. (2019): This dataset includes 13,215 dialogues comprising six domains, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. One is a two-person Wizard of Oz approach that one person acts like a robot and the other is a self-dialogue approach in which crowdsourced workers wrote the entire dialog themselves. It has 22.9 average conversational turns in a single dialogue, which is the longest among all task-oriented datasets listed.

  • MWOZ Budzianowski et al. (2018): Multi-Domain Wizard-of-Oz dataset contains 10,420 dialogues over seven domains, and it has multiple domains in a single dialogue. It has a detailed description of the data collection procedure, and user goal, system act, and dialogue state labels. Different from most of the existing corpora, it also provides full database information.

  • MSR-E2E Li et al. (2018): Microsoft end-to-end dialogue challenge has 10,087 dialogues in three domains, movie-ticket booking, restaurant reservation, and taxi booking. It also includes an experiment platform with built-in simulators in each domain.

  • SMD Eric and Manning (2017): Stanford multi-domain dialogue is an in-car personal assistant dataset, comprising 3,301 dialogues and three domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. It is designed to smoothly interface with knowledge bases, where a knowledge snippet is attached with each dialogue as a piece of simplified database information.

  • Frames Asri et al. (2017): This dataset is composed of 1,369 human-human dialogues with an average of 14.6 turns per dialogue, where users are given some constraints to book a trip and assistants who search a database to find appropriate trips. Different from other datasets, it has labels to keep track of different semantic frames, which is the decision-making behavior of users, throughout each dialogue.

  • WOZ Mrkšić et al. (2016) and CamRest676 Wen et al. (2016): These two corpora use the same data collection procedure and same ontology from DSTC2 Henderson et al. (2014)

    . They are one of the first task-oriented dialogue datasets that use Wizard of Oz style with text input instead of speech input, which improves the model’s capacity for the semantic understanding instead of its robustness to automatic speech recognition errors.

3.2 Model

We train our ToD-BERT based on BERT Devlin et al. (2018) architecture. Note that the dataset we combine can be used to pre-train any existing language model architecture, and here we select BERT simply because it is the most widely used model in NLP research recently. We use the BERT-Base model, which is a transformer self-attention encoder Vaswani et al. (2017) with 12 layers and 12 attention heads with hidden size equals to 768.

To capture speaker information and the underlying interaction behavior in dialogues, we add two special tokens, “[USR]” and “[SYS]”, to the byte-pair embeddings Mrkšić et al. (2016). We prefix the special token to each user utterance and system response, and concatenate all the utterances in the same dialogue into one flat sequence, as shown in Figure 1. For example, for a dialogue , where is the number of dialogue turns and each or contains a sequence of words, the input of the pre-training model is processed as “[SYS] [USR] ” with standard positional embeddings and segmentation embeddings.

ToD-BERT is trained on the masked language model (MLM), in which a random sample of the tokens in the input sequence is selected and replaced with the special token [MASK]. The MLM objective is a cross-entropy loss on predicting the masked tokens. In the original implementation, random masking and replacement is performed once in the beginning and saved for the duration of the training, here we conduct token masking dynamically during batch training. ToD-BERT is initialized from BERT, a good starting parameter set, then is further pre-trained on those task-oriented corpus mentioned above.

We double the masking probability from 0.15 to 0.3, and we gradually reduce the learning rate without a warm-up period. We optimize ToD-BERT with AdamW

Loshchilov and Hutter (2017)

and train with a dropout of 0.1 on all layers and attention weights. GELU activation functions

Hendrycks and Gimpel (2016) is used. Models are trained early-stop using perplexity of a held-out development set, with mini-batches containing 32 sequences of maximum length 512 tokens.

Figure 1: Dialogue pre-training based on BERT architecture with user and system special tokens.

4 Downstream Tasks

We emphasize that what we care the most in this paper is: whether our ToD-BERT, a pre-trained language model using multiple task-oriented corpora, can show any advantage over BERT. Therefore, we try to avoid adding too many additional components on top of their architecture when fine-tuning on each downstream task and simply rely on their learned representations. Also, we always use the same architecture with the same number of parameters for a fair comparison.

We select four common task-oriented downstream tasks to evaluate our pre-trained ToD-BERT: intent classification, dialogue state tracking, dialogue act prediction, and response selection. All of them are core components in modularized task-oriented systems. We briefly introduce them below:

Intent classification

task is a multi-class classification problem, where we input a sentence and models predict one single intent class over possible intents.


where is a pre-trained language model that takes a sequence of tokens as input, and we use its [CLS] embeddings as the output representation. is a trainable linear mapping. The model is trained with cross-entropy loss between the predicted distributions and the true intent labels.

Dialogue state tracking

task can be treated as a multi-class classification problem using a predefined ontology. Unlike intent classification, we input dialogue history (a sequence of utterances, e.g., 6.9 average turns in MWOZ) and a model predicts values for each (domain, slot) pair at each dialogue turn. Each corresponding value , the -th value for the -th (domain, slot) pair, is passed into a pre-trained model and fixed the representation during training. The number of slot projection layers is equal to the number of (domain, slot) pairs:



is the cosine similarity function, and

is the probability distribution of the

-th (domain, slot) pair over its possible values. The model is trained with cross-entropy loss summed over all the (domain, slot) pairs.

Dialogue act prediction

task is a multi-label classification problem because a system response may contain multiple dialogue acts, e.g., request and inform users at the same time. Model take dialogue history as input and predict a binary result for each possible dialogue act:


where is a trainable linear mapping, is the number of possible dialogue acts, and each value in is between after a Sigmoid layer. The model is trained with binary cross-entropy loss and the -th dialogue act is considered as a triggered dialogue act if .

Response selection

task is a ranking problem , aiming to retrieve the most relative system response from a candidate pool. We use dual encoder strategy Henderson et al. (2019b) and compute similarity scores between source and target ,


where is the -th response candidate and is its cosine similarity score. We randomly sample several system responses from the corpus as negative samples. Although it may not be a true negative sample, it is a common way to train a ranker and evaluate its results Henderson et al. (2019a).

1-Shot BERT 40.2% 0.3% 48.9% 0.3% 81.8% 0.1% 1.0% 0.1%
ToD-BERT 44.8% 0.2% 54.4% 0.3% 82.0% 0.1% 1.4% 0.5%
10-Shot BERT 75.5% 0.6% 90.1% 0.5% 83.5% 0.3% 9.8% 1.6%
ToD-BERT 75.8% 0.2% 90.4% 0.1% 83.5% 0.3% 10.0% 1.4%
FastText* - 89.0% - 9.7%
SVM* - 91.0% - 14.5%
CNN* - 91.2% - 18.9%
MLP* - 93.5% - 47.4%
BERT 85.6% 95.8% 89.2% 41.6%
ToD-BERT 85.9% 96.1% 89.9% 46.3%
Table 2: Intent classification results on the OOS dataset, one of the largest intent classification corpus. Models with * are reported from Larson et al. (2019). The “In” column means that only the in-domain intent classes are considered, the “out” columns are the out-of-scope intent class, and the “all” column takes both of them into account.

5 Evaluation Datasets

We pick up several datasets, OOS, DSTC2, GSIM, and MWOZ, for downstream tasks evaluation. The first three corpora are not included in the pre-trained task-oriented datasets. For MWOZ, to be fair, we do not include its test set dialogues during the pre-training stage. Details of each evaluation dataset are discussed in the following:

  • [leftmargin=*]

  • OOS Larson et al. (2019): The out-of-scope intent dataset is one of the largest annotated intent datasets, including 15,100/3,100/5,500 samples for the train, validation, and test sets, respectively. It covers 151 intent classes over 10 domains, including 150 in-scope intent and 1 out-of-scope intent. The out-of-scope intent means that a user utterance that does not fall into any of the predefined intents. Each of the intents has 100 training samples. We use this dataset to evaluate the performance of the intent classification task.

  • DSTC2 Henderson et al. (2014): DSTC2 is a human-machine task-oriented dataset, which has 1,612/506/1117 dialogues for train, validation, and test sets, respectively. We follow Paul et al. (2019) to map the original dialogue act labels to universal dialogue acts, which results in 19 different system dialogue acts. We use this dataset to evaluate the performance of the dialogue act prediction and response selection tasks.

  • GSIM Shah et al. (2018a): GSIM is a human-rewrote machine-machine task-oriented corpus, including 1500/469/1039 dialogues for the train, validation, and test sets, respectively. We combine its two domains, movie and restaurant domains, into one single corpus. It is collected by Machines Talking To Machines (M2M) Shah et al. (2018b) approach, a functionality-driven process combining a dialogue self-play step and a crowd-sourcing step. We map its dialogue act labels to universal dialogue acts Paul et al. (2019), resulting in 13 different system dialogue acts. We use this dataset to evaluate the performance of the dialogue act prediction and response selection tasks.

  • MWOZ Budzianowski et al. (2018): MWOZ is the most common benchmark for task-oriented dialogues, especially for dialogue state tracking. It has 8420/1000/1000 dialogues for train, validation, and test sets, respectively. Across seven different domains, in total it has 30 (domain, slot) pairs that need to be tracked in the test set. We use its revised version MWOZ 2.1 from Eric et al. (2019), which has the same dialogue transcripts but with cleaner state label annotation. We use this dataset to evaluate the performance of dialogue state tracking, dialogue act prediction, and response selection tasks.

6 Results

For each downstream task, we first conduct the experiments using the whole dataset, then we simulate the few-shot setting to show the strength of our ToD-BERT. We run at least three times with different random seeds for each few-shot experiment to reduce the variance of data sampling, and we report its mean and standard deviation for these limited data scenarios.

6.1 Intent Classification

ToD-BERT outperforms BERT and other strong baselines (the numbers of the FastText, SVM, CNN and MLP are reported from Larson et al. (2019)) in one of the largest intent classification datasets, as shown in Table 2. We evaluate accuracy on all the data, only the in-domain intents, and only the out-of-scope intent.

ToD-BERT achieves 85.9% accuracy over the 151 intent classes, 96.1% accuracy over the defined 150 intent classes, and has 89.9% accuracy and 46.3% recall of the out-of-scope intent. Besides, we conduct 1-shot and 10-shot experiments by randomly sampling one and ten utterances from each intent class in the training set. To reduce the variance data sampling, the numbers reported are averaged over five runs. ToD-BERT has 4.6% all-domain accuracy improvement and 5.5% in-domain accuracy improvement compared with BERT for the 1-shot setting. These results confirm our hypothesis that ToD-BERT indeed learns better representations for task-oriented dialogues.

1% Data BERT 6.7% 0.5% 84.4% 0.1%)
ToD-BERT 10.0% 0.5% 87.0% 0.2%
5% Data BERT 20.6% 0.7% 92.4% 0.1%
ToD-BERT 27.5% 0.4% 93.8% 0.2%
10% Data BERT 25.2% 3.1% 93.6% 0.4%
ToD-BERT 35.9% 1.1% 95.2% 0.2%
25% Data BERT 40.2% 0.4% 95.8% 0.1%
ToD-BERT 42.8% 0.4% 96.3% 0.1%
Full Data DSTReader* 36.4% -
HyST* 38.1% -
ZSDST* 43.4% -
TRADE* 45.6% -
BERT 46.6% 96.6%
ToD-BERT 47.7% 96.8%
Table 3: Dialogue state tracking results on MWOZ 2.1 dataset. We report joint goal accuracy and slot accuracy for the full data setting and the simulated few-shot settings.
MWOZ (13) DSTC2 (19) GSIM (13)
micro-F1 macro-F1 micro-F1 macro-F1 micro-F1 macro-F1
1% Data BERT 77.3% 2.0% 58.3% 1.8% 79.2% 0.4% 16.8% 0.7% 83.8% 3.6% 35.3% 2.6%
ToD-BERT 85.8% 1.7% 67.0% 2.8% 82.3% 0.8% 18.5% 0.7% 92.2% 0.9% 40.7% 0.7%
10% Data BERT 89.1% 0.4% 77.3% 0.7% 82.6% 0.9% 26.5% 0.6% 97.1% 0.1% 44.2% 0.1%
ToD-BERT 89.8% 0.2% 79.1% 0.5% 85.1% 1.7% 29.2% 1.3% 98.6% 0.2% 44.9% 0.2%
Full Data MLP 61.6% 45.5% 77.6% 18.1% 89.5% 26.1%
RNN 90.4% 77.3% 90.8% 29.4% 98.4% 45.2%
BERT 90.8% 78.9% 91.3% 32.9% 97.3% 44.5%
ToD-BERT 91.2% 79.8% 92.0% 34.7% 98.9% 45.2%
Table 4: Dialogue act prediction results on three different datasets. The numbers reported are the micro- and macro-F1 scores, and each dataset has different numbers of dialogue acts. Each results of few-shot experiments are averaged over three runs.
1% Data 10% Data Full Data
MWOZ 1-to-100 9.6% 0.1% 19.3% 0.3% 26.7% 0.3% 37.1% 0.4% 59.7% 0.3% 63.3% 0.2%
3-to-100 24.0% 0.2% 41.2% 0.3% 50.7% 0.2% 62.8% 0.3% 83.0% 0.2% 85.0% 0.2%
DSTC2 1-to-100 75.1% 0.3% 75.7% 0.2% 78.8% 0.2% 79.3% 0.3% 79.2% 0.2% 79.4% 0.1%
3-to-100 93.0% 0.2% 93.8% 0.3% 94.2% 0.1% 94.5% 0.1% 94.5% 0.2% 94.7% 0.1%
GSIM 1-to-100 62.3% 0.4% 62.7% 0.1% 78.1% 0% 78.3% 0.1% 78.3% 0.2% 78.4% 0.1%
3-to-100 75.5% 0.2% 76.1% 0.1% 81.2% 0% 81.3% 0% 81.3% 0% 81.4% 0%
Table 5: Response selection evaluation results on three corpora for 1%, 10% and full data setting. The 1-to-100 and 3-to-100 accuracy are reported by the average of five runs.

6.2 Dialogue State Tracking

Two evaluation metrics are commonly used in dialogue state tracking task, joint goal accuracy and slot accuracy. The joint goal accuracy compares the predicted dialogue states to the ground truth at each dialogue turn, where the ground truth includes slot values for all the possible (domain, slot) pairs. The output is considered as a correct prediction if and only if all the predicted values exactly match its ground truth values. The slot accuracy, on the other hand, individually compares each (domain, slot, value) triplet to its ground truth label.

In Table 3, we first compare BERT with ToD-BERT on the MWOZ dataset (the 2.1 version) and find the latter has 1.1% joint goal accuracy improvement. Since the original ontology provided by Budzianowski et al. (2018) is not complete (some labeled values are not included in the ontology), we create a new ontology of all the possible annotated values. We also list several well-known dialogue state trackers as reference, including DSTReader Gao et al. (2019), HyST Goel et al. (2019), TRADE Wu et al. (2019), and ZSDST Rastogi et al. (2019). ToD-BERT outperforms DSTReader, HyST, and TRADE by 11.3%, 9.6% and 2.1% joint goal accuracy, respectively.

All dialogue state trackers which are based on pre-trained models can be easily improved by ToD-BERT. We try to replace BERT used in DS-DST-picklist Zhang et al. (2019a) to our ToD-BERT. This is what we observe: The replacement can gain 0.2% joint goal accuracy improvement, from 53.2% to 53.4%. The model achieves 58.3% validation joint goal accuracy using only 50-60% of the original training steps needed.

We also report the few-shot experiments using 1%, 5%, 10% and 25% data for dialogue state tracking. Each result shown is averaged over three different runs. ToD-BERT outperforms BERT in all the setting, which further show the strength of task-oriented dialogue pre-training. ToD-BERT surpasses BERT by 3.3%, 7.1%, 10.7%, 2.6% in 1%, 5%, 10%, and 25% data setting, respectively. Note that 1% of data has around 84 dialogues.

6.3 Dialogue Act Prediction

We conduct experiments on three different datasets and report micro-F1 and macro-F1 scores for the dialogue act prediction task, a multi-label classification problem. For the MWOZ dataset, we remove the domain information from the original system dialogue act labels, for example, the “taxi-inform” will be simplified to “inform”. This process reduces the number of possible dialogue acts from 31 to 13. For DSTC2 and GSIM corpora, we follow Paul et al. (2019) to apply universal dialogue act mapping that maps the original dialogue act labels to a general dialogue act format, resulting in 19 and 13 system dialogue acts in DSTC2 and GSIM, respectively.

We run two other baselines, MLP and RNN, to further show the strengths of BERT-based models. The MLP model simply takes bag-of-word embeddings to make dialogue act prediction, and the RNN model is a bi-directional GRU network. In Table 4, one can observe that ToD-BERT consistently works better than BERT and other baselines, no matter which datasets or which evaluation metrics.

In the few-shot experiments, we run three times and report the results. ToD-BERT outperforms BERT by 8.5% micro-F1 and 8.7% macro-F1 on MWOZ corpus in the 1% data scenario. Also, ToD-BERT using only 10% data can achieve 89.8% micro-F1 and 79.1% macro-F1 scores on MWOZ, 98.6% micro-F1 and 44.9% macro-F1 scores on GSIM, which are similar or better than BERT using full data.

6.4 Response Selection

To evaluate response selection in task-oriented dialogues, we follow the k-to-100 accuracy, which is becoming a research community standard Yang et al. (2018); Henderson et al. (2019a). The k-of-100 ranking accuracy is a Recall@k metric, which indicates whether the relevant response occurs in the top k ranked candidate responses. The k-of-100 metric is computed using random batches of 100 examples so that the responses from other examples in the batch are used as random negative candidates. This allows efficient computing the metric across many examples in batches. While it is not guaranteed that the random negatives will indeed be “true” negatives, the 1-of-100 metric still provides a useful evaluation signal that correlates with downstream tasks. We run five different random seeds to sample random batches and report the average results.

In Table 5, we conduct response selection experiments on three datasets, MWOZ, DSTC2, and GSIM. ToD-BERT achieves 63.3% 1-to-100 accuracy and 85% 3-to-100 accuracy on MWOZ, which surpasses BERT by 3.6% and 2.0%, respectively. The advantage of the ToD-BERT is more obvious under the few-shot scenario. In the 1% data setting, ToD-BERT has around 10% 1-to-100 accuracy improvement and 17.2% 3-to-100 accuracy improvement. Although we can observe a similar conclusion on DSTC2 and GSIM datasets, the results are not as clear as what we observe on MWOZ. One possible reason could be that both of them are not human-human datasets, so their system responses are less diverse than MWOZ corpus, which makes the negative samples in each random batches more noisy. 111The pre-trained ConveRT model Henderson et al. (2019a) achieves 5.2% 0.1% 1-to-100 accuracy and 10.4% 0.2% 3-to-100 accuracy. Since they only released code for model inference, we report their results without fine-tuning.

(a) BERT
(b) ToD-BERT
Figure 2: The tSNE visualization of (a) BERT and (b) ToD-BERT utterance representations in MWOZ test set. Different colors mean different domains. ToD-BERT has higher normalized mutual information score thant BERT.

7 Visualization

In Figure 2, we visualize the embeddings of BERT and ToD-BERT given the same input, the utterances in the test set of MWOZ. Each sample point is an utterance representation, which is passed through a pretrained model and reduced its high-dimension features to a two-dimension point using the t-distributed stochastic neighbor embedding (tSNE) method. Since we know the true domain label for each utterance, we use different colors to represent different domains. As one can observe, ToD-BERT has more clear group boundaries than BERT.

To analyze the results quantitatively, we run K-means, a common unsupervised clustering algorithms, on top of the output embeddings of BERT and ToD-BERT. We set K for K-means equal to 10 and 20. After the clustering, we can assign each utterance in the MWOZ test set to a predicted class. We then compute the normalized mutual information (NMI) between the clustering result and the true domain label for each utterance. Here is what we observe: ToD-BERT consistently achieves higher NMI scores than BERT. For K=10, ToD-BERT has a 0.143 NMI score and BERT only has 0.094. For K=20, ToD-BERT achieves a 0.213 NMI score while BERT has 0.109.

8 Conclusion

We propose task-oriented dialogue BERT (ToD-BERT) that is trained on nine English-based, human-human, multi-turn and publicly available task-oriented datasets across over 60 domains. ToD-BERT outperforms BERT on four dialogue downstream tasks, including intention classification, dialogue state tracking, dialogue act prediction, and response selection. It also has clear advantage in the few-shot experiments than limited labeled data is available. ToD-BERT is easy-to-deploy and will be open-sourced, allowing the NLP research community to apply or fine-tune on any task-oriented conversational problem. Lastly, the nine task-oriented datasets we combined can be leveraged to train and test any other latest pretrained architectures in the future.