Recent advances in pre-training using self-attention encoder architectures Devlin et al. (2018); Liu et al. (2019); Lan et al. (2019) have been commonly used in many NLP applications. Such models are usually constructed based on a massive scale of general text corpora, such as English Wikipedia or books Zhu et al. (2015)
. The distributed representations are learned self-supervised from the raw text. By further fine-tuning these representations, breakthroughs have been continuously reported for various downstream tasks, especially those of natural language understanding.
However, previous work Rashkin et al. (2018); Wolf et al. (2019) shows that there are some deficiencies in the performance to directly apply fine-tuning on conversational corpora. One possible reason could be the intrinsic difference of linguistic patterns between human conversations and writing text, resulting in a large gap of data distributions Bao et al. (2019). Therefore, pre-training dialogue language models using chit-chat conversational corpora from social media, such as Twitter or Reddit, has been recently investigated, especially for dialogue response generation Zhang et al. (2019b) and retrieval Henderson et al. (2019b) tasks. Although these open-domain dialogues are diverse and easy-to-get, they are usually short, noisy and without specific chatting goals.
Task-oriented dialogues, on the other hand, have explicit goals (e.g. restaurant reservation or ticket booking) and many conversational interactions. But each of these datasets is usually small and scattered since obtaining and labeling such data is difficult and expensive. Moreover, task-oriented dialogues have clear user and system behaviors where the user has his/her goal and the system has its belief and database information, which makes the language understanding component more essential than those chit-chat scenarios.
In this paper, we aim to prove this hypothesis: self-supervised language model pre-training using task-oriented corpora can learn better representations than existing pre-trained models for those task-oriented downstream tasks. We emphasize that what we care the most is not whether our pre-trained model can achieve state-of-the-art results on each downstream task, since most of the current best models are built on top of pre-trained models, which can be easily replaced by our pre-trained model. In our experiments, we avoid adding too many additional components on top of pre-training architectures when fine-tuning on each downstream task, and simply rely on the learned representations to show the full strength of a pre-trained model.
We collect and combine nine English-based, human-human, multi-turn, and publicly available task-oriented dialogue corpora to train a task-oriented dialogue BERT (ToD-BERT). In total, there are around 100k dialogues with 1.4M utterances across 60 different domains. Like BERT Devlin et al. (2018), ToD-BERT is formulated as a masked language model, and uses the deep bidirectional Transformer Vaswani et al. (2017) encoder as its model architecture. We select BERT architecture simply because it is the most widely used model in NLP research recently. Note that the unified datasets we combine can be easily applied to pre-train any existing language models.
We test ToD-BERT on four common downstream tasks of task-oriented dialogue systems, including intention detection, dialogue state tracking, dialogue act prediction, and response selection. This is what we observe: ToD-BERT outperforms BERT and other strong baselines on all the selected downstream tasks, which further confirms its effectiveness for improving dialogue language understanding. More importantly, ToD-BERT has stronger few-shot capacity than BERT on each task, implying that it can reduce the need for expensive human-annotated labels in the future. ToD-BERT can be easily leveraged and adapted to new task-oriented dialogue datasets, especially those with few training examples. Our source code and pre-trained model will be released soon to facilitate future research.
2 Related Work
General Pre-trained Language Models,
which are trained on massive general text such as Wikipedia and BookCorpus, can be roughly divided into two categories: uni-directional or bi-directional attention mechanisms. GPT Radford et al. (a)
and GPT-2Radford et al. (b) are representatives of uni-directional language models using a Transformer decoder, where the objective is to maximize left-to-right generation likelihood. These models are commonly applied in natural language generation tasks. On the other hand, BERT Devlin et al. (2018), RoBERTa Liu et al. (2019)
and their variances are pre-trained using a Transformer encoder with bi-directional token prediction. These models are usually evaluated on classification tasks such as GLUE benchmarkWang et al. (2018) or span-based question answering tasks Rajpurkar et al. (2016).
Some language models can support both uni-directional and bi-directional attention, such as UniLM Dong et al. (2019). Conditional language model pre-training is also proposed, for example, CTRL Keskar et al. (2019) is a conditional Transformer model, trained to condition on control codes that govern style, content, and task-specific behavior. Recently, multi-task language model pre-training with unified sequence-to-sequence generation is proposed. Text-to-text Transformer (T5) Raffel et al. (2019) unifies multiple text modeling tasks, and achieves the promising results in various NLP benchmarks.
Dialogue Pre-trained Language Models
are mostly trained on open-domain conversational data from Reddit or Twitter for dialogue response generation. Transfertransfo Wolf et al. (2019) achieves good performance on ConvAI-2 dialogue competition using GPT-2. DialoGPT Zhang et al. (2019b) is an extension of GPT-2 that is pre-trained on Reddit data for open-domain response generation. ConveRT Henderson et al. (2019a) pre-trained a dual transformer encoder for response selection task on large-scale Reddit (input, response) pairs. PLATO Bao et al. (2019) uses both Twitter and Reddit data to pre-trained a dialogue generation model with discrete latent variables. All of them are designed to cope with the response generation task for open-domain chatbots.
|Name||# Dialogue||# Utterance||Avg. Turn||# Domain|
|MetaLWOZ Lee et al. (2019)||37,884||432,036||11.4||47|
|Schema Rastogi et al. (2019)||22,825||463,284||20.3||17|
|Taskmaster Byrne et al. (2019)||13,215||303,066||22.9||6|
|MWOZ Budzianowski et al. (2018)||10,420||71,410||6.9||7|
|MSR-E2E Li et al. (2018)||10,087||74,686||7.4||3|
|SMD Eric and Manning (2017)||3,031||15,928||5.3||3|
|Frames Asri et al. (2017)||1,369||19,986||14.6||3|
|WOZ Mrkšić et al. (2016)||1,200||5,012||4.2||1|
|CamRest676 Wen et al. (2016)||676||2,744||4.1||1|
Pretraining for task-oriented dialogues, on the other hand, has few related works. Budzianowski and Vulić (2019) first apply the GPT-2 model to train on response generation task, which takes system belief, database result, and last dialogue turn as input to predict next system responses. It only use one dataset to train its model because few public datasets have database information available. Henderson et al. (2019b) pre-trained a response selection model for task-oriented dialogues. They first pre-train on Reddit corpora and then fine-tune on target dialogue domains, but their training and fine-tuning code is not released. Peng et al. (2020) focus on the natural language generation (NLG) task, which assumes dialogue acts and slot-tagging results are given to generate a natural language response. By pre-training on a set of annotated NLG corpora, it can improve conditional generation quality using a GPT-2 model.
In this section, we first discuss each dataset used for our task-oriented pre-training and how we process the data. Then we introduce the selected pre-training base model and its objective functions.
We collect nine different task-oriented datasets which are English-based, human-human, multi-turn and publicly available. In total, there are 100,707 dialogues, which contain 1,388,152 utterances over 60 domains. Dataset statistics is shown in Table 1.
MetaLWOZ Lee et al. (2019): Meta-Learning Wizard-of-Oz is a dataset designed to help develop models capable of predicting user responses in unseen domains. This large dataset was created by crowdsourcing 37,884 goal-oriented dialogs, covering 227 tasks in 47 domains. The MetaLWOZ dataset is used as the fast adaptation task for DSTC8 Kim et al. (2019) dialogue competition.
Schema Rastogi et al. (2019): Schema-guided dialogue has 22,825 dialogues and provides a challenging testbed for several tasks, in particular, dialogue state tracking. Each schema is a set of tracking slots and each domain could have multiple possible schemas. This allows a single dialogue system to support a large number of services and facilitates the simple integration of new services without requiring much training data. The Schema dataset is used as the dialogue state tracking task for DSTC8 Kim et al. (2019) dialogue competition.
Taskmaster Byrne et al. (2019): This dataset includes 13,215 dialogues comprising six domains, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. One is a two-person Wizard of Oz approach that one person acts like a robot and the other is a self-dialogue approach in which crowdsourced workers wrote the entire dialog themselves. It has 22.9 average conversational turns in a single dialogue, which is the longest among all task-oriented datasets listed.
MWOZ Budzianowski et al. (2018): Multi-Domain Wizard-of-Oz dataset contains 10,420 dialogues over seven domains, and it has multiple domains in a single dialogue. It has a detailed description of the data collection procedure, and user goal, system act, and dialogue state labels. Different from most of the existing corpora, it also provides full database information.
MSR-E2E Li et al. (2018): Microsoft end-to-end dialogue challenge has 10,087 dialogues in three domains, movie-ticket booking, restaurant reservation, and taxi booking. It also includes an experiment platform with built-in simulators in each domain.
SMD Eric and Manning (2017): Stanford multi-domain dialogue is an in-car personal assistant dataset, comprising 3,301 dialogues and three domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. It is designed to smoothly interface with knowledge bases, where a knowledge snippet is attached with each dialogue as a piece of simplified database information.
Frames Asri et al. (2017): This dataset is composed of 1,369 human-human dialogues with an average of 14.6 turns per dialogue, where users are given some constraints to book a trip and assistants who search a database to find appropriate trips. Different from other datasets, it has labels to keep track of different semantic frames, which is the decision-making behavior of users, throughout each dialogue.
. They are one of the first task-oriented dialogue datasets that use Wizard of Oz style with text input instead of speech input, which improves the model’s capacity for the semantic understanding instead of its robustness to automatic speech recognition errors.
We train our ToD-BERT based on BERT Devlin et al. (2018) architecture. Note that the dataset we combine can be used to pre-train any existing language model architecture, and here we select BERT simply because it is the most widely used model in NLP research recently. We use the BERT-Base model, which is a transformer self-attention encoder Vaswani et al. (2017) with 12 layers and 12 attention heads with hidden size equals to 768.
To capture speaker information and the underlying interaction behavior in dialogues, we add two special tokens, “[USR]” and “[SYS]”, to the byte-pair embeddings Mrkšić et al. (2016). We prefix the special token to each user utterance and system response, and concatenate all the utterances in the same dialogue into one flat sequence, as shown in Figure 1. For example, for a dialogue , where is the number of dialogue turns and each or contains a sequence of words, the input of the pre-training model is processed as “[SYS] [USR] ” with standard positional embeddings and segmentation embeddings.
ToD-BERT is trained on the masked language model (MLM), in which a random sample of the tokens in the input sequence is selected and replaced with the special token [MASK]. The MLM objective is a cross-entropy loss on predicting the masked tokens. In the original implementation, random masking and replacement is performed once in the beginning and saved for the duration of the training, here we conduct token masking dynamically during batch training. ToD-BERT is initialized from BERT, a good starting parameter set, then is further pre-trained on those task-oriented corpus mentioned above.
We double the masking probability from 0.15 to 0.3, and we gradually reduce the learning rate without a warm-up period. We optimize ToD-BERT with AdamWLoshchilov and Hutter (2017)
and train with a dropout of 0.1 on all layers and attention weights. GELU activation functionsHendrycks and Gimpel (2016) is used. Models are trained early-stop using perplexity of a held-out development set, with mini-batches containing 32 sequences of maximum length 512 tokens.
4 Downstream Tasks
We emphasize that what we care the most in this paper is: whether our ToD-BERT, a pre-trained language model using multiple task-oriented corpora, can show any advantage over BERT. Therefore, we try to avoid adding too many additional components on top of their architecture when fine-tuning on each downstream task and simply rely on their learned representations. Also, we always use the same architecture with the same number of parameters for a fair comparison.
We select four common task-oriented downstream tasks to evaluate our pre-trained ToD-BERT: intent classification, dialogue state tracking, dialogue act prediction, and response selection. All of them are core components in modularized task-oriented systems. We briefly introduce them below:
task is a multi-class classification problem, where we input a sentence and models predict one single intent class over possible intents.
where is a pre-trained language model that takes a sequence of tokens as input, and we use its [CLS] embeddings as the output representation. is a trainable linear mapping. The model is trained with cross-entropy loss between the predicted distributions and the true intent labels.
Dialogue state tracking
task can be treated as a multi-class classification problem using a predefined ontology. Unlike intent classification, we input dialogue history (a sequence of utterances, e.g., 6.9 average turns in MWOZ) and a model predicts values for each (domain, slot) pair at each dialogue turn. Each corresponding value , the -th value for the -th (domain, slot) pair, is passed into a pre-trained model and fixed the representation during training. The number of slot projection layers is equal to the number of (domain, slot) pairs:
is the cosine similarity function, and
is the probability distribution of the-th (domain, slot) pair over its possible values. The model is trained with cross-entropy loss summed over all the (domain, slot) pairs.
Dialogue act prediction
task is a multi-label classification problem because a system response may contain multiple dialogue acts, e.g., request and inform users at the same time. Model take dialogue history as input and predict a binary result for each possible dialogue act:
where is a trainable linear mapping, is the number of possible dialogue acts, and each value in is between after a Sigmoid layer. The model is trained with binary cross-entropy loss and the -th dialogue act is considered as a triggered dialogue act if .
task is a ranking problem , aiming to retrieve the most relative system response from a candidate pool. We use dual encoder strategy Henderson et al. (2019b) and compute similarity scores between source and target ,
where is the -th response candidate and is its cosine similarity score. We randomly sample several system responses from the corpus as negative samples. Although it may not be a true negative sample, it is a common way to train a ranker and evaluate its results Henderson et al. (2019a).
|1-Shot||BERT||40.2% 0.3%||48.9% 0.3%||81.8% 0.1%||1.0% 0.1%|
|ToD-BERT||44.8% 0.2%||54.4% 0.3%||82.0% 0.1%||1.4% 0.5%|
|10-Shot||BERT||75.5% 0.6%||90.1% 0.5%||83.5% 0.3%||9.8% 1.6%|
|ToD-BERT||75.8% 0.2%||90.4% 0.1%||83.5% 0.3%||10.0% 1.4%|
5 Evaluation Datasets
We pick up several datasets, OOS, DSTC2, GSIM, and MWOZ, for downstream tasks evaluation. The first three corpora are not included in the pre-trained task-oriented datasets. For MWOZ, to be fair, we do not include its test set dialogues during the pre-training stage. Details of each evaluation dataset are discussed in the following:
OOS Larson et al. (2019): The out-of-scope intent dataset is one of the largest annotated intent datasets, including 15,100/3,100/5,500 samples for the train, validation, and test sets, respectively. It covers 151 intent classes over 10 domains, including 150 in-scope intent and 1 out-of-scope intent. The out-of-scope intent means that a user utterance that does not fall into any of the predefined intents. Each of the intents has 100 training samples. We use this dataset to evaluate the performance of the intent classification task.
DSTC2 Henderson et al. (2014): DSTC2 is a human-machine task-oriented dataset, which has 1,612/506/1117 dialogues for train, validation, and test sets, respectively. We follow Paul et al. (2019) to map the original dialogue act labels to universal dialogue acts, which results in 19 different system dialogue acts. We use this dataset to evaluate the performance of the dialogue act prediction and response selection tasks.
GSIM Shah et al. (2018a): GSIM is a human-rewrote machine-machine task-oriented corpus, including 1500/469/1039 dialogues for the train, validation, and test sets, respectively. We combine its two domains, movie and restaurant domains, into one single corpus. It is collected by Machines Talking To Machines (M2M) Shah et al. (2018b) approach, a functionality-driven process combining a dialogue self-play step and a crowd-sourcing step. We map its dialogue act labels to universal dialogue acts Paul et al. (2019), resulting in 13 different system dialogue acts. We use this dataset to evaluate the performance of the dialogue act prediction and response selection tasks.
MWOZ Budzianowski et al. (2018): MWOZ is the most common benchmark for task-oriented dialogues, especially for dialogue state tracking. It has 8420/1000/1000 dialogues for train, validation, and test sets, respectively. Across seven different domains, in total it has 30 (domain, slot) pairs that need to be tracked in the test set. We use its revised version MWOZ 2.1 from Eric et al. (2019), which has the same dialogue transcripts but with cleaner state label annotation. We use this dataset to evaluate the performance of dialogue state tracking, dialogue act prediction, and response selection tasks.
For each downstream task, we first conduct the experiments using the whole dataset, then we simulate the few-shot setting to show the strength of our ToD-BERT. We run at least three times with different random seeds for each few-shot experiment to reduce the variance of data sampling, and we report its mean and standard deviation for these limited data scenarios.
6.1 Intent Classification
ToD-BERT outperforms BERT and other strong baselines (the numbers of the FastText, SVM, CNN and MLP are reported from Larson et al. (2019)) in one of the largest intent classification datasets, as shown in Table 2. We evaluate accuracy on all the data, only the in-domain intents, and only the out-of-scope intent.
ToD-BERT achieves 85.9% accuracy over the 151 intent classes, 96.1% accuracy over the defined 150 intent classes, and has 89.9% accuracy and 46.3% recall of the out-of-scope intent. Besides, we conduct 1-shot and 10-shot experiments by randomly sampling one and ten utterances from each intent class in the training set. To reduce the variance data sampling, the numbers reported are averaged over five runs. ToD-BERT has 4.6% all-domain accuracy improvement and 5.5% in-domain accuracy improvement compared with BERT for the 1-shot setting. These results confirm our hypothesis that ToD-BERT indeed learns better representations for task-oriented dialogues.
|1% Data||BERT||6.7% 0.5%||84.4% 0.1%)|
|ToD-BERT||10.0% 0.5%||87.0% 0.2%|
|5% Data||BERT||20.6% 0.7%||92.4% 0.1%|
|ToD-BERT||27.5% 0.4%||93.8% 0.2%|
|10% Data||BERT||25.2% 3.1%||93.6% 0.4%|
|ToD-BERT||35.9% 1.1%||95.2% 0.2%|
|25% Data||BERT||40.2% 0.4%||95.8% 0.1%|
|ToD-BERT||42.8% 0.4%||96.3% 0.1%|
|MWOZ (13)||DSTC2 (19)||GSIM (13)|
|1% Data||BERT||77.3% 2.0%||58.3% 1.8%||79.2% 0.4%||16.8% 0.7%||83.8% 3.6%||35.3% 2.6%|
|ToD-BERT||85.8% 1.7%||67.0% 2.8%||82.3% 0.8%||18.5% 0.7%||92.2% 0.9%||40.7% 0.7%|
|10% Data||BERT||89.1% 0.4%||77.3% 0.7%||82.6% 0.9%||26.5% 0.6%||97.1% 0.1%||44.2% 0.1%|
|ToD-BERT||89.8% 0.2%||79.1% 0.5%||85.1% 1.7%||29.2% 1.3%||98.6% 0.2%||44.9% 0.2%|
|1% Data||10% Data||Full Data|
|MWOZ||1-to-100||9.6% 0.1%||19.3% 0.3%||26.7% 0.3%||37.1% 0.4%||59.7% 0.3%||63.3% 0.2%|
|3-to-100||24.0% 0.2%||41.2% 0.3%||50.7% 0.2%||62.8% 0.3%||83.0% 0.2%||85.0% 0.2%|
|DSTC2||1-to-100||75.1% 0.3%||75.7% 0.2%||78.8% 0.2%||79.3% 0.3%||79.2% 0.2%||79.4% 0.1%|
|3-to-100||93.0% 0.2%||93.8% 0.3%||94.2% 0.1%||94.5% 0.1%||94.5% 0.2%||94.7% 0.1%|
|GSIM||1-to-100||62.3% 0.4%||62.7% 0.1%||78.1% 0%||78.3% 0.1%||78.3% 0.2%||78.4% 0.1%|
|3-to-100||75.5% 0.2%||76.1% 0.1%||81.2% 0%||81.3% 0%||81.3% 0%||81.4% 0%|
6.2 Dialogue State Tracking
Two evaluation metrics are commonly used in dialogue state tracking task, joint goal accuracy and slot accuracy. The joint goal accuracy compares the predicted dialogue states to the ground truth at each dialogue turn, where the ground truth includes slot values for all the possible (domain, slot) pairs. The output is considered as a correct prediction if and only if all the predicted values exactly match its ground truth values. The slot accuracy, on the other hand, individually compares each (domain, slot, value) triplet to its ground truth label.
In Table 3, we first compare BERT with ToD-BERT on the MWOZ dataset (the 2.1 version) and find the latter has 1.1% joint goal accuracy improvement. Since the original ontology provided by Budzianowski et al. (2018) is not complete (some labeled values are not included in the ontology), we create a new ontology of all the possible annotated values. We also list several well-known dialogue state trackers as reference, including DSTReader Gao et al. (2019), HyST Goel et al. (2019), TRADE Wu et al. (2019), and ZSDST Rastogi et al. (2019). ToD-BERT outperforms DSTReader, HyST, and TRADE by 11.3%, 9.6% and 2.1% joint goal accuracy, respectively.
All dialogue state trackers which are based on pre-trained models can be easily improved by ToD-BERT. We try to replace BERT used in DS-DST-picklist Zhang et al. (2019a) to our ToD-BERT. This is what we observe: The replacement can gain 0.2% joint goal accuracy improvement, from 53.2% to 53.4%. The model achieves 58.3% validation joint goal accuracy using only 50-60% of the original training steps needed.
We also report the few-shot experiments using 1%, 5%, 10% and 25% data for dialogue state tracking. Each result shown is averaged over three different runs. ToD-BERT outperforms BERT in all the setting, which further show the strength of task-oriented dialogue pre-training. ToD-BERT surpasses BERT by 3.3%, 7.1%, 10.7%, 2.6% in 1%, 5%, 10%, and 25% data setting, respectively. Note that 1% of data has around 84 dialogues.
6.3 Dialogue Act Prediction
We conduct experiments on three different datasets and report micro-F1 and macro-F1 scores for the dialogue act prediction task, a multi-label classification problem. For the MWOZ dataset, we remove the domain information from the original system dialogue act labels, for example, the “taxi-inform” will be simplified to “inform”. This process reduces the number of possible dialogue acts from 31 to 13. For DSTC2 and GSIM corpora, we follow Paul et al. (2019) to apply universal dialogue act mapping that maps the original dialogue act labels to a general dialogue act format, resulting in 19 and 13 system dialogue acts in DSTC2 and GSIM, respectively.
We run two other baselines, MLP and RNN, to further show the strengths of BERT-based models. The MLP model simply takes bag-of-word embeddings to make dialogue act prediction, and the RNN model is a bi-directional GRU network. In Table 4, one can observe that ToD-BERT consistently works better than BERT and other baselines, no matter which datasets or which evaluation metrics.
In the few-shot experiments, we run three times and report the results. ToD-BERT outperforms BERT by 8.5% micro-F1 and 8.7% macro-F1 on MWOZ corpus in the 1% data scenario. Also, ToD-BERT using only 10% data can achieve 89.8% micro-F1 and 79.1% macro-F1 scores on MWOZ, 98.6% micro-F1 and 44.9% macro-F1 scores on GSIM, which are similar or better than BERT using full data.
6.4 Response Selection
To evaluate response selection in task-oriented dialogues, we follow the k-to-100 accuracy, which is becoming a research community standard Yang et al. (2018); Henderson et al. (2019a). The k-of-100 ranking accuracy is a Recall@k metric, which indicates whether the relevant response occurs in the top k ranked candidate responses. The k-of-100 metric is computed using random batches of 100 examples so that the responses from other examples in the batch are used as random negative candidates. This allows efficient computing the metric across many examples in batches. While it is not guaranteed that the random negatives will indeed be “true” negatives, the 1-of-100 metric still provides a useful evaluation signal that correlates with downstream tasks. We run five different random seeds to sample random batches and report the average results.
In Table 5, we conduct response selection experiments on three datasets, MWOZ, DSTC2, and GSIM. ToD-BERT achieves 63.3% 1-to-100 accuracy and 85% 3-to-100 accuracy on MWOZ, which surpasses BERT by 3.6% and 2.0%, respectively. The advantage of the ToD-BERT is more obvious under the few-shot scenario. In the 1% data setting, ToD-BERT has around 10% 1-to-100 accuracy improvement and 17.2% 3-to-100 accuracy improvement. Although we can observe a similar conclusion on DSTC2 and GSIM datasets, the results are not as clear as what we observe on MWOZ. One possible reason could be that both of them are not human-human datasets, so their system responses are less diverse than MWOZ corpus, which makes the negative samples in each random batches more noisy. 111The pre-trained ConveRT model Henderson et al. (2019a) achieves 5.2% 0.1% 1-to-100 accuracy and 10.4% 0.2% 3-to-100 accuracy. Since they only released code for model inference, we report their results without fine-tuning.
In Figure 2, we visualize the embeddings of BERT and ToD-BERT given the same input, the utterances in the test set of MWOZ. Each sample point is an utterance representation, which is passed through a pretrained model and reduced its high-dimension features to a two-dimension point using the t-distributed stochastic neighbor embedding (tSNE) method. Since we know the true domain label for each utterance, we use different colors to represent different domains. As one can observe, ToD-BERT has more clear group boundaries than BERT.
To analyze the results quantitatively, we run K-means, a common unsupervised clustering algorithms, on top of the output embeddings of BERT and ToD-BERT. We set K for K-means equal to 10 and 20. After the clustering, we can assign each utterance in the MWOZ test set to a predicted class. We then compute the normalized mutual information (NMI) between the clustering result and the true domain label for each utterance. Here is what we observe: ToD-BERT consistently achieves higher NMI scores than BERT. For K=10, ToD-BERT has a 0.143 NMI score and BERT only has 0.094. For K=20, ToD-BERT achieves a 0.213 NMI score while BERT has 0.109.
We propose task-oriented dialogue BERT (ToD-BERT) that is trained on nine English-based, human-human, multi-turn and publicly available task-oriented datasets across over 60 domains. ToD-BERT outperforms BERT on four dialogue downstream tasks, including intention classification, dialogue state tracking, dialogue act prediction, and response selection. It also has clear advantage in the few-shot experiments than limited labeled data is available. ToD-BERT is easy-to-deploy and will be open-sourced, allowing the NLP research community to apply or fine-tune on any task-oriented conversational problem. Lastly, the nine task-oriented datasets we combined can be leveraged to train and test any other latest pretrained architectures in the future.
- Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057.
- Bao et al. (2019) Siqi Bao, Huang He, Fan Wang, and Hua Wu. 2019. Plato: Pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931.
- Budzianowski and Vulić (2019) Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s gpt-2–how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774.
- Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
- Byrne et al. (2019) Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey, Andy Cedilnik, and Kyu-Young Kim. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
- Eric et al. (2019) Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669.
- Eric and Manning (2017) Mihail Eric and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414.
- Gao et al. (2019) Shuyang Gao, Abhishek Sethi, Sanchit Aggarwal, Tagyoung Chung, and Dilek Hakkani-Tur. 2019. Dialog state tracking: A neural reading comprehension approach. arXiv preprint arXiv:1908.01946.
- Goel et al. (2019) Rahul Goel, Shachi Paul, and Dilek Hakkani-Tür. 2019. Hyst: A hybrid approach for flexible and accurate dialogue state tracking. arXiv preprint arXiv:1907.00883.
- Henderson et al. (2019a) Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Pei-Hao Su, Ivan Vulić, et al. 2019a. Convert: Efficient and accurate conversational representations from transformers. arXiv preprint arXiv:1911.03688.
- Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D. Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272, Philadelphia, PA, U.S.A. Association for Computational Linguistics.
- Henderson et al. (2019b) Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrkšić, and Pei-Hao Su. 2019b. Training neural response selection for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5392–5404, Florence, Italy. Association for Computational Linguistics.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
- Kim et al. (2019) Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Adam Atkinson Sungjin Lee, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, and Raghav Gupta. 2019. The eighth dialog system technology challenge. arXiv preprint.
- Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Larson et al. (2019) Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. 2019. An evaluation dataset for intent classification and out-of-scope prediction. arXiv preprint arXiv:1909.02027.
- Lee et al. (2019) Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng Gao, Kaheer Suleman, Layla El Asri, Mahmoud Adada, Minlie Huang, Shikhar Sharma, Wendy Tay, and Xiujun Li. 2019. Multi-domain task-completion dialog challenge. In Dialog System Technology Challenges 8.
- Li et al. (2018) Xiujun Li, Sarah Panda, JJ (Jingjing) Liu, and Jianfeng Gao. 2018. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. In SLT 2018.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Mrkšić et al. (2016) Nikola Mrkšić, Diarmuid O Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2016. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777.
- Paul et al. (2019) Shachi Paul, Rahul Goel, and Dilek Hakkani-Tür. 2019. Towards universal dialogue act tagging for task-oriented dialogues. arXiv preprint arXiv:1907.03020.
- Peng et al. (2020) Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020. Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328.
- Radford et al. (a) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. a. Improving language understanding by generative pre-training.
- Radford et al. (b) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. b. Language models are unsupervised multitask learners.
- Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Rashkin et al. (2018) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2018. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
- Rastogi et al. (2019) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855.
- Shah et al. (2018a) Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Gokhan Tur. 2018a. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 41–51.
- Shah et al. (2018b) Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018b. Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562.
- Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
- Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 808–819, Florence, Italy. Association for Computational Linguistics.
- Yang et al. (2018) Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2019a) Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. 2019a. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544.
- Zhang et al. (2019b) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019b. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
Zhu et al. (2015)
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
Proceedings of the IEEE international conference on computer vision, pages 19–27.