A Simple Language Model for Task-Oriented Dialogue

05/02/2020 ∙ by Ehsan Hosseini-Asl, et al. ∙ Salesforce 0

Task-oriented dialogue is often decomposed into three tasks: understanding user input, deciding actions, and generating a response. This allows for dedicated models for each sub-task, but we find a simple, unified approach leads to state-of-the-art performance across multiple settings on the MultiWOZ dataset. SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model trained on all sub-tasks recast as a single sequence prediction problem. This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2. SimpleTOD improves over the prior state-of-the-art by 1.22 points in joint goal accuracy for dialogue state tracking. SimpleTOD also improves all three metrics used to evaluate action and response generation in the most complete setting for task-oriented dialog systems: inform rate by 8.1 points, success rate by 9.7 points, and BLEU by 23.5 points.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conversational AI has been a long standing problem in computer science, and has gained more and more attention recently in both academia and industries with the current advances of neural approaches (Gao et al., 2019). Open-domain dialogue systems are usually trained end-to-end using large-scale data from social media, and the main goal is to make conversations with human more natural and engaging (Adiwardana et al., 2020). Task-oriented dialogue (TOD) systems, on the other hand, are designed to accomplish a goal described by a user in natural language. Such systems are usually built with a pipeline approach. The pipeline often requires natural language understanding (NLU) for belief state tracking, dialogue management (DM) for deciding which actions to take, and natural language generation (NLG) for generating responses (Wen et al., 2016).

Traditionally, each component of task-oriented systems is trained separately with different labels, e.g., the NLU module has domain and intent labels while a DM module has dialogue belief and dialogue act labels. The modular dependencies of these approaches can lead to error propagation when information is bottle-necked and not provided to modules later in the pipeline (Liu and Lane, 2018). For example, many systems do not consider the entire dialogue history at every turn, but instead rely on the NLU module to pass belief state reliably to the next stage (Zhang et al., 2019b). Moreover, modular approaches fall short of solving all the tasks in a unified way.

We propose recasting task-oriented dialogue as a causal language modeling task, and we show that such a model can solve all the sub-tasks in a unified way using multi-task maximum likelihood training. This simple task-oriented dialogue (SimpleTOD) approach enables modeling of the inherent dependencies between the sub-tasks of task-oriented dialogue, and it allows us to optimize all the tasks in an end-to-end manner. SimpleTOD also opens the path towards fully leveraging large language models such as GPT-2 (Radford et al., 2019) for task-oriented dialogue. The success of SimpleTOD demonstrates a strong connection between the implicit language understanding in the open domain required of high-quality causal language models and the kind of understanding required for a full task-oriented dialogue system.

Evaluation results demonstrate the advantages of SimpleTOD. We achieve 56.45 joint goal accuracy on MultiWOZ, which surpasses all prior work for the dialogue state tracking (i.e. belief state tracking) sub-task. We also show large improvements in the combined scores for action and response generation. In the setting closest to testing a full task-oriented dialogue system, in which belief states and action decisions are generated rather than retrieved from an oracle, SimpleTOD surpasses prior work on each individual action and response generation metric (+8.1 inform rate, +9.7 succes rate,+23.5 BLEU).

Figure 1: SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model to generate all outputs given the dialogue context and database search results. The delexicalized response can then be lexicalized into a human-readable response by using informatoin from the belief state and DB search results.

2 Related Work

Task-Oriented Dialogue

Much work on task-oriented dialogue focuses on a specific module and evaluates only for that module. These components include understanding user intent via intent detection module  (Liu and Lane, 2016), tracking the constraints imposed by the user via dialogue state tracking modules (Henderson et al., 2013; Mrkšić et al., 2017; Rastogi et al., 2017; Nouri and Hosseini-Asl, 2018; Wu et al., 2019a; Zhang et al., 2019a; Zhou and Small, 2019; Chen et al., 2020), determining system actions via dialogue policy modules  (Wen et al., 2017), and using dedicated response generation modules  (Wen et al., 2015).

Some recent works start to bridge multiple sub-tasks by connecting modules together and evaluating in settings that rely partially on generated results from one module handed off to another.  Chen et al. (2019) proposed a joint action-response generation using oracle dialogue states.  Peng et al. (2020), on the other hand, used GPT-2 to learn a response generator conditioned on oracle dialogue acts. Hence, neither of those works evaluate for dialogue state tracking.

Dependencies between these independently optimized modules make such pipeline approaches vulnerable to natural error propagation across different components (Liu et al., 2018). Recent approaches have consequently shifted towards more end-to-end solutions allowing for more flexible design and architectures.

Towards End-to-End Task-Oriented Dialogue

End-to-end approaches aim to reduce human effort and task-specific design. Several works have used both dialogue history and knowledge bases as input and optimized neural encoder-decoder models to directly generate or retrieve next system responses without any modularized supervision (Eric and Manning, 2017; Zhao et al., 2017; Madotto et al., 2018; Wu et al., 2019b, c). The lines are blurry once systems are mostly end-to-end, but still call out to additional APIs or skip intermediate tasks like dialogue state tracking all together (Bordes et al., 2017).

Others have incorporated additional supervision from human annotations and trained systems in multi-task settings. For example, Lei et al. (2018) and Shu et al. (2018) incorporated dialogue state tracking and jointly trained with response generation using a sequence-to-sequence approach.  Liu et al. (2018)

proposed a hybrid imitation and reinforcement learning method, jointly learned dialogue policy and response generation.  

Wen et al. (2016); Liang et al. (2019) trained language understanding, dialogue state tracking, and dialogue policy modules with a shared dialogue encoder.

Many other works fall somewhere in between either combining some tasks and not others or training some modules jointly and other separately. Neelakantan et al. (2019) modeled system action and response generation jointly, incorporating latent knowledge reasoning through attention without using belief states.  Zhao et al. (2019) proposed to model system actions as latent variables, inducing a latent action space from data through different optimization methods based on variational inference.  Zhang et al. (2019b) proposed a domain-aware multi-decoder model and augmented dialogue data to model one-to-many dialogue property, which achieved state-of-the-art combined score for dialogue management and response generation on MultiWOZ dataset.

Although all these approaches have come closer to unifying the stack in different ways, none are as simple as SimpleTOD: treating all of task-oriented dialogue as a single sequence prediction problem, using a single model, trained with a single, joint, multi-task loss.

Unsupervised pre-training for natural language processing

Pre-training approaches for natural language processing focus on transferable representations for contextualized word vectors 

(McCann et al., 2017; Peters et al., 2018), generative models (Radford et al., 2019; Keskar et al., 2019), or a combination of both (Dong et al., 2019; Yang et al., 2019). Variants of pre-trained, bidirectional Transformers like BERT (Devlin et al., 2018) are usually evaluated on classification tasks such as those in the GLUE benchmark (Wang et al., 2018) or span-based question answering tasks (Rajpurkar et al., 2016). Unidirectional (causal) pre-trained language models like GPT-2 (Radford et al., 2019) or CTRL (Keskar et al., 2019) resemble the decoder from the original Transformer architecture (Vaswani et al., 2017) and aim to learn a strong distribution for next-word prediction, which makes them particularly useful for tasks that require text generation. In dialogue, Zhang et al. (2019c) built on GPT-2 by further pre-training it on Reddit data for open-domain response generation. Henderson et al. (2019) pre-trained a dual Transformer encoder for response selection task on large-scale Reddit data. Bao et al. (2019) used both Twitter and Reddit data to pre-train a Transformer model with discrete latent variables. Wu et al. (2020) pre-trained a BERT architecture with response selection using multiple task-oriented corpora.

3 Methods

This section describes task-oriented dialogue, how we frame it to better accommodate our simple approach, the architecture we use, training details, dataset details, and evaluation metrics.

3.1 Task-Oriented Dialogue

Systems that aim to perform task-oriented dialogue (TOD) are evaluated on three sub-tasks: dialogue state (belief state) tracking, action and response generation. This decomposition has made it possible to create dedicated models for each sub-task, which is the dominant approach to TOD. By contrast, we explore the possibility of using a single model approach, SimpleTOD, to replace what is often a multi-model system.

A dialogue consists of multiple turns. In a turn , the user provides input and the system generates a response . To generate a response, SimpleTOD reads previous turns as context, . With this context, it generates a belief state , which is a list of triplets recording values for slots in a particular domain: (domain, slot_name, value). This belief state is used to query a database for information. The database search returns rows from the database that satisfy the conditions of the belief state. While the rows returned can later be used to lexicalize the response (filling in generated placeholders), SimpleTOD only takes as input , the aggregated database search results that convey how many rows were returned as matches and whether any of them have an available booking status. SimpleTOD then conditions on , , and to decide what kind of actions to take . These actions are generated as another list of triplets: (domain, action_type, slot_name). Finally, a delexicalized response is generated conditioned on all prior information. See Table 1 for a schematic overview of the model inputs, which when training SimpleTOD are concatenated together as a single sequence.

Context [context] [user] user input [system] system response [user] user input [endofcontext]
Belief State [belief] domain slot_name value, domain slot_name value, [endofbelief]
DB Search [db] #_matches, booking_status [endofdb]
Action [action] domain action_type slot_name, domain action_type slot_name, [endofaction]
Response [response] system delexicalized response [endofresponse]
Table 1: A schematic representation of the different components of inputs/outputs in task-oriented dialogue. When training SimpleTOD, these are concatenated together into a single sequence.

A single training sequence consists of the concatenation

, which allows modeling the joint probability over the sequence

. This joint probability decomposes into the conditional probabilities that are independently modeled by some modular approaches 111Systems typically do not attempt to model , as this information is queried from the database during inference. We find that including it simplifies modeling, but see Sec. 4 for experimental results questioning the necessity of the DB Search results for SimpleTOD applied to MultiWOZ..

When combined with information from the belief state and database search results, the response can be lexicalized to recover human readable response text.

3.2 Causal Language Modeling

Considering these simply as sequences of tokens, we can use a causal language model to model the joint probability of such sequences direclty. Given example sequences of the form where each comes from a fixed set of symbols, the goal of language modeling is to learn . Because

is a sequence, it is natural to factorize this distribution using the chain rule of probability 

(Bengio et al., 2003):

(1)

This decomposes language modeling into next-word prediction. Current state-of-the-art methods (Dai et al., 2019; Radford et al., 2019)

train a neural network with parameters

to minimize the negative log-likelihood over a dataset where sequence has length :

(2)

3.3 Architecture

We train a variant of the Transformer (Vaswani et al., 2017) to learn these conditional distributions. A sequence containing tokens is embedded as a sequence of corresponding vectors in . Each vector is the sum of a learned token embedding and a sinusoidal positional embedding as in the original Transformer architecture. This sequence of vectors is stacked into a matrix so that it can be processed by attention layers. The th layer consists of two blocks, each of which preserves the model dimension .

The core of the first block is multi-head attention with heads that uses a causal mask to preclude attending to future tokens:

The core of the second block is a feedforward network with ReLU activation that projects inputs to an inner dimension

, with parameters and :

Each block precedes core functionality with layer normalization (Ba et al., 2016; Child et al., 2019)

and follows it with a residual connection 

(He et al., 2016). Together, they yield :

Block 1 Block 2

Scores are then computed from the output of the last layer:

During training, these scores are the inputs of a cross-entropy loss function. During generation, the scores corresponding to the final token are normalized with a softmax, yielding a distribution for sampling a new token.

3.4 Training Details

The input to the model is tokenized with BPE codes (Sennrich et al., 2016). We use the pretrained BPE codes associated with DistilGPT2 (Sanh et al., 2019), distilled version of the original GPT-2 pretrained model (Radford et al., 2019). All results reported for SimpleTOD use this model unless otherwise stated. The model consists of self-attention layers with embedding size of , heads, and positions. For sequences longer than tokens, the dialogue context is truncated, so that the combined input length never exceeds .

3.5 Dataset Details

We evaluate SimpleTOD on the Multi-domain Wizard-of-Oz (MultiWOZ) (Budzianowski et al., 2018). MultiWOZ is a large-scale, multi-domain dialogue dataset of human-human conversations. The dataset contains 8538 multi-turn dialogues with 13.68 average turns, spanning over seven domains (restaurant, train, attraction, hotel, taxi, hospital, police). Police and hospital domains are excluded from evaluation, since they do not have valid/test splits. Therefore, the dataset has 30 domain-slot pairs for the remaining five domain with 4,500 possible values. SimpleTOD is trained on delexicalized system responses according to the pre-processing explained in Budzianowski et al. (2018). Recently, Eric et al. (2019) proposed an improved MultiWOZ 2.1 by removing noisy state values from the dialogue state (belief state) tracking annotations. For dialogue state (belief state) tracking evaluation, we used 2.1 version only in order to compare to recent state-of-the-art methods. However, for action and response generation evaluation we report results for both versions. To the best of our knowledge, all prior work has evaluated on version 2.0 version, so we include those results for direct comparison. But, we also include results for version 2.1 so that future systems can compare to SimpleTOD on the improved version of the dataset.

3.6 Evaluation Details

We follow the original MultiWOZ (Budzianowski et al., 2018) guidance for all individual metrics and follow Mehri et al. (2019) for the combined score. Joint goal accuracy is used to evaluate the performance of dialogue state tracking (i.e. belief state tracking). It measures the accuracy of the generated belief states as they compare to oracle belief states. Model outputs are only counted as correct when all the predicted values exactly match the oracle values. Action and response generation uses three metrics. The first two are inform and success rates. They are designed to capture how well the task was completed. Inform rate measures how often the entities provided by the system are correct. Success rate refers to how often the system is able to answer all the requested attributes by user. BLUE score  (Papineni et al., 2002) is used to measure the fluency of the generated responses. The combined score for action and response generation is computed as ().

4 Experimental Results and Discussion

SimpleTOD is a Unified System for Task-Oriented Dialogue

SimpleTOD is, to the best of our knowledge, the first system that generates state-of-the-art results judged according to all the standard metrics for task-oriented dialogue systems trained on MultiWOZ. For sub-tasks, we compare to other recent methods where each reports results. For dialogue state tracking, we compare to TRADE (Wu et al., 2019a), DSTQA (Zhou and Small, 2019), DST-Picklist (Zhang et al., 2019a), and SST (Chen et al., 2020). For action and response generation we compare in across a variety of settings against HDSA (Chen et al. (2019)), ARDM (Wu et al. (2019c)), LaRL (Zhao et al. (2019)), PARG (Gao et al. (2020)), DAMD (Zhang et al. (2019b)).

4.1 Dialogue State Tracking

This section reviews evaluation performance of SimpleTOD on belief state tracking and compares to recent state-of-the-art methods. Table 2 compares the joint goal accuracy to previous models. All previous models propose a bidirectional encoder to learn a better representation of the dialogue context, but SimpleTOD uses a unidirectional (causal) decoder and no additional bidirectional encoder. It also makes no use of extra supervision through structured graphs. It nonetheless achieves state-of-the-art.

Model Decoder Context Encoder Extra Supervision Joint Accuracy
TRADE

Generative + Classifier

Bidirectional - 45.6
DSTQA Classifier Bidirectional knowledge graph 51.17
DST-Picklist Classifier Bidirectional - 53.3
SST Generative Bidirectional schema graph 55.23
SimpleTOD (ours) Generative Unidirectional - 56.45
Table 2: Evaluation of Dialogue State Tracking (DST) on MultiWOZ 2.1 using joint accuracy metric.

4.2 Action and Response Generation

This section demonstrates the effectiveness of SimpleTOD for action and response generation, especially in the most realistic, fully end-to-end setting, when it must generate its own belief states, actions, and responses.

We report results in Table 3 for three different settings regularly employed in the literature. These settings are determined by how much oracle information is used. The first setting uses oracle belief states and oracle actions. The second uses oracle belief states, but requires the system to generate its own actions. The third requires the system to generate its own belief states and its own actions.

Note that all prior works use oracle DB Search results as supervision during training and as input during inference in all three of these settings. We include directly comparable experiments using oracle DB Search results for all settings. We also include experiments that completely ignore the DB Search results in all settings to show the surprising effectiveness of SimpleTOD without DB Search results. Finally, in the third setting, we attempt to compute dynamic DB Search results to the greatest extent possible222The annotation of the MultiWOZ dataset precludes dynamically computing whether booking status is available in some domains. In the dynamic setting, we ignore booking status all together during training and inference. However, we train with the number of matched entries and compute this dynamically at inference when we use generated belief states.. Even without oracle DB Search results and with approximate, dynamic oracle DB Search results, SimpleTOD performs with surprising effectiveness.

Oracle Belief State, Oracle Actions

SimpleTOD outperforms other methods in this setting according to BLEU and the combined metric. DAMD is superior on individual inform and success rates by 1.4 to 3.1 points, but response generation is low enough that SimpleTOD reaches a higher combined score. Perhaps surprisingly, completely ignoring the DB Search results led to lower inform, higher success, and higher BLEU for the best combined score in this setting.

Oracle Belief State, Generated Actions

SimpleTOD once again outperforms other methods according to BLEU and the combined metric. The gap between the best SimpleTOD success rate and the best overall success rate is a much larger 6.4 points than in the previous setting. The best SimpleTOD inform rate only trails the best overall by 0.3 points. This suggests that the action generation might be the weakest function of SimpleTOD. Again, even without oracle DB Search results used by other methods, the combined scores still remains high, though in this setting using oracle DB Search results did provide improvement.

Generated Belief State, Generated Actions

To the best of our knowledge, DAMD (Zhang et al. (2019b)) is the only prior work that has evaluated with generated belief states from dialogue state tracking during inference. They still use the oracle DB Search information for action and response generation. We found in additional ablation experiments that we could increase scores for individual metrics like inform rate and success rate by training three separate SimpleTOD language models: one for dialogue state tracking, one for action generation, and one for response generation. However, the combined scores remained nearly identical to the full end-to-end, single model approach. For example, separating the models might improve inform rate, but hurt response generation measured by BLEU. Regardless, in this most realistic setting SimpleTOD achieves state-of-the-art on each individual metric as well.

Regarding Oracle DB Search Results

In the case where we dynamically compute partial DB Search results (number of entries matched only), the results are actually lower than ignoring them entirely. Perhaps surprisingly, the best result in this setting (like the oracle, oracle setting above) is when we use SimpleTOD without any DB Search results at all. We have found that in some cases, the generated belief states conflict in some way with the information in the database. For example, there can be discrepancies between the two in the name of restaurants: ‘pizza hut fenditton’ in the target belief states but ‘pizza hut fen ditton’ in the database. When computing dynamic number of matches, this can lead to incorrect DB Search results. Since we use the same scripts to generate the dynamic number of matches as the original code released with MultiWOZ333https://github.com/budzianowski/multiwoz, we speculate that similar issues actually exist with the annotated oracle DB Search results as well. We are still investigating this issue as the oracle DB Search results appeared to help in one out of three settings.

Model Belief State DB Search Action Inform Success BLEU Combined
DAMD oracle oracle oracle 95.4 87.2 27.3 118.5
PARG oracle oracle oracle 91.1 78.9 18.8 103.8
SimpleTOD (ours) oracle oracle oracle 93.4 83.2 53.14 141.44
SimpleTOD (ours) oracle - oracle 92.3 85.8 55.2 144.25
HDSA oracle oracle generated 82.9 68.9 23.6 99.5
DAMD oracle oracle generated 89.2 77.9 18.6 102.5
ARDM oracle oracle - 87.4 72.8 20.6 100.7
LaRL oracle oracle generated 82.78 79.2 12.8 93.79
SimpleTOD (ours) oracle oracle generated 84 72.8 42.15 120.55
SimpleTOD (ours) oracle - generated 88.9 67.1 35.8 113.8
DAMD generated oracle generated 76.3 60.4 18.6 86.95
SimpleTOD (ours) generated oracle generated 78.1 63.4 40.92 111.67
SimpleTOD (ours) generated dynamic generated 81.4 66.8 40.32 114.42
SimpleTOD (ours) generated - generated 84.4 70.1 42.1 119.3
Table 3: Action and response generation on MultiWOZ 2.0 reveals that SimpleTOD, a single, causal language model, is sufficient to surpass all prior approaches according to combined score regardless of how much oracle information is used. SimpleTOD is strongest in the most realistic setting: when systems must generate their own belief states and actions. Note: dynamic refers to computing number matches from the DB using generated belief. See Section 4.2 for details.
Belief State DB Search Action Inform Success BLEU Combined
oracle oracle oracle 92.8 84.5 52.2 140.85
oracle - oracle 92.6 86.1 54.15 143.5
oracle oracle generated 85.1 73.5 41.97 121.27
oracle - generated 89.6 68.6 34.66 113.76
generated oracle generated 79.3 65.4 41.03 113.38
generated dynamic generated 83.4 67.1 41.15 116.4
generated - generated 85 70.5 41.61 119.36
Table 4: SimpleTOD results for action and response generation on MultiWOZ 2.1. Note: dynamic refers to computing number matches from the DB using generated belief. See Section 4.2 for details.

MultiWoz 2.1 Results

The evaluation results on MultiWOZ 2.1, as shown in Table 4, also follows the same patterns as discussed above. We provide these results for future comparisons on the improved version of the dataset.

5 Analysis and Further Discussion

Understanding Long, Multi-turn Dialogues

To further investigate the performance of SimpleTOD, Table 5 shows one example of SimpleTOD doing dialogue state tracking for a many-turn conversation (long context). SimpleTOD with only a causal language model manages to properly generate correct belief states for the full context quite regularly in our qualitative analysis of generated results.

Context <|context|> <|user|> i am looking for a pool somewhere in the south of cambridge . <|system|> i am sorry , but it does not look like we have a pool that matches your criteria . <|user|> how about some type of theater ? <|system|> the junction theatre is in the south area . would you like the address ? <|user|> yes , please provide the address and postcode as well . <|system|> the junction theatre is located on clifton way at postcode cb17gx . is there anything else i can assist you with ? <|user|> thanks . yes , i also need to find a guesthouse for my stay there . <|system|> are you needing the guesthouse in the south , also ? do you have a preference in price range ? <|user|> the area should be the centre . i need it to be cheap and have free parking . <|system|> i have 2 guesthouses that fit your criteria . 1 is the alexander bed and breakfast and the other is el shaddai . <|user|> what is the star rating -s for them ? <|system|> the alexander bed and breakfast has a 4 star rating while the el shaddai has 0 stars . would you like to book a room with either of these ? <|user|> ok great , thank you . i also need to book a taxi to get back and forth . <|system|> where did you wan to leave from and go to ? <|user|> i’ll be leaving the theatre at 14:45 and going to the guesthouse . <|system|>

let s back up a moment . which lodgings did you prefer , so we know where the taxi will be taking you ?

<|user|> let s go with the alexander bed and breakfast . <|system|> your booking is complete , a grey volkswagen will be picking you up , contact number 07753429052 . can i help with anything else ? <|user|> cool beans ! sorry i was so much trouble . thanks so much for all of your help , you need a raise dealing with customers like me ! <|endofcontext|>
Target <|belief|> attraction area south, attraction name the junction, attraction type theatre, hotel area centre, hotel name alexander bed and breakfast, hotel parking yes, hotel pricerange cheap, hotel type guesthouse, taxi departure the junction, taxi destination alexander bed and breakfast, taxi leaveat 14:45 <|belief|> <|action|> general bye none <|endofaction|> <|response|> it was no trouble at all ! i hope you have a fantastic trip and enjoy your stay ! <|endofresponse|>
LM generation <|belief|> attraction area south, attraction name the junction, attraction type theatre, hotel area centre, hotel name alexander bed and breakfast, hotel parking yes, hotel pricerange cheap, hotel type guesthouse, taxi departure the junction, taxi destination alexander bed and breakfast, taxi leaveat 14:45 <|endofbelief|> <|action|> general bye none <|endofaction|> <|response|> thank you for using our service <|endofresponse|>
Table 5: Example of multi-domain dialogue state tracking for long context, where dialogue context spanning over 3 domains, i.e. attraction, hotel, and taxi (dialogue ’MUL1015’)

Decoding Method

Initialized from pre-trained weights, SimpleTOD does not need to employ an advanced, more costly decoding strategy such as beam search, diverse beam search, and top-k sampling as opposed to HDSA (Chen et al. (2019)) and DAMD (Zhang et al. (2019b)). Our results are reported with simple greedy decoding. In initial experiments, we also tried nucleus sampling (Holtzman et al., 2019), but we found it degraded performance and greedily chose to pursue greedy sampling afterward. This relates to the observations in Keskar et al. (2019) around controllable generation: when precision is required, sampling from the distribution is inherently less reliable than greedily sampling.

End token User/System token Joint Acc Inform Success BLEU Combined
No No 17.68 33.8 10.6 24.5 46.7
Yes No 22.32 54.5 41.2 39.46 87.31
No Yes 23.25 61.9 52.7 39.38 96.68
Yes Yes 56.45 85 70.5 41.61 119.36
Table 6: Ablations on MultiWOZ 2.1 comparing different ways to represent TOD as a single sequence. Performance on all metrics drops without <endof(segment)> and <user/system> tokens.

The Role of Special Tokens

Table 6 shows results of evaluating SimpleTOD with different special tokens used to identify sub-tasks and components of the input. We were surprised that end tokens such as <|endofaction|> improved performance, but analysis revealed that without them, SimpleTOD tended to generate much longer belief state, action, and response generations. Even more important is clearly differentiating user and system text for SimpleTOD.

Layers Pretrained Joint Acc Inform Success BLEU Combined
6 Random 17.98 63.5 49.6 18.84 75.39
6 DistilGPT2 56.45 85 70.5 41.61 119.36
12 Random 20.52 58.7 37.4 29.77 77.82
12 GPT2 54.13 88 61.7 39.69 114.54
Table 7: Ablations on MultiWOZ 2.1 comparing the importance of pretraining. Recasting as single sequence prediction enables fully leveraging pre-trained models for the language understanding they have gathered in an open-domain setting.

SimpleTOD and Pre-training

Table 7 highlights the importance of initializing SimpleTOD with pre-trained weights before fine-tuning. Without such pre-training, recasting as a single sequence prediction problem alone would trail behind prior methods. Recasting as single sequence prediction simply enables fully leveraging such pre-trained models for the language understanding they have gathered in an open-domain setting.

Full Dialogue Examples

In Table 8 and Table 9 we provide several turns as part of a full dialogue for delexicalized and lexicalized outputs. These are from our strongest SimpleTOD model in the setting that requires it to generate all outputs, which means that it ignores all DB Search results.

Turn 1 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|endofcontext|>
Target <|belief|> train destination ely <|endofbelief|> <|action|> train inform choice, train inform day, train inform dest, train request arrive <|endofaction|> <|response|> we have [value_count] trains traveling to [value_place] on [value_day] . what time would you like to arrive ? <|endofresponse|>
LM generation <|belief|> train day sunday, train destination ely <|endofbelief|> <|action|> train inform choice, train inform day, train inform leave, train inform depart, train inform dest, train inform arrive, train inform id, train offerbook none <|endofaction|> <|response|> there are [value_count] trains leaving [value_place] on [value_day]. [train_id] leaves [value_place] at [value_time] and arrives in [value_place] at [value_time]. would you like to book a ticket? <|endofresponse|>
Turn 2 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|endofcontext|>
Target <|belief|> train destination ely, train day sunday, train arriveby 13:00, train departure cambridge <|endofbelief|> <|action|> train inform arrive, train inform day, train inform dest, train inform id, train offerbook none <|endofaction|> <|response|> [train_id] will get you to [value_place] by [value_time] on [value_day] . should i book you a ticket ? <|endofresponse|>
LM generation <|belief|> train day sunday, train destination ely, train departure cambridge, train arriveby 13:00 <|endofbelief|> <|action|> train inform arrive, train inform leave, train inform id, train offerbook none <|endofaction|> <|response|> [train_id] leaves at [value_time] and arrives at [value_time]. would you like me to book it for you? <|endofresponse|>
Turn 3 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|endofcontext|>
Target <|belief|> train destination ely, train day sunday, train arriveby 13:00, train departure cambridge, train book people 8 <|endofbelief|> <|action|> train offerbooked ref, train offerbooked ticket <|endofaction|> <|response|> booking was successful , the total fee is [value_price] gbp payable at the station . reference number is [train_reference] . <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general reqmore none, train offerbooked ref, train offerbooked ticket <|endofaction|> <|response|> booking was successful, the total fee is [value_price] gbp payable at the station. reference number is [train_reference]. is there anything else i can help you with? <|endofresponse|>
Turn 4 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction recommend name, general greet none <|endofaction|> <|response|> how about [attraction_name] ? it looks lovely <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction recommend name, attraction recommend area, attraction recommend fee, attraction recommend addr, attraction recommend phone <|endofaction|> <|response|> i would recommend the [attraction_name] in the [value_area]. it is located at [attraction_address]. the entrance fee is [value_count] pounds. the phone number is [attraction_phone] <|endofresponse|>
Turn 5 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|system|> how about cafe jello gallery ? it looks lovely <|user|> that will work . could you get me the phone number and area its in ? <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general greet none, general reqmore none, attraction inform phone, attraction inform area <|endofaction|> <|response|> sure ! it s in the [value_area] area of town , and the phone number is [attraction_phone] . can i help you with anything else ? <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction inform phone, attraction inform area, attraction inform name<|endofaction|> <|response|> [attraction_name] is in the [value_area] area and the phone number is [attraction_phone] <|endofresponse|>
Turn 6 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|system|> how about cafe jello gallery ? it looks lovely <|user|> that will work . could you get me the phone number and area its in ? <|system|> sure ! it s in the west area of town , and the phone number is 01223312112 . can i help you with anything else ? <|user|> you were great today . thanks so much for all the help . that is all . goodbye . <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general bye none, general welcome none, general greet none <|endofaction|> <|response|> it was a pleasure to help . have a good day . goodbye . <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general bye none <|endofaction|> <|response|> thank you for using our service <|endofresponse|>
Table 8: STOD generation samples (delexicalized response) on MultiWOZ 2.1 (dialogue ’PMUL3663’).
Turn 1 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|endofcontext|>
Target <|belief|> train destination ely <|endofbelief|> <|action|> train inform choice, train inform day, train inform dest, train request arrive <|endofaction|> <|response|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|endofresponse|>
LM generation <|belief|> train day sunday, train destination ely <|endofbelief|> <|action|> train inform choice, train inform day, train inform leave, train inform depart, train inform dest, train inform arrive, train inform id, train offerbook none <|endofaction|> <|response|> there are 10 trains leaving cambridge on sunday. tr3177 leaves cambridge at 9:50 and arrives in ely at 10:07. would you like to book a ticket? <|endofresponse|>
Turn 2 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|endofcontext|>
Target <|belief|> train destination ely, train day sunday, train arriveby 13:00, train departure cambridge <|endofbelief|> <|action|> train inform arrive, train inform day, train inform dest, train inform id, train offerbook none <|endofaction|> <|response|>tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|endofresponse|>
LM generation <|belief|> train day sunday, train destination ely, train departure cambridge, train arriveby 13:00 <|endofbelief|> <|action|> train inform arrive, train inform leave, train inform id, train offerbook none <|endofaction|> <|response|> tr1159 leaves at 11:50 and arrives at 12:07. would you like me to book it for you? <|endofresponse|>
Turn 3 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|endofcontext|>
Target <|belief|> train destination ely, train day sunday, train arriveby 13:00, train departure cambridge, train book people 8 <|endofbelief|> <|action|> train offerbooked ref, train offerbooked ticket <|endofaction|> <|response|>booking was successful , the total fee is 28.16 gbp payable at the station . reference number is hdx8ix63. <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general reqmore none, train offerbooked ref, train offerbooked ticket <|endofaction|> <|response|> booking was successful, the total fee is 28.16 gbp payable at the station. reference number is hdx8ix63. is there anything else i can help you with? <|endofresponse|>
Turn 4 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction recommend name, general greet none <|endofaction|> <|response|> how about cafe jello gallery ? it looks lovely <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction recommend name, attraction recommend area, attraction recommend fee, attraction recommend addr, attraction recommend phone <|endofaction|> <|response|> i would recommend the cafe jello gallery in the west. it is located at 13 magdalene street. the entrance fee is 0 pounds. the phone number is 01223464646 <|endofresponse|>
Turn 5 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|system|> how about cafe jello gallery ? it looks lovely <|user|> that will work . could you get me the phone number and area its in ? <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general greet none, general reqmore none, attraction inform phone, attraction inform area <|endofaction|> <|response|> sure ! it s in the west area of town , and the phone number is 01223464646 . can i help you with anything else ? <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> attraction inform phone, attraction inform area, attraction inform name<|endofaction|> <|response|> cafe jello gallery is in the west area and the phone number is 01223464646 <|endofresponse|>
Turn 6 Context <|context|> <|user|> hello , i need a train to ely on sunday please <|system|> we have 10 trains traveling to ely on sunday . what time would you like to arrive ? <|user|> how about 1 arriving by 13:00 departing from cambridge . <|system|> tr1159 will get you to ely by 12:07 on sunday . should i book you a ticket ? <|user|> yes , that would be great . i’ll need tickets for 8 people . do you have a reference number ? <|system|> booking was successful , the total fee is 28.16 gbp payable at the station . reference number is : hdx8ix63 . <|user|> i am also looking for a museum . can you recommend 1 ? <|system|> how about cafe jello gallery ? it looks lovely <|user|> that will work . could you get me the phone number and area its in ? <|system|> sure ! it s in the west area of town , and the phone number is 01223312112 . can i help you with anything else ? <|user|> you were great today . thanks so much for all the help . that is all . goodbye . <|endofcontext|>
Target <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general bye none, general welcome none, general greet none <|endofaction|> <|response|> it was a pleasure to help . have a good day . goodbye . <|endofresponse|>
LM generation <|belief|> train book people 8, train arriveby 13:00, attraction type museum, train day sunday, train departure cambridge, train destination ely <|endofbelief|> <|action|> general bye none <|endofaction|> <|response|> thank you for using our service <|endofresponse|>
Table 9: STOD lexicalized generation (using generated belief) on MultiWOZ 2.1 ( dialogue ’PMUL3663’).

6 Conclusion

We explored a simple approach to task-oriented dialogue (SimpleTOD) that uses a single, causal language model. To do this, during training we treat all inputs for dialogue state tracking, action and response generation as a single sequence. SimpleTOD can then directly leverage pre-trained models like GPT-2 to transfer language understanding from open-domain settings where data is more readily available. Empirical results on the multi-domain dialogue dataset (MultiWOZ) showed that the proposed approach outperformed all prior methods in dialogue state tracking as well as in action and response generation, especially in the setting most pertinent to evaluating complete task-oriented dialogue systems. We found that the pre-trained weights were essential, but to leverage these weights fully we had to guide the system with special tokens that mark user and system responses as well as different portions of the sequence related to different sub-tasks. We found that SimpleTOD was effective at tracking dialogue state over long context with many turns and required no more than greedy decoding to achieve new state-of-the-art results. We hope that these results with SimpleTOD encourage further exploration of simple, unified approaches for dialogue systems.

References

  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1.
  • J. Ba, R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. Cited by: §3.3.
  • S. Bao, H. He, F. Wang, and H. Wu (2019) PLATO: pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931. Cited by: §2.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: §3.2.
  • A. Bordes, Y. Boureau, and J. Weston (2017) Learning end-to-end goal-oriented dialog. In International Conference on Learning Representations, Cited by: §2.
  • P. Budzianowski, I. Casanueva, B. Tseng, and M. Gasic (2018) Towards end-to-end multi-domain dialogue modelling. Cited by: §3.5, §3.6.
  • L. Chen, B. Lv, C. Wang, S. Zhu, B. Tan, and K. Yu (2020) Schema-guided multi-domain dialogue state tracking with graph attention neural networks. Cited by: §2, §4.
  • W. Chen, J. Chen, P. Qin, X. Yan, and W. Y. Wang (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention. arXiv preprint arXiv:1905.12866. Cited by: §2, §4, §5.
  • R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §3.3.
  • Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pp. 13042–13054. Cited by: §2.
  • M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, and D. Hakkani-Tur (2019) Multiwoz 2.1: multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669. Cited by: §3.5.
  • M. Eric and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414. Cited by: §2.
  • J. Gao, M. Galley, L. Li, et al. (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • S. Gao, Y. Zhang, Z. Ou, and Z. Yu (2020) Paraphrase augmented task-oriented dialog generation. arXiv preprint arXiv:2004.07462. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.3.
  • M. Henderson, I. Casanueva, N. Mrkšić, P. Su, I. Vulić, et al. (2019) ConveRT: efficient and accurate conversational representations from transformers. arXiv preprint arXiv:1911.03688. Cited by: §2.
  • M. Henderson, B. Thomson, and S. Young (2013) Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, Cited by: §2.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §5.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §2, §5.
  • W. Lei, X. Jin, M. Kan, Z. Ren, X. He, and D. Yin (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
  • W. Liang, Y. Tian, C. Chen, and Z. Yu (2019) Moss: end-to-end dialog system framework with modular supervision. arXiv preprint arXiv:1909.05528. Cited by: §2.
  • B. Liu and I. Lane (2016)

    Attention-based recurrent neural network models for joint intent detection and slot filling

    .
    In INTERSPEECH, Cited by: §2.
  • B. Liu and I. Lane (2018) End-to-end learning of task-oriented dialogs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 67–73. Cited by: §1.
  • B. Liu, G. Tür, D. Hakkani-Tür, P. Shah, and L. Heck (2018) Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §2, §2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. arXiv preprint arXiv:1804.08217. Cited by: §2.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §2.
  • S. Mehri, T. Srinivasan, and M. Eskenazi (2019) Structured fusion networks for dialog. arXiv preprint arXiv:1907.10016. Cited by: §3.6.
  • N. Mrkšić, D. Ó Séaghdha, T. Wen, B. Thomson, and S. Young (2017) Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
  • A. Neelakantan, S. Yavuz, S. Narang, V. Prasad, B. Goodrich, D. Duckworth, C. Sankar, and X. Yan (2019) Neural assistant: joint action prediction, response generation, and latent knowledge reasoning. In NeurIPS 2019 Converstional AI Workshop, Cited by: §2.
  • E. Nouri and E. Hosseini-Asl (2018) Toward scalable neural dialogue state tracking model. arXiv preprint arXiv:1812.00899. Cited by: §2.
  • K. Papineni, S. Roukos, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §3.6.
  • B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao (2020) Few-shot natural language generation for task-oriented dialog. External Links: 2002.12328 Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §2, §3.2, §3.4.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §2.
  • A. Rastogi, D. Hakkani-Tur, and L. Heck (2017) Scalable multi-domain dialogue state tracking.. In Proceedings of IEEE ASRU, Cited by: §2.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §3.4.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §3.4.
  • L. Shu, P. Molino, M. Namazifar, B. Liu, H. Xu, H. Zheng, and G. Tur (2018) Incorporating the structure of the belief state in end-to-end task-oriented dialogue systems. In NeurIPS 2018 Converstional AI Workshop, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.3.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §2.
  • T. Wen, M. Gašić, N. Mrkšić, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • T. Wen, Y. Miao, P. Blunsom, and S. Young (2017) Latent intention dialogue models. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §2.
  • T. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2016) A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1, §2.
  • C. Wu, S. Hoi, R. Socher, and C. Xiong (2020) ToD-bert: pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871. Cited by: §2.
  • C. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung (2019a) Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743. Cited by: §2, §4.
  • C. Wu, R. Socher, and C. Xiong (2019b) Global-to-local memory pointer networks for task-oriented dialogue. arXiv preprint arXiv:1901.04713. Cited by: §2.
  • Q. Wu, Y. Zhang, Y. Li, and Z. Yu (2019c) Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Cited by: §2, §4.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §2.
  • J. Zhang, K. Hashimoto, C. Wu, Y. Wan, P. S. Yu, R. Socher, and C. Xiong (2019a) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544. Cited by: §2, §4.
  • Y. Zhang, Z. Ou, and Z. Yu (2019b) Task-oriented dialog systems that consider multiple appropriate responses under the same context. arXiv preprint arXiv:1911.10484. Cited by: §1, §2, §4, §4.2, §5.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019c) DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: §2.
  • T. Zhao, A. Lu, K. Lee, and M. Eskenazi (2017) Generative encoder-decoder models for task-oriented spoken dialog systems with chatting capability. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Cited by: §2.
  • T. Zhao, K. Xie, and M. Eskenazi (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. arXiv preprint arXiv:1902.08858. Cited by: §2, §4.
  • L. Zhou and K. Small (2019) Multi-domain dialogue state tracking as dynamic knowledge graph enhanced question answering. arXiv preprint arXiv:1911.06192. Cited by: §2, §4.