|Hello, please find me a restaurant?|
|Response: Sure, what type of food are you looking for?|
|I feel like eating mediterranean food.|
|Response: There are two restaurants in the moderate price range, la mimosa and shiraz restaurant. Do you have a preference?|
|Response: La Mimosa is in the south part of town in the moderate price range.|
|Sounds good. Please book a table on Monday at 19:30 for 4 people.|
|Response: I have booked you a table for 4 at La Mimosa on Monday at 19:30. Your reference number is VRZRPRCM. Is there anything else I can help you with?|
|Action: restaurant-book(people=4, time=19:30, day=monday)|
|No, thanks. Have a good day.|
|Response: You’re welcome. Have a great day!|
Task-oriented dialog systems have been a longstanding goal of artificial intelligence. A single turn of the task-oriented dialog problem can be formalized as mapping conversation history containing both user and system utterances to system action and system text response111We ignore speech-to-text and text-to-speech components in this work.. Along with the conversation history, task-oriented systems have access to an external knowledge source (knowledge bases, documents etc.) relevant to the task that is being completed. For example, a knowledge base containing restaurants and their attributes is typically provided to a food ordering dialog system.
Dialog systems starting from the early rule-based, expert systems Weizenbaum (1966) to the present commercially available virtual assistants like Apple Siri, Amazon Alexa, and Google Assistant rely on a pipeline containing many components. Having such a pipeline seems unavoidable given that task-oriented dialog encompasses multiple problems including multi-turn language understanding and generation, knowledge retrieval and reasoning, and action prediction. Dialog systems typically begin by converting conversation history to belief state by using supervised learning Henderson et al. (2013); Rastogi et al. (2017); Mrkšić et al. (2017); Wen et al. (2017). The belief state is then used to reason on an external knowledge source whose result along with the conversation history is used in action prediction and response generation tasks independently. However, relying on a pipeline of individually optimized components makes these systems hard to scale. Moreover, success of consumer facing systems rely on efficient incorporation of user reinforcement signals which is non-trivial for a pipeline system.
End-to-end learned deep learning methods have recently enjoyed much success over pipeline systems in many tasks such as image recognition, speech recognition, and machine translationLecun et al. (2015). Such methods have been applied to task-oriented dialog only in a limited way. For example, Rojas-Barahona et al. (2017) use a separate deep neural network trained independently for every individual component. Bordes et al. (2017) attend to a small knowledge base but do not have a generative model for text response generation. A major difficulty has been on efficiently incorporating external (structured or unstructured) knowledge to action prediction and text response generation models. In this paper, we develop Neural Assistant: a single neural network model that takes conversation history and an external knowledge source as input and jointly produces both text response and action to be taken by the system as output. The model learns to reason on the provided knowledge source with weak supervision signal coming from the text generation and the action prediction tasks, hence removing the need for belief state annotations.
We evaluate our approach on the MultiWOZ dataset Budzianowski et al. (2018). The dataset contains approximately multi-turn dialogs between users and wizards. Along with conversations, the dataset contains both belief state and dialog act (or semantic parse) annotations. We only predict belief state annotations that correspond to action prediction and remove belief state annotations that are used for accessing the knowledge base from the dataset. We do not use the dialog act annotations in our study. Figures 1, 4 and 5 are examples conversations with the Neural Assistant model. We study the effect of distant supervision, and the size of knowledge base on model performance. We find that the Neural Assistant without belief states is able to incorporate external knowledge information achieving higher factual accuracy scores compared to Transformer. In settings comparable to reported baseline systems, Neural Assistant when provided with oracle belief state significantly improves language generation performance. Even with a weakly labeled knowledge base, our system comes very close to the quality of the baseline belief state system.
2 Neural Assistant
We formulate the task-oriented dialog problem as taking conversation history along with a relevant knowledge base (KB) as input, and generating system action and the assistant’s next turn text response as output (Figure 2). For example, the conversation history could contain a single turn of user utterance "find me an inexpensive Italian restaurant in San Francisco," and one possible next turn assistant response could be "how about The Great Italian?" Here, the external knowledge required to generate the output would be present in the provided KB. A common way to store such facts is in triple format, e.g. in this case the KB could contain (The Great Italian, type, restaurant), (The Great Italian, cuisine, Italian), (The Great Italian, price, cheap) and so on. Given the above two utterances, the user might say "sounds good, can you book a table for 4 at 7pm?", for which the assistant performs a system action book_table(name=The Great Italian, num_seats=4, time=7pm), and generates a text response "Done!"
Neural Assistant learns to directly map the conversation history and KB to next system action and text response without any intermediate symbolic states or intermediate supervision signals. We first begin by introducing notation, then we describe the model architecture and the training objective.
Conversation History consists of alternating user and assistant turns. Let denote conversation history containing turns each of user utterance () and assistant utterance (). The user and assistant turns each contain variable number of word tokens.
Knowledge Base: We assume the external knowledge required to solve the task is provided. While it is possible to leverage both structured and unstructured knowledge in our framework, in this work, we consider external knowledge in the form of structured KB containing a list of triples. Let be the list of triples in the provided KB.
Output consists of both the system action and text response.
2.1 Model and Training Objective
Neural Assistant is an extension of the Transformer Vaswani et al. (2017) encoder-decoder model. Our model additionally attends to the provided KB to incorporate external knowledge. We encode the knowledge triples separately (in parallel) and the decoder attends to the triples in addition to the input conversation history.
Transformer encoder is used to consume the input conversation history. Let be the concatenated conversation history (both assistant and user turns separated by delimiters) containing tokens. Then the encoder produces hidden states after word embedding lookup and multiple self attention layers. We represent each KB fact as an average of the word embeddings of the tokenized triple. We denote the representations of the triples by .
The transformer decoder which contains both self-attention and encoder-decoder attention layers generates the output sequence consisting of both the system action and text response one token at a time, left-to-right. We tokenize the system action with the text tokenizer and generate a concatenated version of system action and text response as one long sequence. While the encoder-decoder attention layers in Transformer Vaswani et al. (2017) only attend to input (conversation history), we make a modification to the Transformer decoder where it attends to both the encoder hidden states of the conversation history, and to the representation of the fact triples (Figure 3). So, the decoder attention heads attend to the set . In previous work with Transformer, the decoder attends only to .
Let denote the target sequence, we model the target sequence distribution as
Given a training set of examples , the objective function to be maximized is given by
We use teacher-forcing (Williams and Zipser, 1989) where the model conditions on ground-truth previous tokens in the output and ground-truth previous assistant turns in the conversation history.
2.2 Distant Supervision
We adopt a technique called distant supervision Mintz et al. (2009) widely used in knowledge base construction research. At train time, we (weakly) label facts in the KB positive if some word in the entities of the triple in are present in the ground-truth target sequence. This weak supervision signal could potentially guide the decoder attention to KB described above.
The distant supervision objective to be maximized is given by
is the attention probability, and
is an indicator variable that is set to 1 if some word in the entities of the triple are present in the ground-truth target sequence and 0 otherwise. The model now maximizes an interpolation of the two objective functions in Equation2 and Equation 3, given by
where is a weighting term tuned on the development set.
3 Related Work
In past work, dialog systems have generally relied on pipeline systems Singh et al. (2002); Levin and Pieraccini (2000). Deep learning has been applied to task-oriented dialog in many recent studies Henderson et al. (2013); Wen et al. (2015); Williams et al. (2017); Rastogi et al. (2017); Mrkšić et al. (2017); Wen et al. (2017); Bordes et al. (2017). One line of work has been on using deep learning to predict belief states using supervised learning Henderson et al. (2013); Rastogi et al. (2017). The other line of work makes use of pipelines consisting of many components each represented as a neural network trained independently Wen et al. (2015); Mrkšić et al. (2017); Wen et al. (2017); Eric and Manning (2017).
The line of work closest to our is in the use of memory networks Weston et al. (2015); Sukhbaatar et al. (2015) for task-oriented dialog Bordes et al. (2017); Pere and Liu (2017); Henderson et al. (2017). While all these works incorporate an external knowledge source directly to text response generation, they do not employ a generative model for response generation, and instead rely on selecting a response from a list of candidate responses which is impractical in real-word settings. More recently, Wu et al. (2019) use a generative model instead of a text classification model but they along with previous work Bordes et al. (2017); Pere and Liu (2017); Henderson et al. (2017) work with much smaller knowledge bases where unlike in our case, full softmax attention over the knowledge base is computationally feasible. Also, they do not generate both the text response and system action jointly together in a single model.
Other kinds of dialog tasks have also been tackled by deep learning. This line of work has predominantly been in the chit-chat setting where generative deep learning models are used to generate text responses Vinyals and Le (2015); Serban et al. (2017); Li et al. (2016). More recent work has extended this line of work to language based negotiation games Lewis et al. (2017) and dialog systems with persona Zhang et al. (2018).
We evaluate our method on the MultiWOZ Budzianowski et al. (2018) dataset. The dataset contains close to training examples and examples in both the validation and test sets. We report results on test set in the tables below. The dataset includes an associated knowledge base containing triples. To evaluate the performance of different methods, we use F-1 score for action prediction (Action F-1) and BLEU score for text response generation. Apart from BLEU score which primarily measures fluency, we also report Entity-F1 score which is an approximate metric to measure the factualness of the text response. We get the list of entities mentioned in the ground truth response and compare it to the list of entities in the model prediction. We use exact string match to get the list of entities. Our models are implemented in the Tensor2Tensor Vaswani et al. (2018) framework. All models are trained for 50k steps. Due to the small size of the dataset, we use the tiny Transformer hyper-parameter setting in Tensor2Tensor. Unless otherwise stated the Neural Assistant is trained without the distant supervision objective.
Figures 1, 4 and 5 are examples conversations with the Neural Assistant model in real-time to complete a task. Note that the model is trained at turn-level, where the dialog history fed into model as input consists of the previous ground-truth turns of the dialog example. The model is not exposed to text responses it generated in the previous turns as a part of input dialog history in training time. However, in the conversations in the figures, the actual text responses generated by model itself are used as the assistant’s side of dialog history to be fed as input to model for generating text responses and actions in the following turns of the dialog.
First, we benchmark the Transformer model on belief state prediction and text generation problems to compare with the results reported in Budzianowski et al. (2018). The Transformer baseline models only take the conversation history as input. They skip the KB and do not use oracle belief state annotations. The text generation results are in Table 1. We treat belief state prediction also as a sequence-to-sequence problem and achieve F-1 score on belief state prediction, which is once again significantly higher than F-1 score from the baseline system.
Next, we start reporting results on the Neural Assistant model. We evaluate our framework in increasingly harder settings by gradually increasing the size of the external knowledge source to be incorporated by the model. To begin with as done in Budzianowski et al. (2018), we include oracle belief state annotations which reduces the size of the KB to be considered for a given input to be less than 10 triples. As shown in Table 1, the Neural Assistant model achieves a BLEU score of , significantly higher than the baseline system Budzianowski et al. (2018) that gets BLEU score. Since the oracle belief states are provided to the model, we do not evaluate the Entity F-1 and Action F-1 score for this setting. Then we make the setting slightly harder where the model consumes only weakly labeled positive triples from distant supervision (Section 2.2). Here, the size of the KB to be considered is around 50 triples per example. Even with a weakly labeled knowledge base, our system comes very close to the quality of the baseline belief state system.
|Model||BLEU||Action F-1||Entity F-1|
|System with Oracle Belief State Budzianowski et al. (2018)||N/A||N/A|
|Neural Assistant (oracle triples)||N/A||N/A|
|Neural Assistant (weakly labeled positive triples)|
4.2 Neural Assistant with Large Knowledge Base
Now, we carefully study the extent to which Neural Assistant models can handle large KBs. We get the set of weakly labeled positive triples for every example and fill up the rest of KB with randomly sampled negative examples both at train and test time. The goal of this experiment is to study the effect of KB size on Neural Assistant performance. Another way to look at this experiment is to study the extent to which our model can tolerate the errors of a retrieval system. The performance of Neural Assistant on different KB sizes are in Table 2. The BLEU score and Entity F-1 scores for Neural Assistant reduce as the KB size increases. The model is able to incorporate external knowledge effectively as long as the KB size is triples or smaller. Beyond that, the Entity F-1 score degrades quite rapidly. We also study the effect of distant supervision discussed in Section 2.2 as an additional training objective on Neural Assistant performance. Our experiments show that in some cases distant supervision helps the model to get better performance particularly higher entity F-1 score but not in all cases. Finally, we report results from using the entire KB at test time using a model that is trained with triples at train time without distant supervision. In this setting, the entity F-1 score is quite low indicating since the model is not able to select the relevant entities from the knowledge base at test time. The model cannot consume the entire KB at train time as it runs out of memory on ML accelerators.
|Neural Assistant||Neural Assistant with DS|
|Size of KB||BLEU||Action F-1||Entity F-1||BLEU||Action F-1||Entity F-1|
|28483 (at test)||-||-||-|
We introduced Neural Assistant, a single neural network model that takes conversation history and an external knowledge source as input and jointly produces both text response and action to be taken by the system as output. The model learns to reason on the provided knowledge source with weak supervision signal coming from the text generation and the action prediction tasks, thereby removing the need for belief state annotations. Future work involves developing techniques that can effectively handle much larger KBs.
|Hi there, can you find me a place to stay?|
|Response: Sure, do you have a price range in mind?|
|I prefer a place with moderate price.|
|Response: There are several in the moderate price range. Do you have a preference?|
|I would prefer it to be in the south.|
|Response: I have two guesthouses in the south in the moderate price range. Would you like to book one of them?|
|Response: I’d be happy to book you a room at aylesbray lodge guest house, and bridge guest house. Which one do you prefer?|
|bridge guest house sounds good.|
|Response: I can book that for you. How many people will be staying and how many nights?|
|4 people will be staying for 2 nights starting on this friday.|
|Response: Booking was successful. Reference number is : GBHPGZH.|
|That was quick, thank you Neural Assistant!|
|Response: You’re welcome. Is there anything else I can help you with?|
|No, thanks. Have good week!|
|Response: You’re welcome. Have a great day.|
|Hello, can you find me a hotel to stay?|
|Response: Sure, do you have a price range in mind?|
|Yes, somewhere in the moderate price range.|
|Response: There are several in the moderate price range. Do you have a preference on area?|
|Yes, how about somewhere in the north?|
|Response: I have two options for you. The Ashley Hotel and Lovell Lodge. Would you like to book one of these?|
|Response: What day would you like to start your stay?|
|Can you book it for 1 people starting on Thursday for 2 night?|
|Response: Booking was successful. Reference number is : FKQVSY9Y.|
|Response: You’re welcome. Is there anything else I can help you with?|
|Can you also find me a nice restaurant?|
|Response: Sure, do you have a food preference?|
|Yes, I prefer Italian food.|
|Response: There are two Italian restaurants in the south area. One is in the moderate price range and one is in the centre. Do you have a preference?|
|I prefer the one in moderate price range.|
|Response: Pizza Hut Cherry Hinton is in the south part of town and is in the moderate price range. Would you like to book a table?|
|Yes, please book a table for 1 person at 20:30.|
|Response: I’m sorry, that time is not available. Would you like to try another time or day?|
|Response: I was able to book you at Pizza Hut Cherry Hinton. Your reference number is FG1KWP5. Is there anything else I can help you with?|
|No, thanks. Have a good night.|
|Response: Thank you for using our service. Have a great day.|
- Learning end-to-end goal-oriented dialog. ICLR. Cited by: §1, §3, §3.
- MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. EMNLP. Cited by: §1, §4.1, §4.1, Table 1, §4.
- A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. EACL. Cited by: §3.
- Deep neural network approach for the dialog state tracking challenge. SIGDIAL. Cited by: §1, §3.
- QUERY-reduction networks for question answering. Cited by: §3.
- Deep learning. Nature. Cited by: §1.
- A stochastic model of computer-human interaction for learning dialogue strategies. IEE Transactions on Speech and Audio Processing. Cited by: §3.
- Deal or no deal? end-to-end learning for negotiation dialogues. EMNLP. Cited by: §3.
- A diversity-promoting objective function for neural conversation models. NAACL. Cited by: §3.
- Distant supervision for relation extraction without labeled data. ACL. Cited by: §2.2.
- Neural belief tracker: data-driven dialogue state tracking. ACL. Cited by: §1, §3.
- Gated end-to-end memory networks. ACL. Cited by: §3.
- Scalable multi-domain dialogue state tracking. Proceedings of IEEE ASRU. Cited by: §1, §3.
- A network-based end-to-end trainable task-oriented dialogue system. EACL. Cited by: §1.
- A hierarchical latent variable encoder-decoder model for generating dialogues. AAAI. Cited by: §3.
Optimizing dialogue management with reinforcement learning: experiments with the njfun system. Journal of Artificial Intelligence Research. Cited by: §3.
- End-to-end memory networks. NeurIPS. Cited by: §3.
Tensor2Tensor for neural machine translation. CoRR. Cited by: §4.
- Attention is all you need. NeurIPS. Cited by: §2.1, §2.1.
- A neural conversational model. CoRR. Cited by: §3.
- ELIZA a computer program for the study of natural language communication between man and machine. Computation Linguistics. Cited by: §1.
- Semantically conditioned lstm-based natural language generation for spoken dialogue systems. EMNLP. Cited by: §3.
- Latent intention dialogue models. ICML. Cited by: §1, §3.
- Memory networks. ICLR. Cited by: §3.
- Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. ACL. Cited by: §3.
A learning algorithm for continually running fully recurrent neural networks. Neural computation. Cited by: §2.1.
- GLOBAL-to-local memory pointer networks for task-oriented dialogue. ICLR. Cited by: §3.
- Personalizing dialogue agents: i have a dog, do you have pets too?. ACL. Cited by: §3.