Spoken Dialogue Systems(SDS) are widely used as assistants to help users in processing daily affairs. Tasks often vary from searching for a restaurant to booking several flight tickets. The demand for finishing tasks in diverse situations requires the SDS to have the ability to handle different domains of the dialogue.
To build a successful SDS, state representation is an essential part of an end-to-end dialogue system. A general used method is to use a human-defined state representation, where the state records necessary information such as indispensable slot values the system needs. The dialogue policy then makes actions associating with the state. To generate such representation in the real dialogue situation, a belief state tracker is usually adopted to recognize the ontology from the user’s text.
Another optional way is to use a hidden state representation. The text is compressed into hidden vectors from the raw utterance. The model summaries the context and dialogue acts are making from the hidden states. In this setting, the dialogue system is pure an end to end model with only text as the input.
Both two methods above have their restraints. The model with human-defined states as input is bounded by an attached state tracking model. The errors accumulated in the states tracking processing, especially in a multi-domain situation where the space of ontology is large. And the abandoning of human-defined states leads to the poor ability of the model. Besides a model with the latent state is always hard to understand and debug for humans.
In this paper, we introduce a universal dialogue generation system dealing with multi-domain dialogues. Rather than using an external tracker to recognize the ontology, our model straightly generates hidden states from raw text. And to make a proper response and benefit from well labeled semantic on training data, we brought a teacher-student framework. In the framework multiple teachers are applied in every required domain to learn the dispersed dialogue knowledge and labeled extra semantic information. Then we extract and merge the well learned knowledge and policy methods from individual teacher models into a universal student model. The framework ensures the student studies the well-learned responses during conversations. Our model takes full advantage of the labeled data and is not bounded by the performance of an outside belief tracker.
The main contributions of our work can be briefly summarized in two folds:
We use a multi-teacher single-student framework to gather the extra knowledge from individual domains to one universal model in dialogue systems.
We built a multi-domain dialogue system that takes no human-defined states as input and can still benefit from semantic labeling while training.
2 Related Works
2.1 Multi-domain Dialogue System
The architectures of a dialogue system usually consist of following key components: A Spoken Language Understanding(SLU) that understands the users’ intents, a dialogue manager that captures the dialogue states and makes decisions for the response, a Natural Language Generator(NLG) that generates human-readable text responses. Task-oriented dialogue systems in special domains process problems according to domain-specific ontology. The increase in the scale of domains depends on the improved ability of every component above.
A large number of works have been done with how to construct a dialogue system in multi-domain. Such as dynamic dialogue processing for users as a daily assistantPakucs (2003), the scalable action spaces for a dialogue system to share knowledge between domainsDzikovska et al. (2003)
. As the neural network is widely used in dialogue systems, ideas appeared to handle the multi-domain dialogue with a deep network. For example,Wen et al. (2016) brings the method that improves the dialogue system’s ability in one domain by pre-train the neural network from another domain. And Ultes et al. (2017) developed a multi-domain dialogue system toolkit with the implementations of all dialogue system modules such as a DQN-based dialogue policy.
2.2 Dialogue State
The dialogue states clarify the current step the dialogue processing locates in. The dialogue states can be regarded as a Partially Observable Markov Decision Processes(POMDPs)Thomson and Young (2010). A general way to represent such processes is to apply human-defined features and consider the multi-hot embedding vectors as the states. These features often contain slots that must be filled in the task, and domain tags if there is more than one domain. States generated by this kind of embedding method are often well explainable.
To apply this method into practical application, we need an external state tracker to recognize correct features from user utterance. Many works have been done on this problem such as a rule-based state tracker Sun et al. (2014) or a Neural Belief Tracker(NBT)Mrksic et al. (2017). There are also works focusing on state trackers that track user intent and slot values in multiple domains. Mrksic et al. (2015); Rastogi et al. (2017); Goel et al. (2018); Nouri and Hosseini-Asl (2018)
Another method in dialogue state representation is to use the hidden state vector generated directly from the raw text. A Hierarchical Recurrent Encoder-Decoder(HRED)Sordoni et al. (2015); Serban et al. (2016, 2017) dialogue system summarizes the history dialogues by utterance vectors without handcrafted features in an open domain. And an Attention with Intention(AWI)Yao et al. (2015) architecture did the work in a similar way. These models take only the raw text as their inputs and outputs, and need no human defined state information while training, thus get rid of a state tracker component.
2.3 The Teacher-Student framework
The teacher-student framework illustrates a teacher model guiding the training step to generate a better student by its internal knowledge. The idea of the teacher-student framework in deep learning is brought in knowledge distillation byHinton et al. (2015)
, where the knowledge is extracted from a large teacher model to a small one or assembled from several models into one student. The earliest application of knowledge distillation is mainly in computer vision. Recent works show that knowledge distillation based teacher-student method works well in a language modelKim and Rush (2016). Fan et al. (2018) extends the teacher-student framework that the duty of the teacher is no longer simply transferring the knowledge but deciding what kind of data to learn, in what space of hyper-parameters, and how well the student can reach. And Tan et al. (2019) proposed a multi-teachers-single-student framework that combines more than one individual model to a multilingual model in the language translation task.
3 Dialogue Generation Systems
Our multi-domain dialogue generation system can be illustrated as three parts: A multi-domain hierarchical dialogue generation model, serving as the main model and the student model to learn external knowledge. Several individual dialogue models, take the roles of teacher models to guide a student. And the guiding step that transfers the knowledge from individual models to the universal one.
The problem of multi-turn dialogues can be considered as a sequence to sequence mapping problem. At the time , the user inputs an utterance , the dialogue finds the most proper response according to and the history context . That is, maximum . By introducing POMDPs, the history dialogues can be summarized as the state . Usually, the state is generated from the all user utterance and the history responses from the system
. In a reinforcement learning setting of dialogue problem, we use actions to represent what the system should respond. The dialogue policy makes actions from the states. And the response is generated from the NLG module by the corresponding action , or by both the action and the user’s utterance in an attention mechanism enhanced dialogue model, .
In our model, the state for the teacher model is directly defined from human labeled semantic in the utterance, and the student model generates the itself from all passed user utterance. The detail of the two models will be discussed in the rest of this section.
3.1 A Universal Dialogue Generation System
A typical method in dialogue generation is to use a sequence to sequence modelCho et al. (2014). The user’s input utterance contains a sequence of words . The encoder part of a Seq2Seq model takes the words as the input and learning a representation of the utterance . And then the action is made from the collection of all utterance representation .
We use an encoder-decoder model to encode the user utterance to a latent vector representation and summarize the all utterance’ vectors with a context-level encoder in hierarchical encoder-decoder architecture as shown in Figure 1. For an utterance at the time contains words . The encoder is an LSTMHochreiter and Schmidhuber (1997) network:
Then we consider the last hidden state of the LSTM as the utterance representation vector , and take the hierarchical encoder as the context-level policy module. The action is made based on the all history of all utterance. We use another LSTM as the context-level decoder.
The action is in the form of an abstract latent vector, serving as the guidance for the dialogue system to make proper responses. Although the policy module is not a necessary part in a dialogue system, we’ll see how the action representation facilitates the performance of our model using the teacher-student framework.
The action is fed into the generation part lately. The NLG module takes the action as the initial state and generates the final response . With the attention mechanism enhanced, the decoder model can be written as:
where is the output of the encoder in the position of the -th word .
3.2 Individual Models as Teachers
We also train dialogue models in each individual domain. Differ from the universal model, the individual models don’t generate the actions directly from the utterance. Instead, we use human labeled semantic to construct the states of the conversation.
We use the model proposed by Budzianowski et al. (2018a). As shown in Figure 2, it contains three parts: the encoder and the decoder as same as the hierarchical model, and a middle policy model that takes both the utterance representation as well as and human defined feature as the input. The feature is split into two vector representations. The first part is the belief state vector , where each dimension of the vector stands for the one-hot value of a specific slot in each domain, a slot that should be received from the user. Thus the whole values of are the necessary information the system keeps at the current situation. At every turn the belief state is updated according to the semantic labeling of the user. Another construction of the state is the database pointer vector , where a database pointer vector stands for the number of the corresponding entities that match the request of the user. We use a 6-dimension one-hot embedding vector and each position embedding means separately 0, 1, 2, 3, 4 and more than 4 candidate entities. We concatenate three vectors: the utterance vector , the belief state , and the database pointer , to get the vector of the current stage in the conversation.
After the state vector is calculated, we feed the vector to the policy model. The vector is processed with a nonlinear layer with tanh as the activation function, and the action vectoris generated from this layer:
where stands for concatenation. The action is finally delivered to the decoder module and the response is generated with an attention mechanism as mentioned above. We train teacher models individually at each domain. Thus the meaning of the belief state differs in teachers. After the teachers are well pre-trained in all domains. We take the teachers as guidance to train the student model.
4 The Teacher-Student Framework
This section we described the method to extract the knowledge from individual models in each domain to the universal model. We consider the dialogue problem a mapping from a mixed set to the set . is the collection of the user’s inputs. stands for the set of the context or the history during a dialogue. And is the response the system gives. For a well-formed domain-specific dialogue data, the semantic labeling is offered to construct the state representation, and then as the extra inputs. That is, the teachers learn the mapping from to the responses . With state set based on human labeling, we can take it for granted that teachers learn a better response strategy than the one without that. Then, while the training of the student model, we use extra information from these teachers to guide the training step. We hope to get a final student model that performs as well as the teachers without extra state information. We use the concept of knowledge distillation and guides the student from both the output part and the decision making part to reach as close in performance to the teacher models.
4.1 Output Guiding
We first trained our model to learn proper responses from single-domain models. For a generation model at the time given user utterance and the context , the purpose of the model is to find the most suitable response with a sequence of words . That can be written as:
the is the vocabulary of all possible words and is the parameters of the generation model. To apply the guidance from the teachers’ output, the student should output a similar result as the teachers do. With a log-likelihood format, the guide method can be written as:
is the parameter of the teacher models. Rather than simply learns the grounding-truth at each turn, the student model also tends to learn the response the teachers give.
We applied 2 methods of distillation:
Vocabulary size distillation
. The output logits at each position are totally used for knowledge distillation. This is the naive way to distill the knowledge from the teacher model. For the grounding truth of the training data, the generation part of the model learns only the one-hot value at each position. For the distillation, the guidance from the teachers’ output applies a smoother distribution of the probability of words. The vocabulary size distillation brings nature and correctness for the dialogue generation.
Top-K distillation. Only top k logits at each position are used for knowledge distillationTan et al. (2019). This method of distillation doesn’t use the full probability of the vocabulary size at every step of response generation. The top k of the logits are selected in the teachers’ output to be the guidance of the student’s training. This kind method of distillation is more efficient than a full vocabulary one. Actually, not all of the words should be taken into consideration in the generation of the sentence so the omitting of the low probability words helps in the guiding process.
We use both the grounding truth and the teachers’ output as the target data in the training, and apply the negative of the log-likelihood as the loss of output distillation. The loss is added to the loss of the grounding truth. To adjust the effect of the teachers, we apply a weighted scalar to change the importance of teachers while training.
4.2 Policy Guiding
Beside applying distillation in the final outputs, we also expect the universal model learns more parts of the teachers’. The teacher model and the student model differ in the structure of the policy part but have the same decoder module as the NLG part. We can assume that the teacher models and the student model should have similar decision making as input in the NLG model. Thus the teaching process of the policy part from teacher models is helpful to a student model. We use the action from the teachers’ policy output as the extra info to train the policy of student’s. For and are both in the form of latent vectors. While training we use mean squared error(MSE) loss to force the student to make decisions like the teachers. That is:
While training we add the policy distillation loss to the existing loss by multiplying another weighted scalar as the output distillation does.
To figure out the performance of the approaches mentioned above, we apply our model to a multi-domain dialogue problem to test the ability of the teacher-student based dialogue systems.
We choose MultiWOZBudzianowski et al. (2018b), a multi-domain human-human conversation dataset. The MultiWOZ dataset consists of dialogue data in 7 domains, which vary in restaurant, hotel, attraction, taxi, train, hospital and police. The conversation in the dataset aims at satisfying the intent of the user’s, and apply the necessary information the user needs about some entities. An episode of conversation contains around 14 turns of dialogues between the user and the system. Several episodes’ topics are limited in one domain from beginning to the end turn, while others’ are switching among the conversation in 2 to up 5 domains. In each domain, there are about 4 slots that the system can be received from the user and about 3 properties of the entity the system should provide to the user. For example, in a restaurant domain, the user can choose the area, the price range and the food type of a restaurant, and the information the system should offer about the restaurant includes the address, the reference number, the phone number, etc.
To test the response ability of the models, at first, we take a pre-processing on the dialogue data, replacing the name of the entities and their property values with placeholders. Then we manually generate the belief states and the database pointers, as the extra inputs of teachers, from the human labeled semantics. To train the individual teachers in different domains, we split the dataset into domains in turn-level, tagging the domain of each turn by the entities mentioned in user, system or the human defined dialogue actions. For some episodes may involve more than one domain, One episode may be taken apart into several. Though the fluency of the conversation may be influenced by the lacking of the context in training individual teachers, we think the teacher models can be well trained as they take the turn-level sentences as input as well as the manual state.
We set several models as the comparison of our models:
The universal model without teachers. We use an HRED model without any teacher’s guide. The model directly learns from the raw data, to see how much the teacher-student framework helps in the model training.
The dialogue model with the belief state. We use a dialogue model takes belief state as the input, as well as a state tracking model. The dialogue model is the same as the teacher model in section 4, takes the belief states as the part of the input. And we also use a Globally Conditioned Encoder(GCE)Budzianowski et al. (2018a) state tracking model, which is the best state tracking model on the MultiWOZ dataset, to update the state as the dialogue carries on. We test this model to see if our framework can successfully summarize the abstract state and make responses from utterance rather than from a belief state.
The dialogue model with the manual state. We also apply a universal dialogue model alone use the manual state as the input during measurement. That is, this model takes more input data than others in comparison. The test result is set as the upper bound as other models can reach.
5.3 Settings and metrics
We use 50 dimensions of the vectors in the word embedding. The vocabulary size is limited to 400 for both the input and the output vocabulary size separately. The HRED model has three parts of LSTM architects, with their hidden size 150. The teacher models have the encoder and the decoder part of 150 dimension LSTM networks as well. For each teacher, we trained it on its individual domain, and find the model has the best performance as the guidance. For the student model, we use Adam optimizer and the learning rate is 0.005. To balance the data training and the teacher guidance, the output distillation loss has a weight and the weight for policy one is set to 0.05.
To measure the performance of different models, we use several metrics to judge the responses generated. Firstly, we calculate sentence level BLEU-4 scores to measure the similarity of the real response and the generated one. For BLEU scores show less correlation about the quality of the dialogue content, we apply other measurements. We use two metrics that are suggested by Budzianowski et al. (2018b)
, as the estimations for the MultiWOZ dataset in the dialogue context to text task. Both the measurements are on the episode level. TheInform rate indicates whether the dialogue system suggests suitable entities according to the user’s intent in an episode. And the Success rate illustrates if the system provides all the correct properties the user requests after a success informing. We run the models on the test dataset which includes 1000 episodes of conversations, then count out the ratios of the successfully informed dialogues and totally succeed ones.
6 Result & Analysis
The comparison between the different models is shown in Table 1. From the table, we can see that in multi-domain our model(HRED-TS) gets the best scores in both the informing rate and the success rate. By adding a teacher-student framework, the informing rate and the success rate receive 4% improvement than the original model. Our model reaches a result as close as the one that takes manual states as input, which is considered as an upper bound. The model with a GCE state tracker increases the Informing rate but has no effect on improving the dialogue success rate. We think that thought the state based on a state tracker has its pros on representing the dialogue processing, the errors exist in the state tracker harms the performance of the model. And our model avoids such a problem by using the teacher-student framework.
In Table 1 we also measure our dialogue models’ performance in the biggest single domain, the restaurant domain. Results show that our model produces the best results in all models and even outperforms than the one with manual state input. We trust that it is due to using an individual teacher in the restaurant domain while training, which results in a better performance in this domain than the universal one. It is worth noticing that a GCE state has a higher informing rate than the manual state. We believe the lower informing score in manual state one is caused by the influence of state labeling from other domains.
Table 2 shows the effect of the top-k distillation method for our framework, it is clear that the total success rate increases by applying the top-k distillation except the value is set to 8. Top-128 brings the best performance for the model. It is well explainable that the top-k distillation removes the unnecessary information the teachers give to the student. Meanwhile, the significant words are kept in the distillation to ensure the response is good enough. As the size of the distillation reduced, both the success rate and the informing rate decrease due to the loss of enough guidance. The result shows slightly abnormal when . We believe that top-1 distillation is in the same form as to add more grounding truth to the training data.
From Table 3, we can see the results of using different guidance. Adding a policy distillation alone brings the slightest improvement to the original model. The output distillation has the highest informing rate in all guidance methods. By adding both a policy distillation and an output distillation, the model succeeded more often on the dialogue but failed on informing rate, which indicates the decision helps less on entity suggestion and has more effect on properties informing, which involves more decision makings on multiple turns. We also proofed that using a single universal model as the teacher is not well as using individual teachers, for the informing rate of the universal distillation is lower than the output distillation one.
In the paper, we propose a multi-domain dialogue generation model trained with a teacher-student framework. The model takes only raw text as input and takes full advantage of the human labeled states during training. The model behaves better than the one using an external state tracker, with great improvements in the success rate during a conversation.
The problem exists in our model that it focuses on the text generation during the conversation, and takes no consideration of the knowledge base querying. So our model cannot be regarded as a complete dialogue system. But we don’t think it is unable to process. Adding an extra component such as a memory network can solve the problem.
- Towards end-to-end multi-domain dialogue modelling. Cited by: §3.2, §5.2.
MultiWOZ - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 5016–5026. External Links: Cited by: §5.1, §5.3.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans (Eds.), pp. 1724–1734. External Links: Cited by: §3.1.
- Integrating linguistic and domain knowledge for spoken dialogue systems in multiple domains. In Proc. of IJCAI-03 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Cited by: §2.1.
- Learning to teach. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §2.3.
- Flexible and scalable state tracking framework for goal-oriented dialogue systems. CoRR abs/1811.12891. External Links: Cited by: §2.2.
- Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Cited by: §2.3.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.1.
- Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 1317–1327. External Links: Cited by: §2.3.
Multi-domain dialog state tracking using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pp. 794–799. External Links: Cited by: §2.2.
- Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 1777–1788. External Links: Cited by: §2.2.
- Toward scalable neural dialogue state tracking model. arXiv preprint arXiv:1812.00899. Cited by: §2.2.
- Towards dynamic multi-domain dialogue processing. In 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003, External Links: Cited by: §2.1.
- Scalable multi-domain dialogue state tracking. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, December 16-20, 2017, pp. 561–568. External Links: Cited by: §2.2.
Building end-to-end dialogue systems using generative hierarchical neural network models.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., D. Schuurmans and M. P. Wellman (Eds.), pp. 3776–3784. External Links: Cited by: §2.2.
- A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., S. P. Singh and S. Markovitch (Eds.), pp. 3295–3301. External Links: Cited by: §2.2.
- A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, J. Bailey, A. Moffat, C. C. Aggarwal, M. de Rijke, R. Kumar, V. Murdock, T. K. Sellis, and J. X. Yu (Eds.), pp. 553–562. External Links: Cited by: §2.2.
- A generalized rule based tracker for dialogue state tracking. In 2014 IEEE Spoken Language Technology Workshop, SLT 2014, South Lake Tahoe, NV, USA, December 7-10, 2014, pp. 330–335. External Links: Cited by: §2.2.
Multilingual neural machine translation with knowledge distillation. CoRR abs/1902.10461. External Links: Cited by: §2.3, §4.1.
- Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems. Computer Speech & Language 24 (4), pp. 562–588. External Links: Cited by: §2.2.
- PyDial: A multi-domain statistical dialogue system toolkit. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, System Demonstrations, M. Bansal and H. Ji (Eds.), pp. 73–78. External Links: Cited by: §2.1.
- Multi-domain neural network language generation for spoken dialogue systems. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow (Eds.), pp. 120–129. External Links: Cited by: §2.1.
- Attention with intention for a neural network conversation model. CoRR abs/1510.08565. External Links: Cited by: §2.2.