Research for dialogue system attracts a lot of attentions recently due to its potential huge value of reducing human cost in many commercial domains, such as restaurant reservation Henderson et al. (2014a), travel planning Peng et al. (2017) and retail service Zhang et al. (2019). The majority of dialogue agents with goals are expected to be efficient to complete tasks with as few as possible dialogue turns, in contrast with those chit-chat counterparts Ritter et al. (2011). The following two examples show the comparison of efficient and inefficient cases respectively. The senarios is restaurant reservation and the agent’s goal is to reserve a table at noon.
“Human: We don’t have empty tables at 11 o’clock tomorrow. All are reserved.” “Agent: How about 12 o’clock? We are also okay then.”
“Human: We don’t have empty tables at 11 o’clock tomorrow. All are reserved.” “Agent: What else time is available?” “Human: 12 o’clock is ok.”
“Agent: All right. We want that time.”
For expressing the same opinion, the inefficient example consumes four turns while the efficient example only needs two. As it can be seen, the efficiency is important for goal-oriented dialogue systems to achieve goals in a rapid way.
Usually, a dialogue system consists of a pipeline of natural language understanding (NLU), dialogue management (DM) and natural language generation (NLG), where the DM part is treat as two separate components: dialogue state tracking (DST) and dialogue control (DC, i.e. dialogue policy selection). The DM part is widely considered to be relevant to the dialogue’s efficiency, because it makes decisions on what to say for the next turn. Recently, methods based on reinforcement learning are proposed for the policy selection component to build efficient dialogue systems. However, there are some drawbacks of reinforcement learning based methods. For example, they requires lots of human work to design the learning strategy. Also a real-world environment which is essential for the agent to learn from is expensive, such as from domain experts. Moreover, training the dialogue manager as a two separate components could lead to error propagation issueRastogi et al. (2018).
In addition to reinforcement learning based methods, sequence-to-sequence based methods are also popular recently, because they can learn a dialogue agent purely from data and almost without too many human efforts. The error propagation issue can also be reduced because they are end-to-end, and they have better scalability for different scenarios. However, it is difficult to build efficient dialogue agents by those methods since their objective functions for training models are usually inclined to general responses, such as I don’t know, yes and OK, or often generate the same response for totally different contexts because the contextual information is not well-modeled by those methods Dodge et al. (2015).
In this paper, we address the problem of learning an efficient dialogue manager from the perspective of reducing manual intervention and error propagation, and propose a new sequence-to-sequence based approach. The proposed end-to-end model contains a novel looking-ahead module for dialogue manager to learn the looking-ahead ability. Our intuition is that by predicting the future several dialogue turns, the agent could make a better decision of what to say for current turn, and therefore goals could be sooner achieved in a long run.
More specifically, our model includes three modules: (1) encoding module, (2) looking-ahead module, and (3) decoding module. At each dialogue turn, three kinds of information, the goals, historical utterances and the current user utterance, are utilized. First they are encoded by three separate Bidirectional Gated Recurrent Units (BiGRU) models. Then the three encoded embeddings are concatenated to one vector, which is then sent to a new bidirectional neural network that can look ahead for several turns. The decoding module will generate utterances for each turn through a learned language model. At last, by considering all the predicted future utterances, a new real system utterance for the next turn is re-generated by using an attention model through the same language model.
Our proposed approach has several advantages. First, it is an end-to-end model and does not take too many human efforts for system design. Although the goals should be handcrafted for specific scenario, the number of goals is small and it is a relatively easy work. Moreover, compared with naive sequence-to-sequence based models, our agent can make the dialogue more efficient by modeling the looking-ahead ability. Experimental results show that our model performs better than baselines on two datasets from different domains, which could suggest that our model is also scalable to various domains.
The contributions in this paper include:
We identify the problem that how to make dialogues efficient by exploiting as little as possible manual intervention during system design from the perspective of end-to-end deep learning.
We propose a novel end-to-end and data-driven model that enables the dialgoue agent to learn to look ahead and make efficient decisions of what to say for the next turn.
Experiments conducted on two datasets demonstrate that our model performs better over baselines and can be applied to different domains.
2 Related Work
In most situations, the dialogue systems require handcrafted definition of dialogue states and dialogue policies Williams and Young (2007); Henderson et al. (2014b); Asher et al. (2012); Chen et al. (2017). Those methods make the pipeline of dialogue systems clear to design and easy to maintain, but suffer from the massive expensive human efforts and the error propagation issue Henderson et al. (2014c); Liu and Lane (2017).
Reinforcement learning based methods for dialogue policy selection are widely studied recently Lipton et al. (2018); Dhingra et al. (2017); Zhao and Eskenazi (2016); Su et al. (2016). These methods only need human to design the learning strategies and do not require massive training data. However, the expensive domain knowledge and human expert efforts for agents to learn from are necessary Liu et al. (2018); Shah et al. (2018)
. Therefore, hybrid methods that integrate supervised learning and reinforcement learning are proposed recentlyWilliams et al. (2017); Williams and Zweig (2016). Thus, collecting massive training data becomes another manual work.
More recently, end-to-end dialogue systems attract much attention because almost no human efforts are required and they are scalable for different domains Wen et al. (2017); Li et al. (2017); Lewis et al. (2017); Luo et al. (2019), especially with sequence-to-sequence based models Sutskever et al. (2014). Although those models have been proved to be effective on chit-chat conversations Ritter et al. (2011); Li et al. (2016a); Zhang et al. (2018), how to build agents that are goal-oriented with efficient dialogue managers through end-to-end approaches still remains questionable Bordes et al. (2017); Joshi et al. (2017), and we investigate the question in this paper.
Our idea of enabling the agent to be efficient by modeling looking-ahead ability is inspired by the AI Planning concept, which is a traditional searching technology in the field of AI, and is suitable for goal-based tasks, such as robotics control Norvig and Russell (1995). Recently, the concept is borrowed to dialogue system communities and integrated into deep learning models. For example, a trade-off method for training the agents neither with real human nor with user simulators is proposed, in order to obtain better policy learning results Peng et al. (2018). In addition, at earlier time, the planning idea has been utilized for improving the dialogue generation task Stent et al. (2004); Walker et al. (2007).
3 End-to-end Dialogue Model
We propose an end-to-end model that contains three modules: (1) encoding module, (2) looking-ahead module, and (3) decoding module. Figure 1 shows the model architecture. We leverage Bidirectional GRU models Bahdanau et al. (2014) to encode agent goals, historical and current utterances. Then the obtained representations by encoding goals and utterances are regarded as inputs of the looking-ahead module, and they are used to predict several future turns. At last the predicted future turns are merged by an attention model and the new real system utterance is generated for the next turn.
Suppose for each dialogue session we have turns, and we do not distinguish whether it is user’s turn or system’s turn. If the agent has goals that are denoted as , each goal is formalized as a binary vector. For example in the restaurant reservation scenario, we can define that each variate in the vector corresponds to a yes-no condition, such as the means agent accepts bar table and the means agent does not want to change time. As to the utterance information, imagine at turn , we denote utterances for historical ones and for current user utterance. Our model predicts the system and user utterances for the next turns and then a new is generated as the system utterance after considering all the predicted turns. The model separates the current user utterance from historical ones in order to highlight the user’s current states. In general, the model is end-to-end and needs little human intervention or domain knowledge.
3.1 Encoding Module
In this module, the agent goals, historical utterances within the dialogue session, and the current user utterance are encoded by using three GRU models which is expected to learn long-range temporal dependencies Cho et al. (2014). is defined to encode agent’s goals and the final hidden state is taken as the representation of goals. The input of is a one-hot binary vector with length . is used to encode the historical utterances, and is used to encode the current user utterance. and are denoted as the final encoded representations of and respectively.
To get the -th hidden state for the three GRUs, respective inputs include the previous hidden state , or , and the embeddings of current observations, , or , where is a goal, is an utterance and is a token. For the textual tokens, we use the Word2vec embeddings as their representations Mikolov et al. (2013). Then the token embeddings are averaged to represent utterances. The formal denotation of the hidden states for the three GRU models is:
where represents the embeddings.
The final output of the encoding module is a concatenation of , and , which is denoted as . serves as the input of the following looking-ahead module. The right arrow means the initial direction to train the looking-ahead module is from the current to the future.
3.2 Looking-ahead Module
With the input of
, this module predicts several future dialogue turns. Since the process is sequential, we propose a recurrent neural network to model the process. In order to exploit the predicted information for later generating a real system utterance, another recurrent neural network is used to backtrack the information from future to current. To reduce the computing cost, the two neural networks share the same parameters, and the whole looking-ahead module looks similar to a bidirectional GRU as shown in Figure1.
We denote the module as . represent the predicted hidden states for future turns. To get , the hidden states from two directions, and , are concatenated. To calculate each or , their inputs include the previous hidden state and the previously-predicted hidden state. Formally, suppose we look ahead for turns, the hidden state of is calculated as following:
where is a weight parameter and is the hidden state for predicting future turns. If , it means our model has no looking-ahead ability and it degrades to a naive goal-based sequence-to-sequence model.
3.3 Decoding Module
where is the attention weight parameter and is the input representation for generating a new that is regarded as the real system utterance.
Given the hidden state , the decoding module can also generate the corresponding utterance for learning the looking-ahead ability. We share the parameters of decoding with those in the encoding module, in order to reduce the computing cost Vinyals and Le (2015). The token sequence in
is generated from left to right by selecting the tokens with the maximum probability distribution through a language model learned by the following equation:
3.4 Model Training
To train the proposed model, we define a loss function to maximize three terms: (1) a language model for predicting tokens in language generation, (2) the probability distribution of predicting utterances of future dialogue turns, and (3) a binary classifier to predict if the dialogue will be complete or not. The final joint loss function is formally denoted as:
is a sigmoid function andis the label of the dialogue that current user utterance belongs to, where 1 means the dialogue ends up with goals achieved while 0 means the goals are not achieved. The three terms are weighted with two hyper-parameters and
. We adopt stochastic gradient descent method to minimize.
In the looking-ahead module, the hidden state is used to generate an utterance , and is also used to calculate and . We design an EM-like algorithm to optimize the loss function, as described in Algorithm 1. Line 3-4 optimize the language model, i.e. the first term of . Line 5-16 optimize the looking-ahead module, i.e. the second term, among which Line 7-14 are for E-step and Line 15-16 are for M-step. In E-step the language model is fixed for updating all the hidden states in looking-ahead module, and in M-step all the hidden states are fixed for updating the language model. Line 17-18 optimize the third term of , which is a binary classifier.
4.1 Data Collection
We use two datasets for two different scenarios to evaluate our model. Table 1 shows the statistics of two datasets.
4.1.1 Dataset 1 - Object Division
Dataset 1 contains crowd-sourced dialogues between humans collected from Amazon Mechanical Turk platform Lewis et al. (2017). The dataset is for object division task and both sides have separate goals of each object’s value. We use the textual data and transform their goals to yes-no questions as our binary vectors. The information of each dialogue session’s final state, agree or disagree, is used for training the agent.
4.1.2 Dataset 2 - Restaurant Reservation
To the best of our knowledge, there is no other public dataset for goal-oriented dialogues where the two sides have different goals. To this end, we construct the Dataset 2 to testify the scalability of our model. The common scenario of restaurant table reservation is chosen.
In this dataset, the two agents are expected to have different goals and they dialogue with each other for looking for the intersection of their goals. We denote Agent A as the role of a customer and Agent B as the restaurant server side. At the beginning of each dialogue session, Agent A is given the available time slot, the number of people, and several other constraints (e.g. can sit at bar or not). All the constraints are regarded as its goals represented by a binary vector. Similarly, Agent B has itself constraints (e.g. whether bar tables are available or not), which are also treat as goals represented by a binary vector. We predefine a pool of ‘goals’ and at the beginning of each dialogue session, the goals for two sides are randomly sampled separately from the pool. The two agents cannot see each other’s goals and they dialogue through natural language until a final decision, agreement or disagreement, is reached. In summary, the objective of constructing this dataset is to see if our model can reach the intersection of the two agents’ goals in a more efficient way.
To generate dialogues for Dataset 2, we resort to a rule-based method via AI planning search Ghallab et al. (2016); Jiang et al. (2019). Watson AI platform 111https://www.ibm.com/watson/ai-assistant/ is leveraged for natural language understanding by defining intents and entities with examples. A planner is designed for the dialogue manager by defining several states and actions. The goals are represented as part of the states, and the STRIPS algorithm is used to search the shortest path to goals at each turn and return the first planned action for generating the next response. Each action has several handcrafted utterances since the diversity of utterances is not our focus in this paper. Table 2 shows a sample dialogue.
|Metric||Dataset 1||Dataset 2|
|Number of Dialogues||5,808||1,613|
|Average Turns per Dialogue||6.6||6.3|
|Average Words per Turn||7.6||8.9|
|Number of Words||566,779||98,726|
|% Goal Achieved||80.1%||71.5%|
|Alice: May I reserve a table for 6 people at 17 tomorrow?|
|Bob: Sorry, we don’t have a table at this point.|
|Alice: Can we sit at the bar then?|
|Bob: We don’t have a bar in the restaurant.|
|Alice: Can I have more expensive tables then?|
|Bob: My apologies, we are required not to do that.|
|Alice: In this case, can I reserve a bigger table?|
|Bob: Yes, we have VIP rooms but more expensive.|
|Alice: I want that.|
4.2 Training Sample Preparation
For each dialogue session with turns, we re-organize the utterances into samples. For each turn , we can get the current user utterance , and a training sample is created with a historical utterance sequence , and the goals are consistent with the same dialogue session. The future turns of utterances are used as the supervised information. In total, we get 38,333 and 10,162 samples including training set and test set for the two datasets respectively.
Since our model is based on purely data-driven learning, we compare our model with the supervised counterparts. Our baselines include:
Seq2Seq(goal): This is a naive baseline by adapting the sequence-to-sequence model Sutskever et al. (2014) and encoding goals, which removes the looking-ahead module and the supervised information of final state prediction from our model.
Seq2Seq(goal+state): This is a baseline model by removing the looking-ahead module from our proposed model. The parameter is set to zero.
Seq2Seq(goal+look): This is a baseline model by removing the supervised information of final state prediction from our model. The parameter is set to zero.
Seq2Seq(goal+look+state): This is our proposed model that includes all the modules and supervised information.
4.4 Evaluation Criteria
In a dialogue system, it could be treat as efficient if it obtains more final goal achievement with as few as possible dialogue turns. Thus we set two criteria for evaluating and comparing models adopted in our experiments: (1) the goal achievement ratio that means the ratio of the number of goal achieved dialogue over the number of attempted dialogues), and (2) the average dialogue turns.
Our experiments are to achieve goals through conversations, and it is difficult to directly adopt existing simulators Asri et al. (2016). We refer to the work Li et al. (2016b) and fine-tune it to our task. For each dataset, a naive sequence-to-sequence model that encodes goals is regarded as the user simulator. We run 1000 times of dialogue sessions using the simulator.
Apart from using the simulator, we also invite humans to dialogue with the agents for 100 times each person for each dataset and we report the average results.
|Model||Dataset 1||Dataset 2|
|vs. Simulator||vs. Human||vs. Simulator||vs. Human|
|% Achieved||# Turns||% Achieved||# Turns||% Achieved||# Turns||% Achieved||# Turns|
4.6 Training Settings
All the baselines are implemented by PyTorch. One-hot input tokens are embedded into a 64-dimensional space. The goals are encoded bywith a hidden layer of size 64. The sizes of hidden states in input utterance encoder , and looking-ahead module , , are all set to 256. A stochastic gradient descent method is employed to optimize the model with a mini-batch size of 32 for supervised learning, an initial learning rate of 1.0, momentum with , and clipping gradients 0.5 in
norm. The best model is chosen from the processing of training the model for 400 epochs. After that, the learning rate decays by a factor of 2 for every epoch. The initial hyper-parameters setting in the loss function (Equation (11)) isand . Words that appear in the training dataset for less than 5 times are replaced with the ‘unknown’ () token. A validation dataset is employed to choose the optimal hyper-parameters.
4.7 Results and Analysis
Table 3 shows the performance of baselines against user simulator and human on the two datasets. Both reveal that models that learn looking-ahead ability can achieve better performance and deliver more efficient dialogues in terms of both goal achievement ratio and dialogue turns. However, in the table, the dialogue turns of Seq2Seq(goal+look+state) are larger than those of Seq2Seq(goal+look), which may suggest that more dialogue turns lead to more achievement. In spite of this, the looking-ahead ability learned by our model is demonstrated to be effective on the two different scenarios. Moreover, the supervised information of final states (the third term of Equation (11)) is also proven effective in delivering more achievement, which can be seen from the second and last rows of Table 3. Compared to the human evaluation, the results with the simulator generally are better. It is probable that human evaluators tend to be more rigorous and more turns are necessary to achieve goals.
|Alice: i just want the book|
|Bob: no way i have the book and you can take others|
|Alice: i keep the book you can have the hat and two balls|
|Bob: i will take the book and you can have others|
|Alice: can we make a deal?|
|Bob: ok deal|
We also evaluate the models’ performance with different parameters. Figure 2 and Figure 3 show the results of using different looking-ahead turns in the looking-ahead module on Dataset 1 and Dataset 2 respectively. In most cases, Seq2Seq(goal+look+state) is better than Seq2Seq(goal+look). Setting the number to 3 is the recommended value, since our scenarios are relatively simple and it is enough for the agents to look ahead for 3 dialogue turns. The results are consistent with those in Table 3.
The performance on different dimension sizes of hidden states used in all the baselines is evaluated and the results are shown in Figure 4 and Figure 5. We find that setting the size to 512 yields the optimal results by balancing the performance and the computing cost in most cases.
In addition, we take some case studies for dialogue examples between the agents trained by Seq2Seq(goal+look+state) (proposed model) and the simulator (Seq2Seq(goal)) for the two datasets. Alice denotes the agent and Bob denotes the simulator. Table 4 is a case where our model with the looking-ahead ability (Seq2Seq(goal+look+state)) tends to achieve goals while consume more dialogue turns. Table 5 gives an example where our model not only tends to achieve goals but also takes fewer dialogue turns. Based on our observation, the language models for natural language generation is learned well in spite of that confused utterances are generated sometimes.
|Alice: can you help me book a table for 2 people at 6pm?|
|Bob: sorry, we dont have a table at this point|
|Alice: can i reserve the seats at the bar instead?|
|Bob: yes you can|
|Alice: can i book a table for us|
|Bob: sure i have written down your reservation|
|Alice: can you help me book a table for 2 people at 6pm?|
|Bob: sorry we dont have a table at this point|
|Alice: can i reserve the seats at the bar instead?|
|Bob: sure i have written down your reservation|
In this paper, we propose an end-to-end model towards the problem of how to learn an efficient dialogue manager without taking too many manual efforts. We model the looking-ahead ability for foreseeing several turns and then the agent can make a decision of what to say that leads the conversation to achieve goals with as few as possible dialogue turns. Experiments on two datasets from different domains demonstrate that our model is efficient in terms of goal achievement ratio and average dialogue turns. Our method is also scalable and can reduce error propagation due to the nature of end-to-end learning.
For the future work, we expect to investigate whether other kinds of abilities, such as reasoning ability, can be modeled for agent towards the problem. In addition to the efficiency issue, the quality of natural language generation should also be paid attention in order to guarantee the quality of overall dialogue system.
The work is partially supported by SFSMBRP (2018YFB1005100), BIGKE (No. 20160754021), NSFC (No. 61772076 and 61751201), NSFB (No. Z181100008918002), CETC (No. w-2018018) and OPBKLICDD (No. ICDD201901). We thank Tian Lan, Henda Xu and Jingyi Lu for experiment preparation. We also thank the three anonymous reviewers for their insightful comments.
- Modelling strategic conversation: the stac project. In SemDial, pp. 27. Cited by: §2.
- A sequence-to-sequence model for user simulation in spoken dialogue systems. In INTERSPEECH, pp. 1151–1155. Cited by: §4.5.
- Learning end-to-end goal-oriented dialog. In ICLR, pp. . Cited by: §2.
- A survey on dialogue systems- recent advances and new frontiers. ACM SIGKDD Explorations Newsletter 19 (2), pp. 25–35. Cited by: §2.
On the properties of neural machine translation: encoder-decoder approaches. In SSST-8, pp. 103–114. Cited by: §3.1.
- Continuously learning neural dialogue management. Note: arXiv preprint arXiv:1606.02689 Cited by: §2.
- Towards end-to-end reinforcement learning of dialogue agents for information access. In ACL, pp. 484–495. Cited by: §2.
- Evaluating prerequisite qualities for learning end-to-end dialog systems. Note: arXiv preprint arXiv:1511.06931 Cited by: §1.
- Automated planning and acting. Cambridge University Press. Cited by: §4.1.2.
- Neural machine translation by jointly learning to align and translate. Note: arXiv preprint arXiv:1409.0473 Cited by: §3.
- The third dialog state tracking challenge. In SLT, pp. 324–329. Cited by: §1.
- The second dialog state tracking challenge. In SIGDIAL, pp. 263–272. Cited by: §2.
- Word-based dialog state tracking with recurrent neural networks. In SIGDIAL, pp. 292–299. Cited by: §2.
- A general planning-based framework for goal-driven conversation assistant. In AAAI, pp. 9857–9858. Cited by: §4.1.2.
- Personalization in goal-oriented dialog. In NIPS, pp. . Cited by: §2.
- Deal or no deal? end-to-end learning for negotiation dialogues. In EMNLP, pp. 2443–2453. Cited by: §2, §4.1.1.
- A diversity-promoting objective function for neural conversation models. In NAACL, pp. 110–119. Cited by: §2.
- End-to-end task-completion neural dialogue systems. In IJCNLP, pp. 733–743. Cited by: §2.
- A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688. Cited by: §4.5.
- BBQ-networks: efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In AAAI, pp. 5237–5244. Cited by: §2.
- An end-to-end trainable neural network model with belief tracking for task-oriented dialog. In INTERSPEECH, pp. 2506–2510. Cited by: §2.
- Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In NAACL, pp. 2060–2069. Cited by: §2.
- Learning personalized end-to-end goal-oriented dialog. In AAAI, Cited by: §2.
- Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §3.1.
- A neural conversational model. Note: arXiv preprint arXiv:1506.05869 Cited by: §3.3.
- Artificial intelligence: a modern approach. Prentice Hall. Cited by: §2.
- Deep dyna-q: integrating planning for task-completion dialogue policy learning. In ACL, pp. 2182–2192. Cited by: §2.
- Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In EMNLP, pp. 2231–2240. Cited by: §1.
- Multi-task learning for joint language understanding and dialogue state tracking. In SIGDIAL, pp. 376–384. Cited by: §1.
- Data-driven response generation in social media. In EMNLP, pp. 583–593. Cited by: §1, §2.
Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In NAACL, pp. 41–51. Cited by: §2.
- Trainable sentence planning for complex information presentation in spoken dialog systems. In ACL, pp. 79. Cited by: §2.
- Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §2, 1st item.
- Individual and domain adaptation in sentence planning for dialogue. Journal of Artificial Intelligence Research 30 (), pp. 413–456. Cited by: §2.
- Attention-based lstm for aspect-level sentiment classification. In EMNLP, pp. 606–615. Cited by: §3.3.
- A network-based end-to-end trainable task-oriented dialogue system. In EACL, pp. 438–449. Cited by: §2.
- Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In ACL, pp. 665–677. Cited by: §2.
Partially observable markov decision processes for spoken dialogue systems. Computer Speech & Language 21 (2), pp. 393–422. Cited by: §2.
- End-to-end lstm-based dialog control optimized with supervised and reinforcement learning. Note: arXiv preprint arXiv:1606.01269 Cited by: §2.
- Context-sensitive generation of open-domain conversational responses. In COLING, pp. 2437–2447. Cited by: §2.
- Neural multimodal belief tracker with adaptive attention for dialogue systems. In WWW, pp. 2401–2412. Cited by: §1.
- Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. In SIGDIAL, pp. 1–10. Cited by: §2.