Task-oriented dialogue system is an important tool to build personal virtual assistants, which can help users to complete most of the daily tasks by interacting with devices via natural language. It’s attracting increasing attention of researchers, and lots of works have been proposed in this area Peng et al. (2018); Eric and Manning (2017); Lipton et al. (2018); Young et al. (2013); Wen et al. (2016); Lei et al. (2018); Schatzmann et al. (2007b, a).
The existing task-oriented dialogue systems usually consist of four components: (1) natural language understanding (NLU), it tries to identify the intent of a user; (2) dialogue state tracker (DST), it keeps the track of user goals and constraints in every turn; (3) dialogue policy maker (DP), it aims to generate the next available dialogue action; and (4) natural language generator (NLG), it generates a natural language response based on the dialogue action. Among the four components, dialogue policy maker plays a key role in order to complete dialogues effectively, because it decides the next dialogue action to be executed.
As far as we know, the dialogue policy makers in most existing task-oriented dialogue systems just use the classifiers of the predefined acts to obtain dialogue policyPeng et al. (2018); Lipton et al. (2018); Wen et al. (2016); Liu and Lane (2017a, b). The classification-based dialogue policy learning methods can assign either only a dialogue act and its corresponding parameters Su et al. (2016); Lipton et al. (2018); Peng et al. (2018) or multiple dialogue acts without their corresponding parameters for a dialogue action Chi et al. (2017). However, all these existing methods cannot obtain multiple dialogue acts and their corresponding parameters for a dialogue action at the same time.
Intuitively, it will be more reasonable to construct multiple dialogue acts and their corresponding parameters for a dialogue action at the same time. For example, it can be shown that there are 49.4% of turns in the DSTC2 dataset and 61.5% of turns in the Maluuba dataset have multiple dialogue acts and their corresponding parameters as the dialogue action. If multiple dialogue acts and their corresponding parameters can be obtained at the same time, the final response of task-oriented dialogue systems will become more accurate and effective. For example, as shown in Figure 1, a user wants to get the name of a cheap french restaurant. The correct dialogue policy should generate three acts in current dialogue turn: offer(name=name_slot), inform(food=french) and inform(food=cheap). Thus, the user’s real thought may be: “name_slot is a cheap french restaurant”. If losing the act offer, the system may generate a response like “There are some french restaurants”, which will be far from the user’s goal.
To address this challenge, we propose a Generative Dialogue Policy model (GDP) by casting the dialogue policy learning problem as a sequence optimization problem. The proposed model generates a series of acts and their corresponding parameters by the learned dialogue policy. Specifically, our proposed model uses a recurrent neural network (RNN) as action decoder to construct dialogue policy maker instead of traditional classifiers. Attention mechanism is used to help the decoder decode dialogue acts and their corresponding parameters, and then the template-based natural language generator uses the results of the dialogue policy maker to choose an appropriate sentence template as the final response to the user.
Extensive experiments conducted on two benchmark datasets verify the effectiveness of our proposed method. Our contributions in this work are three-fold.
The existing methods cannot construct multiple dialogue acts and their corresponding parameters at the same time. In this paper, We propose a novel generative dialogue policy model to solve the problem.
The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art baselines on two benchmarks.
We publicly release the source code.
2 Related Work
Usually, the existing task-oriented dialogue systems use a pipeline of four separate modules: natural language understanding, dialogue belief tracker, dialogue policy and natural language generator. Among these four modules, dialogue policy maker plays a key role in task-oriented dialogue systems, which generates the next dialogue action.
As far as we know, nearly all the existing approaches obtain the dialogue policy by using the classifiers of all predefined dialogue acts Su et al. (2017); Jurčíček et al. (2011). There are usually two kinds of dialogue policy learning methods. One constructs a dialogue act and its corresponding parameters for a dialogue action. For example, Peng et al. (2018) constructs a simple classifier for all the predefined dialogue acts. Lipton et al. (2018) build a complex classifier for some predefined dialogue acts, addtionally Lipton et al. (2018) adds two acts for each parameter: one to inform its value and the other to request it. The other obtains the dialogue policy by using multi-label classification to consider multiple dialogue acts without their parameters. Chi et al. (2017)
performs multi-label multi-class classification for dialogue policy learning and then the multiple acts can be decided based on a threshold. Based on these classifiers, the reinforcement learning can be used to further update the dialogue policy of task-oriented dialogue systemsYoung et al. (2013); Cuayáhuitl et al. (2015); Liu and Lane (2017b).
In the real scene, an correct dialogue action usually consists of multiple dialogue acts and their corresponding parameters. However, it is very hard for existing classification-based dialogue policy maker to achieve this goal. Thus, in this paper we propose a novel generative dialogue policy maker to address this issue by casting the dialogue policy learning problem as a sequence optimization problem.
3 Technical Background
3.1 Encoder-Decoder Seq2Seq Models
Seq2Seq model was first introduced by Cho et al. (2014)
for statistical machine translation. It uses two recurrent neural networks (RNN) to solve the sequence-to-sequence mapping problem. One called encoder encodes the user utterance into a dense vector representing its semantics, the other called decoder decodes this vector to the target sentence. Now Seq2Seq framework has already been used in task-oriented dialog systems such asWen et al. (2016) and Eric and Manning (2017), and shows the challenging performance. In the Seq2Seq model, given the user utterance , the encoder squeezes it into a context vector and then used by decoder to generate the response
word by word by maximizing the generation probability ofconditioned on . The objective function of Seq2Seq can be written as:
In particular, the encoder RNN produces the context vector by doing calculation below:
The is the hidden state of the encoder RNN at time step and1997)
or a gated recurrent unit GRUCho et al. (2014). In this paper, we implement by using GRU.
The decoder RNN generates each word in reply conditioned on the context vector
. The probability distribution of candidate words at every time stepis calculated as:
The is the hidden state of decoder RNN at time step and is the generated word in the reply at time calculated by softmax operations.
3.2 Attention Mechanism
Attention mechanisms Bahdanau et al. (2014) have been proved to improved effectively the generation quality for the Seq2Seq framework. In Seq2Seq with attention, each corresponds to a context vector which is calculated dynamically. It is a weighted average of all hidden states of the encoder RNN. Formally, is defined as , where is given by:
4 Generative Dialogue Policy
Figure 2 shows the overall system architecture of the proposed GDP model. Our model contains five main components: (1) utterance encoder; (2) dialogue belief tracker; (3) dialogue policy maker; (4) knowledge base; (5) template-based natural language generator. Next, we will describe each component of our proposed GDP model in detail.
4.1 Notations and Task Formulation
Given the user utterance at turn and the dialogue context which contains the result of the dialogue belief tracker at turn , the task-oriented dialog system needs to generate user’s intents by dialogue belief tracker and then uses this information to get the knowledge base query result . Then the model needs to generate the next dialogue action based on , and . The natural language generator provides the template-based response as the final reply by using . The and are the sequences, is a one-hot vector representing the number of the query results. For baselines, in this paper, the is the classification result of the next dialogue action, but in our proposed model it’s a sequence which contains multiple acts and their corresponding parameters.
4.2 Utterance Encoder
A bidirectional GRU is used to encode the user utterance , the last turn response made by the system and the dialogue context into a continuous representation. The vector is generated by concatenating the last forward and backward GRU states. is the user utterance at turn . is the dialogue context made by dialogue belief tracker at turn. is the response made by our task-oriented dialogue system at last turn. Then the words of are firstly mapped into an embedding space and further serve as the inputs of each step to the bidirectional GRU. Let denotes the number of words in the sequence . The and represent the forward and backward GRU state outputs at time step . The encoder output of timestep denote as .
where is the embedding of the input sequence, is the hidden size of the GRU. contains the encoder hidden state of each timestep, which will be used by attention mechanism in dialogue policy maker.
4.3 Dialogue State Tracker
Dialogue state tracker maintains the state of a conversation and collects the user’s goals during the dialogue. Recent work successfully represents this component as discriminative classifiers. Lei et al. (2018) verified that the generation is a better way to model the dialogue state tracker.
Specifically, we use a GRU as the generator to decode the of current turn. In order to capture user intent information accurately, the basic attention mechanism is calculated when the decoder decodes the at each step, which is the same as the Eq. (4).
where is the length of , is the embedding of the token, is the hidden size of the GRU and the hidden state at timestep of the RNN in dialogue state tracker denote as . The decoded token at step denotes as .
4.4 Knowledge Base
Knowledge base is a database that stores information about the related task. For example, in the restaurant reservation, a knowledge base stores the information of all the restaurants, such as location and price. After dialogue belief tracker, the will be used as the constraints to search the results in knowledge base. Then the one-hot vector will be produced when the system gets the number of the results.
The search result has a great influence on dialogue policy. For example, if the result has multiple matches, the system should request more constraints of the user. In practice, let be an one-hot vector of 20 dimensions to represent the number of query results. Then will be used as the cue for dialogue policy maker.
4.5 Dialogue Policy Maker
In task-oriented dialogue systems, supervised classification is a straightforward solution for dialogue policy modeling. However, we observe that classification cannot hold enough information for dialogue policy modeling. The generative approach is another way to model the dialogue policy maker for task-oriented dialogue systems, which generates the next dialogue acts and their corresponding parameters based on the dialogue context word by word. Thus the generative approach converts the dialogue policy learning problem into a sequence optimization problem.
The dialogue policy maker generates the next dialogue action based on and . Our proposed model uses the GRU as the action decoder to decode the acts and their parameters for the response. Particularly, at step , for decoding of , the decoder GRU takes the embedding of to generate a hidden vector . Basic attention mechanism is calculated.
where is the embedding of the token, is the context vector of the input utterance and is the context vector of the dialogue state tracker. is the hidden state of the GRU in dialogue policy maker at timestep.
where is the token decoded at timestep. And the final results of dialogue policy maker denote as , and the is the length of it. In our proposed model, the dialogue policy maker can be viewed as a decoder of the seq2seq model conditioned on and .
4.6 Nature Language Generator
After getting the dialogue action by the learned dialogue policy maker, the task-oriented dialogue system needs to generate an appropriate response for users. We construct the natural language generator by using template sentences. For each dataset, we extract all the system responses, then we manually modify responses to construct the sentence templates for task-oriented dialogue systems. In our proposed model, the sequence of the acts and parameters will be used for searching appropriate template. However, the classification-based baselines use the categories of acts and their corresponding parameters to search the corresponding template.
In supervised learning, because our proposed model is built in a seq2seq way, the standard cross entropy is adopted as our objective function to train dialogue belief tracker and dialogue policy maker.
After supervised learning, the dialogue policy can be further updated by using reinforcement learning. In the context of reinforcement learning, the decoder of dialogue policy maker can be viewed as a policy network, denoted as for decoding , is the parameters of the decoder. Accordingly, the hidden state created by GRU is the corresponding state, and the choice of the current token is an action333The action here is different from the dialogue action. It’s a concept of the reinforcement learning..
Reward function is also very important for reinforcement learning when decoding every token. To encourage our policy maker to generate correct acts and their corresponding parameters, we set the reward function as follows: once the dialogue acts and their parameters are decoded correctly, the reward is 2; otherwise, the reward is -5; only the label of the dialogue act is decoded correctly but parameters is wrong, the reward is 1; is a decay parameter. More details are shown in Sec 5.3. In our proposed model, rewards can only be obtained at the end of decoding . In order to get the rewards at each decoding step, we sample some results after choosing , and the reward of is set as the average of all the sampled results’ rewards.
In order to ensure that the model’s performance is stable during the fine-tuning phase of reinforcement learning, we freeze the parameters of user utterance and dialogue belief tracker, only the parameters of the dialogue policy maker will be optimized by reinforcement learning. Policy gradient algorithm REINFORCE Williams (1992) is used for pretrained dialogue policy maker:
where the is the length of the decoded action. The objective function can be optimized by gradient descent.
We evaluate the performance of the proposed model in three aspects: (1) the accuracy of the dialogue state tracker, it aims to show the impact of the dialogue state tracker on the dialogue policy maker; (2) the accuracy of dialogue policy maker, it aims to explain the performance of different methods of constructing dialogue policy; (3) the quality of the final response, it aims to explain the impact of the dialogue policy on the final dialogue response. The evaluation metrics are listed as follows:
BPRA: Belief Per-Response Accuracy (BPRA) tests the ability to generate the correct user intents during the dialogue. This metric is used to evaluate the accuracy of dialogue belief tracker Eric and Manning (2017).
APRA: Action Per-Response Accuracy (APRA) evaluates the per-turn accuracy of the dialogue actions generated by dialogue policy maker. For baselines, APRA evaluates the classification accuracy of the dialogue policy maker. But our model actually generates each individual token of actions, and we consider a prediction to be correct only if every token of the model output matches the corresponding token in the ground truth.
BLEU Papineni et al. (2002): The metric evaluates the quality of the final response generated by natural language generator. The metric is usually used to measure the performance of the task-oriented dialogue system.
We also choose the following metrics to evaluate the efficiency of training the model:
: The time for training the whole model, which is important for industry settings.
: The time for training the dialogue policy maker in a task-oriented dialogue system.
|Actions||11. offer, inform, request etc.|
|Slots||8. area, food, price etc.|
|Actions||16. offer, inform, request etc.|
|Slots||60. startdate, enddate etc.|
|Distinct value||inf (continuous values)|
We adopt the DSTC2 Henderson et al. (2014) dataset and Maluuba Asri et al. (2017) dataset to evaluate our proposed model. Both of them are the benchmark datasets for building the task-oriented dialog systems. Specifically, the DSTC2 is a human-machine dataset in the single domain of restaurant searching. The Maluuba is a very complex human-human dataset in travel booking domain which contains more slots and values than DSTC2. Detailed slot information in each dataset is shown in Table 1.
|E2ECM||0.9689||-||0.1782||42.30 m||0.78 m||0.7458||-||0.0797||45.81 m||0.84 m|
|CDM||0.9704||0.2791||0.2039||45.71 m||2.96 m||0.6771||0.1542||0.0704||50.22 m||3.25 m|
|GDP||0.9719||0.5732||0.2847||46.43 m||9.63 m||0.7500||0.4512||0.1156||55.51 m||11.49 m|
|E2ECM+RL||0.9689||-||0.1823||30.01 m||30.01 m||0.7458||-||0.0799||35.13 m||35.13 m|
|CDM+RL||0.9704||0.2873||0.2088||101.0 m||101.0 m||0.6771||0.1625||0.0734||29.00 m||29.00 m|
|GDP+RL||0.9719||0.5766||0.2879||98.07 m||98.07 m||0.7500||0.4521||0.1226||134.8 m||134.8 m|
For comparison, we choose two state-of-the-art baselines and their variants.
CDM Su et al. (2016): This approach designs a group of classifications (two multi-class classifications and some binary classifications) to model the dialogue policy.
E2ECM+RL: It fine tunes the classification parameters of the dialogue policy by REINFORCE Williams (1992).
CDM+RL: It fine tunes the classification of the act and corresponding parameters by REINFORCE Williams (1992).
In order to verify the performance of the dialogue policy maker, the utterance encoder and dialogue belief tracker of our proposed model and baselines is the same, only dialogue policy maker is different.
5.3 Parameters settings
For all models, the hidden size of dialogue belief tracker and utterance encoder is 350, and the embedding size is set to 300. For our proposed model, the hidden size of decoder in dialogue policy maker is 150. The vocabulary size is 540 for DSTC2 and 4712 for Maluuba. And the size of is set to 20. An Adam optimizer Kingma and Ba (2014) is used for training our models and baselines, with a learning rate of 0.001 for supervised training and 0.0001 for reinforcement learning. In reinforcement learning, the decay parameter is set to 0.8. The weight decay is set to 0.001. And early stopping is performed on developing set.
5.4 Experimental Results
The experimental results of the proposed model and baselines will be analyzed from the following aspects.
BPRA Results: As shown in Table 2, most of the models have similar performance on BPRA on these two datasets, which can guarantee a consistent impact on the dialogue policy maker. All the models perform very well in BPRA on DSTC2 dataset. On Maluuba dataset, the BPRA decreases because of the complex domains. We can notice that BPRA of CDM is slightly worse than other models on Maluuba dataset, the reason is that the CDM’s dialogue policy maker contains lots of classifications and has the bigger loss than other models because of complex domains, which affects the training of the dialogue belief tracker.
APRA Results: Compared with baselines, GDP achieves the best performance in APRA on two datasets. It can be noted that we do not compare with the E2ECM baseline in APRA. E2ECM only uses a simple classifier to recognize the label of the acts and ignores the parameters information. In our experiment, APRA of E2ECM is slightly better than our method. Considering the lack of parameters of the acts, it’s unfair for our GDP method. Furthermore, the CDM baseline considers the parameters of the act. But GDP is far better than CDM in supervised learning and reinforcement learning.
|Dilogue Context||Ground Truth||GDP||E2ECM||CDM|
|Inf: cheap, east; sys: name_slot is a nice place in the east of town and the price is cheap; user: what’s the address?||
|offer name name_slot|
BLEU Results: GDP significantly outperforms the baselines on BLEU. As mentioned above, E2ECM is actually slightly better than GDP in APRA. But in fact, we can find that the language quality of the response generated by GDP is still better than E2ECM, which proves that lack of enough parameters information makes it difficult to find the appropriate sentence template in NLG. It can be found that the BLEU of all models is very poor on Maluuba dataset. The reason is that Maluuba is a human-human task-oriented dialogue dataset, the utterances are very flexible, the natural language generator for all methods is difficult to generate an accurate utterance based on the context. And DSTC2 is a human-machine dialog dataset. The response is very regular so the effectiveness of NLG will be better than that of Maluuba. But from the results, the GDP is still better than the baselines on Maluuba dataset, which also verifies that our proposed method is more accurate in modeling dialogue policy on complex domains than the classification-based methods.
Time and Model Size: In order to obtain more accurate and complete dialogue policy for task-oriented dialogue systems, the proposed model has more parameters on the dialogue policy maker than baselines. As shown in Figure 3, E2ECM has the minimal dialogue policy parameters because of the simple classification. It needs minimum training time, but the performance of E2ECM is bad. The number of parameters in the CDM model is slightly larger than E2ECM. However, because both of them are classification methods, they all lose some important information about dialogue policy. Therefore, we can see from the experimental results that the quality of CDM’s dialogue policy is as bad as E2ECM. The number of dialogue policy maker’s parameters in GDP model is much larger than baselines. Although the proposed model need more time to be optimized by supervised learning and reinforcement learning, the performance is much better than all baselines.
5.5 Case Study
Table 3 illustrates an example of our proposed model and baselines on DSTC2 dataset. In this example, a user’s goal is to find a cheap restaurant in the east part of the town. In the current turn, the user wants to get the address of the restaurant.
E2ECM chooses the inform and offer acts accurately, but the lack of the inform’s parameters makes the final output deviate from the user’s goal. CDM generates the parameters of offer successfully, but the lack of the information of inform also leads to a bad result. By contrast, the proposed model GDP can generate all the acts and their corresponding parameters as the dialogue action. Interestingly, the final result of GDP is exactly the same as the ground truth, which verifies that the proposed model is better than the state-of-the-art baselines.
In this paper, we propose a novel model named GDP. Our proposed model treats the dialogue policy modeling as the generative task instead of the discriminative task which can hold more information for dialogue policy modeling. We evaluate the GDP on two benchmark task-oriented dialogue datasets. Extensive experiments show that GDP outperforms the existing classification-based methods on both action accuracy and BLEU.
- Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: §5.1.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
- Speaker role contextual modeling for language understanding and dialogue policy learning. arXiv preprint arXiv:1710.00164. Cited by: §1, §2, 1st item.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1, §3.1.
- Strategic dialogue management via deep reinforcement learning. arXiv preprint arXiv:1511.08099. Cited by: §2.
- A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. arXiv preprint arXiv:1701.04024. Cited by: §1, §3.1, 1st item.
- The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 263–272. Cited by: §5.1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
- Natural actor and belief critic: reinforcement algorithm for learning parameters of dialogue systems modelled as pomdps. ACM Transactions on Speech and Language Processing (TSLP) 7 (3), pp. 6. Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
- Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1437–1447. Cited by: §1, Figure 2, §4.3.
Bbq-networks: efficient exploration in deep reinforcement learning for task-oriented dialogue systems.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2.
- An end-to-end trainable neural network model with belief tracking for task-oriented dialog. arXiv preprint arXiv:1708.05956. Cited by: §1.
- Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 482–489. Cited by: §1, §2.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: 3rd item.
- Deep dyna-q: integrating planning for task-completion dialogue policy learning. arXiv preprint arXiv:1801.06176. Cited by: §1, §1, §2.
- Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 149–152. Cited by: §1.
- Statistical user simulation with a hidden agenda. Proc SIGDial, Antwerp 273282 (9). Cited by: §1.
- Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. arXiv preprint arXiv:1707.00130. Cited by: §2.
- Continuously learning neural dialogue management. arXiv preprint arXiv:1606.02689. Cited by: §1, 2nd item.
- A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1, §1, §3.1.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §4.7, 3rd item, 4th item.
- Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §1, §2.