Generative Dialog Policy for Task-oriented Dialog Systems

09/17/2019 ∙ by Tian Lan, et al. ∙ 0

There is an increasing demand for task-oriented dialogue systems which can assist users in various activities such as booking tickets and restaurant reservations. In order to complete dialogues effectively, dialogue policy plays a key role in task-oriented dialogue systems. As far as we know, the existing task-oriented dialogue systems obtain the dialogue policy through classification, which can assign either a dialogue act and its corresponding parameters or multiple dialogue acts without their corresponding parameters for a dialogue action. In fact, a good dialogue policy should construct multiple dialogue acts and their corresponding parameters at the same time. However, it's hard for existing classification-based methods to achieve this goal. Thus, to address the issue above, we propose a novel generative dialogue policy learning method. Specifically, the proposed method uses attention mechanism to find relevant segments of given dialogue context and input utterance and then constructs the dialogue policy by a seq2seq way for task-oriented dialogue systems. Extensive experiments on two benchmark datasets show that the proposed model significantly outperforms the state-of-the-art baselines. In addition, we have publicly released our codes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Task-oriented dialogue system is an important tool to build personal virtual assistants, which can help users to complete most of the daily tasks by interacting with devices via natural language. It’s attracting increasing attention of researchers, and lots of works have been proposed in this area Peng et al. (2018); Eric and Manning (2017); Lipton et al. (2018); Young et al. (2013); Wen et al. (2016); Lei et al. (2018); Schatzmann et al. (2007b, a).

The existing task-oriented dialogue systems usually consist of four components: (1) natural language understanding (NLU), it tries to identify the intent of a user; (2) dialogue state tracker (DST), it keeps the track of user goals and constraints in every turn; (3) dialogue policy maker (DP), it aims to generate the next available dialogue action; and (4) natural language generator (NLG), it generates a natural language response based on the dialogue action. Among the four components, dialogue policy maker plays a key role in order to complete dialogues effectively, because it decides the next dialogue action to be executed.

As far as we know, the dialogue policy makers in most existing task-oriented dialogue systems just use the classifiers of the predefined acts to obtain dialogue policy

Peng et al. (2018); Lipton et al. (2018); Wen et al. (2016); Liu and Lane (2017a, b). The classification-based dialogue policy learning methods can assign either only a dialogue act and its corresponding parameters Su et al. (2016); Lipton et al. (2018); Peng et al. (2018) or multiple dialogue acts without their corresponding parameters for a dialogue action Chi et al. (2017). However, all these existing methods cannot obtain multiple dialogue acts and their corresponding parameters for a dialogue action at the same time.

Figure 1: The examples in DSTC2 dataset, our proposed model can hold more information about dialogue policy than the classification models mentioned above. “MA, w/o P” is the model that chooses multiple acts without corresponding parameters during dialogue police modeling, “w/o MA, P” is the model that chooses only one act and its parameters.

Intuitively, it will be more reasonable to construct multiple dialogue acts and their corresponding parameters for a dialogue action at the same time. For example, it can be shown that there are 49.4% of turns in the DSTC2 dataset and 61.5% of turns in the Maluuba dataset have multiple dialogue acts and their corresponding parameters as the dialogue action. If multiple dialogue acts and their corresponding parameters can be obtained at the same time, the final response of task-oriented dialogue systems will become more accurate and effective. For example, as shown in Figure 1, a user wants to get the name of a cheap french restaurant. The correct dialogue policy should generate three acts in current dialogue turn: offer(name=name_slot), inform(food=french) and inform(food=cheap). Thus, the user’s real thought may be: “name_slot is a cheap french restaurant”. If losing the act offer, the system may generate a response like “There are some french restaurants”, which will be far from the user’s goal.

To address this challenge, we propose a Generative Dialogue Policy model (GDP) by casting the dialogue policy learning problem as a sequence optimization problem. The proposed model generates a series of acts and their corresponding parameters by the learned dialogue policy. Specifically, our proposed model uses a recurrent neural network (RNN) as action decoder to construct dialogue policy maker instead of traditional classifiers. Attention mechanism is used to help the decoder decode dialogue acts and their corresponding parameters, and then the template-based natural language generator uses the results of the dialogue policy maker to choose an appropriate sentence template as the final response to the user.

Extensive experiments conducted on two benchmark datasets verify the effectiveness of our proposed method. Our contributions in this work are three-fold.

  • The existing methods cannot construct multiple dialogue acts and their corresponding parameters at the same time. In this paper, We propose a novel generative dialogue policy model to solve the problem.

  • The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art baselines on two benchmarks.

  • We publicly release the source code.

2 Related Work

Usually, the existing task-oriented dialogue systems use a pipeline of four separate modules: natural language understanding, dialogue belief tracker, dialogue policy and natural language generator. Among these four modules, dialogue policy maker plays a key role in task-oriented dialogue systems, which generates the next dialogue action.

As far as we know, nearly all the existing approaches obtain the dialogue policy by using the classifiers of all predefined dialogue acts Su et al. (2017); Jurčíček et al. (2011). There are usually two kinds of dialogue policy learning methods. One constructs a dialogue act and its corresponding parameters for a dialogue action. For example, Peng et al. (2018) constructs a simple classifier for all the predefined dialogue acts. Lipton et al. (2018) build a complex classifier for some predefined dialogue acts, addtionally Lipton et al. (2018) adds two acts for each parameter: one to inform its value and the other to request it. The other obtains the dialogue policy by using multi-label classification to consider multiple dialogue acts without their parameters. Chi et al. (2017)

performs multi-label multi-class classification for dialogue policy learning and then the multiple acts can be decided based on a threshold. Based on these classifiers, the reinforcement learning can be used to further update the dialogue policy of task-oriented dialogue systems

Young et al. (2013); Cuayáhuitl et al. (2015); Liu and Lane (2017b).

In the real scene, an correct dialogue action usually consists of multiple dialogue acts and their corresponding parameters. However, it is very hard for existing classification-based dialogue policy maker to achieve this goal. Thus, in this paper we propose a novel generative dialogue policy maker to address this issue by casting the dialogue policy learning problem as a sequence optimization problem.

3 Technical Background

3.1 Encoder-Decoder Seq2Seq Models

Seq2Seq model was first introduced by Cho et al. (2014)

for statistical machine translation. It uses two recurrent neural networks (RNN) to solve the sequence-to-sequence mapping problem. One called encoder encodes the user utterance into a dense vector representing its semantics, the other called decoder decodes this vector to the target sentence. Now Seq2Seq framework has already been used in task-oriented dialog systems such as

Wen et al. (2016) and Eric and Manning (2017), and shows the challenging performance. In the Seq2Seq model, given the user utterance , the encoder squeezes it into a context vector and then used by decoder to generate the response

word by word by maximizing the generation probability of

conditioned on . The objective function of Seq2Seq can be written as:

(1)

In particular, the encoder RNN produces the context vector by doing calculation below:

(2)

The is the hidden state of the encoder RNN at time step and

is the non-linear transformation which can be a long-short term memory unit LSTM

Hochreiter and Schmidhuber (1997)

or a gated recurrent unit GRU

Cho et al. (2014). In this paper, we implement by using GRU.

The decoder RNN generates each word in reply conditioned on the context vector

. The probability distribution of candidate words at every time step

is calculated as:

(3)

The is the hidden state of decoder RNN at time step and is the generated word in the reply at time calculated by softmax operations.

3.2 Attention Mechanism

Attention mechanisms Bahdanau et al. (2014) have been proved to improved effectively the generation quality for the Seq2Seq framework. In Seq2Seq with attention, each corresponds to a context vector which is calculated dynamically. It is a weighted average of all hidden states of the encoder RNN. Formally, is defined as , where is given by:

(4)

where is the last hidden state of the decoder, the

is often implemented as a multi-layer-perceptron (MLP) with tanh as the activation function.

4 Generative Dialogue Policy

Figure 2 shows the overall system architecture of the proposed GDP model. Our model contains five main components: (1) utterance encoder; (2) dialogue belief tracker; (3) dialogue policy maker; (4) knowledge base; (5) template-based natural language generator. Next, we will describe each component of our proposed GDP model in detail.

Figure 2: GDP overview. The utterance encoder encodes the user utterance, the dialogue context and the last reply of the systems into the dense vector. As for dialogue belief tracker, we use the approach of Lei et al. (2018) to generate dialogue context. Then this information will be used to search the knowledge base. Based on the user’s intents and query results, dialogue policy maker generates the next dialogue action by using our RNN-based proposed method.

4.1 Notations and Task Formulation

Given the user utterance at turn and the dialogue context which contains the result of the dialogue belief tracker at turn , the task-oriented dialog system needs to generate user’s intents by dialogue belief tracker and then uses this information to get the knowledge base query result . Then the model needs to generate the next dialogue action based on , and . The natural language generator provides the template-based response as the final reply by using . The and are the sequences, is a one-hot vector representing the number of the query results. For baselines, in this paper, the is the classification result of the next dialogue action, but in our proposed model it’s a sequence which contains multiple acts and their corresponding parameters.

4.2 Utterance Encoder

A bidirectional GRU is used to encode the user utterance , the last turn response made by the system and the dialogue context into a continuous representation. The vector is generated by concatenating the last forward and backward GRU states. is the user utterance at turn . is the dialogue context made by dialogue belief tracker at turn. is the response made by our task-oriented dialogue system at last turn. Then the words of are firstly mapped into an embedding space and further serve as the inputs of each step to the bidirectional GRU. Let denotes the number of words in the sequence . The and represent the forward and backward GRU state outputs at time step . The encoder output of timestep denote as .

(5)

where is the embedding of the input sequence, is the hidden size of the GRU. contains the encoder hidden state of each timestep, which will be used by attention mechanism in dialogue policy maker.

4.3 Dialogue State Tracker

Dialogue state tracker maintains the state of a conversation and collects the user’s goals during the dialogue. Recent work successfully represents this component as discriminative classifiers. Lei et al. (2018) verified that the generation is a better way to model the dialogue state tracker.

Specifically, we use a GRU as the generator to decode the of current turn. In order to capture user intent information accurately, the basic attention mechanism is calculated when the decoder decodes the at each step, which is the same as the Eq. (4).

(6)

where is the length of , is the embedding of the token, is the hidden size of the GRU and the hidden state at timestep of the RNN in dialogue state tracker denote as . The decoded token at step denotes as .

4.4 Knowledge Base

Knowledge base is a database that stores information about the related task. For example, in the restaurant reservation, a knowledge base stores the information of all the restaurants, such as location and price. After dialogue belief tracker, the will be used as the constraints to search the results in knowledge base. Then the one-hot vector will be produced when the system gets the number of the results.

The search result has a great influence on dialogue policy. For example, if the result has multiple matches, the system should request more constraints of the user. In practice, let be an one-hot vector of 20 dimensions to represent the number of query results. Then will be used as the cue for dialogue policy maker.

4.5 Dialogue Policy Maker

In task-oriented dialogue systems, supervised classification is a straightforward solution for dialogue policy modeling. However, we observe that classification cannot hold enough information for dialogue policy modeling. The generative approach is another way to model the dialogue policy maker for task-oriented dialogue systems, which generates the next dialogue acts and their corresponding parameters based on the dialogue context word by word. Thus the generative approach converts the dialogue policy learning problem into a sequence optimization problem.

The dialogue policy maker generates the next dialogue action based on and . Our proposed model uses the GRU as the action decoder to decode the acts and their parameters for the response. Particularly, at step , for decoding of , the decoder GRU takes the embedding of to generate a hidden vector . Basic attention mechanism is calculated.

(7)

where is the embedding of the token, is the context vector of the input utterance and is the context vector of the dialogue state tracker. is the hidden state of the GRU in dialogue policy maker at timestep.

(8)

where is the token decoded at timestep. And the final results of dialogue policy maker denote as , and the is the length of it. In our proposed model, the dialogue policy maker can be viewed as a decoder of the seq2seq model conditioned on and .

4.6 Nature Language Generator

After getting the dialogue action by the learned dialogue policy maker, the task-oriented dialogue system needs to generate an appropriate response for users. We construct the natural language generator by using template sentences. For each dataset, we extract all the system responses, then we manually modify responses to construct the sentence templates for task-oriented dialogue systems. In our proposed model, the sequence of the acts and parameters will be used for searching appropriate template. However, the classification-based baselines use the categories of acts and their corresponding parameters to search the corresponding template.

4.7 Training

In supervised learning, because our proposed model is built in a seq2seq way, the standard cross entropy is adopted as our objective function to train dialogue belief tracker and dialogue policy maker.

(9)

After supervised learning, the dialogue policy can be further updated by using reinforcement learning. In the context of reinforcement learning, the decoder of dialogue policy maker can be viewed as a policy network, denoted as for decoding , is the parameters of the decoder. Accordingly, the hidden state created by GRU is the corresponding state, and the choice of the current token is an action333The action here is different from the dialogue action. It’s a concept of the reinforcement learning..

Reward function is also very important for reinforcement learning when decoding every token. To encourage our policy maker to generate correct acts and their corresponding parameters, we set the reward function as follows: once the dialogue acts and their parameters are decoded correctly, the reward is 2; otherwise, the reward is -5; only the label of the dialogue act is decoded correctly but parameters is wrong, the reward is 1; is a decay parameter. More details are shown in Sec 5.3. In our proposed model, rewards can only be obtained at the end of decoding . In order to get the rewards at each decoding step, we sample some results after choosing , and the reward of is set as the average of all the sampled results’ rewards.

In order to ensure that the model’s performance is stable during the fine-tuning phase of reinforcement learning, we freeze the parameters of user utterance and dialogue belief tracker, only the parameters of the dialogue policy maker will be optimized by reinforcement learning. Policy gradient algorithm REINFORCE Williams (1992) is used for pretrained dialogue policy maker:

(10)

where the is the length of the decoded action. The objective function can be optimized by gradient descent.

5 Experiments

We evaluate the performance of the proposed model in three aspects: (1) the accuracy of the dialogue state tracker, it aims to show the impact of the dialogue state tracker on the dialogue policy maker; (2) the accuracy of dialogue policy maker, it aims to explain the performance of different methods of constructing dialogue policy; (3) the quality of the final response, it aims to explain the impact of the dialogue policy on the final dialogue response. The evaluation metrics are listed as follows:

  • BPRA: Belief Per-Response Accuracy (BPRA) tests the ability to generate the correct user intents during the dialogue. This metric is used to evaluate the accuracy of dialogue belief tracker Eric and Manning (2017).

  • APRA: Action Per-Response Accuracy (APRA) evaluates the per-turn accuracy of the dialogue actions generated by dialogue policy maker. For baselines, APRA evaluates the classification accuracy of the dialogue policy maker. But our model actually generates each individual token of actions, and we consider a prediction to be correct only if every token of the model output matches the corresponding token in the ground truth.

  • BLEU Papineni et al. (2002): The metric evaluates the quality of the final response generated by natural language generator. The metric is usually used to measure the performance of the task-oriented dialogue system.

We also choose the following metrics to evaluate the efficiency of training the model:

  • : The time for training the whole model, which is important for industry settings.

  • : The time for training the dialogue policy maker in a task-oriented dialogue system.

5.1 Datasets

Dataset DSTC2
Size Train:1612,Test:506,Dev:1117
Domains restaurant reservation
Actions 11. offer, inform, request etc.
Slots 8. area, food, price etc.
Distinct value 212
Dataset Maluuba
Size Train:8621,Test:478,Dev:480
Domains travel booking
Actions 16. offer, inform, request etc.
Slots 60. startdate, enddate etc.
Distinct value inf (continuous values)
Table 1: The details of DSTC2 and Maluuba dataset. The Maluuba dataset is more complex than DSTC2, and has some continuous value space such as time and price which is hard to solve for classification model.

We adopt the DSTC2 Henderson et al. (2014) dataset and Maluuba Asri et al. (2017) dataset to evaluate our proposed model. Both of them are the benchmark datasets for building the task-oriented dialog systems. Specifically, the DSTC2 is a human-machine dataset in the single domain of restaurant searching. The Maluuba is a very complex human-human dataset in travel booking domain which contains more slots and values than DSTC2. Detailed slot information in each dataset is shown in Table 1.

Models DSTC2 Maluuba
BPRA APRA BLEU BPRA APRA BLEU
E2ECM 0.9689 - 0.1782 42.30 m 0.78 m 0.7458 - 0.0797 45.81 m 0.84 m
CDM 0.9704 0.2791 0.2039 45.71 m 2.96 m 0.6771 0.1542 0.0704 50.22 m 3.25 m
GDP 0.9719 0.5732 0.2847 46.43 m 9.63 m 0.7500 0.4512 0.1156 55.51 m 11.49 m
E2ECM+RL 0.9689 - 0.1823 30.01 m 30.01 m 0.7458 - 0.0799 35.13 m 35.13 m
CDM+RL 0.9704 0.2873 0.2088 101.0 m 101.0 m 0.6771 0.1625 0.0734 29.00 m 29.00 m
GDP+RL 0.9719 0.5766 0.2879 98.07 m 98.07 m 0.7500 0.4521 0.1226 134.8 m 134.8 m

Table 2: The performance of baselines and proposed model on DSTC2 and Maluuba dataset. is the time spent on training the whole model, is the time spent on training the dialogue policy maker.

5.2 Baselines

For comparison, we choose two state-of-the-art baselines and their variants.

  • E2ECM Chi et al. (2017): In dialogue policy maker, it adopts a classic classification for skeletal sentence template. In our implement, we construct multiple binary classifications for each act to search the sentence template according to the work proposed by Chi et al. (2017).

  • CDM Su et al. (2016): This approach designs a group of classifications (two multi-class classifications and some binary classifications) to model the dialogue policy.

  • E2ECM+RL: It fine tunes the classification parameters of the dialogue policy by REINFORCE Williams (1992).

  • CDM+RL: It fine tunes the classification of the act and corresponding parameters by REINFORCE Williams (1992).

In order to verify the performance of the dialogue policy maker, the utterance encoder and dialogue belief tracker of our proposed model and baselines is the same, only dialogue policy maker is different.

5.3 Parameters settings

For all models, the hidden size of dialogue belief tracker and utterance encoder is 350, and the embedding size is set to 300. For our proposed model, the hidden size of decoder in dialogue policy maker is 150. The vocabulary size is 540 for DSTC2 and 4712 for Maluuba. And the size of is set to 20. An Adam optimizer Kingma and Ba (2014) is used for training our models and baselines, with a learning rate of 0.001 for supervised training and 0.0001 for reinforcement learning. In reinforcement learning, the decay parameter is set to 0.8. The weight decay is set to 0.001. And early stopping is performed on developing set.

5.4 Experimental Results

The experimental results of the proposed model and baselines will be analyzed from the following aspects.

BPRA Results: As shown in Table 2, most of the models have similar performance on BPRA on these two datasets, which can guarantee a consistent impact on the dialogue policy maker. All the models perform very well in BPRA on DSTC2 dataset. On Maluuba dataset, the BPRA decreases because of the complex domains. We can notice that BPRA of CDM is slightly worse than other models on Maluuba dataset, the reason is that the CDM’s dialogue policy maker contains lots of classifications and has the bigger loss than other models because of complex domains, which affects the training of the dialogue belief tracker.

APRA Results: Compared with baselines, GDP achieves the best performance in APRA on two datasets. It can be noted that we do not compare with the E2ECM baseline in APRA. E2ECM only uses a simple classifier to recognize the label of the acts and ignores the parameters information. In our experiment, APRA of E2ECM is slightly better than our method. Considering the lack of parameters of the acts, it’s unfair for our GDP method. Furthermore, the CDM baseline considers the parameters of the act. But GDP is far better than CDM in supervised learning and reinforcement learning.

Dilogue Context Ground Truth GDP E2ECM CDM
Inf: cheap, east; sys: name_slot is a nice place in the east of town and the price is cheap; user: what’s the address?
offer name name_slot
inform addr addr_slot
offer name name_slot
inform addr addr_slot
inform
offer
offer name name_slot
sure, name_slot is
on addr_slot
sure, name_slot is
on addr_slot
name_slot is a nice place in the east of
the town
name_slot is a
nice place
Table 3: Case Study on DSTC2 dataset. The first column is the Dialogue Context of this case, it contains three parts: (1) Inf is the user’s intent captured by dialogue state tracker; (2) sys is the system response at last turn; (3) user is the user utterance in this turn. The second column to the fifth column has two rows, above is the action made by the learned dialogue policy maker below is the final response made by template-based generator.

BLEU Results: GDP significantly outperforms the baselines on BLEU. As mentioned above, E2ECM is actually slightly better than GDP in APRA. But in fact, we can find that the language quality of the response generated by GDP is still better than E2ECM, which proves that lack of enough parameters information makes it difficult to find the appropriate sentence template in NLG. It can be found that the BLEU of all models is very poor on Maluuba dataset. The reason is that Maluuba is a human-human task-oriented dialogue dataset, the utterances are very flexible, the natural language generator for all methods is difficult to generate an accurate utterance based on the context. And DSTC2 is a human-machine dialog dataset. The response is very regular so the effectiveness of NLG will be better than that of Maluuba. But from the results, the GDP is still better than the baselines on Maluuba dataset, which also verifies that our proposed method is more accurate in modeling dialogue policy on complex domains than the classification-based methods.

Figure 3: The number of the parameters. GDP has the bigger model size and more dialogue policy parameters because of the RNN-based dialogue policy maker.

Time and Model Size: In order to obtain more accurate and complete dialogue policy for task-oriented dialogue systems, the proposed model has more parameters on the dialogue policy maker than baselines. As shown in Figure 3, E2ECM has the minimal dialogue policy parameters because of the simple classification. It needs minimum training time, but the performance of E2ECM is bad. The number of parameters in the CDM model is slightly larger than E2ECM. However, because both of them are classification methods, they all lose some important information about dialogue policy. Therefore, we can see from the experimental results that the quality of CDM’s dialogue policy is as bad as E2ECM. The number of dialogue policy maker’s parameters in GDP model is much larger than baselines. Although the proposed model need more time to be optimized by supervised learning and reinforcement learning, the performance is much better than all baselines.

5.5 Case Study

Table 3 illustrates an example of our proposed model and baselines on DSTC2 dataset. In this example, a user’s goal is to find a cheap restaurant in the east part of the town. In the current turn, the user wants to get the address of the restaurant.

E2ECM chooses the inform and offer acts accurately, but the lack of the inform’s parameters makes the final output deviate from the user’s goal. CDM generates the parameters of offer successfully, but the lack of the information of inform also leads to a bad result. By contrast, the proposed model GDP can generate all the acts and their corresponding parameters as the dialogue action. Interestingly, the final result of GDP is exactly the same as the ground truth, which verifies that the proposed model is better than the state-of-the-art baselines.

6 Conclusion

In this paper, we propose a novel model named GDP. Our proposed model treats the dialogue policy modeling as the generative task instead of the discriminative task which can hold more information for dialogue policy modeling. We evaluate the GDP on two benchmark task-oriented dialogue datasets. Extensive experiments show that GDP outperforms the existing classification-based methods on both action accuracy and BLEU.

References

  • L. E. Asri, H. Schulz, S. Sharma, J. Zumer, J. Harris, E. Fine, R. Mehrotra, and K. Suleman (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: §5.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
  • T. Chi, P. Chen, S. Su, and Y. Chen (2017) Speaker role contextual modeling for language understanding and dialogue policy learning. arXiv preprint arXiv:1710.00164. Cited by: §1, §2, 1st item.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1, §3.1.
  • H. Cuayáhuitl, S. Keizer, and O. Lemon (2015) Strategic dialogue management via deep reinforcement learning. arXiv preprint arXiv:1511.08099. Cited by: §2.
  • M. Eric and C. D. Manning (2017) A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. arXiv preprint arXiv:1701.04024. Cited by: §1, §3.1, 1st item.
  • M. Henderson, B. Thomson, and J. D. Williams (2014) The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 263–272. Cited by: §5.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • F. Jurčíček, B. Thomson, and S. Young (2011) Natural actor and belief critic: reinforcement algorithm for learning parameters of dialogue systems modelled as pomdps. ACM Transactions on Speech and Language Processing (TSLP) 7 (3), pp. 6. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
  • W. Lei, X. Jin, M. Kan, Z. Ren, X. He, and D. Yin (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1437–1447. Cited by: §1, Figure 2, §4.3.
  • Z. Lipton, X. Li, J. Gao, L. Li, F. Ahmed, and L. Deng (2018) Bbq-networks: efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §1, §2.
  • B. Liu and I. Lane (2017a) An end-to-end trainable neural network model with belief tracking for task-oriented dialog. arXiv preprint arXiv:1708.05956. Cited by: §1.
  • B. Liu and I. Lane (2017b) Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 482–489. Cited by: §1, §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: 3rd item.
  • B. Peng, X. Li, J. Gao, J. Liu, K. Wong, and S. Su (2018) Deep dyna-q: integrating planning for task-completion dialogue policy learning. arXiv preprint arXiv:1801.06176. Cited by: §1, §1, §2.
  • J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, and S. Young (2007a) Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 149–152. Cited by: §1.
  • J. Schatzmann, B. Thomson, and S. Young (2007b) Statistical user simulation with a hidden agenda. Proc SIGDial, Antwerp 273282 (9). Cited by: §1.
  • P. Su, P. Budzianowski, S. Ultes, M. Gasic, and S. Young (2017) Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. arXiv preprint arXiv:1707.00130. Cited by: §2.
  • P. Su, M. Gasic, N. Mrksic, L. Rojas-Barahona, S. Ultes, D. Vandyke, T. Wen, and S. Young (2016) Continuously learning neural dialogue management. arXiv preprint arXiv:1606.02689. Cited by: §1, 2nd item.
  • T. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2016) A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1, §1, §3.1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §4.7, 3rd item, 4th item.
  • S. Young, M. Gašić, B. Thomson, and J. D. Williams (2013) Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §1, §2.