Task-Oriented dialog systems aim for offering users with assistance to complete tasks by interacting with users, such as Siri, Google Assistant and Amazon Alexa. With the developing of Reinforcement learning in robotics and other domains, it brings another view of learning the dialog policy (Williams and Young, 2007; Gašić and Young, 2014; Su et al., 2017; Li et al., 2019). As it is not practical to interact with a real user in the policy training loop, a common but essential strategy is to build a user simulator to provide replies to the dialog agent(Schatzmann et al., 2007; Li et al., 2016). In the real dialog systems, they aims to maximize the positive feedback they can get from the user. To simulate the user feedback during training, a reward function has been designed and embedded to the user simulator and it will return a reward signal to the dialog agent according to the given dialog context and system action (Peng et al., 2018b; Williams et al., 2017; Dhingra et al., 2016; Su et al., 2016). The reward signal can be in the form of binary feedback or a continuous score. The most straightforward way to design such a reward function is to provide the agent with different reward signal based on the dialog status: if the dialog ends successfully, a large positive reward signal will be returned; if the dialog fails, the reward signal will be a large negative value; if the dialog is still ongoing, a small negative signal will be returned to encourage shorter session (Peng et al., 2018b). However, this solution assigns the same negative signal to all the system actions happened in the dialog except the last one, which results in the qualities of different actions are not distinguishable. Besides, the really meaningful reward signal only be returned at the end of a dialog and this can delay the penalty to low-quality actions and the reward to high-quality actions. Liu and Lane (2018) address the difficulties listed above by adopting adversarial training for policy learning: they jointly train two systems: (1) a policy model that decides on the actions to take at each turn, and (2) a discriminator that marks a dialog as being successful or not. Feedback from the discriminator is used as a reward signal to push the policy model to complete a task in a way that is indistinguishable from how a human agent completes it. Following this solution, Takanobu et al. (2019) replaces the discriminator with a reward function which has a specific architecture and takes as input the dialog state, system action and next dialog state. This method managed to achieve higher performance with respect to success rate and other metrics.
However, to alternatively update the dialog policy and the reward model on the fly, the algorithms to update the dialog policy are limited to policy gradient based algorithms, such as REINFORCE(Williams, 1992) and PPO(Schulman et al., 2017)
, while off-policy methods are not able to benefit from the self-learned reward functions. Besides, the alternative training of dialog agent and the reward model can easily get stuck in local optimum or result in mode collapse. To alleviate the potential problems mentioned above, we decompose the adversarial learning method in dialog policy learning into two sequential step. We learn the reward function using an auxiliary dialog state generator where the loss from the discriminator can be backpropagated to the generator directly. In the next step, we discard the state generator and only keep the trained discriminator as the dialog reward model. The trained reward model will be incorporated to the reinforcement learning process and will not be updated. In this way, we can utilize any reinforcement learning algorithms to update the dialog policy, including both on-policy and off-policy methods. Besides, we show how to use the pretrained reward functions to transfer knowledge learned in existing domains to a new dialog domain. To summarize, we make the following technological contributions:
A reward learning method that is applicable to off-policy reinforcement learning methods in dialog training.
A reward learning method can alleviate the problem of local optimum for adversarial dialog training.
A reward function that can transfer knowledge learned in existing domains to a new dialog domain.
2 Related Work
Building a dialog system that can handle conversations across different domains has attracted a lot of attentions in the last few years. A rule-based dialog system is becoming powerless in the multi-domain scenarios because of the rich and diverse interactions. It is intractable to take into account all the possible situations and get the corresponding solutions ready by predefining a bunch of rules. Reinforcement learning methods (Peng et al., 2017; Lipton et al., 2018; Li et al., 2017; Su et al., 2018; Dhingra et al., 2016; Williams et al., 2017), have been widely utilized to train a dialog agent by interacting with users. With the help of reinforcement learning, the dialog agent is able to explore the dialog contexts which may not exist in the previously observed data. However, the reward signal used to update the dialog policy is usually from a reward function predefined with domain knowledge and it could become very tricky facing to the multi-domain dialog scenarios. To provide the dialog policy with high quality reward signal, Peng et al. (2018a) proposed to make use of the adversarial loss as an extra critic in addition to shape the main reward function. Inspired by t,he success of adversarial learning in other research fields, Liu and Lane (2018) learns the reward function directly from dialog samples by alternatively updating the dialog policy and the reward function. The reward function in fact is a discriminator and it aims to assign high value to real human dialogues while low value to dialogues generated by the current dialog policy. In contrast, the dialog policy attempts to achieve higher reward from the discriminator given the generated dialog. Following this solution, Takanobu et al. (2019) replaces the discriminator with a reward function which has a specific architecture and argues to achieve higher performance with respect to success rate and other metrics.
3 Learning Reward Functions with an Auxiliary Generator
Different from previous adversarial training methods Liu and Lane (2018); Takanobu et al. (2019), in our method the dialog policy and reward model are trained consecutively rather than trained alternatively in different time step. We believe this can avoid potential training issues, such as mode collapse and local optimum. To achieve this goal, we introduce an auxiliary generator in the first step which is used to explore potential dialog situations. The advantage of this setup is that we transfer the SeqGan setting Yu et al. (2017) to a vanilla Gan setting Goodfellow et al. (2014). SeqGan setup refers to the adversarial training style that policy gradient is utilized to deliver the update signal from Discriminator to the dialog agent. In contrast, in the vanilla Gan framework the discriminator can directly backpropogate the reward signal to the generator. Once we restored a high-quality reward model with the auxiliary generator in the first step, we can make use of it in common reinforcement learning methods to update the dialog agent. Since the reward model will keep fixed during the policy training, we can adopt different kinds of reinforcement learning methods while the adversarial learning methods are restricted to policy gradient based methods.
3.1 Dialog State Tracker
) to keep tracking the information emerged in the interactions between users and the dialog agent. The state tracker plays an important role in dialog systems since its output is the foundation for the dialog policy decisions in the next step. The embedded tracker in ConvLab has the ability to handle multi-domain interactions. The output from the NLU module is fed to the dialog state tracker to extract informative information, including the informable slots about the constraints from users and requestable slots that indicates what users want to inquiry. Besides, a belief vector will be maintained and updated for each slot in every domain.
Dialog State The scattered information from the dialog state tracker will be integrated to form a structured state representation at time step . There are mainly six feature segments information in the final representation, including the embedded results of returned entities for a query, the availability of the booking option with respect to a given domain, the state of informable slots, the state of requestable slots, the last user action and how many times the last user action has been repeated without interruptions. The final state representation is an information vector with 392 dimensions and each position is filled in with or .
Dialog Action The action space consists of two different sets. In the first action set, each action is a concatenation of domain name, action type and slot name, such as Attraction_Inform_Address and Hotel_Request_Internet. Since in the real scenarios, the response from a human or a dialog agent can cover several different single actions defined in the first action set, we extract the most frequently used dialog actions from the human-human dialog dataset to form the second action set. In another word, all the actions in the second set are a combination of two or three single actions from the first action set. For example, [Attraction_Inform_Address, Hotel_Request_Internet] will be regarded as a new action that the policy agent can execute. In the end, the final action space has different dialog actions. We utilize one-hot embedding to represent the actions.
3.2 Exploring Dialog Scenarios with an Auxiliary Generator
We aim to train a reward function which has the ability to distinguish high quality dialogs from unreasonable and inappropriate dialogs. We utilize a generator to explore the possible dialog scenarios that could happen in real life. The dialog scenario at time is a pair of a dialog state and the corresponding system action at the same time step . The dialog state-action pairs generated from this generator are fed to the reward model as negative samples. During reward training, the reward function can benefit from the rich and high-quality negative instances generated by the advanced generator to improve the discriminability. The dialog simulating step can be formulated as:
, where is a sampled Gaussian noise and each corresponds to one potential state-action pair .
3.2.1 Action Simulation
To simulate the dialog actions, we adopt an MLP as the action generator following by a Gumbel-Softmax function with dimensions, where each dimension corresponds to a specific action in the action space. The Gumbel-Max trick (Gumbel, 1954) is commonly used to draw samples
from a categorical distribution with class probabilities:
where is independently sampled from Gumbel (0,1). However, the argmax operation is not differentiable, thus no gradient can be backpropagated through . Instead, we can adopt the soft-argmax approximation (Jang et al., 2016) as a continuous and differentiable approxiamation to and to generate k-dimensional sample vectors below:
for . When the temperature , the operation is exactly recovered and samples from the Gumbel-Softmax distribution become one-hot. However the gradient will vanish when approaches . In the contrary, when
is going higher, the Gumbel-Softmax samples are getting similar to samples from uniform distribution overcategories. In practice,
should be selected to balance the approximation bias and the magnitude of gradient variance. In our case,corresponding to the action distribution and equals to the action dimension .
3.2.2 State Simulation
In our setting, the state representation is a vector filled in with discrete values which means we cannot connect the generator with the discriminator directly. Similar to the action generating method, the Gumbel-Softmax trick could be the bridge to deliver the gradient from the discriminator to the state generator . In this solution, we have to attach a bunch of Gumbel-Softmax functions to the back of and the number depends on how many meaningful segments included in the state representation. The Gumbel-Softmax trick is powerless in our setting because there are around
independent meaningful segments in the state representation. Besides, a preprocessing step is essential to expand the discrete representation to a concatenation of a number of one-hot embeddings which demands the familiarity with the state structure. These disadvantages lead us, to an alternative solution for state simulation by utilizing a pretrained Variational AutoEncoder(Kramer, 1991; Kingma and Welling, 2013).
State transferring with Variational AutoEncoder
Compared to the scenarios of GAN in computer vision, the output of the generator in our setting is a discrete vector which makes it challenging to backpropogate the loss from discriminator to the generator directly. To address this problem, we propose to project the discrete representationin the expert demonstrations to a continuous space with an encoder from a pretrained variational autoencoder (Kingma and Welling, 2013). Assuming the expert-like dialogue state is generated by a latent variable via the distribution , the variable could be the representation we aim for in a continuous space. Given human-generated state , the VAE utilize a conditional probabilistic encoder to infer the latents :
are the variational parameters for the encoder while the decoder. The optimization objective is given as:
The first term in the right side is responsible reconstruction loss and this term encourages the decoder parametered with to learn to reconstruct the input . The second term is the KL-divergence between the encoder’s distribution
and a standard Gaussian distribution.
The benefit of projecting the state representations to a different space is that we can directly simulate the dialog states in the continuous space just like generating realistic images in computer vision. Besides, similar dialog states will be embedded into close latent representations. As shown in Fig 1, we utilize an variational autoencoder to learn the state projecting fucntion given dialog states from real human dialogs. In summary, we transfer the discrete dialog state from the state tracker to a continuous state space through a pretrained state encoder and all the future training will happen in the latent continuous space rather than the original state space.
3.2.3 Adversarial training
By applying Gumbel-Softmax to action simulation and state transferring to the state simulation respectively, we can simulate the real state-action distribution in a differentiable setup.
The whole process of can be formulated as follow:
denotes all the parameters in the generator and is the concatenating operation. During the adversarial training process, the generator takes noise as input and outputs a sample and it aims to get higher reward signal from the discriminator . The training loss for the generator can be given as:
where and denotes the discriminator measuring the reality of generated state-action pairs .
The discriminator in this work is an MLP which takes as input the state-action pair and outputs the probability that the sample is from the real data distribution. Since the discriminator’s goal is to assign higher probability to the real data while lower score to the fake data, the objective can be given as the average log probability it assigns to the correct classification. Given an equal mixture of real data samples and generated samples from the generator
, the loss function for the discriminatoris:
where denotes the discrete state representation from the state tracker.
4 Experiemntal Setup
4.1 Dataset and Training environment
MultiWOZ MultiWOZ (Budzianowski et al., 2018) is a multi-domain dialogue dataset spanning 7 distinct domains and consisting of 10,438 dialogues. The main scenario in this dataset is that a dialogue agent is trying to satisfy the demand from tourists such as booking a restaurant or recommending a hotel with specific requirements. The interactions between the dialogue agent and users can happen in 7 different domains, including: Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train. The average number of turns are 8.93 and 15.39 for single and multi-domain dialogs, respectively.
ConvLab ConvLab (Lee et al., 2019)
is an open-source multi-domain end-to-end dialogue system platform. ConvLab offers the annotated MultiWOZ dataset and associated pre-trained reference models. We reuse the rule-based dialog state tracker from ConvLab to keep tracking the information emerged in the interactions between users and the dialog agent. Besides, an agenda-based(Schatzmann et al., 2007) user simulator is embedded in ConvLab and it has been adapted for multi-domain dialogue scenarios.
4.2 Architecture and Training Details
Variational AutoEncoder The Encoder is a two-layer MLP which takes the discrete state representation (392 dimensions) as input and outputs two intermediate embedding (64 dimensions) corresponding to the mean and the variance respectively. During inference time, we regard the mean as the embedded representation for a given state input .
The generator takes a randomly sampled Gaussian noise as input and output a continuous state representation and a one-hot action embedding. The input noise will be fed to a one-layer MLP first followed by the state generator and action generator. The state generator is implemented with a two-layer MLP whose output is the simulated state representation (64 dimensions) corresponding to the input noise. The main components of the action generator is a two-layer MLP followed by a Gumbel-Softmax function. The output of the Gumbel-Softmax function is an one-hot representation (300 dimensions). Specifically, in order to sample a discrete action, we implemented the “Straight-Through” Gumbel Softmax Estimator(Jang et al., 2016) and the temperature for the function is set to .
The discriminator is a three-layer MLP which takes as input the concatenation of latent state representation (64 dimensions) and one-hot encoding (300 dimensions) of the action. During the adversarial training, the real samples come from the real human dialogues in the training set while the fake samples have tree different sources. The main source is the output of the generator introduced above. The second way is that we randomly sample state-action pairs from the training set and replace the action in each pair with a different action to build a fake state-action pair. Besides, we keep a history buffer with limited size () to record the fake state-action pairs from the generator. The state-action pairs in the buffer will be replaced randomly by the new generated pairs from the generator. To strength the reward signal, we incorporate the human reward signal to the pretrained reward function and we use the mixed reward as the final reward function to train the dialog agent.
4.3 Reinforcement Learning Methods
In this work, we validate our pre-trained reward with two different types of reinforcement learning methods: Deep Q-network (DQN) and Proximal Policy Optimization (PPO). DQN (Mnih et al., 2015) is a off-policy reinforcement learning algorithm while PPO (Schulman et al., 2017) is policy gradient based algorithm. What needs to be pointed out is that the adversarial learning methods can only be applied to PPO or other policy gradient algorithms. Besides, to speed up the training speed, we extend the vanilla DQN to WDQN, where the dialog policy has the access to the expert data from the training set at the very beginning. We implemented the DQN and PPO algorithms according to the reinforcement learning module in ConvLab.
The handcrafted reward signal is defined as follow: at the end of a dialogue, if the dialog agent successfully accomplish the task within turns, it will receive as reward; otherwise, it will receive as penalty. is the maximum number of turns in each dialogue and we set it to in the whole experiments. Furthermore, the dialogue agent will receive as intermediate reward during the dialogue. We use r(Human) to represent the handcrafted reward function.
In terms of DQN based methods, we have DQN(human), DQN(GAN-AE) and DQN(GAN-VAE), where GAN-VAE is our method and denotes the variant that the Variational autoencoder is replaced with an vanilla autoencoder. With respect to WDQN, we also provide three different reward signals from Human, GAN-AE, GAN-VAE.
In terms of PPO based methods, we implemented Generative Adversarial Imitation Learning (GAIL)Ho and Ermon (2016). In GAIL, the reward signal is provided with a discriminator and the parameter of this discriminator will be updated during the adversarial training process. To show the efficiency of different reward signals in a fair setup, the discriminator in GAIL have been pretrained but the dialog policies are initialized randomly for all methods. We report the average performance by running the same method times with different random seeds. In the rest of this paper, we use GAN-VAE to denote the reward function trained with GAN and VAE, same for GAN-AE.
5 Experimental Results
5.1 Results with DQN
Figure 2 draws the learning process with different reward functions but the same user simulator. With respect to DQN agents, the dialogue policy trained with GAN-VAE shows the best performance in terms of convergence speed and success rate. Compared to GAN-VAE and GAN-AE, the update signal from the handcrafted reward function r(Human) can still optimize the dialog policy to a reasonable performance but with slower speed. This could oppositely verify that denser reward signals could speed up the training process of a dialog policy. Besides, the policy with handcrafted reward function r(Human) converges to a lower success rate in comparison with GAN-VAE and GAN-AE. We believe, to some extent, the pre-trained reward functions have mastered the underlying information to measure the quality of given state-action pairs. The knowledge that the reward function learned during the adversarial learning step could be generalized to unseen dialog states and actions to avoid potential local optimum. In contrast, the dialog agent DQN(Human) only relies on the final reward signal from the simulator at the end of a dialog and it can not provide much guidance to the ongoing dialogue turns during the conversation. This could be the reason why DQN(Human) shows lower success rate compared to DQN(GAN-VAE) and DQN(GAN-AE). The representation quality of learned state embeddings leads to different performance between GAN-VAE and GAN-AE, where VAE brings more benefits to the reward functions because of its better generalizability.
In terms of WDQN agents, all three methods achieve there inflection points in the first frames. By comparing DQN(Human) and WDQN(Human), we found that the expert dialog pairs from the training set do alleviate the problem of sparse reward signals for the handcrafted reward function during the start stage of policy training. Similar results could be observed from agents with pre-trained reward functions. After frames, the curve of WDQN(Human) coincides in position of DQN(Human) and they converge to the same point in the end. The faster convergence speed on WDQN(Human) did not bring higher success rate because the dialog policy still has no access to precise intermediate reward signals for the ongoing dialogue turns.
|Dialog Agent||Success Rate||Average Turn|
Table 1 reports the final performance of different dialog agents during testing time. All the agents have been trained with frames and we save the model which has the best performance during the training stage. One interesting finding is that DQN(GAN-VAE) outperforms WDQN(GAN-VAE) while WDQN(Human) beats DQN(Human). The warming up stage in WDQN(GAN-VAE) improves the training speed but also bring side effect that it achieve lower success rate in the final stage. The potential reason is that the expert dialog bring strong update signal at the beginning of the training process but also limit the exploring ability of the dialog agent. To verify this argument, we designed two more WDQN agents: WDQN_keep(Human), WDQN_keep(GAN-AE) as shown in Table 1. The expert dialogues in these two agents will be kept in the whole training stage instead of being removed gradually. With respect to the human designed reward function, there is a huge performance gap, almost , between WDQN_keep(Human) and WDQN(Human). The performance difference between WDQN_keep(GAN-AE) and WDQN(GAN-AE) is much smaller because the pre-trained reward function brings more precise and consistent update signals that have been explored and disclosed during the adversarial training step.
5.2 Results with PPO
With respect to the adversarial training methods, the reward functions are updated on the fly and only policy gradient based reinforcement learning algorithms are applicable. To compare the performance of a pretrained reward function and these reward functions updated in real-time, we utilized PPO algorithms to train the dialog agent with different reward functions. According to how we design the discriminators in GAIL, we have two different variants, PPO() and PPO(). To pretrained the discriminator in PPO(), we firstly trained a dialogue policy with imitation learning to generate negative samples. The generated samples will be utilized to pretrain the discriminator. In contrast, PPO() reused the pretrained reward function in our proposed method PPO(GAN-VAE) but keep updating it during the training process. The training performance is shown in Figure 3. It should be noted that the dialog agents shown in Figure 3 are not pretrained and only the corresponding reward models have been tuned in advance. PPO(GAN-VAE) managed to increase the success rate through interacting with the user simulator. With respect to the reward functions updated in real-time, corresponding to PPO() and PPO(), the success rate is increasing gradually in the first frames and starting slowing down in the following interactions and getting stuck in local optimum. The learning curve of the human designed reward function keeps growing, albeit slowly.
5.3 Transfer learning with pretrained reward function
When we are defining the action space, we keep the most frequent actions from the MultiWoz dataset and use onehot embedding to represent the actions. As shown in Figure 1, the action representation will be concatenated to the state representation to denote a specific state-action pair. However, this way of formulating the action space ignores the relations between different actions. For example, Restaurant_Inform_Price and Restaurant_Request_People should be close in the same conversation since they happen in the same domain. From the other side, even in different domains, there are still connections between actions from two different domain given the example that slot types Inform_Price and Request_People can also happen in Hotel domain, corresponding to actions Hotel_Inform_Price and Hotel_Request_People. We are curious if we can transfer the knowledge learned in several domain to a new domain never seen before through the pre-trained reward function. To verify this hypothesis, we first reformulate the action representation as a concatenation of three different segments: Onehot(Domain), Onehot(Diact), Onehot(Slot). In this way, actions containing similar information will be linked through the corresponding segments in the action representation. Following this formulation, we retrained our reward function in several domain and incorporate it to the training process of a dialogue agent in a new domain. In our work, the existing domains are Restaurant, Bus, Attraction and Train and the testing domain is Hotel since Hotel has the most slot types and some of them are unique such as Internet, Parking, Stars. As shown in Figure 4, the agent DQN(GAN-VAE + NoHotel) still benefits from the reward function trained in different domains and manages to outperform DQN(Human). corresponds to the dialog agent trained in full domains and the action is represented with a single onehot embedding. By replacing the action representation in we get agent . Obviously, the reward function trained in full domains should have better performance compared to the one trained in different domains.
In this work, we propose a guided dialog policy training method without using adversarial trianing in the loop. We first train the reward model with an auxiliary generator and then incorporate this trained reward model to a common reinforcement learning method to train a high-quality dialog agent. By conducting several experiments, we show the proposed methods can achieve remarkable task success and its potential to transfer knowlege from existing domains to a new domain.
Budzianowski et al. (2018)
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva,
Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018.
Multiwoz-a large-scale multi-domain wizard-of-oz dataset for
task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026.
- Dhingra et al. (2016) Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2016. Towards end-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777.
- Gašić and Young (2014) Milica Gašić and Steve Young. 2014. Gaussian processes for pomdp-based dialogue manager optimization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):28–40.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
- Gumbel (1954) Emil Julius Gumbel. 1954. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office.
- Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In NIPS, pages 4565–4573.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Kramer (1991) Mark A Kramer. 1991. AIChE journal, 37(2):233–243.
- Lee et al. (2019) Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li, Yaoqin Zhang, Zheng Zhang, Jinchao Li, Baolin Peng, Xiujun Li, Minlie Huang, et al. 2019. Convlab: Multi-domain end-to-end dialog system platform. arXiv preprint arXiv:1904.08637.
- Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008.
- Li et al. (2016) Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688.
Li et al. (2019)
Ziming Li, Julia Kiseleva, and Maarten de Rijke. 2019.
Dialogue generation: From imitation learning to inverse reinforcement
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6722–6729.
- Lipton et al. (2018) Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. 2018. Bbq-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Liu and Lane (2018) Bing Liu and Ian Lane. 2018. Adversarial learning of task-oriented neural dialog models. In Proceedings of the SIGDIAL 2018 Conference, pages 350–359.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- Peng et al. (2018a) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung Chen, and Kam-Fai Wong. 2018a. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6149–6153. IEEE.
- Peng et al. (2018b) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. 2018b. Deep dyna-q: Integrating planning for task-completion dialogue policy learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2182–2192.
- Peng et al. (2017) Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. arXiv preprint arXiv:1704.03084.
- Schatzmann et al. (2007) Jost Schatzmann, Blaise Thomson, Karl Weilhammer, Hui Ye, and Steve Young. 2007. Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 149–152. Association for Computational Linguistics.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Su et al. (2017) Pei-Hao Su, Pawel Budzianowski, Stefan Ultes, Milica Gasic, and Steve Young. 2017. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. arXiv preprint arXiv:1707.00130.
- Su et al. (2016) Pei-Hao Su, Milica Gasic, Nikola Mrkšić, Lina M Rojas Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line active reward learning for policy optimisation in spoken dialogue systems. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2431–2441.
- Su et al. (2018) Shang-Yu Su, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Yun-Nung Chen. 2018. Discriminative deep dyna-q: Robust planning for dialogue policy learning. In EMNLP.
- Takanobu et al. (2019) Ryuichi Takanobu, Hanlin Zhu, and Minlie Huang. 2019. Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 100–110.
- Williams et al. (2017) Jason D Williams, Kavosh Asadi, and Geoffrey Zweig. 2017. Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. arXiv preprint arXiv:1702.03274.
Williams and Young (2007)
Jason D Williams and Steve Young. 2007.
Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language, 21(2):393–422.
- Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence.