Guided Dialog Policy Learning without Adversarial Learning in the Loop

04/07/2020 ∙ by Ziming Li, et al. ∙ Amazon Microsoft University of Amsterdam 0

Reinforcement-based training methods have emerged as the most popular choice to train an efficient and effective dialog policy. However, these methods are suffering from sparse and unstable reward signals usually returned from the user simulator at the end of the dialog. Besides, the reward signal is manually designed by human experts which requires domain knowledge. A number of adversarial learning methods have been proposed to learn the reward function together with the dialog policy. However, to alternatively update the dialog policy and the reward model on the fly, the algorithms to update the dialog policy are limited to policy gradient-based algorithms, such as REINFORCE and PPO. Besides, the alternative training of the dialog agent and the reward model can easily get stuck in local optimum or result in mode collapse. In this work, we propose to decompose the previous adversarial training into two different steps. We first train the discriminator with an auxiliary dialog generator and then incorporate this trained reward model to a common reinforcement learning method to train a high-quality dialog agent. This approach is applicable to both on-policy and off-policy reinforcement learning methods. By conducting several experiments, we show the proposed methods can achieve remarkable task success and its potential to transfer knowledge from existing domains to a new domain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Task-Oriented dialog systems aim for offering users with assistance to complete tasks by interacting with users, such as Siri, Google Assistant and Amazon Alexa. With the developing of Reinforcement learning in robotics and other domains, it brings another view of learning the dialog policy  (Williams and Young, 2007; Gašić and Young, 2014; Su et al., 2017; Li et al., 2019). As it is not practical to interact with a real user in the policy training loop, a common but essential strategy is to build a user simulator to provide replies to the dialog agent(Schatzmann et al., 2007; Li et al., 2016). In the real dialog systems, they aims to maximize the positive feedback they can get from the user. To simulate the user feedback during training, a reward function has been designed and embedded to the user simulator and it will return a reward signal to the dialog agent according to the given dialog context and system action  (Peng et al., 2018b; Williams et al., 2017; Dhingra et al., 2016; Su et al., 2016). The reward signal can be in the form of binary feedback or a continuous score. The most straightforward way to design such a reward function is to provide the agent with different reward signal based on the dialog status: if the dialog ends successfully, a large positive reward signal will be returned; if the dialog fails, the reward signal will be a large negative value; if the dialog is still ongoing, a small negative signal will be returned to encourage shorter session  (Peng et al., 2018b). However, this solution assigns the same negative signal to all the system actions happened in the dialog except the last one, which results in the qualities of different actions are not distinguishable. Besides, the really meaningful reward signal only be returned at the end of a dialog and this can delay the penalty to low-quality actions and the reward to high-quality actions. Liu and Lane (2018) address the difficulties listed above by adopting adversarial training for policy learning: they jointly train two systems: (1) a policy model that decides on the actions to take at each turn, and (2) a discriminator that marks a dialog as being successful or not. Feedback from the discriminator is used as a reward signal to push the policy model to complete a task in a way that is indistinguishable from how a human agent completes it. Following this solution, Takanobu et al. (2019) replaces the discriminator with a reward function which has a specific architecture and takes as input the dialog state, system action and next dialog state. This method managed to achieve higher performance with respect to success rate and other metrics.

However, to alternatively update the dialog policy and the reward model on the fly, the algorithms to update the dialog policy are limited to policy gradient based algorithms, such as REINFORCE(Williams, 1992) and PPO(Schulman et al., 2017)

, while off-policy methods are not able to benefit from the self-learned reward functions. Besides, the alternative training of dialog agent and the reward model can easily get stuck in local optimum or result in mode collapse. To alleviate the potential problems mentioned above, we decompose the adversarial learning method in dialog policy learning into two sequential step. We learn the reward function using an auxiliary dialog state generator where the loss from the discriminator can be backpropagated to the generator directly. In the next step, we discard the state generator and only keep the trained discriminator as the dialog reward model. The trained reward model will be incorporated to the reinforcement learning process and will not be updated. In this way, we can utilize any reinforcement learning algorithms to update the dialog policy, including both on-policy and off-policy methods. Besides, we show how to use the pretrained reward functions to transfer knowledge learned in existing domains to a new dialog domain. To summarize, we make the following technological contributions:

  • [nosep,leftmargin=*]

  • A reward learning method that is applicable to off-policy reinforcement learning methods in dialog training.

  • A reward learning method can alleviate the problem of local optimum for adversarial dialog training.

  • A reward function that can transfer knowledge learned in existing domains to a new dialog domain.

2 Related Work

Building a dialog system that can handle conversations across different domains has attracted a lot of attentions in the last few years. A rule-based dialog system is becoming powerless in the multi-domain scenarios because of the rich and diverse interactions. It is intractable to take into account all the possible situations and get the corresponding solutions ready by predefining a bunch of rules. Reinforcement learning methods (Peng et al., 2017; Lipton et al., 2018; Li et al., 2017; Su et al., 2018; Dhingra et al., 2016; Williams et al., 2017), have been widely utilized to train a dialog agent by interacting with users. With the help of reinforcement learning, the dialog agent is able to explore the dialog contexts which may not exist in the previously observed data. However, the reward signal used to update the dialog policy is usually from a reward function predefined with domain knowledge and it could become very tricky facing to the multi-domain dialog scenarios. To provide the dialog policy with high quality reward signal, Peng et al. (2018a) proposed to make use of the adversarial loss as an extra critic in addition to shape the main reward function. Inspired by t,he success of adversarial learning in other research fields, Liu and Lane (2018) learns the reward function directly from dialog samples by alternatively updating the dialog policy and the reward function. The reward function in fact is a discriminator and it aims to assign high value to real human dialogues while low value to dialogues generated by the current dialog policy. In contrast, the dialog policy attempts to achieve higher reward from the discriminator given the generated dialog. Following this solution, Takanobu et al. (2019) replaces the discriminator with a reward function which has a specific architecture and argues to achieve higher performance with respect to success rate and other metrics.

3 Learning Reward Functions with an Auxiliary Generator

Different from previous adversarial training methods Liu and Lane (2018); Takanobu et al. (2019), in our method the dialog policy and reward model are trained consecutively rather than trained alternatively in different time step. We believe this can avoid potential training issues, such as mode collapse and local optimum. To achieve this goal, we introduce an auxiliary generator in the first step which is used to explore potential dialog situations. The advantage of this setup is that we transfer the SeqGan setting Yu et al. (2017) to a vanilla Gan setting Goodfellow et al. (2014). SeqGan setup refers to the adversarial training style that policy gradient is utilized to deliver the update signal from Discriminator to the dialog agent. In contrast, in the vanilla Gan framework the discriminator can directly backpropogate the reward signal to the generator. Once we restored a high-quality reward model with the auxiliary generator in the first step, we can make use of it in common reinforcement learning methods to update the dialog agent. Since the reward model will keep fixed during the policy training, we can adopt different kinds of reinforcement learning methods while the adversarial learning methods are restricted to policy gradient based methods.

3.1 Dialog State Tracker

We reuse the rule-based dialog state tracker in ConvLab Lee et al. (2019) (more details about ConvLab will be introduced in Section 4.1

) to keep tracking the information emerged in the interactions between users and the dialog agent. The state tracker plays an important role in dialog systems since its output is the foundation for the dialog policy decisions in the next step. The embedded tracker in ConvLab has the ability to handle multi-domain interactions. The output from the NLU module is fed to the dialog state tracker to extract informative information, including the informable slots about the constraints from users and requestable slots that indicates what users want to inquiry. Besides, a belief vector will be maintained and updated for each slot in every domain.

Dialog State The scattered information from the dialog state tracker will be integrated to form a structured state representation at time step . There are mainly six feature segments information in the final representation, including the embedded results of returned entities for a query, the availability of the booking option with respect to a given domain, the state of informable slots, the state of requestable slots, the last user action and how many times the last user action has been repeated without interruptions. The final state representation is an information vector with 392 dimensions and each position is filled in with or .
Dialog Action The action space consists of two different sets. In the first action set, each action is a concatenation of domain name, action type and slot name, such as Attraction_Inform_Address and Hotel_Request_Internet. Since in the real scenarios, the response from a human or a dialog agent can cover several different single actions defined in the first action set, we extract the most frequently used dialog actions from the human-human dialog dataset to form the second action set. In another word, all the actions in the second set are a combination of two or three single actions from the first action set. For example, [Attraction_Inform_Address, Hotel_Request_Internet] will be regarded as a new action that the policy agent can execute. In the end, the final action space has different dialog actions. We utilize one-hot embedding to represent the actions.

3.2 Exploring Dialog Scenarios with an Auxiliary Generator

We aim to train a reward function which has the ability to distinguish high quality dialogs from unreasonable and inappropriate dialogs. We utilize a generator to explore the possible dialog scenarios that could happen in real life. The dialog scenario at time is a pair of a dialog state and the corresponding system action at the same time step . The dialog state-action pairs generated from this generator are fed to the reward model as negative samples. During reward training, the reward function can benefit from the rich and high-quality negative instances generated by the advanced generator to improve the discriminability. The dialog simulating step can be formulated as:

, where is a sampled Gaussian noise and each corresponds to one potential state-action pair .

3.2.1 Action Simulation

To simulate the dialog actions, we adopt an MLP as the action generator following by a Gumbel-Softmax function with dimensions, where each dimension corresponds to a specific action in the action space. The Gumbel-Max trick (Gumbel, 1954) is commonly used to draw samples

from a categorical distribution with class probabilities


where is independently sampled from Gumbel (0,1). However, the argmax operation is not differentiable, thus no gradient can be backpropagated through . Instead, we can adopt the soft-argmax approximation (Jang et al., 2016) as a continuous and differentiable approxiamation to and to generate k-dimensional sample vectors below:

for . When the temperature , the operation is exactly recovered and samples from the Gumbel-Softmax distribution become one-hot. However the gradient will vanish when approaches . In the contrary, when

is going higher, the Gumbel-Softmax samples are getting similar to samples from uniform distribution over

categories. In practice,

should be selected to balance the approximation bias and the magnitude of gradient variance. In our case,

corresponding to the action distribution and equals to the action dimension .

3.2.2 State Simulation

In our setting, the state representation is a vector filled in with discrete values which means we cannot connect the generator with the discriminator directly. Similar to the action generating method, the Gumbel-Softmax trick could be the bridge to deliver the gradient from the discriminator to the state generator . In this solution, we have to attach a bunch of Gumbel-Softmax functions to the back of and the number depends on how many meaningful segments included in the state representation. The Gumbel-Softmax trick is powerless in our setting because there are around

independent meaningful segments in the state representation. Besides, a preprocessing step is essential to expand the discrete representation to a concatenation of a number of one-hot embeddings which demands the familiarity with the state structure. These disadvantages lead us, to an alternative solution for state simulation by utilizing a pretrained Variational AutoEncoder

(Kramer, 1991; Kingma and Welling, 2013).

State transferring with Variational AutoEncoder

Compared to the scenarios of GAN in computer vision, the output of the generator in our setting is a discrete vector which makes it challenging to backpropogate the loss from discriminator to the generator directly. To address this problem, we propose to project the discrete representation

in the expert demonstrations to a continuous space with an encoder from a pretrained variational autoencoder (Kingma and Welling, 2013). Assuming the expert-like dialogue state is generated by a latent variable via the distribution , the variable could be the representation we aim for in a continuous space. Given human-generated state , the VAE utilize a conditional probabilistic encoder to infer the latents :

are the variational parameters for the encoder while the decoder. The optimization objective is given as:

The first term in the right side is responsible reconstruction loss and this term encourages the decoder parametered with to learn to reconstruct the input . The second term is the KL-divergence between the encoder’s distribution

and a standard Gaussian distribution

The benefit of projecting the state representations to a different space is that we can directly simulate the dialog states in the continuous space just like generating realistic images in computer vision. Besides, similar dialog states will be embedded into close latent representations. As shown in Fig 1, we utilize an variational autoencoder to learn the state projecting fucntion given dialog states from real human dialogs. In summary, we transfer the discrete dialog state from the state tracker to a continuous state space through a pretrained state encoder and all the future training will happen in the latent continuous space rather than the original state space.

3.2.3 Adversarial training

By applying Gumbel-Softmax to action simulation and state transferring to the state simulation respectively, we can simulate the real state-action distribution in a differentiable setup.

The whole process of can be formulated as follow:

denotes all the parameters in the generator and is the concatenating operation. During the adversarial training process, the generator takes noise as input and outputs a sample and it aims to get higher reward signal from the discriminator . The training loss for the generator can be given as:

where and denotes the discriminator measuring the reality of generated state-action pairs .

Figure 1: The architecture to simulate state-action representation with variational autoencoder. is the sampled Gaussian noise.

The discriminator in this work is an MLP which takes as input the state-action pair and outputs the probability that the sample is from the real data distribution. Since the discriminator’s goal is to assign higher probability to the real data while lower score to the fake data, the objective can be given as the average log probability it assigns to the correct classification. Given an equal mixture of real data samples and generated samples from the generator

, the loss function for the discriminator


where denotes the discrete state representation from the state tracker.

4 Experiemntal Setup

4.1 Dataset and Training environment

MultiWOZ MultiWOZ (Budzianowski et al., 2018) is a multi-domain dialogue dataset spanning 7 distinct domains and consisting of 10,438 dialogues. The main scenario in this dataset is that a dialogue agent is trying to satisfy the demand from tourists such as booking a restaurant or recommending a hotel with specific requirements. The interactions between the dialogue agent and users can happen in 7 different domains, including: Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train. The average number of turns are 8.93 and 15.39 for single and multi-domain dialogs, respectively.
ConvLab ConvLab (Lee et al., 2019)

is an open-source multi-domain end-to-end dialogue system platform. ConvLab offers the annotated MultiWOZ dataset and associated pre-trained reference models. We reuse the rule-based dialog state tracker from ConvLab to keep tracking the information emerged in the interactions between users and the dialog agent. Besides, an agenda-based

(Schatzmann et al., 2007) user simulator is embedded in ConvLab and it has been adapted for multi-domain dialogue scenarios.

4.2 Architecture and Training Details

Variational AutoEncoder The Encoder is a two-layer MLP which takes the discrete state representation (392 dimensions) as input and outputs two intermediate embedding (64 dimensions) corresponding to the mean and the variance respectively. During inference time, we regard the mean as the embedded representation for a given state input .

The generator takes a randomly sampled Gaussian noise as input and output a continuous state representation and a one-hot action embedding. The input noise will be fed to a one-layer MLP first followed by the state generator and action generator. The state generator is implemented with a two-layer MLP whose output is the simulated state representation (64 dimensions) corresponding to the input noise. The main components of the action generator is a two-layer MLP followed by a Gumbel-Softmax function. The output of the Gumbel-Softmax function is an one-hot representation (300 dimensions). Specifically, in order to sample a discrete action, we implemented the “Straight-Through” Gumbel Softmax Estimator

(Jang et al., 2016) and the temperature for the function is set to .

The discriminator is a three-layer MLP which takes as input the concatenation of latent state representation (64 dimensions) and one-hot encoding (300 dimensions) of the action. During the adversarial training, the real samples come from the real human dialogues in the training set while the fake samples have tree different sources. The main source is the output of the generator introduced above. The second way is that we randomly sample state-action pairs from the training set and replace the action in each pair with a different action to build a fake state-action pair. Besides, we keep a history buffer with limited size (

) to record the fake state-action pairs from the generator. The state-action pairs in the buffer will be replaced randomly by the new generated pairs from the generator. To strength the reward signal, we incorporate the human reward signal to the pretrained reward function and we use the mixed reward as the final reward function to train the dialog agent.

4.3 Reinforcement Learning Methods

In this work, we validate our pre-trained reward with two different types of reinforcement learning methods: Deep Q-network (DQN) and Proximal Policy Optimization (PPO). DQN (Mnih et al., 2015) is a off-policy reinforcement learning algorithm while PPO (Schulman et al., 2017) is policy gradient based algorithm. What needs to be pointed out is that the adversarial learning methods can only be applied to PPO or other policy gradient algorithms. Besides, to speed up the training speed, we extend the vanilla DQN to WDQN, where the dialog policy has the access to the expert data from the training set at the very beginning. We implemented the DQN and PPO algorithms according to the reinforcement learning module in ConvLab.

4.4 Baselines

The handcrafted reward signal is defined as follow: at the end of a dialogue, if the dialog agent successfully accomplish the task within turns, it will receive as reward; otherwise, it will receive as penalty. is the maximum number of turns in each dialogue and we set it to in the whole experiments. Furthermore, the dialogue agent will receive as intermediate reward during the dialogue. We use r(Human) to represent the handcrafted reward function.
In terms of DQN based methods, we have DQN(human), DQN(GAN-AE) and DQN(GAN-VAE), where GAN-VAE is our method and denotes the variant that the Variational autoencoder is replaced with an vanilla autoencoder. With respect to WDQN, we also provide three different reward signals from Human, GAN-AE, GAN-VAE.

In terms of PPO based methods, we implemented Generative Adversarial Imitation Learning (GAIL)

Ho and Ermon (2016). In GAIL, the reward signal is provided with a discriminator and the parameter of this discriminator will be updated during the adversarial training process. To show the efficiency of different reward signals in a fair setup, the discriminator in GAIL have been pretrained but the dialog policies are initialized randomly for all methods. We report the average performance by running the same method times with different random seeds. In the rest of this paper, we use GAN-VAE to denote the reward function trained with GAN and VAE, same for GAN-AE.

5 Experimental Results

5.1 Results with DQN

Figure 2: The learning process with different reward functions and training agents

Figure 2 draws the learning process with different reward functions but the same user simulator. With respect to DQN agents, the dialogue policy trained with GAN-VAE shows the best performance in terms of convergence speed and success rate. Compared to GAN-VAE and GAN-AE, the update signal from the handcrafted reward function r(Human) can still optimize the dialog policy to a reasonable performance but with slower speed. This could oppositely verify that denser reward signals could speed up the training process of a dialog policy. Besides, the policy with handcrafted reward function r(Human) converges to a lower success rate in comparison with GAN-VAE and GAN-AE. We believe, to some extent, the pre-trained reward functions have mastered the underlying information to measure the quality of given state-action pairs. The knowledge that the reward function learned during the adversarial learning step could be generalized to unseen dialog states and actions to avoid potential local optimum. In contrast, the dialog agent DQN(Human) only relies on the final reward signal from the simulator at the end of a dialog and it can not provide much guidance to the ongoing dialogue turns during the conversation. This could be the reason why DQN(Human) shows lower success rate compared to DQN(GAN-VAE) and DQN(GAN-AE). The representation quality of learned state embeddings leads to different performance between GAN-VAE and GAN-AE, where VAE brings more benefits to the reward functions because of its better generalizability.
In terms of WDQN agents, all three methods achieve there inflection points in the first frames. By comparing DQN(Human) and WDQN(Human), we found that the expert dialog pairs from the training set do alleviate the problem of sparse reward signals for the handcrafted reward function during the start stage of policy training. Similar results could be observed from agents with pre-trained reward functions. After frames, the curve of WDQN(Human) coincides in position of DQN(Human) and they converge to the same point in the end. The faster convergence speed on WDQN(Human) did not bring higher success rate because the dialog policy still has no access to precise intermediate reward signals for the ongoing dialogue turns.

Dialog Agent Success Rate Average Turn
WDQN_keep(Human) 0.741 9.572
WDQN_keep(GAN-AE) 0.879 7.559
WDQN(Human) 0.906 6.790
WDQN(GAN-AE) 0.911 6.649
WDQN(GAN-VAE) 0.937 6.130
DQN(Human) 0.870 7.480
DQN(GAN-AE) 0.953 6.150
DQN(GAN-VAE) 0.985 5.520
Table 1: WDQN_keep means the dialog policy has access to the expert state-action pairs during the whole training stage; WDQN is the agent we described in Section 4.4 where we remove the expert dialogues gradually from the expert buffer. We calculate the performance based on the average results by running each method times.

Table 1 reports the final performance of different dialog agents during testing time. All the agents have been trained with frames and we save the model which has the best performance during the training stage. One interesting finding is that DQN(GAN-VAE) outperforms WDQN(GAN-VAE) while WDQN(Human) beats DQN(Human). The warming up stage in WDQN(GAN-VAE) improves the training speed but also bring side effect that it achieve lower success rate in the final stage. The potential reason is that the expert dialog bring strong update signal at the beginning of the training process but also limit the exploring ability of the dialog agent. To verify this argument, we designed two more WDQN agents: WDQN_keep(Human), WDQN_keep(GAN-AE) as shown in Table 1. The expert dialogues in these two agents will be kept in the whole training stage instead of being removed gradually. With respect to the human designed reward function, there is a huge performance gap, almost , between WDQN_keep(Human) and WDQN(Human). The performance difference between WDQN_keep(GAN-AE) and WDQN(GAN-AE) is much smaller because the pre-trained reward function brings more precise and consistent update signals that have been explored and disclosed during the adversarial training step.

5.2 Results with PPO

Figure 3: The learning process with different reward fuctions and PPO agents

With respect to the adversarial training methods, the reward functions are updated on the fly and only policy gradient based reinforcement learning algorithms are applicable. To compare the performance of a pretrained reward function and these reward functions updated in real-time, we utilized PPO algorithms to train the dialog agent with different reward functions. According to how we design the discriminators in GAIL, we have two different variants, PPO() and PPO(). To pretrained the discriminator in PPO(), we firstly trained a dialogue policy with imitation learning to generate negative samples. The generated samples will be utilized to pretrain the discriminator. In contrast, PPO() reused the pretrained reward function in our proposed method PPO(GAN-VAE) but keep updating it during the training process. The training performance is shown in Figure 3. It should be noted that the dialog agents shown in Figure 3 are not pretrained and only the corresponding reward models have been tuned in advance. PPO(GAN-VAE) managed to increase the success rate through interacting with the user simulator. With respect to the reward functions updated in real-time, corresponding to PPO() and PPO(), the success rate is increasing gradually in the first frames and starting slowing down in the following interactions and getting stuck in local optimum. The learning curve of the human designed reward function keeps growing, albeit slowly.

5.3 Transfer learning with pretrained reward function

Figure 4: The learning process with different reward fuctions and PPO agents

When we are defining the action space, we keep the most frequent actions from the MultiWoz dataset and use onehot embedding to represent the actions. As shown in Figure 1, the action representation will be concatenated to the state representation to denote a specific state-action pair. However, this way of formulating the action space ignores the relations between different actions. For example, Restaurant_Inform_Price and Restaurant_Request_People should be close in the same conversation since they happen in the same domain. From the other side, even in different domains, there are still connections between actions from two different domain given the example that slot types Inform_Price and Request_People can also happen in Hotel domain, corresponding to actions Hotel_Inform_Price and Hotel_Request_People. We are curious if we can transfer the knowledge learned in several domain to a new domain never seen before through the pre-trained reward function. To verify this hypothesis, we first reformulate the action representation as a concatenation of three different segments: Onehot(Domain), Onehot(Diact), Onehot(Slot). In this way, actions containing similar information will be linked through the corresponding segments in the action representation. Following this formulation, we retrained our reward function in several domain and incorporate it to the training process of a dialogue agent in a new domain. In our work, the existing domains are Restaurant, Bus, Attraction and Train and the testing domain is Hotel since Hotel has the most slot types and some of them are unique such as Internet, Parking, Stars. As shown in Figure 4, the agent DQN(GAN-VAE + NoHotel) still benefits from the reward function trained in different domains and manages to outperform DQN(Human). corresponds to the dialog agent trained in full domains and the action is represented with a single onehot embedding. By replacing the action representation in we get agent . Obviously, the reward function trained in full domains should have better performance compared to the one trained in different domains.

6 Conclusion

In this work, we propose a guided dialog policy training method without using adversarial trianing in the loop. We first train the reward model with an auxiliary generator and then incorporate this trained reward model to a common reinforcement learning method to train a high-quality dialog agent. By conducting several experiments, we show the proposed methods can achieve remarkable task success and its potential to transfer knowlege from existing domains to a new domain.