Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

by   Baolin Peng, et al.

This paper presents a new method --- adversarial advantage actor-critic (Adversarial A2C), which significantly improves the efficiency of dialogue policy learning in task-completion dialogue systems. Inspired by generative adversarial networks (GAN), we train a discriminator to differentiate responses/actions generated by dialogue agents from responses/actions by experts. Then, we incorporate the discriminator as another critic into the advantage actor-critic (A2C) framework, to encourage the dialogue agent to explore state-action within the regions where the agent takes actions similar to those of the experts. Experimental results in a movie-ticket booking domain show that the proposed Adversarial A2C can accelerate policy exploration efficiently.


page 1

page 2

page 3

page 4


Variance Reduction in Actor Critic Methods (ACM)

After presenting Actor Critic Methods (ACM), we show ACM are control var...

Self-Imitation Learning

This paper proposes Self-Imitation Learning (SIL), a simple off-policy a...

Policy Networks with Two-Stage Training for Dialogue Systems

In this paper, we propose to use deep policy networks which are trained ...

Diluted Near-Optimal Expert Demonstrations for Guiding Dialogue Stochastic Policy Optimisation

A learning dialogue agent can infer its behaviour from interactions with...

ACtuAL: Actor-Critic Under Adversarial Learning

Generative Adversarial Networks (GANs) are a powerful framework for deep...

Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning

This paper presents a Discriminative Deep Dyna-Q (D3Q) approach to impro...

Actor-Critic Network for Q A in an Adversarial Environment

Significant work has been placed in the Q A NLP space to build models ...

1 Introduction

There has been growing interest in exploiting reinforcement learning (RL) for policy learning in task-oriented dialogue systems  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. One of the biggest challenges in these approaches is the reward sparsity issue. Dialogue policy learning for complex tasks, such as movie-ticket booking and travel planning, requires exploration in a large state-action space, and it often takes many conversation turns between the user and the agent to fulfill a task, leading to a long trajectory. Thus, the reward signals (usually provided by users at the end of a conversation) are often delayed and sparse.

To deal with reward sparsity, different approaches have been proposed recently, with promising empirical results. One approach is to leverage prior knowledge learned from expert-generated (or human-human) dialogue. For example, instead of learning a dialogue policy from scratch, we construct an initial policy learned from human-human dialogues, via imitation learning or hand-crafted rules. Prior work showed that a pre-trained supervised policy or a weak rule-based policy can significantly improve the efficiency of exploration 

[4, 11]

. Another approach is to introduce heuristics, often in the form of the intrinsic reward to guide the exploration 

[12, 13, 14, 15]. While the extrinsic reward (e.g., feedback provided by users at the end of a conversation) could be sparse, it is possible to get intrinsic reward after each action in order to guide the agent to explore the region more effectively. For example, VIME maximizes information gain about the agent’s belief of environment dynamics [14]. It adds an intrinsic reward bonus to the reward function, which quantifies the agent’s surprise to encourage the agent to explore the regions that are relatively unexplored. BBQN encourages the agent to explore those state-action regions where the agent is relatively uncertain in action selection [11]. UNREAL converts the training signals from three auxiliary tasks as intrinsic rewards, which significantly improved the learning speed and the robustness of the agent [15].

In this paper, we present a new method that combines the strength of the two approaches mentioned above. Similar to the first approach, we also leverage expert-generated dialogues as prior knowledge. However, instead of constructing an initial dialogue policy using prior knowledge, we, inspired by generative adversarial networks (GAN) [16], train a discriminator to differentiate the responses (or actions) generated by dialogue agents from those by human experts. Then, we use the output of the discriminator as intrinsic reward to encourage the dialogue agent to explore state-action regions in which the agent takes actions similar to what human experts do. Specifically, we incorporate the discriminator as another critic into the advantage actor-critic (A2C) framework, resulting in a new model, called adversarial advantage actor-critic (Adversarial A2C). The modeling assumption behind our method is that the expert policies (embedded in the expert-generated dialogues) are reasonably good, thus the agent-selected actions, which are more similar to expert-selected ones, lead more often to successful dialogues with positive rewards. In a word, we remedy the reward sparse problem on two fronts, by leveraging human-human dialogues as prior knowledge and by introducing intrinsic rewards. Experiments in a movie-ticket booking domain show that the proposed Adversarial A2C model can significantly improve dialogue policy learning in terms of both effectiveness and efficiency.

2 Methodology

Figure 1: Illustration of a task-completion dialogue system.

Figure 1

illustrates a typical task-completion dialogue system that contains three main components: language understanding (LU) that converts natural language to system-readable semantic frames, natural language generation (NLG) that converts system actions to natural language, and a dialogue manager (DM). The dialogue manager controls state tracking and policy learning, where dialogue policy learning can be regarded as a sequential decision process. The system will learn to select the best response action at each step, by maximizing the long-term objective associated with a reward function.

This paper focuses on dialogue policy learning (the bottom-right part of Figure 1), where the input of the policy learner is the dialogue state representation that consists of the latest user action (e.g., request_moviename(genre=action, date=today)), the last agent action (e.g. request_location), history dialogue turns, and available database results. The learned dialogue policy then helps the agent decide what action to take in each turn of the conversation, in order to maximize the future cumulative reward.

As aforementioned, dialogue policy optimization can be formulated as a sequential decision problem to maximize the long term objective associated with a reward function. The advantage actor-critic (A2C) method has achieved superior performance on solving sequential decision problems [17, 18, 19]. Su et al. applied the actor-critic model to dialogue policy optimization and proved its superiority on convergence to other methods such as deep Q-networks [20]. Similarly, we employ an actor-critic approach to learn dialogue policy in our model. In addition, inspired by GAN [16] (using a discriminator to guide the training of generative models), we form a minimax game between a generator (an actor that selects actions in our scenario) and a discriminator, to judge whether an action is performed by the expert or the actor. The discriminator can be regarded as another critic and servers as a heuristic intrinsic reward function to guide the actor towards expert-like regions. Another related topic is inverse reinforcement learning [21], which is to recover the reward function from expert demonstrations, samples of the trajectories executed by experts [22]. Ho and Ermon also drew a connection between inverse reinforcement learning and generative adversarial networks to learn the reward function in the GAN framework [23]. Compared to their work that focused on learning the extrinsic reward, in this paper, we use intrinsic reward to speech up the training.

2.1 Advantage Actor-Critic for Dialogue Policy Learning

The training objective of policy-based approaches is to find a policy that maximizes the expected reward (minimizes the loss ) over all possible dialogue trajectories. The expected reward is defined as over a dialogue with the length , where is the reward at time stamp , and is the discount factor. The policy is a parametrized probabilistic mapping function between the state space and the action space:


where represents the parameters learned by policy gradient algorithms [24]. Given the objective function, the gradients of the parameters are computed as



is the long-term reward value. However, the gradients usually have high variance, which makes the learning task more challenging. A baseline function

is usually employed to reduce the variance, while keeping the estimated gradient unchanged 

[25]. Here we can simply choose the state value function as a baseline . With this strategy, we can rewrite (2) using the advantage function :


However, in this setting, there are two functions and parameters that need to be learned. In order to reduce the number of required parameters and improve stability, temporal difference (TD) error is employed as an unbiased estimate of the advantage function,


In this way, the policy gradient with the TD error can be computed as


The policy network is termed as the actor to yield a dialogue system action, and the advantage function is the critic, indicating “good” or “bad” for executing an action given a state. The classic A2C architecture is shown at the bottom part of Figure 2 without discriminator.

2.2 Adversarial Model for Dialogue Policy Learning

Figure 2: Illustration of the proposed adversarial advantage actor-critic for dialogue policy learning.

GAN is a minimax competing game between a generator and a discriminator. In our scenario, the actor can be viewed as a generator , which aims to generate actions that can purposefully confuse a discriminator . The discriminator is expected to identify a state-action pair as either an expert demonstration or a simulation experience. When cannot distinguish actions generated from the actor and those from the experts, we believe that has been improved from the previous state. Moreover, can be viewed as a reward function extracted from the experts’ trajectories. Figure 2 shows the discriminator training procedure using adversarial learning.

The training objective is to find a saddle point of


More specifically, let denote the parameters of the discriminator . The training objective of

is simply to maximize the probability of classifying each state-action pair


where and represent simulation experience and expert demonstration, respectively. As thus, the actor can be improved using actor-critic, with as the reward function. The updated gradients can then be reformed as:


Similarly, we use TD error as an unbiased estimation of the advantage function:


2.3 Adversarial Advantage Actor-Critic

Furthermore, training dialogue policy with a stand-alone adversarial model can be impractical, due to the high dimensionality of its state-and-action space. To address this issue, we propose the adversarial advantage actor-critic (Adversarial A2C) method as depicted in Figure 2, which combines A2C with a reward function learned from an adversarial model that serves as another additional critic for the actor . There are several ways to combine two critics, such as linear combination of two reward functions or alternately optimizing with each reward function. In our experiments, we use alternating optimization. Algorithm 1 outlines the full procedure of training the Adversarial A2C model. The goal is to encourage the actor to select better actions guided by a discriminator, in order to improve the efficiency and effectiveness of the exploration.

1:  Input: Expert demonstrations , initialize actor , discriminator and two value functions ,
2:  for =: do
3:     Restart the dialogue simulator, get state representation , initialize transition tuple buffer = []
4:     while  is not a terminal state do
5:         Perform the action according to the actor
6:         Receive the reward and switch to a new state
7:         Store to the transition tuple buffer
9:     end while
10:     Train the actor with gradients (6)
11:     Train value function by minimizing the TD error (5)
12:     Sample state action pairs from expert demonstration
13:     Train the actor with gradients (9)
14:     Update the reward with and train value function by minimizing the TD error (10)
15:     Update the discriminator parameters (2.2)
16:  end for
Algorithm 1 Adversarial Advantage Actor-Critic Model

3 Experiments

To verify the performance of the proposed model, we evaluated it in a task-completion dialogue system for movie-ticket booking. In this system, the agent will gather information from users through conversations and eventually book the movie tickets for them. The environment then judges a binary outcome (success or failure) at the end of each conversation, based on: 1) whether a movie ticket is booked, and 2) whether the booked ticket satisfies the constraints requested by the user.

3.1 Experimental Setup

The dataset used in our experiment is raw conversational data collected via Amazon Mechanical Turk, annotated by domain experts [5]. This single-domain movie-ticket booking dataset contains 11 dialogue acts and 29 slots, including informable slots (users can use these to narrow down the search), and requestable slots (where users can ask the agent for more information). There are in total 280 labeled dialogues, with an average length of 11 turns.

In order to perform end-to-end training for the dialogue system, a user simulator is required to interact with the system in a natural way. We adopted a publicly available, user-agenda-based simulator in our experiments [26]. In a task-completion dialogue setting, the user simulator first generates a user goal, and the dialogue agent tries to help the user accomplish that goal in the course of the conversation, without explicitly knowing the user goal. A user goal normally consists of two parts: inform_slots representing slot-value pairs that serve as constraints from the user, and request_slots representing slots whose value the user has no information about, but wants to get information from the agent through the conversation. In our experiment, the user goals were generated from labeled conversational data.

3.2 Implementation

In Figure 2

, the expert demonstrations can be collected from either human or pre-trained agent. In our experiment, we collected 50 successful dialogues from a pre-trained agent. The discriminator is a binary classifier of a single-layer neural network with 80 hidden units. For the actor, we use a single-layer neural network with a hidden size of 80, pre-trained with rule-based examples in order to give acceptable initialization. During the Adversarial A2C model training, two critics (the critic and the discriminator in Figure 

2) are applied alternatively, where their value functions are single-layer neural networks with 80 hidden units. All parameters are optimized with RMSProp. During training, the model is updated at the end of each dialogue episode.

3.3 Evaluation Results

In the movie-ticket booking task, we benchmark the proposed Adversarial A2C model against three baseline models on three metrics: success rate, average rewards, and the average number of turns per dialogue session.

  • Rule Agent is a handcrafted rule-based policy that informs and requests a hand-picked subset of necessary slots.

  • A2C Agent is trained with a pre-defined reward function and a standard advantage actor-critic algorithm.

  • BBQN-Map Agent is the best agent among a set of BBQN variants (including BBQN-VIME) and DQN variants, which has demonstrated great efficiency for policy exploration in task-completion dialogue systems [11].

Figure 3: Learning curves of dialogue policies.
Agents Success Rate Reward Turn
Rule 41.34 0.26 16.00
A2C 81.24 5.08 15.43
BBQN-Map 81.56 5.00 18.75
Adversarial A2C 87.52 5.93 13.52
Table 1: Final agent performance on 5K simulated dialogues.

Figure 3 shows the learning curves of all these dialogue agents mentioned above, and Table 1 shows the evaluation performance of each agent, averaged over 5 runs. The learning curves in Figure 3 shows that Adversarial A2C agent can learn much faster with better exploration capability. The learning curve is also more stable compared with others. Table 1 suggests that the Adversarial A2C agent can yield better dialogue policies than other approaches, in terms of success rate, average reward, and average number of turns per dialogue.

4 Conclusions

This paper presents an adversarial advantage actor-critic model, which can explore policy learning in task-completion dialogue systems with great efficiency. The proposed model learns a discriminator from expert demonstrations and online experience, and then the learned discriminator serves as an additional critic to guide policy learning. Our experiments in a movie-ticket booking domain demonstrate the superiority and efficiency of the proposed model in policy learning, compared with state-of-the-art approaches. The promising results suggest several interesting future directions: 1) employing variance-reducing methods to stabilize the gradient calculation, in order to address the high variance issue in policy gradient estimation, 2) applying the model to more complicated dialogue tasks, such as composite task-completion dialogues [8], and 3) extending this work to other deep reinforcement learning benchmark tasks and other domains.