Integrating planning for task-completion dialogue policy learning

Training a task-completion dialogue agent with real users via reinforcement learning (RL) could be prohibitively expensive, because it requires many interactions with users. One alternative is to resort to a user simulator, while the discrepancy of between simulated and real users makes the learned policy unreliable in practice. This paper addresses these challenges by integrating planning into the dialogue policy learning based on Dyna-Q framework, and provides a more sample-efficient approach to learn the dialogue polices. The proposed agent consists of a planner trained on-line with limited real user experience that can generate large amounts of simulated experience to supplement with limited real user experience, and a policy model trained on these hybrid experiences. The effectiveness of our approach is validated on a movie-booking task in both a simulation setting and a human-in-the-loop setting.


page 1

page 2

page 3

page 4


Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning

This paper presents a Discriminative Deep Dyna-Q (D3Q) approach to impro...

Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning

Training task-completion dialogue agents with reinforcement learning usu...

Budgeted Policy Learning for Task-Oriented Dialogue Systems

This paper presents a new approach that extends Deep Dyna-Q (DDQ) by inc...

Subgoal Discovery for Hierarchical Dialogue Policy Learning

Developing conversational agents to engage in complex dialogues is chall...

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

Building a dialogue agent to fulfill complex tasks, such as travel plann...

Reward Shaping with Recurrent Neural Networks for Speeding up On-Line Policy Learning in Spoken Dialogue Systems

Statistical spoken dialogue systems have the attractive property of bein...

Automatic Curriculum Learning With Over-repetition Penalty for Dialogue Policy Learning

Dialogue policy learning based on reinforcement learning is difficult to...

1 Introduction

Learning policies for task-completion dialogue is often formulated as a reinforcement learning (RL) problem Young et al. (2013); Levin et al. (1997). However, applying RL to real-world dialogue systems can be challenging, due to the constraint that an RL learner needs an environment to operate in. In the dialogue setting, this requires a dialogue agent to interact with real users and adjust its policy in an online fashion, as illustrated in Figure 1(a). Unlike simulation-based games such as Atari games Mnih et al. (2015) and AlphaGo Silver et al. (2016a, 2017)

where RL has made its greatest strides, task-completion dialogue systems may incur significant real-world cost in case of failure. Thus, except for very simple tasks 

Singh et al. (2002); Gašić et al. (2010, 2011); Pietquin et al. (2011); Li et al. (2016a); Su et al. (2016b), RL is too expensive to be applied to real users to train dialogue agents from scratch.

(a) Learning with real users
(b) Learning with user simulators
(c) Learning with real users via DDQ
Figure 1: Three strategies of learning task-completion dialogue policies via RL.

One strategy is to convert human-interacting dialogue to a simulation problem (similar to Atari games), by building a user simulator using human conversational data Schatzmann et al. (2007); Li et al. (2016b). In this way, the dialogue agent can learn its policy by interacting with the simulator instead of real users (Figure 1(b)). The simulator, in theory, does not incur any real-world cost and can provide unlimited simulated experience for reinforcement learning. The dialogue agent trained with such a user simulator can then be deployed to real users and further enhanced by only a small number of human interactions. Most of recent studies in this area have adopted this strategy Su et al. (2016a); Lipton et al. (2016); Zhao and Eskenazi (2016); Williams et al. (2017); Dhingra et al. (2017); Li et al. (2017); Liu and Lane (2017); Peng et al. (2017b); Budzianowski et al. (2017); Peng et al. (2017a).

However, user simulators usually lack the conversational complexity of human interlocutors, and the trained agent is inevitably affected by biases in the design of the simulator. Dhingra et al. (2017) demonstrated a significant discrepancy in a simulator-trained dialogue agent when evaluated with simulators and with real users. Even more challenging is the fact that there is no universally accepted metric to evaluate a user simulator Pietquin and Hastie (2013). Thus, it remains controversial whether training task-completion dialogue agent via simulated users is a valid approach.

We propose a new strategy of learning dialogue policy by interacting with real users. Compared to previous works Singh et al. (2002); Li et al. (2016a); Su et al. (2016b); Papangelis (2012), our dialogue agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial dialogue tasks.

Our approach is based on the Dyna-Q framework Sutton (1990) where planning is integrated into policy learning for task-completion dialogue. Specifically, we incorporate a model of the environment, referred to as the world model

, into the dialogue agent, which simulates the environment and generates simulated user experience. During the dialogue policy learning, real user experience plays two pivotal roles: first, it can be used to improve the world model and make it behave more like real users, via supervised learning; second, it can also be used to directly improve the dialogue policy via RL. The former is referred to as

world model learning, and the latter direct reinforcement learning. Dialogue policy can be improved either using real experience directly (i.e., direct reinforcement learning) or via the world model indirectly (referred to as planning or indirect reinforcement learning). The interaction between world model learning, direct reinforcement learning and planning is illustrated in Figure 1(c), following the Dyna-Q framework Sutton (1990).

The original papers on Dyna-Q and most its early extensions used tabular methods for both planning and learning  Singh (1992); Peng and Williams (1993); Moore and Atkeson (1993); Kuvayev and Sutton (1996). This table-lookup representation limits its application to small problems only. Sutton et al. (2012) extends the Dyna architecture to linear function approximation, making it applicable to larger problems. In the dialogue setting, we are dealing with a much larger action-state space. Inspired by Mnih et al. (2015)

, we propose Deep Dyna-Q (DDQ) by combining Dyna-Q with deep learning approaches to representing the state-action space by neural networks (NN).

By employing the world model for planning, the DDQ method can be viewed as a model-based RL approach, which has drawn growing interest in the research community. However, most model-based RL methods Tamar et al. (2016); Silver et al. (2016b); Gu et al. (2016); Racanière et al. (2017) are developed for simulation-based, synthetic problems (e.g., games), but not for human-in-the-loop, real-world problems. To these ends, our main contributions in this work are two-fold:

  • We present Deep Dyna-Q, which to the best of our knowledge is the first deep RL framework that incorporates planning for task-completion dialogue policy learning.

  • We demonstrate that a task-completion dialogue agent can efficiently adapt its policy on the fly, by interacting with real users via RL. This results in a significant improvement in success rate on a nontrivial task.

2 Dialogue Policy Learning via Deep Dyna-Q (DDQ)

Our DDQ dialogue agent is illustrated in Figure 2, consisting of five modules: (1) an LSTM-based natural language understanding (NLU) module Hakkani-Tür et al. (2016) for identifying user intents and extracting associated slots; (2) a state tracker Mrkšić et al. (2016) for tracking the dialogue states; (3) a dialogue policy which selects the next action222In the dialogue scenario, actions are dialogue-acts, consisting of a single act and a (possibly empty) collection of pairs Schatzmann et al. (2007).

based on the current state; (4) a model-based natural language generation (NLG) module for converting dialogue actions to natural language response 

Wen et al. (2015); and (5) a world model for generating simulated user actions and simulated rewards.

Figure 2: Illustration of the task-completion DDQ dialogue agent.

As illustrated in Figure 1(c), starting with an initial dialogue policy and an initial world model (both trained with pre-collected human conversational data), the training of the DDQ agent consists of three processes: (1) direct reinforcement learning, where the agent interacts with a real user, collects real experience and improves the dialogue policy; (2) world model learning, where the world model is learned and refined using real experience; and (3) planning, where the agent improves the dialogue policy using simulated experience.

Although these three processes conceptually can occur simultaneously in the DDQ agent, we implement an iterative training procedure, as shown in Algorithm 1, where we specify the order in which they occur within each iteration. In what follows, we will describe these processes in details.

2.1 Direct Reinforcement Learning

In this process (lines 5-18 in Algorithm 1) we use the DQN method Mnih et al. (2015)

to improve the dialogue policy based on real experience. We consider task-completion dialogue as a Markov Decision Process (MDP), where the agent interacts with a user in a sequence of actions to accomplish a user goal. In each step, the agent observes the dialogue state

, and chooses the action to execute, using an

-greedy policy that selects a random action with probability

or otherwise follows the greedy policy .

which is the approximated value function, implemented as a Multi-Layer Perceptron (MLP) parameterized by

. The agent then receives reward333In the dialogue scenario, reward is defined to measure the degree of success of a dialogue. In our experiment, for example, success corresponds to a reward of , failure to a reward of , and the agent receives a reward of at each turn so as to encourage shorter dialogues. , observes next user response , and updates the state to . Finally, we store the experience in the replay buffer . The cycle continues until the dialogue terminates.

We improve the value function by adjusting

to minimize the mean-squared loss function, defined as follows:


where is a discount factor, and is the target value function that is only periodically updated (line 42 in Algorithm 1). By differentiating the loss function with respect to , we arrive at the following gradient:


As shown in lines 16-17 in Algorithm 1, in each iteration, we improve using minibatch Deep Q-learning.

0:  , , , , ,
0:  ,
1:  initialize and via pre-training on human conversational data
2:  initialize with
3:  initialize real experience replay buffer using Reply Buffer Spiking (RBS), and simulated experience replay buffer as empty
4:  for =: do
5:     # COMMENTDirect Reinforcement Learning starts
6:     user starts a dialogue with user action
7:     generate an initial dialogue state
8:     while  is not a terminal state do
9:         with probability select a random action
10:         otherwise select
11:         execute , and observe user response and reward
12:         update dialogue state to
13:         store to
15:     end while
16:     sample random minibatches of from
17:     update via -step minibatch Q-learning according to Equation (2)
18:     # COMMENTDirect Reinforcement Learning ends
19:     # COMMENTWorld Model Learning starts
20:     sample random minibatches of training samples from
21:     update via -step minibatch SGD of multi-task learning
22:     # COMMENTWorld Model Learning ends
23:     # COMMENTPlanning starts
24:     for =: do
25:          = FALSE,
26:         sample a user goal
27:         sample user action from
28:         generate an initial dialogue state
29:         while  is FALSE  do
30:            with probability select a random action
31:            otherwise select
32:            execute
33:            world model responds with , and
34:            update dialogue state to
35:            store to
36:            ,
37:         end while
38:         sample random minibatches of from
39:         update via -step minibatch Q-learning according to Equation (2)
40:     end for
41:     # COMMENTPlanning ends
42:     every steps reset
43:  end for
Algorithm 1 Deep Dyna-Q for Dialogue Policy Learning

2.2 Planning

In the planning process (lines 23-41 in Algorithm 1), the world model is employed to generate simulated experience that can be used to improve dialogue policy. in line 24 is the number of planning steps that the agent performs per step of direct reinforcement learning. If the world model is able to accurately simulate the environment, a big can be used to speed up the policy learning. In DDQ, we use two replay buffers, for storing real experience and for simulated experience. Learning and planning are accomplished by the same DQN algorithm, operating on real experience in for learning and on simulated experience in for planning. Thus, here we only describe the way the simulated experience is generated.

Similar to Schatzmann et al. (2007), at the beginning of each dialogue, we uniformly draw a user goal , where is a set of constraints and is a set of requests (line 26 in Algorithm 1). For movie-ticket booking dialogues, constraints are typically the name and the date of the movie, the number of tickets to buy, etc. Requests can contain these slots as well as the location of the theater, its start time, etc. Table 3 presents some sampled user goals and dialogues generated by simulated and real users, respectively. The first user action (line 27) can be either a request or an inform dialogue-act. A request, such as request(theater; moviename=batman), consists of a request slot and multiple () constraint slots, uniformly sampled from and , respectively. An inform contains constraint slots only. The user action can also be converted to natural language via NLG, e.g., "which theater will show batman?"

In each dialogue turn, the world model takes as input the current dialogue state and the last agent action

(represented as an one-hot vector), and generates user response

, reward

, and a binary variable

, which indicates whether the dialogue terminates (line 33). The generation is accomplished using the world model , a MLP shown in Figure 3, as follows:

where is the concatenation of and , and and are parameter matrices and vectors, respectively.

Figure 3: The world model architecture.

2.3 World Model Learning

In this process (lines 19-22 in Algorithm 1), is refined via minibatch SGD using real experience in the replay buffer . As shown in Figure 3, is a multi-task neural network Liu et al. (2015) that combines two classification tasks of simulating and , respectively, and one regression task of simulating . The lower layers are shared across all tasks, while the top layers are task-specific.

3 Experiments and Results

We evaluate the DDQ method on a movie-ticket booking task in both simulation and human-in-the-loop settings.

3.1 Dataset

Raw conversational data in the movie-ticket booking scenario was collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 4, which consists of 11 dialogue acts and 16 slots. In total, the dataset contains 280 annotated dialogues, the average length of which is approximately 11 turns.

3.2 Dialogue Agents for Comparison

To benchmark the performance of DDQ, we have developed different versions of task-completion dialogue agents, using variations of Algorithm 1.

  • A DQN agent is learned by standard DQN, implemented with direct reinforcement learning only (lines 5-18 in Algorithm 1

    ) in each epoch.

  • The DDQ() agents are learned by DDQ of Algorithm 1, with an initial world model pre-trained on human conversational data, as described in Section 3.1. is the number of planning steps. We trained different versions of DDQ() with different ’s.

  • The DDQ(, rand-init ) agents are learned by the DDQ method with a randomly initialized world model.

  • The DDQ(, fixed ) agents are learned by DDQ with an initial world model pre-trained on human conversational data. But the world model is not updated afterwards. That is, the world model learning part in Algorithm 1 (lines 19-22) is removed. The DDQ(, fixed ) agents are evaluated in the simulation setting only.

  • The DQN() agents are learned by DQN, but with times more real experiences than the DQN agent. DQN() is evaluated in the simulation setting only. Its performance can be viewed as the upper bound of its DDQ() counterpart, assuming that the world model in DDQ() perfectly matches real users.

Agent Epoch = 100 Epoch = 200 Epoch = 300
Success Reward Turns Success Reward Turns Success Reward Turns
DQN .4260 -3.84 31.93 .5308 10.78 22.72 .6480 27.66 22.21
DDQ(5) .6056 20.35 26.65 .7128 36.76 19.55 .7372 39.97 18.99
DDQ(5, rand-init ) .5904 18.75 26.21 .6888 33.47 20.36 .7032 36.06 18.64
DDQ(5, fixed ) .5540 14.54 25.89 .6660 29.72 22.39 .6860 33.58 19.49
DQN(5) .6560 29.38 21.76 .7344 41.09 16.07 .7576 43.97 15.88
DDQ(10) .6624 28.18 24.62 .7664 42.46 21.01 .7840 45.11 19.94
DDQ(10, rand-init ) .6132 21.50 26.16 .6864 32.43 21.86 .7628 42.37 20.32
DDQ(10, fixed ) .5884 18.41 26.41 .6196 24.17 22.36 .6412 26.70 22.49
DQN(10) .7944 48.61 15.43 .8296 54.00 13.09 .8356 54.89 12.77
Table 1: Results of different agents at training epoch = {100, 200, 300}. Each number is averaged over 5 runs, each run tested on 2000 dialogues. Excluding DQN(5) and DQN(10) which serve as the upper bounds, any two groups of success rate (except three groups: at epoch 100, DDQ(5, rand-init ) and DDQ(10, fixed ), at epoch 200, DDQ(5, rand-init ) and DDQ(10, rand-init ), at epoch 300, DQN and DDQ(10, fixed )) evaluated at the same epoch is statistically significant in mean with . (Success: success rate)

Implementation Details

All the models in these agents (, ) are MLPs with tanh activations. Each policy network has one hidden layer with 80 hidden nodes. As shown in Figure 3, the world model contains two shared hidden layers and three task-specific hidden layers, with 80 nodes in each. All the agents are trained by Algorithm 1 with the same set of hyper-parameters. -greedy is always applied for exploration. We set the discount factor = 0.95. The buffer sizes of both and are set to 5000. The target value function is updated at the end of each epoch. In each epoch, and are refined using one-step () -tuple-minibatch update. 444We found in our experiments that setting improves the performance of all agents, but does not change the conclusion of this study: DDQ consistently outperforms DQN by a statistically significant margin. Conceptually, the optimal value of used in planning is different from that in direct reinforcement learning, and should vary according to the quality of the world model. The better the world model is, the more aggressive update (thus bigger ) is being used in planning. We leave it to future work to investigate how to optimize for planning in DDQ. In planning, the maximum length of a simulated dialogue is 40 (

). In addition, to make the dialogue training efficient, we also applied a variant of imitation learning, called Reply Buffer Spiking (RBS) 

Lipton et al. (2016). We built a naive but occasionally successful rule-based agent based on human conversational dataset (line 1 in Algorithm 1), and pre-filled the real experience replay buffer with 100 dialogues of experience (line 2) before training for all the variants of agents.

Figure 4: Learning curves of the DDQ() agents with . The DQN agent is identical to a DDQ() agent with .

3.3 Simulated User Evaluation

In this setting the dialogue agents are optimized by interacting with user simulators, instead of real users. Thus, the world model is learned to mimic user simulators. Although the simulator-trained agents are sub-optimal when applied to real users due to the discrepancy between simulators and real users, the simulation setting allows us to perform a detailed analysis of DDQ without much cost and to reproduce the experimental results easily.

User Simulator

We adapted a publicly available user simulator Li et al. (2016b) to the task-completion dialogue setting. During training, the simulator provides the agent with a simulated user response in each dialogue turn and a reward signal at the end of the dialogue. A dialogue is considered successful only when a movie ticket is booked successfully and when the information provided by the agent satisfies all the user’s constraints. At the end of each dialogue, the agent receives a positive reward of for success, or a negative reward of for failure, where is the maximum number of turns in each dialogue, and is set to in our experiments. Furthermore, in each turn, the agent receives a reward of , so that shorter dialogues are encouraged. Readers can refer to Appendix B for details on the user simulator.

Figure 5: Learning curves of DQN, DDQ(10), DDQ(10, rand-init ), DDQ(10, fixed ), and DQN(10).
Agent Epoch = 100 Epoch = 150 Epoch = 200
Success Reward Turns Success Reward Turns Success Reward Turns
DQN .0000 -58.69 39.38 .4080 -5.730 30.38 .4545 0.350 30.38
DDQ(5) .4620 00.78 31.33 .5637 15.05 26.17 .6000 19.84 26.32
DDQ(5, rand-init ) .3600 -11.67 31.74 .5500 13.71 26.58 .5752 16.84 26.37
DDQ(10) .5555 14.69 25.92 .6416 25.85 24.28 .7332 38.88 20.21
DDQ(10, rand-init ) .5010 6.27 29.70 .6055 22.11 23.11 .7023 36.90 21.20
Table 2: The performance of different agents at training epoch = {100, 150, 200} in the human-in-the-loop experiments. The difference between the results of all agent pairs evaluated at the same epoch is statistically significant (). (Success: success rate)


The main simulation results are reported in Table 1 and Figures 4 and 5. For each agent, we report its results in terms of success rate, average reward, and average number of turns (averaged over 5 repetitions of the experiments). Results show that the DDQ agents consistently outperform DQN with a statistically significant margin. Figure 4 shows the learning curves of different DDQ agents trained using different planning steps. Since the training of all RL agents started with RBS using the same rule-based agent, their performance in the first few epochs is very close. After that, performance improved for all values of , but much more rapidly for larger values. Recall that the DDQ() agent with =0 is identical to the DQN agent, which does no planning but relies on direct reinforcement learning only. Without planning, the DQN agent took about 180 epochs (real dialogues) to reach the success rate of 50%, and DDQ(10) took only 50 epochs.

Intuitively, the optimal value of needs to be determined by seeking the best trade-off between the quality of the world model and the amount of simulated experience that is useful for improving the dialogue agent. This is a non-trivial optimization problem because both the dialogue agent and the world model are updated constantly during training and the optimal needs to be adjusted accordingly. For example, we find in our experiments that at the early stages of training, it is fine to perform planning aggressively by using large amounts of simulated experience even though they are of low quality, but in the late stages of training where the dialogue agent has been significantly improved, low-quality simulated experience is likely to hurt the performance. Thus, in our implementation of Algorithm 1

, we use a heuristic

555The heuristic is not presented in Algorithm 1. Readers can refer to the released source code for details. to reduce the value of in the late stages of training (e.g., after 150 epochs in Figure 4) to mitigate the negative impact of low-qualify simulated experience. We leave it to future work how to optimize the planning step size during DDQ training in a principled way.

Figure 5 shows that the quality of the world model has a significant impact on the agent’s performance. The learning curve of DQN(10) indicates the best performance we can expect with a perfect world model. With a pre-trained world model, the performance of the DDQ agent improves more rapidly, although eventually, the DDQ and DDQ(rand-init ) agents reach the same success rate after many epochs. The world model learning process is crucial to both the efficiency of dialogue policy learning and the final performance of the agent. For example, in the early stages (before 60 epochs), the performances of DDQ and DDQ(fixed ) remain very close to each other, but DDQ reaches a success rate almost 10% better than DDQ(fixed ) after 400 epochs.

Figure 6: Human-in-the-loop dialogue policy learning curves in four different agents.
Simulation Sample Real User Sample
movie-ticket booking user goal:
  “request_slots”: {            “constraint_slots”: {
    “ticket”: “?”                   “numberofpeople”:“2”
    “theater”: “?”                 “moviename”: “deadpool”
    “starttime”: “?”               “city”: “seattle”
    “date”: “?”                   }
movie-ticket booking user goal:
  “request_slots”: {            “constraint_slots”: {
    “ticket”: “?”                   “date”:“this weekend”
    “theater”: “?”                 “numberofpeople”: “1”
    “starttime”: “?”               “moviename”: “batman”
  }                                 }
DQN agent
1 usr: Hi, what date can I book 2 tickets for deadpool?
usr: Hi, which theater will show batman? I want 1 ticket.
2 agt: Which city would you like?
agt: Which city would you like?
3 usr: I want to watch at seattle.
usr: Boston.
4 agt: What time would you like to see it?
agt: What time would you like to see it?
5 usr: What start time is available?
usr: What start time is available?
6 agt: Which theater would you like?
agt: What date would you like to see it?
7 usr: Which theater is available?
usr: This weekend.
8 agt: What time would you like to see it?
agt:Which theater would you like?
9 usr: What start time is available?
usr: Which theater is available?
10 agt: Which theater would you like?
agt: Which theater would you like?
11 usr: Which theater is available?
usr: Which theater is available?
12 agt: How many tickets do you need?
agt: What time would you like to see it?
13 usr: I want 2 tickets please!
usr: What start time is available?
14 agt: Which theater would you like?
agt: Which theater would you like?
15 usr: Which theater is available?
usr: Thanks, I do not need the ticket.
16 agt: What time would you like to see it?
agt: Thank you.
39 agt: Which theater would you like?
40 usr: Which theater is available?
Agent does not book the right ticket within the
max turns.
User terminates the dialogue before reaching the
max turns because the agent repeatly asks the same
DDQ(10) agent
1 usr: Hi, when is deadpool playing at seattle?
usr: Hi, which theater will show batman? I want 1 ticket.
2 agt: Tonight.
agt: Century Eastport 16.
3 usr: What is the start time?
usr: What is the start time?
4 agt: Which theater would you like?
agt: 9PM.
5 usr: I want to watch it at amc pacific place 11.
usr: Can I buy a ticket here?
6 agt: 10pm is available.
agt: This weekend?
7 usr: I want 2 tickets please!
usr: Yes, one ticket please!
8 agt:
Great, here are 2 deadpool tickets for you
at amc pacific place 11 in seattle tonight 10 pm.
Sure, here it is - batman at Century Eastport 16,
9PM this weekend.
9 usr: Thank you.
usr: Thank you.
10 agt: Thank you.
agt: Thank you.
Success Success
Table 3: Two sample dialogue sessions by DQN and DDQ(10) agents trained at epoch 100: Left: simulated user experiments; Right: human-in-the-loop experiments. (agt: agent, usr: user)

3.4 Human-in-the-Loop Evaluation

In this setting, five dialogue agents (i.e., DQN, DDQ(10), DDQ(10, rand-init ), DDQ(5), and DDQ(5, rand-init )) are trained via RL by interacting with real human users. In each dialogue session, one of the agents was randomly picked to converse with a user. The user was presented with a user goal sampled from the corpus, and was instructed to converse with the agent to complete the task. The user had the choice of abandoning the task and ending the dialogue at any time, if she or he believed that the dialogue was unlikely to succeed or simply because the dialogue dragged on for too many turns. In such cases, the dialogue session is considered failed. At the end of each session, the user was asked to give explicit feedback whether the dialogue succeeded (i.e., whether the movie tickets were booked with all the user constraints satisfied). Each learning curve is trained with two runs, with each run generating 150 dialogues (and additional simulated dialogues when planning is applied). In total, we collected 1500 dialogue sessions for training all five agents.

The main results are presented in Table 2 and Figure 6, with each agent averaged over two independent runs. The results confirm what we observed in the simulation experiments. The conclusions are summarized as below:

  • The DDQ agent significantly outperforms DQN, as demonstrated by the comparison between DDQ(10) and DQN. Table 3 presents four example dialogues produced by two dialogue agents interacting with simulated and human users, respectively. The DQN agent, after being trained with 100 dialogues, still behaved like a naive rule-based agent that requested information bit by bit in a fixed order. When the user did not answer the request explicitly (e.g., usr: which theater is available?), the agent failed to respond properly. On the other hand, with planning, the DDQ agent trained with 100 real dialogues is much more robust and can complete 50% of user tasks successfully.

  • A larger leads to more aggressive planning and better results, as shown by DDQ(10) vs. DDQ(5).

  • Pre-training world model with human conversational data improves the learning efficiency and the agent’s performance, as shown by DDQ(5) vs. DDQ(5, rand-init ), and DDQ(10) vs. DDQ(10, rand-init ).

4 Conclusion

We propose a new strategy for a task-completion dialogue agent to learn its policy by interacting with real users. Compared to previous work, our agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial domains. Our strategy is based on the Deep Dyna-Q (DDQ) framework where planning is integrated into dialogue policy learning. The effectiveness of DDQ is validated by human-in-the-loop experiments, demonstrating that a dialogue agent can efficiently adapt its policy on the fly by interacting with real users via deep RL.

One interesting topic for future research is exploration in planning. We need to deal with the challenge of adapting the world model in a changing environment, as exemplified by the domain extension problem Lipton et al. (2016). As pointed out by Sutton and Barto (1998), the general problem here is a particular manifestation of the conflict between exploration and exploitation. In a planning context, exploration means trying actions that may improve the world model, whereas exploitation means trying to behave in the optimal way given the current model. To this end, we want the agent to explore in the environment, but not so much that the performance would be greatly degraded.


We would like to thank Chris Brockett, Yun-Nung Chen, Michel Galley and Lihong Li for their insightful comments on the paper. We would like to acknowledge the volunteers from Microsoft Research for helping us with the human-in-the-loop experiments. This work was done when Baolin Peng and Shang-Yu Su were visiting Microsoft. Baolin Peng is in part supported by Innovation and Technology Fund (6904333), and General Research Fund of Hong Kong (12183516).


Appendix A Dataset Annotation Schema

Table 4 lists all annotated dialogue acts and slots in details.

request, inform, deny, confirm_question,
Intent confirm_answer, greeting, closing, not_sure,
multiple_choice, thanks, welcome
Slot city, closing, date, distanceconstraints,
greeting, moviename, numberofpeople,
price, starttime, state, taskcomplete, theater,
theater_chain, ticket, video_format, zip
Table 4: The data annotation schema

Appendix B User Simulator

In the task-completion dialogue setting, the entire conversation is around a user goal implicitly, but the agent knows nothing about the user goal explicitly and its objective is to help the user to accomplish this goal. Generally, the definition of user goal contains two parts:

  • inform_slots contain a number of slot-value pairs which serve as constraints from the user.

  • request_slots contain a set of slots that user has no information about the values, but wants to get the values from the agent during the conversation. ticket is a default slot which always appears in the request_slots part of user goal.

To make the user goal more realistic, we add some constraints in the user goal: slots are split into two groups. Some of slots must appear in the user goal, we called these elements as Required slots. In the movie-booking scenario, it includes moviename, theater, starttime, date, numberofpeople; the rest slots are Optional slots, for example, theater_chain, video_format etc.

We generated the user goals from the labeled dataset mentioned in Section 3.1, using two mechanisms. One mechanism is to extract all the slots (known and unknown) from the first user turns (excluding the greeting user turn) in the data, since usually the first turn contains some or all the required information from user. The other mechanism is to extract all the slots (known and unknown) that first appear in all the user turns, and then aggregate them into one user goal. We dump these user goals into a file as the user-goal database. Every time when running a dialogue, we randomly sample one user goal from this user goal database.