Subgoal Discovery for Hierarchical Dialogue Policy Learning

04/20/2018 ∙ by Da Tang, et al. ∙ Google Microsoft Columbia University 0

Developing conversational agents to engage in complex dialogues is challenging partly because the dialogue policy needs to explore a large state-action space. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given a set of successful dialogue sessions, we present a Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a hierarchical policy which consists of 1) a top-level policy that selects among subgoals, and 2) a low-level policy that selects primitive actions to accomplish the subgoal. We exemplify our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals. Moreover, we show that learned subgoals are human comprehensible.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider we want to plan a trip to a distant city using a dialogue agent. The agent must make choices at each leg, e.g., whether to fly or to drive, whether to book a hotel. Each of these steps in turn involves making a sequence of decisions all the way down to lower-level actions. For example, to book a hotel involves identifying the location, specifying the check-in date and time, and negotiating the price etc.

The above process of the agent has a natural hierarchy: a top-level process selects which subgoal to complete, and a low-level process chooses primitive actions to accomplish the selected subgoal. Within the reinforcement learning (RL) paradigm, such a hierarchical decision making process can be formulated in the options framework Sutton et al. (1999), where subgoals with their own reward functions are used to learn policies for achieving these subgoals. These learned policies are then used as temporally extended actions, or options, for solving the entire task.

Based on the options framework, researchers have developed dialogue agents for complex tasks, such as travel planning, using hierarchical reinforcement learning (HRL) (Cuayáhuitl et al., 2010). Recently, Peng et al. (2017b) showed that the use of subgoals mitigates the reward sparsity and leads to more effective exploration for dialogue policy learning. However, these subgoals need to be human-defined which limits the applicability of the approach in practice because the domain knowledge required to properly define subgoals is often not available in many cases.

In this paper, we propose a simple yet effective Subgoal Discovery Network (SDN) that discovers useful subgoals automatically for an RL-based dialogue agent. The SDN takes as input a collection of successful conversations, and identifies “hub” states as subgoals. Intuitively, a hub state is a region in the agent’s state space that the agent tends to visit frequently on successful paths to a goal but not on unsuccessful paths. Given the discovered subgoals, HRL can be applied to learn a hierarchical dialogue policy which consists of (1) a top-level policy that selects among subgoals, and (2) a low-level policy that chooses primitive actions to achieve selected subgoals.

We present the first study of learning dialogue agents with automatically discovered subgoals. We demonstrate the effectiveness of our approach by building a composite task-completion dialogue agent for travel planning. Experiments with both simulated and real users show that an agent learned with discovered subgoals performs competitively against an agent learned using expert-defined subgoals, and significantly outperforms an agent learned without subgoals. We also find that the subgoals discovered by SDN are often human comprehensible.

2 Background

A goal-oriented dialogue can be formulated as a Markov decision process, or MDP 

Levin et al. (2000), in which the agent interacts with its environment over a sequence of discrete steps. At each step , the agent observes the current state of the conversation Henderson (2015); Mrkšić et al. (2017); Li et al. (2017), and chooses action according to a policy . Here, the action may be a natural-language sentence or a speech act, among others. Then, the agent receives a numerical reward and switches to next state . The process repeats until the dialogue terminates. The agent is to learn to choose optimal actions so as to maximize the total discounted reward , where is a discount factor. This learning paradigm is known as reinforcement learning, or RL Sutton and Barto (1998).

When facing a complex task, it is often more efficient to divide it into multiple simpler sub-tasks, solve them, and combine the partial solutions into a full solution for the original task. Such an approach may be formalized as hierarchical RL (HRL) in the options framework Sutton et al. (1999). An option can be understood as a subgoal, which consists of an initiation condition (when the subgoal can be triggered), an option policy to solve the subgoal, and a termination condition (when the subgoal is considered finished).

When subgoals are given, there exist effective RL algorithms to learn a hierarchical policy. A major open challenge is the automatic discovery of subgoals from data, the main innovation of this work is covered in the next section.

3 Subgoal Discovery for HRL

Figure 1 shows the overall workflow of our proposed method of using automatic subgoal discovery for HRL. First a dialogue session is divided into several segments. Then at the end of those segments (subgoals), we equip an intrinsic or extrinsic reward for the HRL algorithm to learn a hierarchical dialogue policy. Note that only the last segment has an extrinsic reward. The details of the segmentation algorithm and how to use subgoals for HRL are presented in Section 3.1 and Section 3.3.

Figure 1: The workflow for HRL with subgoal discovery. In addition to the extrinsic reward at the end of the dialogue session, HRL also uses intrinsic rewards induced by the subgoals (or the ends of dialogue segments). Section 3.2 details the reward design for HRL.

3.1 Subgoal Discovery Network

Assume that we have collected a set of successful state trajectories of a task, as shown in Figure 2. We want to find subgoal states, such as the three red states , and , which form the “hubs” of these trajectories. These hub states indicate the subgoals, and thus divide a state trajectory into several segments, each for an option111There are many ways of creating a new option for a discovered subgoal state. For example, when a subgoal state is identified at time step , we add to the set of states visited by the agent from time to , where is a pre-set parameter. is therefore the union of all such states over all the state trajectories. The termination condition is set to 1 when the subgoal is reached or when the agent is no longer in , and to

otherwise. In the deep RL setting where states are represented by continuous vectors,

is a probability whose value is proportional to the vector distance e.g., between current state and subgoal state.


Figure 2: Illustration of “subgoals”. Assuming that there are three state trajectories , and . Then red states , , could be good candidates for “subgoals”.

Thus, discovering subgoals by identifying hubs in state trajectories is equivalent to segmenting state trajectories into options. In this work, we formulate subgoal discovery as a state trajectory segmentation problem, and address it using the Subgoal Discovery Network (SDN), inspired by the sequence segmentation model Wang et al. (2017).

The SDN architecture.

SDN repeats a two-stage process of generating a state trajectory segment, until a trajectory termination symbol is generated: first it uses an initial segment hidden state to start a new segment, or a trajectory termination symbol to terminate the trajectory, given all previous states; if the trajectory is not terminated, then keep generating the next state in this trajectory segment given previous states until a segment termination symbol is generated. We illustrated this process in Figure 3.

Figure 3: Illustration of SDN for state trajectory with , and as subgoals. Symbol # is the termination. The top-level RNN (RNN1) models segments and the low-level RNN (RNN2) provides information about previous states from RNN1. The embedding matrix maps the outputs of RNN2 to low dimensional representations so as to be consistent with the input dimensionality of RNN1. Note that state is associated with two termination symbols #; one is for the termination of the last segment and the other is for the termination of the entire trajectory.

We model the likelihood of each segment using an RNN, denoted as RNN1. During the training, at each time step, RNN1 predicts the next state with the current state as input, until it reaches the option termination symbol #. Since different options are under different conditions, it is not plausible to apply a fixed initial input to each segment. Therefore, we use another RNN (RNN2) to encode all previous states to provide relevant information and we transform these information to low dimensional representations as the initial inputs for the RNN1 instances. This is based on the causality assumption of the options framework Sutton et al. (1999) — the agent should be able to determine the next option given all previous information, and this should not depend on information related to any later state. The low dimensional representations are obtained via a global subgoal embedding matrix , where and are the dimensionality of RNN1’s input layer and RNN2’s output layer, respectively. Mathematically, if the output of RNN2 at time step is , then from time the RNN1 instance has as its initial input222 for . . is the number of subgoals we aim to learn. Ideally, the vector in a well-trained SDN is close to an one-hot vector. Therefore, should be close to one column in and we can view that provides at most different “embedding vectors” for RNN1 as inputs, indicating at most different subgoals. Even in the case where is not close to any one-hot vector, choosing a small helps avoid overfitting.

Segmentation likelihood.

Given the state trajectory , assuming that , and are the discovered subgoal states, we model the conditional likelihood of a proposed segmentation as , where each probability term is based on an RNN1 instance. And for the whole trajectory , its likelihood is the sum over all possible segmentations.

Generally, for state trajectory , we model its likelihood as follows333For notation convenience, we include into the observational sequence, though is always conditioned upon.:


where is the set of all possible segmentations for the trajectory , denotes the segment in the segmentation , and is the concatenation operator. is an upper limit on the maximal number of segments. This parameter is important for learning subgoals in our setting since we usually prefer a small number of subgoals. This is different from Wang et al. (2017), where a maximum segment length is enforced.

We use maximum likelihood estimation with Eq. (

1) for training. However, the number of possible segmentations is exponential in and the naive enumeration is intractable. Here, dynamic programming is employed to compute the likelihood in Eq. (1) efficiently: for a trajectory , if we denote the sub-trajectory of as , then its likelihood follows the below recursion:

Here, denotes the likelihood of sub-trajectory with no more than segments and is an indicator function. is the likelihood segment given the previous history, where RNN1 models the segment and RNN2 models the history as shown in Figure 3. With this recursion, we can compute the likelihood for the trajectory in time.

Learning algorithm.

We denote as the model parameter including the parameters of the embedding matrix , RNN1 and RNN2. We then parameterize the segment likelihood function as , and the trajectory likelihood function as .

Given a set of state trajectories , we optimize by minimizing the negative mean log-likelihood with regularization term where

, using stochastic gradient descent:


Algorithm 1 outlines the training procedure for SDN using stochastic gradient descent.

0:  A set of state trajectories , the number of segments limit , initial learning rate .
1:  Initialize the SDN parameter .
2:  while not converged do
3:     Compute the gradient of the loss as in Eq. (2).
4:     Update .
5:     Update the learning rate .
6:  end while
Algorithm 1 Learning SDN

3.2 Hierarchical Dialogue Policy Learning

Before describing how we use a trained SDN model for HRL, we first present a short review of HRL for a task-oriented dialogue system. Following the options framework Sutton et al. (1999), assume that we have a state set , an option set and a finite primitive action set .

The HRL approach we take learns two Q-functions Peng et al. (2017b), parameterized by and , respectively:

  • The top-level measures the maximum total discounted extrinsic reward received by choosing subgoal in state and then following an optimal policy. These extrinsic rewards are the objective to be maximized by the entire dialogue policy.

  • The low-level measures the maximum total discounted intrinsic reward received to achieve a given subgoal , by choosing action in state and then following an optimal option policy. These intrinsic rewards are used to learn an option policy to achieve a given subgoal.

Suppose we have a dialogue session of turns: , which is segmented into a sequence of subgoals . Consider one of these subgoals which starts and ends in steps and , respectively.

The top-level Q-function is learned using Q-learning, by treating subgoals as temporally extended actions:


and is the step-size parameter, is a discount factor. In the above expression of , the first term refers to the total discounted reward during fulfillment of subgoal , and the second to the maximum total discounted after is fulfilled.

The low-level Q-function is learned in a similar way, and follows the standard Q-learning update, except that intrinsic rewards for subgoal are used. Specifically, for :


Here, the intrinsic reward is provided by the internal critic of dialogue manager. More details are in Appendix A.

In hierarchical policy learning, the combination of the extrinsic and intrinsic rewards is expected to help the agent to successfully accomplish a composite task as fast as possible while trying to avoid unnecessary subtask switches. Hence, we define the extrinsic and intrinsic rewards as follows:

Extrinsic Reward.

Let be the maximum number of turns of a dialogue, and the number of subgoals. At the end of a dialogue, the agent receives a positive extrinsic reward of for a success dialogue, or for a failure dialogue; for each turn, the agent receives an extrinsic reward of to encourage shorter dialogues.

Intrinsic Reward.

When a subgoal terminates, the agent receives a positive intrinsic reward of if a subgoal is completed successfully, or a negative intrinsic reward of otherwise; for each turn, the agent receives an intrinsic reward to encourage shorter dialogues.

3.3 Hierarchical Policy Learning with SDN

We use a trained SDN in HRL as follows. The agent starts from the initial state , keeps sampling the output from the distribution related to the top-level RNN (RNN1) until a termination symbol # is generated, which indicates the agent reaches a subgoal. In this process, intrinsic rewards are generated as specified in the previous subsection. After # is generated, the agent selects a new option, and repeats this process.

This type of naive sampling may allow the option to terminate at some places with a low probability. To stabilize the HRL training, we introduce a threshold , which directs the agent to terminate an option if and only if the probability of outputting # is at least

. We found this modification leads to better behavior of the HRL agent than the naive sampling method, since it normally has a smaller variance.

In the HRL training, the agent only uses the probability of outputting # to decide subgoal termination. Algorithm 2 outlines the full procedure of one episode for hierarchical dialogue policies with a trained SDN in the composite task-completion dialogue system.

4 Experiments and Results

We evaluate the proposed model on a travel planning scenario for composite task-oriented dialogues Peng et al. (2017b). Over the exchange of a conversation, the agent gathers information about the user’s intent before booking a trip. The environment then assesses a binary outcome (success or failure) at the end of the conversation, based on (1) whether a trip is booked, and (2) whether the trip satisfies the user’s constraints.

0:  A trained SDN , initial state of an episode, threshold , the HRL agent .
1:  Initialize an RNN2 instance with parameters from and as the initial input.
2:  Initialize an RNN1 instance with parameters from and as the initial input, where is the embedding matrix (from ) and is the initial output of .
3:  Current state .
4:  Select an option using the agent .
5:  while Not reached the final goal do
6:     Select an action according to and using the agent . Get the reward and the next state from the environment.
7:     Place to , denote as ’s latest output and take as the ’s new input. Let be the probability of outputting the termination symbol #.
8:     if  then
9:         Select a new option using the agent .
10:         Re-initialize using the latest output from and the embedding matrix .
11:     end if
12:  end while
Algorithm 2 HRL episode with a trained SDN


The raw dataset in our experiments is from a publicly available multi-domain dialogue corpus El Asri et al. (2017). Following Peng et al. (2017b), a few changes were made to introduce dependencies among subtasks. For example, the hotel check-in date should be the same with the departure flight arrival date. The data was mainly used to create simulated users, and to build the knowledge bases for the subtasks of booking flights and reserving hotels.

User Simulator.

In order to learn good policies, RL algorithms typically need an environment to interact with. In the dialogue research community, it is common to use simulated users for this purpose Schatzmann et al. (2007); Li et al. (2017); Liu and Lane (2017). In this work, we adapted a publicly available user simulator Li et al. (2016) to the composite task-completion dialogue setting with the dataset described above. During training, the simulator provides the agent with an (extrinsic) reward signal at the end of the dialogue. A dialogue is considered to be successful only when a travel plan is booked successfully, and the information provided by the agent satisfies user’s constraints.

Baseline Agents.

We benchmarked the proposed agent (referred to as the m-HRL Agent) against three baseline agents:

  • A Rule Agent uses a sophisticated, hand-crafted dialogue policy, which requests and informs a hand-picked subset of necessary slots, and then confirms with the user about the reserved trip before booking the flight and hotel.

  • A flat RL Agent is trained with a standard deep reinforcement learning method, DQN Mnih et al. (2015), which learns a flat dialogue policy using extrinsic rewards only.

  • A h-HRL Agent is trained with hierarchical deep reinforcement learning (HDQN), which learns a hierarchical dialogue policy based on human-defined subgoals Peng et al. (2017b).

Collecting State Trajectories.

Recall that our subgoal discovery approach takes as input a set of state trajectories which lead to successful outcomes. In practice, one can collect a large set of successful state trajectories, either by asking human experts to demonstrate (e.g., in a call center), or by rolling out a reasonably good policy (e.g., a policy designed by human experts). In this paper, we obtain dialogue state trajectories from a rule-based agent which is handcrafted by a domain expert, the performance of this rule-based agent can achieve success rate of 32.2% as shown in Figure 4 and Table 1. We only collect the successful dialogue sessions from the roll-outs of the rule-based agent, and try to learn the subgoals from these dialogue state trajectories.

Experiment Settings.

To train SDN, we use RMSProp 

Tieleman and Hinton (2012) to optimize the model parameters. For both RNN1 and RNN2, we use LSTM Hochreiter and Schmidhuber (1997) as hidden units and set the hidden size to . We set embedding matrix with columns. As we discussed in Section 3.1, captures the maximum number of subgoals that the model is expected to learn. Again, to avoid SDN from learning many unnecessary subgoals, we only allow segmentation with at most segments during subgoal training. The values for and are usually set to be a little bit larger than the expected number of subgoals (e.g., or for this task) since we expect a great proportion of the subgoals that SDN learns are useful, but not necessary for all of them. As long as SDN discovers useful subgoals that guide the agent to learn policies faster, it is beneficial for HRL training, even if some non-perfect subgoals are found. During the HRL training, we use the learned SDN to propose subgoal-completion queries. In our experiment, we set the maximum turn .

We collected successful, but imperfect, dialogue episodes from the rule-based agent in Table 1 and randomly choose of these dialogue state trajectories for training SDN. The remaining were used as a validation set.

As illustrated in Section 3.3, SDN starts a new RNN1 instance and issues a subgoal-completion query when the probability of outputting the termination symbol # is above a certain threshold (as in Algorithm 2). In our experiment, is set to be 0.2, which was manually picked according to the termination probability during SDN training.

In dialogue policy learning, for the baseline RL agent, we set the size of the hidden layer to . For the HRL agents, both top-level and low-level dialogue policies have a hidden layer size of . RMSprop was applied to optimize the parameters. We set the batch size to be . During training, we used -greedy strategy for exploration with annealing and set

. For each simulation epoch, we simulated

dialogues and stored these state transition tuples in the experience replay buffers. At the end of each simulation epoch, the model was updated with all the transition tuples in the buffers in a batch manner.

Figure 4: Learning curves of agents under simulation.
Agent Success Rate Turns Reward
Rule .3220 46.23 -24.02
RL .4440 45.50 -1.834
h-HRL .6485 44.23 35.32
m-HRL .6455 44.85 34.77
Table 1: Performance of agents with simulated user.

4.1 Simulated User Evaluation

In the composite task-completion dialogue scenario, we compared the proposed m-HRL agent with three baseline agents in terms of three metrics: success rate444Success rate is the fraction of dialogues which accomplished the task successfully within the maximum turns., average rewards and average turns per dialogue session.

Figure 4 shows the learning curves of all four agents trained against the simulated user. Each learning curve was averaged over runs. Table 1 shows the test performance where each number was averaged over runs and each run generated simulated dialogues. We find that the HRL agents generated higher success rates and needed fewer conversation turns to achieve the users’ goals than the rule-based agent and the flat RL agent. The performance of the m-HRL agent is tied with that of the h-HRL agent, even though the latter requires high-quality subgoals designed by human experts.

Figure 5: Performance of three agents tested with real users: success rate, number of dialogues and p-value are indicated on each bar (difference in mean is significant with 0.05).

4.2 Human Evaluation

We further evaluated the agents that were trained on simulated users against real users, who were recruited from the authors’ organization. We conducted a study using the one RL agent and two HRL agents {RL, h-HRL, m-HRL}, and compared two pairs: {RL, m-HRL} and {h-HRL, m-HRL}. In each dialogue session, one agent was randomly selected from the pool to interact with a user. The user was not aware of which agent was selected to avoid systematic bias. The user was presented with a goal sampled from a user-goal corpus, then was instructed to converse with the agent to complete the given task. At the end of each dialogue session, the user was asked to give a rating on a scale from to based on the naturalness and coherence of the dialogue; here, is the worst rating and the best. In total, we collected dialogue sessions from human users.

Figure 5 summarizes the performances of these agents against real users in terms of success rate. Figure 6 shows the distribution of user ratings for each agent. For these two metrics, both HRL agents were significantly better than the flat RL agent. Another interesting observation is that the m-HRL agent performs similarly to the h-HRL agent in terms of success rate in the real user study as shown on Figure 5. Meanwhile in Figure 6, the h-HRL agent is significantly better than m-HRL agent in terms of real user ratings. This may be caused by the probabilistic termination of subgoals: we used a threshold strategy to decide whether to terminate a subgoal. This could introduce variance so the agent might not behave reasonably compared with human-defined subgoals which terminate deterministically.

Figure 6: Distribution of user ratings for three agents in human evaluation

4.3 Subgoal Visualization

Table 2 shows the subgoals discovered by SDN in a sample dialogue by a rule-based agent interacting with the simulated user. The rule-based agent is equipped with a human-defined subtask structure, which always solves subtask flight (turns 1–15) before hotel (turns 16–23), as shown in the first column. At turn 10, the user starts to talk about hotel while the rule-based agent is still working on the pre-defined, unfinished flight subtask until subtask flight is finished at turn 15. At turn 16, the user switches to hotel, and so does the rule-based agent until the end of the dialogue. For this rule-based agent, the human-defined subgoal (flight) terminates at turn 15. Meanwhile, our SDN model detected two subgoals (except for the final goal): one terminating at turn 9 (Subgoal 1), and another terminating at turn 15 (Subgoal 2). Subgoal 2 is consistent with the human-defined subgoal. Subgoal 1 is also reasonable since the user tries to switch to hotel at turn 10. In Appendix B, Table 3 shows a sample dialogue session by m-HRL agent interacting with a real user.

h-Task ID Text
Flight 1 agt: Hi, may I know your departure city?
2 usr: I depart from Fort Lauderdale.
3 agt: And what is your destination city?
4 usr: I need economy seat to Pittsburgh.
5 agt: What is your departure date?
6 usr: 2016-09-12, what time is available for the flight?
7 agt: There is one flight at 21:00:00.
8 usr: What is the price for that flight?
9 agt: $423.
Subgoal 1: A detected subgoal terminates!
10 usr: I need to reserve a hotel for five people.
11 agt: What is your return date for the trip?
12 usr: 2016-09-18, I need to book 5 tickets for the flight.
13 agt: What time do you prefer for return flight?
14 usr: What time is available that day?
15 agt: There is one flight departing at 11:00:00AM.
Subgoal 2: A detected subgoal terminates!
Hotel 16 usr: What hotel is available?
17 agt: There is one hotel called Winter Hill Inn available.
18 usr: What is the price for hotel?
19 agt: $410.
20 usr: Great, could you book the trip for me?
21 agt: Okay. Trip tickets are issued.
22 usr: Thanks very much!
23 agt: You are welcome!
Table 2: Discovered subgoals (except for the final goal) in a sample dialogue by a rule-based agent interacting with user simulator. The left column (h-Task) shows the human-defined subtasks for the rule-based agent. SDN detects two subgoals that terminate at turn 9 and 15 respectively. (h-Task: human-defined subtask, ID: turn ID, agt: Agent, usr: User)

5 Related Work

Task-completion dialogue systems have attracted numerous research efforts, and there is growing interest in leveraging reinforcement learning for policy learning. One line of research is on single-domain task-completion dialogues with flat deep reinforcement learning algorithms such as DQN Zhao and Eskenazi (2016); Li et al. (2017); Peng et al. (2018), actor-critic Peng et al. (2017a); Liu and Lane (2017) and policy gradients Williams et al. (2017); Liu et al. (2017). Another line of research addresses multi-domain dialogues where each domain is handled by a separate agent Gašić et al. (2015, 2015); Cuayáhuitl et al. (2016). Recently, Peng et al. (2017b) presented a composite task-completion dialogue system. Unlike multi-domain dialogue systems, composite tasks introduce inter-subtask constraints. As a result, the completion of a set of individual subtasks does not guarantee the solution of the entire task.

Cuayáhuitl et al. (2010) applied HRL to dialogue policy learning, although they focus on problems with a small state space. Later, Budzianowski et al. (2017) used HRL in multi-domain dialogue systems. Peng et al. (2017b) first presented an HRL agent with a global state tracker to learn the dialogue policy in the composite task-completion dialogue systems. All these works are built based on subgoals that were pre-defined with human domain knowledge for the specific tasks. The only job of the policy learner is to learn a hierarchical dialogue policy, which leaves the subgoal discovery problem unsolved. In addition to the applications in dialogue systems, subgoal is also widely studied in the linguistics research community Allwood (2000); Linell (2009).

In the literature, researchers have proposed algorithms to automatically discovery subgoals for hierarchical RL. One large body of work is based on analyzing the spatial structure of the state transition graphs, by identifying bottleneck states or clusters, among others Stolle and Precup (2002); McGovern and Barto (2001); Mannor et al. (2004); Şimşek et al. (2005); Entezari et al. (2011); Bacon (2013). Another family of algorithms identifies commonalities of policies and extracts these partial policies as useful skills (Thrun and Schwartz, 1994; Pickett and Barto, 2002; Brunskill and Li, 2014)

. While similar in spirit to ours, these methods do not easily scale to continuous problems as in dialogue systems. More recently, researchers have proposed deep learning models to discover subgoals in continuous-state MDPs 

(Bacon et al., 2017; Machado et al., 2017; Vezhnevets et al., 2017). It would be interesting to see how effective they are for dialogue management.

Segmental structures are common in human languages. In the NLP community, some related research on segmentation includes word segmentation Gao et al. (2005); Zhang et al. (2016) to divide the words into meaningful units. Alternatively, topic detection and tracking Allan et al. (1998); Sun et al. (2007) segment a stream of data and identify stories or events in news or social text. In this work, we formulate subgoal discovery as a trajectory segmentation problem. Section 3.1 presents our approach to subgoal discovery which is inspired by a probabilistic sequence segmentation model Wang et al. (2017).

6 Discussion and Conclusion

We have proposed the Subgoal Discovery Network to learn subgoals automatically in an unsupervised fashion without human domain knowledge. Based on the discovered subgoals, we learn the dialogue policy for complex task-completion dialogue agents using HRL. Our experiments with both simulated and real users on a composite task of travel planning, show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals. Through visualization, we find that SDN discovers reasonable, comprehensible subgoals given only a small amount of suboptimal but successful dialogue state trajectories.

These promising results suggest several directions for future research. First, we want to integrate subgoal discovery into dialogue policy learning rather than treat them as two separate processes. Second, we would like to extend SDN to identify multi-level hierarchical structures among subgoals so that we can handle more complex tasks than those studied in this paper. Third, we would like to generalize SDN to a wide range of complex goal-oriented tasks beyond dialogue, such as the particularly challenging Atari game of Montezuma’s Revenge Kulkarni et al. (2016).


We would like to thank the anonymous reviewers, members of the xlab at the University of Washington, and Chris Brockett, Michel Galley for their insightful comments on the work. Most of this work was done while DT, CW & LL were with Microsoft.


Appendix A Hierarchical Dialogue Policy Learning

This section provides more algorithmic details for Section 3.2. Again, assume a conversation of length :

Suppose an HRL agent segments the trajectory into a sequence of subgoals as , and the corresponding subgoal termination time steps as . Furthermore, denote the intrinsic reward at time step by . The top-level and low-level Q-functions satisfy the following Bellman equations:


Here is a discount factor, and the expectations are taken over the randomness of the reward and the state transition,

We use deep neural networks to approximate the two Q-value functions as and . The parameters and

are optimized to minimize the following quadratic loss functions:




Here, , are the replay buffers storing dialogue experience for training top-level and low-level policies.

Optimization of parameters and can be done by stochastic gradient descent on the two loss functions in Equations (3) and (4). The gradients of the two loss functions w.r.t their parameters are


To avoid overfitting, we also add -regularization to the objective functions above.

Appendix B Sample Dialogue

User Goal
reserve-hotel subtask:
“request_slots”: {             “inform_slots”: {
“hotel_price”: “?”             “hotel_date_checkin”:“2016-09-22”
“hotel_date_checkout”: “?” “hotel_city”: “Curitiba”
“hotel_name”: “?”             “hotel_numberofpeople”: “4”
“hotel_amenity_wifi”: “?”   }
book-flight-ticket subtask:
“request_slots”: {           “inform_slots”: {
“price”: “?”                   “or_city”: “Lima”,
“return_time_dep”: “?”     “dst_city”: “Curitiba”,
“return_date_dep”: “?”     “numberofpeople”: “4”,
“depart_time_dep”: “?”     “depart_date_dep”:“2016-09-22”
“seat”: “?”                     }
m-HRL agent
agt: Hi, may I know your departure city and destination city?
usr: I depart from Lima to Curitiba.
agt: What is your departure date?
usr: I am leaving on 2016-09-22.
agt: There is one flight available at 8:00:00AM.
usr: What date is available for return flight?
agt: There is one return flight on 2016-09-27.
usr: What is the time for the return flight?
agt: It is 00:00:00AM.
usr: What seat is available?
agt: Only Economy seat now.
usr: We need 4 tickets for the flight.
agt: Total is 1201 dollars for flight.
inform(subtask_complete): Subgoal reached!
usr: What hotel is available there?
agt: There is one hotel called Starlight.
usr: Is the wifi available for the hotel?
agt: Yes, it is free.
usr: What’s the price for the hotel?
agt: It is 1541 dollars for hotel.
usr: Could you please book the trip for me?
agt: Okay. Trip tickets are issued.
usr: Thanks!
agt: Thanks!
Table 3: Sample dialogue by the m-HRL agent interacting with real user: bolded slots are the joint constraints between two subtasks. (agt: Agent, usr: User)