Feudal Reinforcement Learning for Dialogue Management in Large Domains

by   Iñigo Casanueva, et al.
University of Cambridge

Reinforcement learning (RL) is a promising approach to solve dialogue policy optimisation. Traditional RL algorithms, however, fail to scale to large domains due to the curse of dimensionality. We propose a novel Dialogue Management architecture, based on Feudal RL, which decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains. We show that an implementation of this approach, based on Deep-Q Networks, significantly outperforms previous state of the art in several dialogue domains and environments, without the need of any additional reward signal.


page 1

page 2

page 3

page 4


What Does The User Want? Information Gain for Hierarchical Dialogue Policy Optimisation

The dialogue management component of a task-oriented dialogue system is ...

Deep Reinforcement Learning for Multi-Domain Dialogue Systems

Standard deep reinforcement learning methods such as Deep Q-Networks (DQ...

Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning

Human conversation is inherently complex, often spanning many different ...

Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces

In spoken dialogue systems, we aim to deploy artificial intelligence to ...

Reinforcement Learning for Personalized Dialogue Management

Language systems have been of great interest to the research community a...

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

A dialogue policy module is an essential part of task-completion dialogu...

Bootstrapping incremental dialogue systems: using linguistic knowledge to learn from minimal data

We present a method for inducing new dialogue systems from very small am...

1 Introduction

Task-oriented Spoken Dialogue Systems (SDS), in the form of personal assistants, have recently gained much attention in both academia and industry. One of the most important modules of a SDS is the Dialogue Manager (DM) (or policy), the module in charge of deciding the next action in each dialogue turn. Reinforcement Learning (RL) (RL) has been studied for several years as a promising approach to model dialogue management (levin1998using; henderson2008hybrid; pietquin2011sample; young2013pomdp; casanueva2015knowledge; su2016continuously). However, as the dialogue state space increases, the number of possible trajectories needed to be explored grows exponentially, making traditional RL methods not scalable to large domains.

Hierarchical RL (HRL), in the form of temporal abstraction, has been proposed in order to mitigate this problem (cuayahuitl2010evaluation; cuayahuitl2016deep; budzianowski2017subdomain; peng2017composite). However, proposed HRL methods require that the task is defined in a hierarchical structure, which is usually handcrafted. In addition, they usually require additional rewards for each subtask. Space abstraction, instead, has been successfully applied to dialogue tasks such as Dialogue State Tracking (DST) (Henderson2014b), and policy transfer between domains (gavsic2013pomdp; gavsic2015policy; wang2015learning)

. For DST, a set of binary classifiers can be defined for each slot, with shared parameters, learning a general way to track slots. The policy transfer method presented in

(wang2015learning), named Domain Independent Parametrisation (DIP), transforms the belief state into a slot-dependent fixed size representation using a handcrafted feature function. This idea could also be applied to large domains, since it can be used to learn a general way to act in any slot.

In slot-filling dialogues, a HRL method that relies on space abstraction, such as Feudal RL (FRL) (dayan1993feudal), should allow RL scale to domains with a large number of slots. FRL divides a task spatially rather than temporally, decomposing the decisions in several steps and using different abstraction levels in each sub-decision. This framework is especially useful in RL tasks with large discrete action spaces, making it very attractive for large domain dialogue management.

In this paper, we introduce a Feudal Dialogue Policy which decomposes the decision in each turn into two steps. In a first step, the policy decides if it takes a slot independent or slot dependent action. Then, the state of each slot sub-policy is abstracted to account for features related to that slot, and a primitive action is chosen from the previously selected subset. Our model does not require any modification of the reward function and the hierarchical architecture is fully specified by the structured database representation of the system (i.e. the ontology), requiring no additional design.

2 Background

Dialogue management can be cast as a continuous MDP (young2013pomdp) composed of a continuous multivariate belief state space , a finite set of actions and a reward function . At a given time , the agent observes the belief state , executes an action and receives a reward drawn from . The action taken, , is decided by the policy, defined as the function . For any policy and , the -value function can be defined as the expected (discounted) return , starting from state , taking action , and then following policy until the end of the dialogue at time step :


where and is a discount factor, with .

The objective of RL is to find an optimal policy , i.e. a policy that maximizes the expected return in each belief state. In Value-based algorithms, the optimal policy can be found by greedily taking the action which maximises .

In slot-filling SDSs the belief state space is defined by the ontology, a structured representation of a database of entities that the user can retrieve by talking to the system. Each entity has a set of properties, refereed to as slots , where each of the slots can take a value from the set . The belief state

is then defined as the concatenation of the probability distribution of each slot, plus a set of general features (e.g. the communication function used by the user, the database search method…)

Henderson2014a. The set is defined as a set of summary actions, where the actions can be either slot dependent (e.g. request(food), confirm(area)…) or slot independent111We include the summary actions dependent on all the slots, such as inform(), in this group. (e.g. hello(), inform()…).

The belief space is defined by the ontology, therefore belief states of different domains will have different shapes. In order to transfer knowledge between domains, Domain Independent Parametrization (DIP) (wang2015learning) proposes to abstract the belief state into a fixed size representation. As each action is either slot independent or dependent on a slot , a feature function can be defined, where and stands for slot independent actions. Therefore, in order to compute the policy, can be approximated as , where is the slot associated to action .

wang2015learning presents a handcrafted feature function . It includes the slot independent features of the belief state, a summarised representation of the joint belief state, and a summarised representation of the belief state of the slot . Section 4 gives a more detailed description of the function used in this work.

3 Feudal dialogue management

Figure 1: Feudal dialogue architecture used in this work. The sub-policies surrounded by the dashed line have shared parameters. The simple lines show the data flow and the double lines the sub-policy decisions.

FRL decomposes the policy decision in each turn into several sub-decisions, using different abstracted parts of the belief state in each sub-decision. The objective of a task oriented SDS is to fulfill the users goal, but as the goal is not observable for the SDS, the SDS needs to gather enough information to correctly fulfill it. Therefore, in each turn, the DM can decompose its decision in two steps: first, decide between taking an action in order to gather information about the user goal (information gathering actions) or taking an action to fulfill the user goal or a part of it (information providing actions) and second, select a (primitive) action to execute from the previously selected subset. In a slot-filling dialogue, the set of information gathering actions can be defined as the set of slot dependent actions, while the set of information providing actions can be defined as the remaining actions.

The architecture of the feudal policy proposed by this work is represented schematically in Figure 1. The (primitive) actions are divided between two subsets; slot independent actions (e.g. hello(), inform()); and slot dependent actions (e.g. request(), confirm())222Note that the actions of this set are composed just by the communication function of the slot dependent actions, thus reducing the number of actions compared to .. In addition, a set of master actions is defined, where corresponds to taking an action from and to taking an action from . Then, a feature function is defined for each slot , as well as a slot independent feature function and a master feature function . These feature functions can be handcrafted (e.g. the DIP feature function introduced in section 2

) or any function approximator can be used (e.g. neural networks trained jointly with the policy).

Finally, a master policy , a slot independent policy and a set of slot specific policies , one for each , are defined, where , and . Contrary to other feudal policies, the slot specific sub-policies have shared parameters, in order to generalise between slots (following the idea used by Henderson2014b for DST). The differences between the slots (size, value distribution…) are accounted by the feature function . Therefore is defined as:


If , the sub-policy run is :


Else, if , is selected. This policy runs each slot specific policy, , for all , choosing the action-slot pair that maximises the Q function over all the slot sub-policies.


Then, the summary action is constructed by joining and (e.g. if =request() and =food, then the summary action will be request(food)). A pseudo-code of the Feudal Dialogue Policy algorithm is given in Appendix A.

4 Experimental setup

The models used in the experiments have been implemented using the PyDial toolkit (ultes2017pydial)333The implementation of the models can be obtained in www.pydial.org and evaluated on the PyDial benchmarking environment (casanueva2017benchmarking). This environment presents a set of tasks which span different size domains, different Semantic Error Rates (SER), and different configurations of action masks and user model parameters (Standard (Std.) or Unfriendly (Unf.)). Table 1 shows a summarised description of the tasks. The models developed in this paper are compared to the state-of-the-art RL algorithms and to the handcrafted policy presented in the benchmarks.

Domain Code # constraint slots # requests # values
Cambridge Restaurants CR 3 9 268
San Francisco Restaurants SFR 6 11 636
Laptops LAP 11 21 257
Env. 1 Env. 2 Env. 3 Env. 4 Env. 5 Env. 6
SER 0% 0% 15% 15% 15% 30%
Masks on off on off on on
User Std. Std. Std. Std. Unf. Std.
Table 1: Sumarised description of the domains and environments used in the experiments. Refer to (casanueva2017benchmarking) for a detailed description.

4.1 DIP-DQN baseline

An implementation of DIP based on Deep-Q Networks (DQN) mnih2013playing is implemented as an additional baseline (papangelis2017single)

. This policy, named DIP-DQN, uses the same hyperparameters as the DQN implementation released in the PyDial benchmarks. A DIP feature function based in the description in

(wang2015learning) is used, , where:
accounts for general features of the belief state, such as the database search method.
accounts for features of the joint belief state, such as the entropy of the joint belief.
accounts for features of the marginal distribution of slot , such as the entropy of .
Appendix B shows a detailed description of the DIP features used in this work.

4.2 Feudal DQN policy

A Feudal policy based on the architecture described in sec. 3 is implemented, named FDQN. Each sub-policy is constructed by a DQN policy (su2017sample). These policies have the same hyperparameters as the baseline DQN implementation, except for the two hidden layer sizes, which are reduced to 130 and 50 respectively. As feature functions, subsets of the DIP features are used:

The original set of summary actions of the benchmarking environment, , has a size of , where is the number of slots. This set is divided in two subsets444An additional pass() action is added to each subset, which is taken whenever the other sub-policy is executed. This simplifies the training algorithm.: of size 6 and of size 4. Each sub-policy (including ) is trained with the same sparse reward signal used in the baselines, getting a reward of if the dialogue is successful or otherwise, minus the dialogue length.

5 Results

The results in the 18 tasks of the benchmarking environment after 4000 training dialogues are presented in Table 2. The same evaluation procedure of the benchmarks is used, presenting the mean over 10 different random seeds and testing every seed for 500 dialogues.

Feudal-DQN DIP-DQN Bnch. Hdc.
Task Suc. Rew. Suc. Rew. Rew. Rew.

Env. 1

CR 89.3% 11.7 48.8% -2.8 13.5 14.0
SFR 71.1% 7.1 25.8% -7.4 11.7 12.4
LAP 65.5% 5.7 26.6% -8.8 10.5 11.7

Env. 2

CR 97.8% 13.1 85.5% 9.6 12.2 14.0
SFR 95.4% 12.4 85.7% 8.4 9.6 12.4
LAP 94.1% 12.0 89.5% 9.7 7.3 11.7

Env. 3

CR 92.6% 11.7 86.1% 8.9 11.9 11.0
SFR 90.0% 9.7 59.3% 0.2 8.6 9.0
LAP 89.6% 9.4 71.5% 3.1 6.7 8.7

Env. 4

CR 91.4% 11.2 82.6% 8.7 10.7 11.0
SFR 90.3% 10.2 86.1% 9.2 7.7 9.0
LAP 88.7% 9.8 74.8% 6.0 5.5 8.7

Env. 5

CR 96.3% 11.5 74.4% 2.9 10.5 9.3
SFR 88.9% 7.9 75.5% 3.2 4.5 6.0
LAP 78.8% 5.2 64.4% -0.4 4.1 5.3

Env. 6

CR 90.6% 10.4 83.4% 8.1 10.0 9.7
SFR 83.0% 7.1 71.9% 3.9 3.9 6.4
LAP 78.5% 6.0 66.5% 2.7 3.6 5.5
Table 2: Success rate and reward for Feudal-DQN and DIP-DQN in the 18 benchmarking tasks, compared with the reward of the best performing algorithm in each task (Bnch.) and the handcrafted policy (Hdc.) presented in (casanueva2017benchmarking).

The FDQN policy substantially outperforms every other other policy in all the environments except Env. 1. The performance increase is more considerable in the two largest domains (SFR and LAP), with gains up to 5 points in accumulated reward in the most challenging environments (e.g. Env. 4 LAP), compared to the best benchmarked RL policies (Bnch.). In addition, FDQN consistently outperforms the handcrafted policy (Hdc.) in environments 2 to 6, which traditional RL methods could not achieve. In Env. 1, however, the results for FDQN and DIP-DQN are rather low, specially for DIP-DQN. Surprisingly, the results in Env. 2, which only differs from Env. 1 in the absence of action masks (thus, in principle, is a more complex environment), outperform every other algorithm. Analysing the dialogues individually, we could observe that, in this environment, both policies are prone to “overfit” to an action 555The model overestimates the value of an incorrect action, continuously repeating it until the user runs out of patience.. The performance of FDQN and DIP-DQN in Env. 4 is also better than in Env. 3, while the difference between these environments also lies in the masks. This suggests that an specific action mask design can be helpful for some algorithms, but can harm the performance of others. This is especially severe in the DIP-DQN case, which shows good performance in some challenging environments, but it is more unstable and prone to overfit than FDQN.

Figure 2: Learning curves for Feudal-DQN and DIP-DQN in Env. 4, compared to the two best performing algorithms in (casanueva2017benchmarking) (DQN and GP-Sarsa). The shaded area depicts the mean

the standard deviation over ten random seeds.

However, the main purpose of action masks is to reduce the number of dialogues needed to train a policy. Observing the learning curves shown in Figure 2, the FDQN model can learn a near-optimal policy in large domains in about 1500 dialogues, even if no additional reward is used, making the action masks unnecessary.

6 Conclusions and future work

We have presented a novel dialogue management architecture, based on Feudal RL, which substantially outperforms the previous state of the art in several dialogue environments. By defining a set of slot dependent policies with shared parameters, the model is able to learn a general way to act in slots, increasing its scalability to large domains.

Unlike other HRL methods applied to dialogue, no additional reward signals are needed and the hierarchical structure can be derived from a flat ontology, substantially reducing the design effort.

A promising approach would be to substitute the handcrafted feature functions used in this work by neural feature extractors trained jointly with the policy. This would avoid the need to design the feature functions and could be potentially extended to other modules of the SDS, making text-to-action learning tractable. In addition, a single model can be potentially used in different domains (papangelis2017single), and different feudal architectures could make larger action spaces tractable (e.g. adding a third sub-policy to deal with actions dependent on 2 slots).


This research was funded by the EPSRC grant EP/M018946/1 Open Domain Statistical Spoken Dialogue Systems


Appendix A Feudal Dialogue Policy algorithm

1:for each dialogue turn do
2:     observe
5:     if  then drop to
8:     else then drop to
12:     end if
13:     execute
14:end for
Algorithm 1 Feudal Dialogue Policy

Appendix B DIP features

This section gives a detailed description of the DIP feature functions used in this work. The differences with the features used in (wang2015learning) and (papangelis2017single) are the following:

  • No priority or importance features are used.

  • No Potential contribution to DB search features are used.

  • The joint belief features are extended to account for large-domain aspects.

Feature Feature Feature
function description size
last user dialogue act (bin) * 7
DB search method (bin) * 6
# of requested slots (bin) 5
offer happened * 1
last action was Inform no venue * 1
normalised # of slots (1/# of slots) 1
normalised avg. slot length (1/avg. # of values) 1
prob. of the top 3 values of 3
prob. of *NONE* value of 1
entropy of 1
diff. between top and 2nd value probs. (bin) 5
# of slots with top value not *NONE* (bin) 5
prob. of the top 3 values of 3
prob. of *NONE* value of 1
diff. between top and 2nd value probs. (bin) 5
entropy of 1
# of values of with prob. 0 (bin) 5
normalised slot length (1/# of values) 1
slot length (bin) 10
entropy of the distr. of values of in the DB 1
total 64
Table 3: List of features composing the DIP features. the tag (bin) denotes that a binary encoding is used for this feature. Some of the joint features are extracted from the joint belief , computed as the Cartesian product of the beliefs of the individual slots. * denotes that these features exist in the original belief state .