A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management

by   Iñigo Casanueva, et al.
University of Cambridge

Dialogue assistants are rapidly becoming an indispensable daily aid. To avoid the significant effort needed to hand-craft the required dialogue flow, the Dialogue Management (DM) module can be cast as a continuous Markov Decision Process (MDP) and trained through Reinforcement Learning (RL). Several RL models have been investigated over recent years. However, the lack of a common benchmarking framework makes it difficult to perform a fair comparison between different models and their capability to generalise to different environments. Therefore, this paper proposes a set of challenging simulated environments for dialogue model development and evaluation. To provide some baselines, we investigate a number of representative parametric algorithms, namely deep reinforcement learning algorithms - DQN, A2C and Natural Actor-Critic and compare them to a non-parametric model, GP-SARSA. Both the environments and policy models are implemented using the publicly available PyDial toolkit and released on-line, in order to establish a testbed framework for further experiments and to facilitate experimental reproducibility.


Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management

The task-oriented spoken dialogue system (SDS) aims to assist a human us...

Optimizing Nitrogen Management with Deep Reinforcement Learning and Crop Simulations

Nitrogen (N) management is critical to sustain soil fertility and crop p...

Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management

Deep reinforcement learning (RL) methods have significant potential for ...

Benchmarking Deep Reinforcement Learning Algorithms for Vision-based Robotics

This paper presents a benchmarking study of some of the state-of-the-art...

Cascaded LSTMs based Deep Reinforcement Learning for Goal-driven Dialogue

This paper proposes a deep neural network model for joint modeling Natur...

Say What I Want: Towards the Dark Side of Neural Dialogue Models

Neural dialogue models have been widely adopted in various chatbot appli...

Curious Hierarchical Actor-Critic Reinforcement Learning

Hierarchical abstraction and curiosity-driven exploration are two common...

1 Introduction

In recent years, due to the large improvements achieved in Automatic Speech Recognition (ASR), Natural Language Understanding (NLU) and machine learning techniques, dialogue systems have gained much attention in both academia and industry. Two directions have been intensively researched: open-domain chat-based systems

[vinyals2015neural, serban2016building] and task-oriented dialogue systems [POMDP_williams, young2013pomdp]. The former cover non-goal driven dialogues about general topics. The latter aim to assist users to achieve specific goals via natural language, making it a very attractive interface for small electronic devices. Under a speech-driven scenario, Spoken Dialogue Systems (SDSs) are typically based on a modular architecture (Fig. 1

), consisting of input processing modules (ASR and NLU modules), Dialogue Management (DM) modules (belief state tracking and policy) and output processing modules (Natural Language Generation (NLG) and speech synthesis). The domain of a SDS is defined by the ontology, a structured representation of the database of the system defining the

requestable slots, informable slots and database entries (i.e. the type of entities users can interact with and their properties). Part of the dialogue flow in such an architecture is explained schematically in Figure 2 in Appendix A.

The DM module is the core component of a modular SDS, controlling the conversational flow of the dialogue. Traditional approaches have been mostly based on handcrafted decision trees covering all possible dialogues outcomes. However, this approach does not scale to larger domains and it is not resilient to noisy inputs resulting from ASR or NLU errors. Therefore, data-driven methods have been proposed to learn the policy automatically, either from a corpus of dialogues or from direct interaction with human users

[wen2016network, gasic2014gaussian].

Supervised learning can be used to learn a dialogue policy, training the policy model to "mimic" the responses observed in the training corpora [wen2016network]. This approach, however, has several shortcomings. In a spoken dialogue scenario, the training corpora can not be guaranteed to represent optimal behaviour. The effect of selecting an action on the future course of the dialogue is not considered and this may result in sub-optimal behaviour. In addition, due to the large size of the dialogue state space, the training dataset may lack sufficient coverage.

To tackle the issues mentioned above, this task is frequently formulated as a planning (control) problem [young2013pomdp], solved using Reinforcement Learning (RL) [sutton1999between]. In this framework, the system learns by a trial-and-error process governed by a potentially delayed reward signal. Therefore, the DM module learns to plan actions in order to maximise the final outcome. Recent advances such as Gaussian Process (GP) based RL [gasic2014gaussian, casanueva2015knowledge] and deep RL methods [mnih2013playing, silver2016mastering] have led to significant progress in data-driven dialogue modelling, showing that general algorithms such as policy gradients and Q-learning can achieve good performance in challenging dialogue scenarios [fatemi2016policy, su2017sample].

However, in contrast to other RL domains, the lack of a common testbed for spoken dialogue has made it difficult to compare different algorithms. Recent RL advancements have been largely influenced by the release of benchmarking environments [bellemare2013arcade, duan2016benchmarking] which allow a fair comparison to be made of different RL algorithms operating under similar conditions.

In the same spirit, based on the recently released PyDial multi-domain SDS tool-kit [ultes2017pydial], this paper aims to provide a set of testbed environments for developing and evaluating dialogue models. To account for the large variability of different scenarios, these environments span different size domains, different user behaviours and different input channel performance. To provide some baselines, the evaluation of a set of the most representative reinforcement learning algorithms for DM is presented. The benchmark and environment implementations are available on-line111http://www.camdial.org/pydial/benchmarks/, allowing for the development, implementation, and evaluation of new algorithms and tasks.

Figure 1: Spoken dialogue system environment used in this benchmark.

2 Motivation and related work

During the last decade, several reinforcement learning algorithms have been applied to the task of dialogue policy optimization [levin1998using, henderson2005hybrid, POMDP_williams, pietquin2011sample, jurcicek2010natural, gasic2014gaussian, su2017sample]. However, the evaluations of these algorithms are hard to compare, mostly because of the lack of a common benchmark environment. In addition, they are usually evaluated in only a few environments, making it hard to assess their potential to generalise to different environments.

In other fields, such as video game playing [bellemare2013arcade, vinyals2017starcraft] and continuous control [duan2016benchmarking], the release of common benchmarking environments has been a great stimulus to research in that area, leading to achievements such as human level game playing [mnih2015human] or beating the world champion in the game of Go [silver2016mastering].

Historically, there has not been a common testbed for the dialogue policy optimisation task. There are several reasons for this. First of all, unlike supervised learning tasks, using a corpus of dialogues to train a RL algorithm can be used only in a bootstrapping phase. However, a corpus can not be used to evaluate the final outcome of a dialogue222It can be used to evaluate the per-turn responses though [fb_n2n]., because the learning of an RL agent involves sequential observation and feedback generated from its operating world. This feedback, is conditioned on the action of the agent itself. Therefore, two different policies will generate two different sequences of observations. Training and testing policies directly interacting with real users has been proposed [milica_real_users]. However, system complexity, time and high cost make this approach infeasible for a large part of the research community. In addition, it can be very hard to control for extraneous factors that can modify the behaviour of users such as mood or tiredness, making a fair assessment very difficult.

To cope with these problems, simulated users [userSim, pietquin2006probabilistic, asri2016sequence, keizer2010simuser] (and simulated input processing channels [pietquin2002asr, thomson2012n]) have been proposed. These models approximate the behaviour of real users along with input channel noise introduced by ASR or NLU errors. However, the development of the processing modules needed to create a simulated dialogue environment requires a lot of effort. Even though some simulated environments are publicly available [williams2010demonstration, li2016user, lison2016opendial], they cover very small dialogue domains and the lack of consistency across them prohibits wide-scale testing.

The need for a common testbed for the dialogue task is a known issue in the dialogue community, with initiatives such as the Dialogue State Tracking Challenges (DSTC) to being the most prominent one [williams2013dialog]

. These challenges were possible thanks to a clear evaluation metric. Recently, the BABI dialogue tasks

[fb_n2n, li2016dialogue] and the DSTC6, (renamed to Dialogue Systems Technology Challenge), aim to create a testbed for end-to-end text based dialogue management. However, these tasks are focused either on end-to-end supervised learning or in RL based question answering tasks, where the reward signal is delayed only a few steps in time.

3 Dialogue management through reinforcement learning

Dialogue management can be cast as a continuous MDP [young2013pomdp] composed of a continuous multivariate belief state space , a finite set of actions and a reward function . The belief state

is a probability distribution over all possible (discrete) states. At a given time

, the agent (policy) observes the belief state and executes an action . The agent then receives a reward drawn from .

The policy is defined as a function that with probability takes an action in a state . For any policy and , the value function corresponding to is defined as:


where , , is a discount factor and is one-step reward.

The objective of reinforcement learning is to find an optimal policy

, i.e. a policy that maximizes the value function in each belief state. Equivalently, we can estimate the unique optimal value function

which corresponds to an optimal policy. In both cases, the goal is to find an optimal policy that maximises the discounted total return


over a dialogue with turns, where is the reward when taking action in dialogue state at turn and is the discount factor.

3.1 Task oriented dialogue management RL environment

In the RL framework, the environment encompasses every part of the system which is outside the control of the agent. In a modular SDS, the RL environment is every part of the system except the policy itself. In most classical approaches, the policy module acts as the agent and the rest of the modules constitute the environment (Figure 1). However, various ways have been proposed to train the policy jointly with other modules using RL. For example, the state tracker and the policy can be trained jointly [zhao2016towards, williams2017hybrid]. Other approaches train the policy and the NLG module jointly [wen2017latent], learn to query the database and the policy together [yang2017end] or train all the models of a (text based) system jointly [fb_n2n]. In this paper, we focus on the classical approach where only the policy is optimised through reinforcement learning.

There are other design features that have impact on the environment. For example, to reduce the action space of the MDP, the full set of actions can be clustered as summary actions [young2013pomdp, thomson2013statistical, williams2008masks]

. In addition, it is often desirable for SDSs to constraint the set of actions the system can take at each turn (e.g. avoid attempting to book a hotel before the dates have been specified). This is usually done by defining a set of action masks – i.e. heuristics which reduce the number of actions the MDP can take in each dialogue state

[williams2008masks, thomson2013statistical, gavsic2009masks]. The use of action masks also speeds up learning. However, these heuristics must be carefully defined by the system designer, since a poor design of summary actions or masks can lead to suboptimal policies.

In addition, the domain (specified by the ontology) determines the state space size (input) and action set size (output) of the MDP, as well as influencing several other modules. See appendix A for a schematic example of the summary action mapping, action mask definition and slot based ontology.

In summary, the dialogue environment has several sources of variability - domain, user behaviour, input channel (i.e. the semantic error rate, N-bests and confidence score distributions, state tracker behaviour), output channel, action masks, summary actions or database access mechanism. A robust dialogue policy should be able to generalise to all of these conditions.

3.2 Benchmarked algorithms

In this section, the algorithms used for benchmarking are described. The detailed explanations of the methods and how they are adapted to dialogue management can be found in [gasic2014gaussian, su2017sample]. In general, all algorithms can be divided in two classes: value-based and policy gradient methods.

Value-based methods.  Value-based methods usually try to estimate a -value function approximation given a belief state and an action with the form:


where is a one-step reward at time . the policy can be then defined greedily as the action that maximizes

Policy gradient methods.  Value-based models often suffer from divergence problems when using function approximation. This happens because they optimize in value space while following a greedy policy. Therefore a slight change in the value function estimate can lead to a large change in the policy space [sutton1999policy]. However, we can directly parametrize a policy and then adjust the parameters to maximize the expected reward (2):


where the expectation is taken with respect to all possible dialogue trajectories that start in some initial belief state .

To provide some baselines, we investigate a set of representative parametric algorithms, namely deep reinforcement learning algorithms - DQN [mnih2013playing], A2C [fatemi2016policy] and episodic Natural Actor-Critic (eNAC) [su2017sample] models and compare them to a non-parametric algorithm, GP-SARSA [gasic2014gaussian]. Table 1 presents main characteristics of the four algorithms.

Model type non-parametric parametric parametric parametric
value-based value-based policy-based policy-based
Value function
Policy function
Experience replay

Trained by backpropagation

Computational complexity cubic* linear linear linear
  • In the size of a set of representative points [gasic2014gaussian].

Table 1: General overview of the baseline algorithms analyzed in this benchmark.

3.3 PyDial

PyDial [ultes2017pydial] is an open-source statistical spoken dialogue system toolkit which provides domain-independent implementations of all the dialogue system modules shown in Figure 1, as well as simulated users and simulated error models. Therefore, this toolkit has the potential to create a set of benchmark environments to compare different RL algorithms in the same conditions. The main focus of PyDial is task-oriented dialogue where the user has to find a matching entity based on a number of constraints. For example the system needs to provide a user with a description of a laptop in a store that meets specific user requirements. In this work, PyDial is used to define different environments, and the configuration files which specify these environments are provided with the paper.

4 Benchmarking tasks

RL-based DM research is typically evaluated on only a single or a very small set of environments. Such tests do not reveal much about the capability of the algorithms to generalise to different settings, and may be prone to overfitting to specific cases. To test the capability of algorithms in different environments, a set of tasks has been defined that spans a wide range of environments across a number of dimensions:

Domain. The first dimension of variability between environments is the application domain. Three ontologies with databases of differing sizes are defined, representing information seeking tasks for restaurants in Cambridge and San Francisco and a generic shopping task for laptops [Mrksic15]. These are slot-based ontologies [Henderson2014b], where the dialogue state is factorised into slots (see appendix A for an factorised state space example). Table 2 provides a summary of the characteristics of each domain.

Domain Code # constraint slots # requests # values
Cambridge Restaurants CR 3 9 268
San Francisco Restaurants SFR 6 11 636
Laptops LAP 11 21 257
Table 2: Description of the domains. The third column represents the number of database search constraints that the user can define, the fourth the number of information slots the user can request from a given database entry and the fifth the sum of the number of values of each requestable slot.

Input error. The second dimension of variability comes from the ASR and NLU channel simulation modelling. In PyDial, this is modelled at a semantic level whereby the true user act is corrupted by noise to generate an N-best-list with associated confidence scores [thomson2012n]. The Semantic Error Rate (SER) is set to three different values, simulating different noise levels in the speech understanding input channel.

User model. The third dimension of variability comes from the user model. Even if the parameters of the model are sampled at the beginning of each dialogue, the distribution from where these parameters are sampled can be different. In addition to the Standard parameter sampling distribution, we define an Unfriendly distribution, where the users barely provide any extra information to the system.

Masking mechanism. Finally, in order to test the learning capability of the algorithms, the action masking mechanism provided in PyDial is disabled in two of the tasks.

In total, user model/error model/action mask environments are defined, representing environments with , and SER, with the masks deactivated in two of them. Moreover, the parameters of the user behaviour model [schatzmann2007agenda] are sampled at the beginning of each dialogue, simulating the situation that every interaction is conducted with a unique user. Two parameter sampling distributions are defined, standard and unfriendly. Thus, as summarised in Table 3, a total of different tasks are defined for evaluating each algorithm.

Env. 1 Env. 2 Env. 3 Env. 4 Env. 5 Env. 6
task T1.1 T1.2 T1.3 T2.1 T2.2 T2.3 T3.1 T3.2 T3.3 T4.1 T4.2 T4.3 T5.1 T5.2 T5.3 T6.1 T6.2 T6.3
SER 0% 0% 15% 15% 15% 30%
Masks On Off On Off On On
User Standard Standard Standard Standard Unfriendly Standard
Table 3: The set of benchmarking tasks. Each user model/error model/action mask environment is evaluated in three different domains.

5 Experimental Setup

In this section, the experimental setup used to run the benchmarking tasks is explained.

5.1 Simulated user and input channel

The user behaviour is modelled by an agenda-based simulator which provides semantic-level interactions [schatzmann2007agenda]. The actions taken during each dialogue are conditioned by parameters sampled from a user model. These are re-sampled at the beginning of each dialogue to ensure a unique profile for every dialogue. The user model has parameters (e.g. probabilities determining the frequency of repetitions and confirmations), and the range over which the parameters are sampled is provided by a PyDial configuration file. The semantic error rate introduced by the noisy speech channel is simulated through an error model [casanueva2014adaptive, thomson2012n] with parameters learned from real NLU data333In order to ensure variability, the parameters of environments with different error rate are trained from different data - i.e. the environments are grouped in (1, 2), (3, 4, 5) and (6), each group having different parameters. [mrkvsic2016neural]. This model has parameters (e.g. specifying the variability of confidence scores in the input N-best-list).

All tasks use the same rule-based dialogue state tracker. It factorises the dialogue state distribution into the different slots defined by the ontology, plus several general slots which track dialogue meta-data, e.g. whether or not the user has been presented with some entity. Each slot has values also defined by the ontology. For a more detailed description of the state tracker refer to [Henderson2014b].

5.2 Summary actions and action masks

The MDP action set is defined as a set of summary actions [thomson2013statistical, young2010hidden]. This set consists of slot independent actions (inform by constraints, inform requested, inform alternatives, bye and request more) and slot dependent actions (request, confirm and select), making a total of actions where is the number of slots requestable slots (see Tab. 2). The mapping between summary and master actions is based on simple heuristics dependent on the belief state (e.g, inform a venue matching the top values of each slot, confirm the top value of a slot, etc.)

In the case of action masks, similar heuristics are used. For slot dependent actions, these heuristics depend on the distribution of the values of that slot (e.g. confirm foodtype is masked if all the probability mass of foodtype is in the "none" value). For the slot independent actions, the masks depend on the general method slot, which tracks the way the user is conducting the database search. The masks of the slot independent actions are dependent on the value of this slot (e.g. inform by constraints is only unmasked if the top value of the method slot is byconstraints).

5.3 Model hyperparameters

GP-SARSA uses a linear kernel for the state space and a delta kernel for the action space. The scale, responsible for the degree of exploration, is set . The remaining parameters are set as in [gasic2014gaussian]

. Futher improvements in overall performance can be obtained with a Gaussian kernel with optimized hyperparameters 

[chen2015hyper], however this was not explored here.

Unlike GPSARSA, the trade-off between exploration and exploitation is not handled automatically in deep-RL models, being dependent on the number of training dialogues. The exploration schedule is often a critical factor in obtaining good learning performance. The -greedy policy used here follows a linear scheduling starting from and then annealed to after 4000 dialogues, where the optimal initial value for was found by grid search over values , and .

All deep-RL policy models are composed of hidden feedforward layers. As the objective of the paper is to see how these models generalise across environments, the hyper-parameters of all models across all the tasks are kept the same. The hyperparameters are set as in [su2017sample], with the exception of the size of the hidden layers and the initial , which are tuned by grid search. Table 5 presents the hyperparameters of the best models across each domain for all deep-RL algorithms, selected through a grid search over combinations of hyperparameters. The Adam optimiser was used to train all the deep-RL models, with an initial learning rate of [kingma2014adam].

For a more detailed description, the hyperparameters of every implemented model are specified in the PyDial configuration files provided for each task.

5.4 Handcrafted policy

In addition to the RL algorithms described in Table 1, the performance of a classic handcrafted policy interacting with each environment is also evaluated. The actions taken by this policy are based on carefully designed heuristics, dependent on the belief state [thomson2013statistical].

5.5 Reward function and performance metrics

The maximum dialogue length was set to turns and the discount factor was . The metrics presented in next section are the average success rate and average reward for each evaluated policy model. Success rate is defined as the percentage of dialogues which are completed successfully – i.e. whether the dialogue manager is able to fulfill the user goal or not. Final reward is defined as , where is the success indicator and is the dialogue length in turns.

6 Results and discussion

GP-Sarsa DQN A2C eNAC Handcrafted
Task Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew.

Env. 1

CR 99.4% 13.5 93.9% 12.7 89.3% 11.6 94.8% 12.4 100.0% 14.0
SFR 96.1% 11.4 65.0% 5.9 58.3% 4.0 94.0% 11.7 98.2% 12.4
LAP 89.1% 9.4 70.1% 6.9 57.1% 3.5 91.4% 10.5 97.0% 11.7

Env. 2

CR 96.8% 12.2 91.9% 12.0 75.5% 7.0 83.6% 9.0 100.0% 14.0
SFR 91.9% 9.6 84.3% 9.2 45.5% -0.3 65.6% 3.7 98.2% 12.4
LAP 82.3% 7.3 74.5% 6.6 26.8% -5.0 55.1% 1.5 97.0% 11.7

Env. 3

CR 95.1% 11.0 93.4% 11.9 74.6% 7.3 90.8% 11.2 96.7% 11.0
SFR 81.6% 6.9 60.9% 4.0 39.1% -2.0 84.6% 8.6 90.9% 9.0
LAP 68.3% 4.5 61.1% 4.3 37.0% -1.9 76.6% 6.7 89.6% 8.7

Env. 4

CR 91.5% 9.9 90.0% 10.7 64.7% 3.7 85.3% 9.0 96.7% 11.0
SFR 81.6% 7.2 77.8% 7.7 38.8% -3.1 61.7% 2.0 90.9% 9.0
LAP 72.7% 5.3 68.7% 5.5 27.3% -6.0 52.8% -0.8 89.6% 8.7

Env. 5

CR 93.8% 9.8 90.7% 10.3 70.1% 5.0 91.6% 10.5 95.9% 9.7
SFR 74.7% 3.6 62.8% 2.9 20.2% -5.9 74.4% 4.5 87.7% 6.4
LAP 39.5% -1.6 45.5% 0.0 28.9% -4.7 75.8% 4.1 85.1% 5.5

Env. 6

CR 89.6% 8.8 87.8% 10.0 62.3% 3.5 79.6% 8.0 89.6% 9.3
SFR 64.2% 2.7 47.2% 0.4 27.5% -5.1 66.7% 3.9 79.0% 6.0
LAP 44.9% -0.2 46.1% 1.0 32.1% -3.8 64.6% 3.6 76.1% 5.3


CR 94.4% 10.9 91.3% 11.3 72.8% 6.4 87.6% 10.0 96.5% 11.5
SFR 81.7% 6.9 66.3% 5.0 38.2% -2.1 74.5% 5.7 90.8% 9.2
LAP 66.1% 4.1 61.0% 4.1 34.9% -3.0 69.4% 4.3 89.1% 8.6
ALL 80.7% 7.3 72.9% 6.8 48.6% 0.4 77.2% 6.7 92.1% 9.8
Table 4: Reward and success rates after training dialogues for the five policy models considered in this benchmark. Each row represents one of the different tasks. The highest reward obtained by a data driven model in each row is highlighted.

The evaluation results for the tasks444As shown in table 3, we refer to the tasks as task X.Y, where X indicates the user/error/mask environment and Y the domain. e.g. Task 2 refers to env. 2 in the three domains. Task 2.3 refers to env. 2 in LAP. are presented in Table 4. For each task, every model is trained over ten different random seeds and evaluated after training dialogues. The models are evaluated over test dialogues and the results shown are averaged over all seeds. In addition, evaluation results after and training dialogues are shown in Appendix C and learning curves for task are shown in Appendix D.

The results clearly show that the domain complexity plays a crucial role on the overall performance. Value-based methods (GP-SARSA and DQN) achieve the best performance in the CR domain across all six environmental settings. Value-based methods are known to have a higher learning rate. While this might lead to overfitting to the two larger domains (SFR and LAP), in domains with small action and state spaces, a higher learning rate helps to achieve a good policy faster than policy gradient based methods. On the other hand, eNAC provides the best performance on the SFR and LAP tasks suggesting that policy-gradient methods scales robustly to larger state and action spaces.

Action masks significantly reduce the size of the action space and thus increase the policy learning rate. However, for the environments where the action masks are deactivated ( and ), the policy-gradient methods learn much slower. In contrast, value-based approaches still maintain reasonable performance, indicating that they are more sample-efficient than policy-based methods. However, it is worth noting that DQN is highly unstable, especially with larger domains (see Figure 2(b)). Thanks to the non-parametric approach, this pattern is not observed with GP-SARSA. As noted earlier, this is mainly due to optimisation being performed in value space rather than directly in policy space. In addition, after dialogues, the performance of eNAC decreases in some environments. This might be because the hyperparameters were optimised for dialogues. A more extensive grid search could solve the problem.

The performance of every model drops substantially when noise is introduced to the semantic input. Results from tasks , , and show, however, that eNAC is more robust in these partially observed environments and thus degrades less than the other methods. One reason for this is that, contrary to other deep-RL methods, the natural gradient points more directly to the desired goal and is less prone to getting stuck on local plateaus, thereby learning better policies in noisy environments.

As it could be expected, interacting with the unfriendly set of users in task 5 degrades the performance. However, the performance drop is smaller for eNAC than for the rest of the models. This suggests that this policy has the ability to learn faster how to guide the dialogue when the user is less prone to provide information about his or her goal.

GPSARSA consistently performs well, showing very stable performance and fast learning rate (see Appendix D). Overall, it is the best model across all tasks and domains both in terms of the learning rate and the final performance, followed closely by DQN and eNAC555Note, however, that the mean results for eNAC are degraded because of the very poor performance in unmasked environments. If this problem could be solved (e.g. by using techniques to increase the sample efficiency[wang2016sample]), this would be the best performing model.. A2C shows the worse results of all and, contrary to other RL applications, the ability to perform asynchronous learning is less useful because it significantly raises the training costs with real users. It can also be observed that some deep-RL models are prone to overfitting. Furthermore, these algorithms are very sensitive to hyper-parameter values.

Lastly, it is worth noting that the handcrafted policy model outperforms the RL-based policies in almost all the tasks in the larger domains (SFR and LAP), showing that RL-based models have difficulties to learn in large state spaces. To mitigate this issue, state space abstraction [wang2015learning, papangelis2017single] or hierarchical reinforcement learning [budzianowski2017subdomain] approaches can be used.

6.1 Cross-tasks evaluation

To further examine the generalisation capabilities of the various algorithms, we performed some cross-task evaluations. We chose three tasks, namely , and to test how algorithms trained in a noisy environment perform in a zero noise set-up and vice versa. Table 8 presents results for GPSARSA, DQN and A2C. For clarity, we omit A2C results since this algorithm performed substantially worse.

Results show that eNAC has the strongest generalisation capabilities, having the best performance in most of the cross-task environments. Value-based models have a good performance when trained with noisy data and tested in clean data, with DQN getting very close performance to eNAC. However, when trained in clean data and tested in noisy data the performance greatly decreases, especially in the larger domains. This decrease in performance is more severe for GPSARSA.

7 Conclusions and future work

To our knowledge, this is the first work to present a set of extensive simulated dialogue management environments along with a comparison of several RL algorithms using an open-domain toolkit. The results show that a large amount of improvement is still necessary for data driven models to match the performance of handcrafted policies, especially in larger domains. The environments presented in this paper, however, are still very constrained compared to real world tasks (e.g. Siri, Alexa…). In the future, we plan to include multi-domain environments (where the rewards are more delayed in time and are thus more challenging environments) and word-level user simulations, which would enable the dialogue managers to be trained in more realistic environments. Also, these environments are implemented in an open domain toolkit, offering the possibility to the research community to add new algorithms and new tasks.


This research was partly funded by the EPSRC grant EP/M018946/1 Open Domain Statistical Spoken Dialogue Systems. Paweł Budzianowski is supported by EPSRC Council and Toshiba Research Europe Ltd, Cambridge Research Laboratory. Pei-Hao Su is supported by Cambridge Trust and the Ministry of Education, Taiwan. The benchmark is available on-line at http://www.camdial.org/pydial/benchmarks/.


Appendix A Example of the dialogue flow in a modular SDS

Figure 2: Dialogue flow during a single turn in a modular SDS. The turn begins with the user action (1), which is processed by the input channel and converted into a N-best list with confidence scores. This N-best list is used by the state tracker (2) to update the dialogue (belief) state. The dialogue state is used by several modules of the system decision process (3). First, the action masks are defined based in the dialogue state. Then, the policy choses the optimal summary action based on the dialogue state and the action masks. Finally, the summary action is converted into a master action using heuristics based on the dialogue state. Then, the system outputs the action (usually through an NLG+TTS channel) (4) and the cycle begins again (5).

Appendix B Architecture details

Model Hidden layer 1 Hidden layer 2 starting value
DQN 300 100 0.3
A2C 200 75 0.5
eNAC 130 50 0.3
Table 5: Hyperparameters of the deep-RL models

Appendix C Results after 1000 and 10000 training dialogues

GP-Sarsa DQN A2C ENAC Handcrafted
Task Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew.

Env. 1

CR 98.0% 13.0 88.6% 11.6 83.4% 10.0 93.0% 12.2 100.0% 14.0
SFR 91.9% 10.0 48.0% 2.7 46.9% 1.7 85.8% 9.9 98.2% 12.4
LAP 78.9% 6.7 61.9% 5.5 41.1% 0.3 84.2% 8.8 97.0% 11.7

Env. 2

CR 91.1% 10.8 67.6% 6.4 58.6% 4.0 78.8% 6.6 100.0% 14.0
SFR 82.1% 7.4 64.2% 5.4 39.0% -1.1 67.5% 2.7 98.2% 12.4
LAP 68.4% 3.1 70.8% 5.8 31.0% -3.6 57.8% -0.5 97.0% 11.7

Env. 3

CR 91.9% 10.4 79.5% 9.2 66.5% 6.0 85.7% 10.0 96.7% 11.0
SFR 76.6% 5.5 42.4% 1.0 34.5% -2.1 73.6% 6.2 90.9% 9.0
LAP 65.0% 2.8 51.9% 3.1 32.7% -2.2 71.0% 5.5 89.6% 8.7

Env. 4

CR 88.2% 9.3 73.5% 6.9 54.2% 2.2 73.6% 4.4 96.7% 11.0
SFR 73.6% 4.9 65.9% 4.5 27.2% -3.7 60.4% 0.8 90.9% 9.0
LAP 61.3% 0.3 53.2% 2.7 28.1% -3.7 46.9% -2.9 89.6% 8.7

Env. 5

CR 90.2% 9.0 60.1% 4.1 49.0% 1.6 81.2% 8.1 95.9% 9.7
SFR 65.3% 1.3 32.5% -2.0 14.0% -6.2 54.0% 0.9 87.7% 6.4
LAP 44.9% -2.8 31.4% -1.8 17.8% -5.5 61.3% 1.7 85.1% 5.5

Env. 6

CR 84.9% 8.3 72.3% 6.9 50.2% 2.1 73.6% 6.7 89.6% 9.3
SFR 59.7% 0.7 35.6% -1.2 19.0% -5.6 55.2% 1.4 79.0% 6.0
LAP 52.0% -1.5 47.5% 1.4 20.7% -5.3 56.3% 1.9 76.1% 5.3


CR 90.7% 10.1 73.6% 7.5 60.3% 4.3 81.0% 8.0 96.5% 11.5
SFR 74.9% 5.0 48.1% 1.7 30.1% -2.8 66.1% 3.6 90.8% 9.2
LAP 61.7% 1.4 52.8% 2.8 28.6% -3.3 62.9% 2.4 89.1% 8.6
ALL 75.8% 5.5 58.2% 4.0 39.7% -0.6 70.0% 4.7 92.1% 9.8
Table 6: Reward and success rates after training dialogues.
GP-Sarsa DQN A2C ENAC Handcrafted
Task Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew. Suc. Rew.

Env. 1

CR 99.4% 13.5 92.5% 12.4 86.3% 10.5 85.3% 10.5 100.0% 14.0
SFR 97.3% 11.7 79.5% 8.7 65.4% 5.4 97.0% 12.3 98.2% 12.4
LAP 90.3% 9.7 72.9% 7.3 56.0% 3.5 92.1% 11.0 97.0% 11.7

Env. 2

CR 97.9% 12.4 96.1% 12.7 66.3% 4.4 49.4% 2.3 100.0% 14.0
SFR 95.4% 10.1 84.2% 9.7 32.9% -3.3 59.0% 3.0 98.2% 12.4
LAP 87.5% 8.4 83.9% 9.1 22.2% -6.0 42.7% -0.2 97.0% 11.7

Env. 3

CR 95.8% 10.9 94.7% 12.2 81.5% 8.4 76.0% 8.2 96.7% 11.0
SFR 81.2% 6.5 73.1% 6.2 37.8% -2.9 84.1% 8.6 90.9% 9.0
LAP 64.3% 3.9 69.2% 5.6 48.5% -0.3 73.3% 6.5 89.6% 8.7

Env. 4

CR 92.6% 10.0 91.9% 11.1 61.9% 2.3 51.6% 2.4 96.7% 11.0
SFR 81.0% 6.9 81.1% 8.2 34.1% -4.9 28.0% -3.5 90.9% 9.0
LAP 74.0% 5.8 69.3% 5.6 25.2% -7.3 35.2% -1.7 89.6% 8.7

Env. 5

CR 91.7% 8.8 92.6% 10.5 67.8% 3.9 78.9% 7.9 95.9% 9.7
SFR 68.6% 2.7 72.8% 4.5 23.8% -6.3 82.3% 6.5 87.7% 6.4
LAP 36.9% -1.4 53.3% 0.7 24.7% -5.6 72.8% 3.8 85.1% 5.5

Env. 6

CR 89.6% 8.6 88.3% 9.9 62.3% 2.8 57.8% 3.9 89.6% 9.3
SFR 54.8% 1.3 64.8% 3.6 23.5% -6.3 61.1% 3.1 79.0% 6.0
LAP 45.6% 0.3 52.1% 1.7 25.3% -5.6 61.2% 3.2 76.1% 5.3


CR 94.5% 10.7 92.7% 11.5 71.0% 5.4 66.5% 5.9 96.5% 11.5
SFR 79.7% 6.5 75.9% 6.8 36.2% -3.1 68.6% 5.0 90.8% 9.2
LAP 66.4% 4.4 66.8% 5.0 33.7% -3.6 62.9% 3.8 89.1% 8.6
ALL 80.2% 7.2 78.5% 7.8 47.0% -0.4 66.0% 4.9 92.1% 9.8
Table 7: Reward and success rates after training dialogues.

Appendix D Learning curves for task 3

(a) Cambridge Restaurants
(b) San Francisco Restaurants
(c) Laptops
Figure 3: Performance of the benchmarked algorithms as a function of the number of dialogues for three different domains; the shaded area depicts the mean

the standard deviation over ten different random seeds.

Appendix E Cross task experiments

Evaluation Env. 1 Env. 3 Env. 6
Training Model/Domain CR SFR LAP CR SFR LAP CR SFR LAP
Env. 1 GP-SARSA 0.6 -6.8 -5.9 -4.3 -12.3 -11.0
DQN 8.0 0.2 3.4 4.6 -1.8 0.8
ENAC 9.7 8.0 7.3 7.0 4.3 4.1
Env. 3 GP-SARSA 9.5 9.9 6.1 7.0 0.7 -2.7
DQN 13.0 7.7 6.5 9.1 1.9 1.6
ENAC 13.1 10.9 10.3 7.7 4.7 3.9
Env. 6 GP-SARSA 11.9 9.1 5.1 10.6 6.1 2.6
DQN 13.8 6.3 6.0 12.0 3.7 3.7
ENAC 13.2 9.7 10.2 10.9 7.2 7.0
Table 8: Reward obtained by the three best performing algorithms in cross-tasks evaluation. The (wide) rows represent the environment in which the model is trained and the (wide) columns the environment where its evaluated. The (thin) rows represent the model and the (thin) columns the domain where the model is trained and tested. The reward for the best performing model in each cross-environment setup and domain combination is highlighted.